Partitioning
Individual compute nodes are grouped together into larger subsets of the cluster to form so called partitions.
MOGON NHR
Partition | Nodes | Limit | RAM | Intended Use |
---|---|---|---|---|
smallcpu | CPU-Nodes | 6 days | $\space256\thinspace\text{GB}$ $\space512\thinspace\text{GB}$ | for jobs using $\text{n} \le 32$ max. running jobs per user: 3.000 |
parallel | CPU-Nodes | 6 days | $\space256\thinspace\text{GB}$ $\space512\thinspace\text{GB}$ | jobs using $\text{n}\times128$ for $\text{n}\in[1,2,3,\ldots]$ |
longtime | CPU-Nodes | 12 days | $\space256\thinspace\text{GB}$ $\space512\thinspace\text{GB}$ | long running jobs $\ge \text{6 days}$ |
largemem | CPU-Nodes | 6 days | $\space1024\thinspace\text{GB}$ | — |
hugemem | CPU-Nodes | 6 days | $\space2048\thinspace\text{GB}$ | — |
quick | CPU-Nodes | 8 hours | $\space256\thinspace\text{GB}$ | for jobs using $\text{n} \le 32$ max. running jobs per user: 4 |
Partitions supporting Accelerators
Partition | Nodes | Limit | RAM | Intended Use |
---|---|---|---|---|
mi250 | AMD-Nodes | 6 days | $\space1024\thinspace\text{GB}$ | — |
a40 | A40 | 6 days | $\space1024\thinspace\text{GB}$ | — |
a100dl | A100 | 6 days | $\space1024\thinspace\text{GB}$ | — |
a100ai | A100 | 6 days | $\space2048\thinspace\text{GB}$ | — |
Private Partitions within MOGON NHR
Partition | Nodes | Limit | RAM | Accelerators | |
---|---|---|---|---|---|
topml | gpu0601 | 6 days | $1\thinspace\text{TB}$ | NVIDIA H100 80GB HBM3 | |
komet | floating Partition | 6 days | $256\thinspace\text{GB}$ | - | |
czlab | gpu0602 | 6 days | $1.5\thinspace\text{TB}$ | NVIDIA L40 |
MOGON II
- Only ~5% of nodes are available for small jobs (
n<<40
). - Each account has a
GrpTRESRunLimit
.
Check using sacctmgr -s list account <your_account> format=account,GRpTRESRunMin
, you can use sacctmgr -n -s list user $USER formatAccount%20 | grep -v none
to get your accounts. The default is cpu=22982400
, which is the equivalent of using 700 nodes for 12 hours in total:
Partition | Nodes | Limit | RAM | Interconnect | Intended Use |
---|---|---|---|---|---|
smp | z-nodes x-nodes | 5 days | $\space64\thinspace\text{GB}$ $\space96\thinspace\text{GB}$ $128\thinspace\text{GB}$ $192\thinspace\text{GB}$ $256\thinspace\text{GB}$ | Intel Omnipath | for jobs using $\text{n} \ll 40$ or $\text{n} \ll 64$ max. running jobs per user: 3.000 |
devel | z-nodes x-nodes | 4 hours | $\space64\thinspace\text{GB}$ $\space96\thinspace\text{GB}$ $128\thinspace\text{GB}$ | Intel Omnipath | max. 2 Jobs per User, max. 320 CPUs in total |
parallel | z-nodes x-nodes | 5 days | $\space64\thinspace\text{GB}$ $\space96\thinspace\text{GB}$ $128\thinspace\text{GB}$ $192\thinspace\text{GB}$ $256\thinspace\text{GB}$ | Intel Omnipath | jobs using $\text{n}\times40$ or jobs using $\text{n}\times64$ for $\text{n}\in[1,2,3,\ldots]$ |
bigmem | z-nodes x-nodes | 5 days | $384\thinspace\text{GB}$ $512\thinspace\text{GB}$ $1\thinspace\text{TB}$ $1.5\thinspace\text{TB}$ | Intel Omnipath | for jobs needing more than $256\thinspace\text{GB}$ of memory |
Partitions supporting Accelerators
Partition | Nodes | Limit | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|
deeplearning | dgx-nodes | 12 hours | Infiniband | 8 Tesla V100-SXM2 per node | for access get in touch with us |
m2_gpu | s-nodes | 5 days | Infiniband | 6 GeForce GTX 1080 Ti per node | - |
Private Partitions within MOGON II
Partition | Nodes | Limit | RAM | Interconnect | Accelerators |
---|---|---|---|---|---|
himster2_exp | x0753 - x0794, x2001 - x2023 | 5 days | $96\thinspace\text{GB}$ | Intel OmniPath | - |
himster2_th | x2024 - x2320 | 5 days | $96\thinspace\text{GB}$ | Intel OmniPath | - |
Most Nodes have a default runtime of 10 minutes after which they will be automatically killed unless more time is requested using the -t
flag. The default runtime for a partition can be checked with
scontrol show partition <partition>
The Limit is the maximum requestable runtime on a node. Large jobs need to be split up and continued in a separate job.
Memory limits
The technical specification for RAM on our nodes (as described above) is slightly different from the memory that is effectively available. A small part is always going to be reserved for the operating system, the parallel file system, the scheduler, etc. Therefore, you find memory limits that might be relevant for a job—for example when specifying the --mem
option—in the table below.
Memory [MB] | Number of Nodes | Type |
---|---|---|
$\space57.000$ | 584 | broadwell |
$\space88.500$ | 576 | skylake |
$120.000$ | 120 | broadwell |
$177.000$ | 120 | skylake |
$246.000$ | 40 | broadwell |
Memory [MB] | Number of Nodes | Type |
---|---|---|
$\space\space354.000$ | 32 | skylake |
$\space\space498.000$ | 20 | broadwell |
$1.002.000$ | 2 | broadwell |
$1.516.000$ | 2 | skylake |
You can, of course, also use the SLURM command sinfo
to query all these limits. A helpful alias might use the following options:
sinfo -e -o "%20P %16F %8z %.8m %.11l %18f" -S "+P+m"
The output returns a list of our partitions and
- information on their nodes (
allocated/idle/other/total
) - CPU specs of these nodes (
sockets:cores:threads
) - size of real memory in megabytes
- walltime limits for job requests
- and feature constraints.
At the moment of writing, the output looks like this for MOGON II:
PARTITION NODES(A/I/O/T) S:C:T MEMORY TIMELIMIT AVAIL_FEATURES
bigmem 8/5/19/32 2:16:2 354000 5-00:00:00 anyarch,skylake
bigmem 0/8/12/20 2:10:2 498000 5-00:00:00 anyarch,broadwell
bigmem 0/0/2/2 2:10:2 1002000 5-00:00:00 anyarch,broadwell
bigmem 0/2/0/2 2:16:2 1516000 5-00:00:00 anyarch,skylake
deeplearning 0/1/1/2 2:20:2 490000 18:00:00 anyarch,broadwell
devel 438/14/140/592 2:10:2 57000 4:00:00 anyarch,broadwell
devel 473/9/138/620 2:16:2 88500 4:00:00 anyarch,skylake
devel 99/23/46/168 2:10:2 120000 4:00:00 anyarch,broadwell
devel 60/16/44/120 2:16:2 177000 4:00:00 anyarch,skylake
devel 20/5/15/40 2:10:2 246000 4:00:00 anyarch,broadwell
devel 8/5/19/32 2:16:2 354000 4:00:00 anyarch,skylake
himster2_exp 1/57/7/65 2:16:2 88500 5-00:00:00 anyarch,skylake
himster2_interactive 1/1/0/2 2:16:2 88500 5-00:00:00 anyarch,skylake
himster2_th 272/7/17/296 2:16:2 88500 5-00:00:00 anyarch,skylake
kph_NT 0/14/2/16 2:10:2 246000 5-00:00:00 anyarch,broadwell
m2_gpu 20/5/5/30 2:12:2 115500 5-00:00:00 anyarch,broadwell
m2_gpu-compile 20/5/5/30 2:12:2 115500 1:00:00 anyarch,broadwell
m2_gputest 2/0/0/2 2:12:2 115500 5-00:00:00 anyarch,broadwell
parallel 438/14/140/592 2:10:2 57000 5-00:00:00 anyarch,broadwell
parallel 473/9/138/620 2:16:2 88500 5-00:00:00 anyarch,skylake
parallel 99/23/46/168 2:10:2 120000 5-00:00:00 anyarch,broadwell
parallel 60/16/44/120 2:16:2 177000 5-00:00:00 anyarch,skylake
parallel 20/5/15/40 2:10:2 246000 5-00:00:00 anyarch,broadwell
parallel 8/5/19/32 2:16:2 354000 5-00:00:00 anyarch,skylake
smp 438/14/140/592 2:10:2 57000 5-00:00:00 anyarch,broadwell
smp 473/9/138/620 2:16:2 88500 5-00:00:00 anyarch,skylake
smp 99/23/46/168 2:10:2 120000 5-00:00:00 anyarch,broadwell
smp 60/16/44/120 2:16:2 177000 5-00:00:00 anyarch,skylake
smp 20/5/15/40 2:10:2 246000 5-00:00:00 anyarch,broadwell
smp 8/5/19/32 2:16:2 354000 5-00:00:00 anyarch,skylake
and for MOGON NHR
PARTITION NODES(A/I/O/T) S:C:T MEMORY TIMELIMIT
a100ai 1/2/1/4 2:64:2 1992000 6-00:00:00
a100dl 1/8/2/11 2:64:1 1016000 6-00:00:00
a40 1/6/0/7 2:64:1 1016000 6-00:00:00
czlab 0/1/0/1 2:64:1 1031828 6-00:00:00
hugemem 0/1/3/4 2:64:1 1992000 6-00:00:00
komet 355/43/34/432 2:64:1 248000 6-00:00:00
largemem 0/19/9/28 2:64:1 1016000 6-00:00:00
longtime 9/0/1/10 2:64:1 248000 12-00:00:00
longtime 10/0/0/10 2:64:1 504000 12-00:00:00
mi250 0/2/0/2 2:64:1 1016000 6-00:00:00
mod 167/4/5/176 2:64:1 504000 6-00:00:00
parallel 355/43/34/432 2:64:1 248000 6-00:00:00
parallel 167/4/5/176 2:64:1 504000 6-00:00:00
quick 355/43/34/432 2:64:1 248000 8:00:00
smallcpu 355/43/34/432 2:64:1 248000 6-00:00:00
topml 0/1/0/1 2:48:2 1547259 6-00:00:00
Hidden Partitions
Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects / groups and of interest to these groups, only.
To visualize all jobs for a user in all partitions supply the -a
flag:
squeue -u $USER -a
Likewise sinfo
can be supplemented with -a
to gather informations. All other commands work without this flag as expected.