Partitioning

Individual compute nodes are grouped together into larger subsets of the cluster to form so called partitions.

MOGON NHR

PartitionNodesLimitRAMIntended Use
smallcpuCPU-Nodes6 days$\space256\thinspace\text{GB}$
$\space512\thinspace\text{GB}$
for jobs using $\text{n} \le 32$
max. running jobs per user: 3.000
parallelCPU-Nodes6 days$\space256\thinspace\text{GB}$
$\space512\thinspace\text{GB}$
jobs using $\text{n}\times128$ for $\text{n}\in[1,2,3,\ldots]$
longtimeCPU-Nodes12 days$\space256\thinspace\text{GB}$
$\space512\thinspace\text{GB}$
long running jobs $\ge \text{6 days}$
largememCPU-Nodes6 days$\space1024\thinspace\text{GB}$
hugememCPU-Nodes6 days$\space2048\thinspace\text{GB}$
quickCPU-Nodes8 hours$\space256\thinspace\text{GB}$for jobs using $\text{n} \le 32$
max. running jobs per user: 4

Partitions supporting Accelerators

PartitionNodesLimitRAMIntended Use
mi250AMD-Nodes6 days$\space1024\thinspace\text{GB}$
a40A406 days$\space1024\thinspace\text{GB}$
a100dlA1006 days$\space1024\thinspace\text{GB}$
a100aiA1006 days$\space2048\thinspace\text{GB}$

Private Partitions within MOGON NHR

PartitionNodesLimitRAMAccelerators
topmlgpu06016 days$1\thinspace\text{TB}$NVIDIA H100 80GB HBM3
kometfloating Partition6 days$256\thinspace\text{GB}$-
czlabgpu06026 days$1.5\thinspace\text{TB}$NVIDIA L40

MOGON II

  • Only ~5% of nodes are available for small jobs (n<<40).
  • Each account has a GrpTRESRunLimit.

Check using sacctmgr -s list account <your_account> format=account,GRpTRESRunMin, you can use sacctmgr -n -s list user $USER formatAccount%20 | grep -v none to get your accounts. The default is cpu=22982400, which is the equivalent of using 700 nodes for 12 hours in total:

PartitionNodesLimitRAMInterconnectIntended Use
smpz-nodes x-nodes5 days$\space64\thinspace\text{GB}$ $\space96\thinspace\text{GB}$
$128\thinspace\text{GB}$ $192\thinspace\text{GB}$
$256\thinspace\text{GB}$
Intel Omnipathfor jobs using $\text{n} \ll 40$ or $\text{n} \ll 64$
max. running jobs per user: 3.000
develz-nodes x-nodes4 hours$\space64\thinspace\text{GB}$
$\space96\thinspace\text{GB}$
$128\thinspace\text{GB}$
Intel Omnipathmax. 2 Jobs per User, max. 320 CPUs in total
parallelz-nodes x-nodes5 days$\space64\thinspace\text{GB}$ $\space96\thinspace\text{GB}$
$128\thinspace\text{GB}$
$192\thinspace\text{GB}$
$256\thinspace\text{GB}$
Intel Omnipathjobs using $\text{n}\times40$ or jobs using $\text{n}\times64$ for $\text{n}\in[1,2,3,\ldots]$
bigmemz-nodes x-nodes5 days$384\thinspace\text{GB}$ $512\thinspace\text{GB}$
$1\thinspace\text{TB}$
$1.5\thinspace\text{TB}$
Intel Omnipathfor jobs needing more than $256\thinspace\text{GB}$ of memory

Partitions supporting Accelerators

PartitionNodesLimitInterconnectAcceleratorsComment
deeplearningdgx-nodes12 hoursInfiniband8 Tesla V100-SXM2 per nodefor access get in touch with us
m2_gpus-nodes5 daysInfiniband6 GeForce GTX 1080 Ti per node-

Private Partitions within MOGON II

PartitionNodesLimitRAMInterconnectAccelerators
himster2_expx0753 - x0794, x2001 - x20235 days$96\thinspace\text{GB}$Intel OmniPath-
himster2_thx2024 - x23205 days$96\thinspace\text{GB}$Intel OmniPath-
Default Runtime

Most Nodes have a default runtime of 10 minutes after which they will be automatically killed unless more time is requested using the -t flag. The default runtime for a partition can be checked with

scontrol show partition <partition>

The Limit is the maximum requestable runtime on a node. Large jobs need to be split up and continued in a separate job.

Memory limits

The technical specification for RAM on our nodes (as described above) is slightly different from the memory that is effectively available. A small part is always going to be reserved for the operating system, the parallel file system, the scheduler, etc. Therefore, you find memory limits that might be relevant for a job—for example when specifying the --mem option—in the table below.

Memory [MB]Number of NodesType
$\space57.000$584broadwell
$\space88.500$576skylake
$120.000$120broadwell
$177.000$120skylake
$246.000$40broadwell
Memory [MB]Number of NodesType
$\space\space354.000$32skylake
$\space\space498.000$20broadwell
$1.002.000$2broadwell
$1.516.000$2skylake

You can, of course, also use the SLURM command sinfo to query all these limits. A helpful alias might use the following options:

sinfo -e -o "%20P %16F %8z %.8m %.11l %18f" -S "+P+m"

The output returns a list of our partitions and

  • information on their nodes (allocated/idle/other/total)
  • CPU specs of these nodes (sockets:cores:threads)
  • size of real memory in megabytes
  • walltime limits for job requests
  • and feature constraints.

At the moment of writing, the output looks like this for MOGON II:

PARTITION            NODES(A/I/O/T)   S:C:T      MEMORY   TIMELIMIT AVAIL_FEATURES    
bigmem               8/5/19/32        2:16:2     354000  5-00:00:00 anyarch,skylake   
bigmem               0/8/12/20        2:10:2     498000  5-00:00:00 anyarch,broadwell 
bigmem               0/0/2/2          2:10:2    1002000  5-00:00:00 anyarch,broadwell 
bigmem               0/2/0/2          2:16:2    1516000  5-00:00:00 anyarch,skylake   
deeplearning         0/1/1/2          2:20:2     490000    18:00:00 anyarch,broadwell 
devel                438/14/140/592   2:10:2      57000     4:00:00 anyarch,broadwell 
devel                473/9/138/620    2:16:2      88500     4:00:00 anyarch,skylake   
devel                99/23/46/168     2:10:2     120000     4:00:00 anyarch,broadwell 
devel                60/16/44/120     2:16:2     177000     4:00:00 anyarch,skylake   
devel                20/5/15/40       2:10:2     246000     4:00:00 anyarch,broadwell 
devel                8/5/19/32        2:16:2     354000     4:00:00 anyarch,skylake   
himster2_exp         1/57/7/65        2:16:2      88500  5-00:00:00 anyarch,skylake   
himster2_interactive 1/1/0/2          2:16:2      88500  5-00:00:00 anyarch,skylake   
himster2_th          272/7/17/296     2:16:2      88500  5-00:00:00 anyarch,skylake   
kph_NT               0/14/2/16        2:10:2     246000  5-00:00:00 anyarch,broadwell 
m2_gpu               20/5/5/30        2:12:2     115500  5-00:00:00 anyarch,broadwell 
m2_gpu-compile       20/5/5/30        2:12:2     115500     1:00:00 anyarch,broadwell 
m2_gputest           2/0/0/2          2:12:2     115500  5-00:00:00 anyarch,broadwell 
parallel             438/14/140/592   2:10:2      57000  5-00:00:00 anyarch,broadwell 
parallel             473/9/138/620    2:16:2      88500  5-00:00:00 anyarch,skylake   
parallel             99/23/46/168     2:10:2     120000  5-00:00:00 anyarch,broadwell 
parallel             60/16/44/120     2:16:2     177000  5-00:00:00 anyarch,skylake   
parallel             20/5/15/40       2:10:2     246000  5-00:00:00 anyarch,broadwell 
parallel             8/5/19/32        2:16:2     354000  5-00:00:00 anyarch,skylake   
smp                  438/14/140/592   2:10:2      57000  5-00:00:00 anyarch,broadwell 
smp                  473/9/138/620    2:16:2      88500  5-00:00:00 anyarch,skylake   
smp                  99/23/46/168     2:10:2     120000  5-00:00:00 anyarch,broadwell 
smp                  60/16/44/120     2:16:2     177000  5-00:00:00 anyarch,skylake   
smp                  20/5/15/40       2:10:2     246000  5-00:00:00 anyarch,broadwell 
smp                  8/5/19/32        2:16:2     354000  5-00:00:00 anyarch,skylake 

and for MOGON NHR

PARTITION            NODES(A/I/O/T)   S:C:T      MEMORY   TIMELIMIT
a100ai               1/2/1/4          2:64:2    1992000  6-00:00:00
a100dl               1/8/2/11         2:64:1    1016000  6-00:00:00
a40                  1/6/0/7          2:64:1    1016000  6-00:00:00
czlab                0/1/0/1          2:64:1    1031828  6-00:00:00
hugemem              0/1/3/4          2:64:1    1992000  6-00:00:00
komet                355/43/34/432    2:64:1     248000  6-00:00:00
largemem             0/19/9/28        2:64:1    1016000  6-00:00:00
longtime             9/0/1/10         2:64:1     248000 12-00:00:00
longtime             10/0/0/10        2:64:1     504000 12-00:00:00
mi250                0/2/0/2          2:64:1    1016000  6-00:00:00
mod                  167/4/5/176      2:64:1     504000  6-00:00:00
parallel             355/43/34/432    2:64:1     248000  6-00:00:00
parallel             167/4/5/176      2:64:1     504000  6-00:00:00
quick                355/43/34/432    2:64:1     248000     8:00:00
smallcpu             355/43/34/432    2:64:1     248000  6-00:00:00
topml                0/1/0/1          2:48:2    1547259  6-00:00:00

Hidden Partitions

Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects / groups and of interest to these groups, only.

To visualize all jobs for a user in all partitions supply the -a flag:

squeue -u $USER -a

Likewise sinfo can be supplemented with -a to gather informations. All other commands work without this flag as expected.