General Notes
On MOGON we differentiate between public partitions (those readily visible with sinfo
) and non-public ones. The later have restricted access and will not be described here. They are set to be hidden.
Detailed information on partitions can be retrieved with
scontrol show partition <partition_name>
Quality of service (QoS) values can be viewed with
sacctmgr show qos <qos_of_that_partition_name>
Informations regarding jobs running or pending within a partition can be obtained by
squeue -p <partition_name>,
while an status overview is given by
sinfo -p <partition_name>.
Submitting to Partitions
In SLURM a partition can be selected in your jobscript by
#SBATCH -p <partitionname>
or interactively: $ sbatch -p <partitionname> … <jobscript>
Severel partitions can be selected with
#SBATCH -p <partition1>,<partition2>
This can be useful for users with "private" hardware, to allow a job to be scheduled onto general purpose hardware, when the group-owned hardware is occupied.
MOGON II
Only ~ 5% of nodes are available for small jobs (n«40
). Each account has a GrpTRESRunLimit
. Check using sacctmgr -s list account <your_account> format=account,GRpTRESRunMin
, you can use sacctmgr -n -s list user $USER format=Account%20 | grep -v none
to get your accounts. The default is cpu=22982400
, which is the equivalent of using 700 nodes for 12 hours in total:
Partition | Nodes | Max wall time | nodes | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|---|
parallel | z-nodes x-nodes | 5 days | 64GB,96GB,128GB,192GB,256GB-nodes | Intel Omnipath | - | jobs using n*40 or jobs using n*64 |
smp | z-nodes x-nodes | 5 days | up to 5% of 64GB,96GB,128GB,192GB,256GB-nodes | Intel Omnipath | - | jobs using n « 40 or n « 64, Max running jobs per user: 3.000 |
bigmem | z-nodes x-nodes | 5 days | 384GB,512GB,1TB,1.5TB-nodes | Intel Omnipath | - | 256GB or more memory |
devel | z-nodes x-nodes | 4 hours | 10 64GB,96GB,128GB-nodes | Intel Omnipath | - | Max 2 Jobs per User, Max 320 CPUs in total |
Partitions for Applications using Accelerators
Partition | Nodes | Max wall time | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|
m2_gpu | s-nodes | 5 days | Infiniband | 6 GeForce GTX 1080 Ti per node | - |
deeplearning | dgx-nodes | 12 hours | Infiniband | 8 Tesla V100-SXM2 per node | for access get in touch with us |
Memory limits in the parallel partition
For the parallel
partition we find:
Memory [MiB] | No. of Nodes 1) | Type |
---|---|---|
57000 | 584 | broadwell |
88500 | 576 | skylake |
120000 | 120 | broadwell |
177000 | 120 | skylake |
246000 | 40 | broadwell |
Likewise for the bigmem
partition:
Memory [MiB] | No. of Nodes 2) | Type |
---|---|---|
354000 | 32 | skylake |
498000 | 20 | broadwell |
1002000 | 2 | broadwell |
1516000 | 2 | skylake |
Private Partitions
Partition | Nodes | Max wall time | nodes | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|---|
himster2_exp | x0753 - x0794, x2001-x2023 | 5 days | 96GB | Intel OmniPath | - | - |
himster2_th | x2024 - x2320 | 5 days | 96GB | Intel OmniPath | - | - |
Hidden Partitions
Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are "private" to certain project / groups and of interest to these groups, only.
To visualize all jobs for a user in all partitions supply the -a
flag:
$ squeue -u $USER -a
Likewise sinfo
can be supplemented with -a
to gather informations. All other commands work without this flag as expected.
Out of Service
Under the following link you will find clusters that have been taken out of service for various reasons: