Partitions
How to select Partitions and work with batch queues
General Notes
On MOGON we differentiate between public partitions (those readily visible with sinfo
) and non-public ones. The latter have restricted access and will not be described here. They are set to be hidden.
Detailed information on partitions can be retrieved with
scontrol show partition <partition_name>
Quality of service (QoS) values can be viewed with
sacctmgr show qos <qos_of_that_partition_name>
Information regarding jobs running or pending within a partition can be obtained by
squeue -p <partition_name>
while an status overview is given by
sinfo -p <partition_name>
Submitting to Partitions
In SLURM a partition can be selected in your jobscript by
#SBATCH -p <partitionname>
or interactively: $ sbatch -p <partitionname> ... <jobscript>
Several partitions can be selected with
#SBATCH -p <partition1>,<partition2>
This can be useful for users with private hardware, to allow a job to be scheduled onto general purpose hardware, when the group-owned hardware is occupied.
MOGON II
- Only ~5% of nodes are available for small jobs (
n<<40
). - Each account has a
GrpTRESRunLimit
.
Check using sacctmgr -s list account <your_account> format=account,GRpTRESRunMin
, you can use sacctmgr -n -s list user $USER formatAccount%20 | grep -v none
to get your accounts. The default is cpu=22982400
, which is the equivalent of using 700 nodes for 12 hours in total:
Partition | Nodes | Limit | RAM | Interconnect | Comment |
---|---|---|---|---|---|
parallel | z-nodes x-nodes | 5 days | 64GB, 96GB, 128GB, 192GB, 256GB | Intel Omnipath | jobs using n*40 or jobs using n*64 |
smp | z-nodes x-nodes | 5 days | 64GB, 96GB, 128GB, 192GB, 256GB | Intel Omnipath | jobs using n«40 or n«64, Max running jobs per user: 3.000 |
bigmem | z-nodes x-nodes | 5 days | 384GB, 512GB, 1TB, 1.5TB | Intel Omnipath | 256GB or more memory |
devel | z-nodes x-nodes | 4 hours | 64GB, 96GB, 128GB | Intel Omnipath | Max 2 Jobs per User, Max 320 CPUs in total |
Default Runtime
Most Nodes have a default runtime of 10 minutes after which they will be automatically killed unless more time is requested using the -t
flag. The default runtime for a partition can be checked with
scontrol show partition <partition>
The Limit is the maximum requestable runtime on a node. Large jobs need to be split up and continued in a separate job.
Partitions for Applications using Accelerators
Partition | Nodes | Limit | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|
deeplearning | dgx-nodes | 12 hours | Infiniband | 8 Tesla V100-SXM2 per node | for access get in touch with us |
m2_gpu | s-nodes | 5 days | Infiniband | 6 GeForce GTX 1080 Ti per node | - |
Memory limits
For the parallel
partition:
Memory [MiB] | No. of Nodes (if all nodes are functional) | Type |
---|---|---|
57000 | 584 | broadwell |
88500 | 576 | skylake |
120000 | 120 | broadwell |
177000 | 120 | skylake |
246000 | 40 | broadwell |
For the bigmem
partition:
Memory [MiB] | No. of Nodes (if all nodes are functional) | Type |
---|---|---|
354000 | 32 | skylake |
498000 | 20 | broadwell |
1002000 | 2 | broadwell |
1516000 | 2 | skylake |
Private Partitions
Partition | Nodes | Limit | RAM | Interconnect | Accelerators | Comment |
---|---|---|---|---|---|---|
himster2_exp | x0753 - x0794, x2001-x2023 | 5 days | 96GB | Intel OmniPath | - | - |
himster2_th | x2024 - x2320 | 5 days | 96GB | Intel OmniPath | - | - |
Hidden Partitions
Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects / groups and of interest to these groups, only.
To visualize all jobs for a user in all partitions supply the -a
flag:
$ squeue -u $USER -a
Likewise sinfo
can be supplemented with -a
to gather informations. All other commands work without this flag as expected.