Partitions

How to select Partitions and work with batch queues

General Notes

On MOGON we differentiate between public partitions (those readily visible with sinfo) and non-public ones. The latter have restricted access and will not be described here. They are set to be hidden.

Detailed information on partitions can be retrieved with

scontrol show partition <partition_name>

Quality of service (QoS) values can be viewed with

sacctmgr show qos <qos_of_that_partition_name>

Information regarding jobs running or pending within a partition can be obtained by

squeue -p <partition_name>

while an status overview is given by

sinfo -p <partition_name>

Submitting to Partitions

In SLURM a partition can be selected in your jobscript by

#SBATCH -p <partitionname>

or interactively: $ sbatch -p <partitionname> ... <jobscript>

Several partitions can be selected with

#SBATCH -p <partition1>,<partition2>

This can be useful for users with private hardware, to allow a job to be scheduled onto general purpose hardware, when the group-owned hardware is occupied.

ℹ️

The partition constraints below will occasionally cause SLURM to leave a job pending. Common pending reasons are described here.

MOGON II

Only ~5% of nodes are available for small jobs (n<<40).
Each account has a GrpTRESRunLimit.

Check using sacctmgr -s list account <your_account> format=account,GRpTRESRunMin, you can use sacctmgr -n -s list user $USER formatAccount%20 | grep -v none to get your accounts. The default is cpu=22982400, which is the equivalent of using 700 nodes for 12 hours in total:

Partition	Nodes	Limit	RAM	Interconnect	Comment
parallel	z-nodes x-nodes	5 days	64GB, 96GB, 128GB, 192GB, 256GB	Intel Omnipath	jobs using n40 or jobs using n64
smp	z-nodes x-nodes	5 days	64GB, 96GB, 128GB, 192GB, 256GB	Intel Omnipath	jobs using n«40 or n«64, Max running jobs per user: 3.000
bigmem	z-nodes x-nodes	5 days	384GB, 512GB, 1TB, 1.5TB	Intel Omnipath	256GB or more memory
devel	z-nodes x-nodes	4 hours	64GB, 96GB, 128GB	Intel Omnipath	Max 2 Jobs per User, Max 320 CPUs in total

⚠️

Default Runtime

Most Nodes have a default runtime of 10 minutes after which they will be automatically killed unless more time is requested using the -t flag. The default runtime for a partition can be checked with

scontrol show partition <partition>

The Limit is the maximum requestable runtime on a node. Large jobs need to be split up and continued in a separate job.

Partitions for Applications using Accelerators

Partition	Nodes	Limit	Interconnect	Accelerators	Comment
deeplearning	dgx-nodes	12 hours	Infiniband	8 Tesla V100-SXM2 per node	for access get in touch with us
m2_gpu	s-nodes	5 days	Infiniband	6 GeForce GTX 1080 Ti per node	-

Memory limits

For the parallel partition:

Memory [MiB]	No. of Nodes (if all nodes are functional)	Type
57000	584	broadwell
88500	576	skylake
120000	120	broadwell
177000	120	skylake
246000	40	broadwell

For the bigmem partition:

Memory [MiB]	No. of Nodes (if all nodes are functional)	Type
354000	32	skylake
498000	20	broadwell
1002000	2	broadwell
1516000	2	skylake

Private Partitions

Partition	Nodes	Limit	RAM	Interconnect	Accelerators	Comment
himster2_exp	x0753 - x0794, x2001-x2023	5 days	96GB	Intel OmniPath	-	-
himster2_th	x2024 - x2320	5 days	96GB	Intel OmniPath	-	-

Hidden Partitions

Information on hidden partitions can be viewed by anyone. These partitions are set to be hidden to avoid cluttering the output for every poll - these partitions are “private” to certain projects / groups and of interest to these groups, only.

To visualize all jobs for a user in all partitions supply the -a flag:

$ squeue -u $USER -a

Likewise sinfo can be supplemented with -a to gather informations. All other commands work without this flag as expected.

Last modified on July 21, 2023, by Jens Rutten

Partitions

General Notes #

Submitting to Partitions #

MOGON II #

Partitions for Applications using Accelerators #

Memory limits #

Private Partitions #

Hidden Partitions #