Memory Reservation

How to reserve Memory on MOGON

SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — $300 MB$ per CPU respectively task or $115500 MB$ per node for full node jobs. If your job uses more than that without specifying the need, you’ll get an error that your job exceeded job memory limit. To set a larger limit, add to your job submission:

#SBATCH --mem X

where X is the maximum amount of memory your job will use per node, in $MB$. Different units can be specified using the suffix [K|M|G|T]. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large and then use sacct to look at how much your job is actually using or has used:

sacct -o JobID%20,ReqMem,MaxRSS,AveRSS,Elapsed,CPUTime -j <JOBID>

where JOBID is the one you’re interested in. This sample command gives the output for all JobSteps and compares used CPU time with the actual elapsed time - sometimes useful to get performance hints. If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with --ntasks-per-node), the same job could have very different values when run at different times.

The ReqMem value indicates the requested memory in $MB$ at the submission, appended by either n (per CPU) or n (per node).

The number is in KB, so divide by 1024 to get a rough idea of what to use with --mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with --ntasks-per-node), the same job could have very different values when run at different times.

Sampling might be inconsistent with actually used memory values for very short jobs. Quickly aborted jobs aren’t good to retrieve the stastics one needs, obviously.

Run sacct -e to get a full list of the available fields and man sacct for more detailed information.

The CPU time divided by the number of used CPUs should more or less equal elapsed run time. Otherwise, this is an indication for poor parallelisation.

On MOGON II we use a JobSubmit-Plugin to manage memory reservation in case the user didn’t speficy one. If the job is going to the broadwell nodes the default memory per node is $57.000MB$. On the other hand, if the job is going to the skylake nodes the default memory per node will be set to $88.500MB$. For the bigmem-partition the process runs vice versa. If you specify memory per node > 1TB without specifying constrain broadwell or skylake you’ll get skylake since that’s the only nodes that support up to $1,5TB$. For anything $< 1TB$ you’ll get broadwell if you don’t specify anything different.