Slurm Options

With SLURM there are three commands to reserve resource allocaction, resp. to submit jobs:

  • salloc: to reserve allocations for interactive tasks
  • srun to run so-called job steps or small interactive jobs
  • sbatch: to submit jobs to a queue for processing

An extensive documentation on the salloc, srun and sbatch commands can be found in the SLURM documentation: salloc, srun, sbatch, or the man pages for each command, e.g $ man sbatch.

The most commonly used parameters for these commands are listed below. Detailed information on important options can also be found in separate articles.

Parameter List

OptionDescription
-A --accountThe project account that is billed for your job. For example:
-A m2_zdvhpc
--account=hpckurs
Mandatory. Looking for your account?
-p --partitionThe partition your job should run in. For example:
-p parallel
--partition=smp
Mandatory. Look up available partitions.
-n --ntasksControls the number of tasks to be created for the job (=cores, if no advanced topology is given). For example:
-n 4
-N --nodesThe number of nodes you need. For example:
--nodes=2
-t --timeSet the runtime limit of your job (within the partition constraints). For example to specify 1 hour:
-t 01:00:00
More details on the format here.
-J --job-nameSets an arbitrary name for your job that is used for listing of jobs. Defaults to script name. For example:
--job-name=%x.%j.out
--task-per-nodeControls the maximum number of tasks per allocated node.
-c --cpus-per-taskNo. of CPUs per task
-C --constraintWhich processor architecture to use. For example:
-C broadwell
--constraint=skylake
Read more about this constraint here.
--memThe amount of memory per Node. Different units can be specified using [K|M|G|T] (default is M for MegaByte). See the Memory reservation page for details and hints, particularly with respect to partition default memory settings.
--mem-per-cpuAmount of memory per CPU. See above for the units.
-o --outputWill direct stdout, stderr into one file. (SLURM writes buffered. Shell based solution do not write buffered.)
-o <filename>.log
-e <filename>.err
Will direct stdout to the log file and stderr to the error log file.
-i <filename>Instruct Slurm to connect the batch script’s standard input directly to the file name specified.

You may use one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j). For example, job%4j.out yields job0128.out

%AJob array’s master job allocation number.
%aJob array ID (index) number.
%Jjobid.stepid of the running job. (e.g. “128.0”)
%jjobid of the running job.
%sstepid of the running job.
%uUser name.
%xJob name.

Other important parameters / features on MOGON include:

Once a job has been submitted you can get information on it or control with this list of commands.

CPU Architecure

On MOGON II a third important parameter is present:

You may select the CPU type to be either skylake or broadwell for the Skylake and Broadwell nodes, respectively. If the architecture is not relevant for your application, select anyarch.

This can be set with:

  • -C <selection list> or
  • --constraint=<selection list>

to sbatch (on the command line or within a jobscript).

The defaults are:

If nothing is specified you’ll get broadwell except for the himster2 partition where it’s going to be skylake. On the bigmem partition it will depend on your requested memory per node.

You can get a list of features and resources of each node with:

sinfo -o "%32N %5c %10m %20f %15G"

You will get an output similar to:

NODELIST                         CPUS  MEMORY     AVAIL_FEATURES       GRES
s[0020,0023],z[0001-0838]        40+   57000+     anyarch,broadwell    (null)
x[0001-0814,0901-0902,2001-2320] 64    88500+     anyarch,skylake      (null)
s[0027-0030]                     48    115500     anyarch,broadwell    gpu:gtx1080ti:6
s[0001-0019,0021-0022,0024-0026] 48    115500     anyarch,broadwell    gpu:gtx1080ti:6
dgx01                            80    490000     anyarch,broadwell    gpu:V100_16g:8
dgx02                            80    490000     anyarch,broadwell    gpu:V100_32g:8

Specifying Runtime

Requesting runtime is straightforward: The -t or --time flag can be used in srun/salloc and sbatch alike:

srun --time <time reservation>

Or within a script

#SBATCH -t <time reservation>

where <time reservation> can be any of the acceptable time formats:

  • minutes,
  • minutes:seconds,
  • hours:minutes:seconds,
  • days-hours,
  • days-hours:minutes and
  • days-hours:minutes:seconds.

Time resolution is one minute and second values are rounded up to the next minute. A time limit of zero requests that no time limit is imposed, meaning that the maximum runtime of the partitions will be used.

Default Runtime

Most Nodes have a default runtime of 10 minutes after which they will be automatically killed unless more time is requested using the -t flag. The default runtime for a partition can be checked with

scontrol show partition <partition>

The Max wall time is the maximum requestable runtime on a node. Large jobs need to be split up and continued in a separate job.

Receiving mail notifications

Specify which types of mails you want to receive with:

--mail-type=<TYPE>

<TYPE> can be any of:

  • NONE,
  • BEGIN,
  • END,
  • FAIL,
  • REQUEUE,
  • STAGE_OUT (burst buffer stage out and teardown completed),
  • INVALID_DEPEND (dependency never satisfied) or
  • ALL (equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT)

Specify the receiving mail address using:

--mail-user=<username>@uni-mainz.de

The default value is the submitting user. We highly recommend taking an internal address rather relying on an a third party service.

Signals

Slurm does not send signals if not requested. However, there are situations when you may like to trigger a signal (e.g. in some IO-workflows). You can request a specific signal with --signal either to srun or sbatch from within a script. The flag can be used like --signal=<sig_num>[@<sig_time>]: When a job is within sig_time seconds of its end time, then the signal sig_num is sent. If a sig_num is specified without any sig_time, the default time will $60 s$. Due to the resolution of event handling by Slurm, the signal may be sent up to $60 s$ earlier than specified.

An example would be:

sbatch --signal=SIGUSR2@600 ...

Or within a script:

#SBATCH --signal=SIGUSR2@600

Here, the signal SIGUSR2 is sent to the application ten minutes before hitting the walltime of the job. Note once more that the slurm documentation states that there is a uncertainty of up to $1 min$.

Cancel Jobs

Use the

scancel <jobid>

command with the jobid of the job you want to cancel.

In the case you want to cancel all your jobs, use -u, --user=:

scancel -u <username>

You can also restrict the operation to jobs in a certain state with -t, --state=

scancel -t <jobstate>

where <jobstate> can be:

  • PENDING
  • RUNNING
  • SUSPENDED

Using sbatch

You have to prepare a job script to submit jobs using sbatch. You can pass options to sbatch directly on the command-line or specify them in the job script file.

To submit your job use:

sbatch myjobscript

When does my Job start

A job is either started when it has the highest priority and the required resources are available, or when it has the opportunity to backfill. The following command gives an estimate of the time and date when your Job is supposed to start, but note that the estimate is based on the workload at current time:

squeue --start

Slurm cannot anticipate that higher priority jobs will be submitted after yours, or that machine downtime will result in fewer resources for jobs, or that job crashes will result in large jobs being started earlier than expected, causing smaller jobs that are scheduled for replenishment to lose that replenishment opportunity.

Slurm-based Job Monitoring

For running Jobs you can retrieve information on memory usage with sstat. Detailed information on which slots exactly your job is assigned to can be retrieved with the following command:

scontrol show -d job <jobid>

For completed Jobs, this Information is provided by sacct, e.g.:

sacct --format JobID,Jobname,NTasks,Nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize

For completed jobs, you can also use seff, which reports on the efficiency of a job’s CPU and memory utilisation.

seff <jobid>