Job Monitoring

Manage and Monitor Jobs using SLURM

Information on Jobs

List job	Command
own active	`squeue -u $USER`
in `<partition>`	`squeue -u $USER -p <partition>`
show priority	`sprio -l`
list running	`squeue -u $USER -t RUNNING`
list pending	`squeue -u $USER -t PENDING`
show details	`scontrol show jobid -dd <jobid>`
status info	`sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps`
statistics on completed (per job)	`sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed`
statistics on completed (per username)	`sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed`
summary statistics on completed job	`seff <jobid>`

You can see completed Jobs only with sacct. Note that only recent jobs will be displayed without specifying the -S flag (for the start date to search from). For example -S 0901 would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.

Controlling Jobs

Job operation	Command
cancel one	`scancel <jobid>`
cancel all	`scancel -u <username>`
cancel all your pending	`scancel -u $USER -t PENDING`
cancel one or more by name	`scancel --name <myJobName>`
pause one	`scontrol hold <jobid>`
resume one	`scontrol resume <jobid>`
requeue one	`scontrol requeue <jobid>`

Modifying Pending Jobs

Sometimes squeue --start might indicate a wrong requirement specification, e.g. BadConstraints. In this case a user can figure out the mismatch with scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:

To correct a job’s	Command
memory requirement	`scontrol update job <jobid> MinMemoryNode=<mem in MB>`
memory requirement	`scontrol update job <jobid> MinMemoryCPU=<mem in MB>`
number of requested CPUs	`scontrol update job <jobid> NumCPUs=<number>`

For more information see man scontrol.

Job State Codes

Status	Code	Description
COMPLETED	`CD`	The Job has completed successfully.
COMPLETING	`CG`	The job is finishing but some processes are still active.
FAILED	`F`	The job terminated with a non-zero exit code and failed to execute.
PENDING	`PD`	The job is waiting for resource allocation. It will eventually run.
PREEMPTED	`PR`	The job was terminated because of preemption by another job.
RUNNING	`R`	The job currently is allocated to a node and is running.
SUSPENDED	`S`	A running job has been stopped with its cores released to other jobs.
STOPPED	`ST`	A running job has been stopped with its cores retained.

Pending Reasons

So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD, when squeue is triggered). Here, we show some more frequent reasons:

Reason	Brief Explanation
`Priority`	At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.
`AssocGrpCPURunMinutesLimit`	Indicates, that the partitions-associated quality of service (CPU time) is exhausted for the user account/project account in question. This number will recover.
`QOSMaxCpuPerNode`	This may indicate a violation of the number allowed in the chosen partition.
`QOSMaxJobsPerUserLimit`	For certain partitions the number of running jobs per user is limited.
`QOSMaxJobsPerAccountLimit`	For certain partitions the number of running jobs per account is limited.
`QOSGrpGRESRunMinutes`	For certain partitions the generic resources (e.g. GPUs) are limited. See GPU Queues
`QOSGrpMemLimit`	The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
`QOSMinMemory`	The Job isn’t requesting enough Memory for the requested Partition.
`QOSGrpCpuLimit`	The requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
`Resources`	The job is eligible to run but resources aren’t available at this time. This usually just means that your job will start next once nodes are done with their current jobs.
`ReqNodeNotAvail`	Simply means that no node with the required resources is available. SLURM will list all non-available nodes, which can be confusing. This reason is similar to `Resources` as it means that a specific job has to wait for a resource to be released.

And then there are limitations due to the number of jobs a group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective wiki site.

Last modified by Jens Rutten on May 14, 2024

Job Monitoring

Information on Jobs #

Controlling Jobs #

Modifying Pending Jobs #

Job State Codes #

Pending Reasons #

Information on Jobs

Controlling Jobs

Modifying Pending Jobs

Job State Codes

Pending Reasons