User Tools

Site Tools


slurm_manage

This is an old revision of the document!


Information on Jobs

List job(s) … for you (or a different user) Command
squeue -u $USER
in <partition> squeue -u $USER -p <partition>
priority sprio -l
running squeue -u $USER -t RUNNING
pending squeue -u $USER -t PENDING
details scontrol show jobid -dd <jobid>
status info sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
statistics on completed (per job) sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
statistics on completed (per username) sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

You can see completed Jobs only wit sacct

Controlling Jobs

To… job(s) Command
cancel one scancel <jobid>
cancel all scancel -u <username>
cancel all the pending scancel -t PENDING <jobid>
cancel one or more by name scancel --name <myJobName>
pause one scontrol hold <jobid>
resume one scontrol resume <jobid>
requeue one scontrol requeue <jobid>

Modifying Pending Jobs

Sometimes squeue --start might indicate a wrong requirement specification, e.g. BadConstraints. In this case a user can figure out the mismatch with scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:

To correct a job's Command
memory requirement scontrol update job <jobid> MinMemoryNode=<mem in MB>
memory requirement scontrol update job <jobid> MinMemoryCPU=<mem in MB>
number of requested CPUs scontrol update job <jobid> NumCPUs=<number>

For more information see man scontrol.

Pending Reasons

So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD, when squeue is triggered).

Reason Brief Explanation
Priority At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.
AssocGrpCPURunMinutesLimit Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the account / association in question is exhausted. This number will recover.
QOSMaxJobsPerUserLimit For certain partitions the number of running jobs per user is limited.
QOSGrpMemLimit the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
Resources while the partition may allow to take the resources you requested, it cannot not – at the time – provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).

And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time.

slurm_manage.1504709753.txt.gz · Last modified: 2017/09/06 16:55 by meesters