This is an old revision of the document!
Information on Jobs
List job(s) … for you (or a different user) | Command |
---|---|
squeue -u $USER |
|
in <partition> | squeue -u $USER -p <partition> |
priority | sprio -l |
running | squeue -u $USER -t RUNNING |
pending | squeue -u $USER -t PENDING |
details | scontrol show jobid -dd <jobid> |
status info | sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps |
statistics on completed (per job) | sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed |
statistics on completed (per username) | sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed |
You can see completed Jobs only wit sacct
Headline
Now, you know why your job is pending and you want to know: What does it mean? There can be several reasons. The most frequent are:
QOSGrpMemLimit
- the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.Resources
- while the partition may allow to take the resources you requested, it cannot not – at the time – provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).Priority
- there are sufficient resources, but – alas – you do not have sufficient priority to get your job running.- and then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time.
Controlling Jobs
To… job(s) | Command |
---|---|
cancel one | scancel <jobid> |
cancel all | scancel -u <username> |
cancel all the pending | scancel -t PENDING <jobid> |
cancel one or more by name | scancel --name <myJobName> |
pause one | scontrol hold <jobid> |
resume one | scontrol resume <jobid> |
requeue one | scontrol requeue <jobid> |
Modifying Pending Jobs
Sometimes squeue --start
might indicate a wrong requirement specification, e.g. BadConstraints
. In this case a user can figure out the mismatch with scontrol show job <jobid>
(which might require some experience). Wrong requirements can be fixed like:
To correct a job's | Command |
---|---|
memory requirement | scontrol update job <jobid> MinMemoryNode=<mem in MB> |
memory requirement | scontrol update job <jobid> MinMemoryCPU=<mem in MB> |
number of requested CPUs | scontrol update job <jobid> NumCPUs=<number> |
For more information see man scontrol
.
Pending Reasons
So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD
, when squeue
is triggered).
Reason | Brief Explanation |
---|---|
Priority | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. |
AssocGrpCPURunMinutesLimit | Indicates, that the partitions associated quality of service in terms of CPU time is up. |