User Tools

Site Tools


slurm_manage

This is an old revision of the document!


Information on Jobs

List job(s) … for you (or a different user) Command
squeue -u $USER
in <partition> squeue -u $USER -p <partition>
priority sprio -l
running squeue -u $USER -t RUNNING
pending squeue -u $USER -t PENDING
details scontrol show jobid -dd <jobid>
status info sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
statistics on completed (per job) sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
statistics on completed (per username) sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

You can see completed Jobs only wit sacct

Headline

Now, you know why your job is pending and you want to know: What does it mean? There can be several reasons. The most frequent are:

  • QOSGrpMemLimit - the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.
  • Resources - while the partition may allow to take the resources you requested, it cannot not – at the time – provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).
  • Priority - there are sufficient resources, but – alas – you do not have sufficient priority to get your job running.
  • and then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time.

Controlling Jobs

To… job(s) Command
cancel one scancel <jobid>
cancel all scancel -u <username>
cancel all the pending scancel -t PENDING <jobid>
cancel one or more by name scancel --name <myJobName>
pause one scontrol hold <jobid>
resume one scontrol resume <jobid>
requeue one scontrol requeue <jobid>

Modifying Pending Jobs

Sometimes squeue --start might indicate a wrong requirement specification, e.g. BadConstraints. In this case a user can figure out the mismatch with scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:

To correct a job's Command
memory requirement scontrol update job <jobid> MinMemoryNode=<mem in MB>
memory requirement scontrol update job <jobid> MinMemoryCPU=<mem in MB>
number of requested CPUs scontrol update job <jobid> NumCPUs=<number>

For more information see man scontrol.

Pending Reasons

So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled PD, when squeue is triggered).

Reason Brief Explanation
Priority At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.
AssocGrpCPURunMinutesLimit Indicates, that the partitions associated quality of service in terms of CPU time is up.
slurm_manage.1502424081.txt.gz · Last modified: 2017/08/11 06:01 by meesters