This is an old revision of the document!
|List job(s) … for you (or a different user)||Command|
|statistics on completed (per job)||
|statistics on completed (per username)||
You can see completed Jobs only wit
sacct. Note that only recent jobs will be displayed without specifying the
-S flag (for the start date to search from). For example
-S 0901 would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.
|cancel all the pending||
|cancel one or more by name||
squeue --start might indicate a wrong requirement specification, e.g.
BadConstraints. In this case a user can figure out the mismatch with
scontrol show job <jobid> (which might require some experience). Wrong requirements can be fixed like:
|To correct a job's||Command|
|number of requested CPUs||
For more information see
So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled
squeue is triggered).
| ||At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start.|
| ||Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the account / association in question is exhausted. This number will recover.|
| ||For certain partitions the number of running jobs per user is limited.|
| ||For certain partitions the number of running jobs per account is limited.|
| ||the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.|
| ||while the partition may allow to take the resources you requested, it cannot not – at the time – provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).|
And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found on their respective wiki site.
Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the
–mem if deviating from the partition defaults) options. A job is killed if one of these limits is exceeded. In both cases, the error file provides appropriate information:
(...) slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11:22:57 DUE TO TIME LIMIT *** (...)
(...) slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10:21:37 *** (...)
sacct will display an informative
State string. The error code is 1 for whole job. Jobstep error code will depend on the jobscript and application handling this situation.
The exit code of a job is captured by SLURM and saved as part of the job record. For
sbatch jobs the exit code of the batch script is captured. For
srun or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination, the signal number will also be captured, and displayed after the exit code (separated by a colon).
Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example:
#!/bin/bash #SBATCH --job-name="A script which fails, but displays 0 as exit code" #SBATCH --time=00:05:00 #SBATCH --mem-per-cpu=3G #SBATCH -p short/smp #SBATCH -A <your account> module load lang/R/3.4.1-foss-2017a # Put your code below this line R --no-save --slave -f fail.r echo "Script finished"
We submit this job:
$ sbatch submit_fail_r.sh Submitted batch job 3695216
The exit code and state wrongly indicates that the job finished successfully:
$ sacct -j 3695216 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3695216 A script + short zdvhpc 1 COMPLETED 0:0 3695216.bat+ batch zdvhpc 1 COMPLETED 0:0 3695216.ext+ extern zdvhpc 1 COMPLETED 0:0
There are several solutions to this problem:
R --no-save --slave -f fail.r
srun R --no-save --slave -f fail.r
The output will be a lot more informative:
$ sacct -j 3713748 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3713748 A script + short zdvhpc 1 COMPLETED 0:0 3713748.bat+ batch zdvhpc 1 COMPLETED 0:0 3713748.ext+ extern zdvhpc 1 COMPLETED 0:0 3713748.0 R zdvhpc 1 FAILED 1:0