User Tools

Site Tools


slurm_manage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm_manage [2017/09/06 17:18]
meesters [Pending Reasons]
slurm_manage [2019/05/06 10:17] (current)
meesters [Pending Reasons]
Line 11: Line 11:
 | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 +| summary statistics on completed job | ''​%%seff <​jobid>​ %%''​ |
    
 <WRAP center round info 80%> <WRAP center round info 80%>
-You can see completed Jobs only wit ''​sacct''​+You can see completed Jobs only wit ''​sacct''​. Note that only recent jobs will be displayed without specifying the ''​-S''​ flag (for the start date to search from). For example ''​-S 0901''​ would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.  ​
 </​WRAP>​ </​WRAP>​
 +
  
  
Line 41: Line 43:
 ====== Pending Reasons ====== ====== Pending Reasons ======
  
-So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered).+So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered). ​Here, we show some more frequent reasons:
  
 ^ Reason ^ Brief Explanation ^ ^ Reason ^ Brief Explanation ^
Line 47: Line 49:
 | ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. |  | ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. | 
 | ''​QOSMaxJobsPerUserLimit''​ | For certain partitions the number of running jobs per user is limited. | | ''​QOSMaxJobsPerUserLimit''​ | For certain partitions the number of running jobs per user is limited. |
-| ''​QOSMaxJobsPerAccountLimit''​ | For certain partitions the number of running jobs per account is limited. | +| ''​QOSMaxJobsPerAccountLimit''​ | For certain partitions the number of running jobs per account is limited. ​
 +| ''​QOSGrpGRESRunMinutes''​ | For certain partitions the generic resources (e.g. GPUs) are limited. See [[gpu|GPU Queues]] ​|
 | ''​QOSGrpMemLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| | ''​QOSGrpMemLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.|
 +| ''​QOSGrpCpuLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.|
 | ''​Resources''​ | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).| | ''​Resources''​ | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).|
 +| ''​ReqNodeNotAvail''​ | simply means that no node with the required resources is available. SLRUM will list //all// non-available nodes, which can be confusing. This reason is similar to ''​Priority''​ as it means that a specific job has to wait for a resource to be released.|
  
 And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]]. And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]].
 +
 +====== Investigating Job Failures ======
 +
 +===== Exceeding Resource Limits =====
 +
 +
 +Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the ''​--time''​ and ''​--mem-per-core''​ (or ''​--mem''​ if deviating from the [[partitions|partition defaults]]) options. A job is killed if one of these limits is exceeded. In both cases, the error file provides appropriate information:​
 +
 +Time limit:
 +
 +<​code>​
 +(...)
 +slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11:​22:​57 DUE TO TIME LIMIT ***
 +(...)
 +</​code>​
 +
 +Memory limit:
 +
 +<​code>​
 +(...)
 +slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed
 +slurmstepd: error: Exceeded job memory limit
 +slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10:​21:​37 ***
 +(...)
 +</​code>​
 +
 +In addition ''​sacct''​ will display an informative ''​State''​ string. The error code is 1 for whole job. Jobstep error code will depend on the jobscript and application handling this situation.
 +
 +===== Software Errors =====
 + 
 +
 +The exit code of a job is captured by SLURM and saved as part of the job record. For ''​sbatch''​ jobs the exit code of the batch script is captured. For ''​srun''​ or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination,​ the signal number will also be captured, and displayed after the exit code (separated by a colon).
 +
 +Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example (//note for non-R users//: ''​sq''​ does not exist without loading a library which provides it):
 +
 +<code Rsplus>
 +var<​-sq(1,​1000000000)
 +</​code>​
 +
 +<code bash>
 +#!/bin/bash
 +
 +#SBATCH --job-name="​A script which fails, but displays 0 as exit code"
 +#SBATCH --time=00:​05:​00
 +#SBATCH --mem-per-cpu=3G
 +#SBATCH -p short/smp
 +#SBATCH -A <your account>
 +
 +module load lang/​R/​3.4.1-foss-2017a
 +
 +# Put your code below this line
 +R --no-save --slave -f fail.r
 +echo "​Script finished"​
 +</​code>​
 +
 +We submit this job:
 +
 +<code bash>
 +$ sbatch submit_fail_r.sh ​
 +Submitted batch job 3695216
 +</​code>​
 +
 +The exit code and state wrongly indicates that the job finished successfully:​
 +<code bash>
 +$ sacct -j 3695216
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 +------------ ---------- ---------- ---------- ---------- ---------- -------- ​
 +3695216 ​     A script +      short    account ​         1  COMPLETED ​     0:0 
 +3695216.bat+ ​     batch               ​account ​         1  COMPLETED ​     0:0 
 +3695216.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0
 +</​code>​
 +
 +There are several solutions to this problem:
 +
 +  * The //​**preferred**//​ solution is to create genuine job steps where
 +<code bash>
 +R --no-save --slave -f fail.r
 +</​code>​
 +would become
 +<code bash>
 +srun R --no-save --slave -f fail.r
 +</​code>​
 +The output will be a lot more informative:​
 +<code bash>
 +$ sacct -j 3713748
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 +------------ ---------- ---------- ---------- ---------- ---------- -------- ​
 +3713748 ​     A script +      short    account ​         1  COMPLETED ​     0:0 
 +3713748.bat+ ​     batch               ​account ​         1  COMPLETED ​     0:0 
 +3713748.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0 
 +3713748.0 ​            ​R ​              ​account ​         1     ​FAILED ​     1:0
 +</​code>​
 +
 +  * In the case, where the batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes:
 +
 +<code bash>
 +R --no-save --slave -f fail.r || exit 42
 +</​code>​
 +which now translates into a batch script failure
 +<code bash>
 +$ sacct -j 3714719
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 +------------ ---------- ---------- ---------- ---------- ---------- -------- ​
 +3714719 ​     A script +      short    account ​         1     ​FAILED ​    ​42:​0 ​
 +3714719.bat+ ​     batch               ​account ​         1     ​FAILED ​    ​42:​0 ​
 +3714719.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0
 +</​code>​
 +  * Finally, it is possible to trigger a script exit with every error (e. g. in bash ''​set -e''​). This, however, if to be recommended only if you know how to script well.
 +
 +<WRAP center round info 80%>
 +The most useful information can be derived from the application specific output, usually written to the job log files.
 +</​WRAP>​
 +
 +===== Hardware Errors =====
 +
 +Alas, sometimes((For brand new and very old systems more frequently than "​sometimes"​.)) you might experience node failures or network issues (particularly with very big jobs). In such cases, you job might get aborted with weird messages, e.g. from MPI. If you re-submit SLURM will schedule your new job with a great probability on those nodes where your previous job tried to compute - with the same consequence.
 +
 +We try our best to detect hardware issues with scripts //prior// to the execution of a job, but sometimes a glitch passes undetected with the consequences described above.
 +
 +If this happens, **please notify us**.
 +
 +Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run:
 +
 +<code bash>
 +$ sacct -o JOBID,​EXITCODE,​NODELIST
 +</​code>​
 +
 +and then with the copied nodelist for the job(s) in question -- without modifying your jobscript:
 +
 +<code bash>
 +$ sbatch --exclude <​nodelist>​ <​jobscript>​
 +</​code>​
 +
 +
  
slurm_manage.1504711090.txt.gz · Last modified: 2017/09/06 17:18 by meesters