User Tools

Site Tools


slurm_manage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm_manage [2017/08/11 06:01]
meesters [Pending Reasons]
slurm_manage [2019/05/06 10:17] (current)
meesters [Pending Reasons]
Line 11: Line 11:
 | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 +| summary statistics on completed job | ''​%%seff <​jobid>​ %%''​ |
    
 <WRAP center round info 80%> <WRAP center round info 80%>
-You can see completed Jobs only wit ''​sacct''​+You can see completed Jobs only wit ''​sacct''​. Note that only recent jobs will be displayed without specifying the ''​-S''​ flag (for the start date to search from). For example ''​-S 0901''​ would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.  ​
 </​WRAP>​ </​WRAP>​
  
-===== Headline ===== 
  
-Now, you know why your job is pending and you want to know: What does it mean? There can be several reasons. The most frequent are: 
- 
-  * ''​QOSGrpMemLimit''​ - the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. 
-  * ''​Resources''​ - while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied). 
-  * ''​Priority''​ - there are sufficient resources, but -- alas -- you do not have sufficient priority to get your job running. 
-  * and then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time.  
  
 ====== Controlling Jobs ====== ====== Controlling Jobs ======
Line 49: Line 43:
 ====== Pending Reasons ====== ====== Pending Reasons ======
  
-So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered).+So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered). ​Here, we show some more frequent reasons:
  
 ^ Reason ^ Brief Explanation ^ ^ Reason ^ Brief Explanation ^
 | ''​Priority''​ | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. | | ''​Priority''​ | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. |
-| ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is up. | +| ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. |  
 +| ''​QOSMaxJobsPerUserLimit''​ | For certain partitions the number of running jobs per user is limited. | 
 +| ''​QOSMaxJobsPerAccountLimit''​ | For certain partitions the number of running jobs per account is limited. | 
 +| ''​QOSGrpGRESRunMinutes''​ | For certain partitions the generic resources (e.g. GPUs) are limited. See [[gpu|GPU Queues]] | 
 +| ''​QOSGrpMemLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| 
 +| ''​QOSGrpCpuLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| 
 +| ''​Resources''​ | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).| 
 +| ''​ReqNodeNotAvail''​ | simply means that no node with the required resources is available. SLRUM will list //all// non-available nodes, which can be confusing. This reason is similar to ''​Priority''​ as it means that a specific job has to wait for a resource to be released.| 
 + 
 +And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]]. 
 + 
 +====== Investigating Job Failures ====== 
 + 
 +===== Exceeding Resource Limits ===== 
 + 
 + 
 +Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the ''​--time''​ and ''​--mem-per-core''​ (or ''​--mem''​ if deviating from the [[partitions|partition defaults]]) options. A job is killed if one of these limits is exceeded. In both cases, the error file provides appropriate information:​ 
 + 
 +Time limit: 
 + 
 +<​code>​ 
 +(...) 
 +slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11:​22:​57 DUE TO TIME LIMIT *** 
 +(...) 
 +</​code>​ 
 + 
 +Memory limit: 
 + 
 +<​code>​ 
 +(...) 
 +slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed 
 +slurmstepd: error: Exceeded job memory limit 
 +slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10:​21:​37 *** 
 +(...) 
 +</​code>​ 
 + 
 +In addition ''​sacct''​ will display an informative ''​State''​ string. The error code is 1 for whole job. Jobstep error code will depend on the jobscript and application handling this situation. 
 + 
 +===== Software Errors ===== 
 +  
 + 
 +The exit code of a job is captured by SLURM and saved as part of the job record. For ''​sbatch''​ jobs the exit code of the batch script is captured. For ''​srun''​ or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination,​ the signal number will also be captured, and displayed after the exit code (separated by a colon). 
 + 
 +Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example (//note for non-R users//: ''​sq''​ does not exist without loading a library which provides it): 
 + 
 +<code Rsplus>​  
 +var<​-sq(1,​1000000000) 
 +</​code>​ 
 + 
 +<code bash> 
 +#​!/​bin/​bash 
 + 
 +#SBATCH --job-name="​A script which fails, but displays 0 as exit code"​ 
 +#SBATCH --time=00:​05:​00 
 +#SBATCH --mem-per-cpu=3G 
 +#SBATCH -p short/smp 
 +#SBATCH -A <your account>​ 
 + 
 +module load lang/​R/​3.4.1-foss-2017a 
 + 
 +# Put your code below this line 
 +R --no-save --slave -f fail.r 
 +echo "​Script finished"​ 
 +</​code>​ 
 + 
 +We submit this job: 
 + 
 +<code bash> 
 +$ sbatch submit_fail_r.sh  
 +Submitted batch job 3695216 
 +</​code>​ 
 + 
 +The exit code and state wrongly indicates that the job finished successfully:​ 
 +<code bash> 
 +$ sacct -j 3695216 
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode  
 +------------ ---------- ---------- ---------- ---------- ---------- --------  
 +3695216 ​     A script +      short    account ​         1  COMPLETED ​     0:0  
 +3695216.bat+ ​     batch               ​account ​         1  COMPLETED ​     0:0  
 +3695216.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0 
 +</​code>​ 
 + 
 +There are several solutions to this problem: 
 + 
 +  * The //​**preferred**//​ solution is to create genuine job steps where 
 +<code bash> 
 +R --no-save --slave -f fail.r 
 +</​code>​ 
 +would become 
 +<code bash> 
 +srun R --no-save --slave -f fail.r 
 +</​code>​ 
 +The output will be a lot more informative:​ 
 +<code bash> 
 +$ sacct -j 3713748 
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode  
 +------------ ---------- ---------- ---------- ---------- ---------- --------  
 +3713748 ​     A script +      short    account ​         1  COMPLETED ​     0:0  
 +3713748.bat+ ​     batch               ​account ​         1  COMPLETED ​     0:0  
 +3713748.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0  
 +3713748.0 ​            ​R ​              ​account ​         1     ​FAILED ​     1:0 
 +</​code>​ 
 + 
 +  * In the case, where the batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes: 
 + 
 +<code bash> 
 +R --no-save --slave -f fail.r || exit 42 
 +</​code>​ 
 +which now translates into a batch script failure 
 +<code bash> 
 +$ sacct -j 3714719 
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode  
 +------------ ---------- ---------- ---------- ---------- ---------- --------  
 +3714719 ​     A script +      short    account ​         1     ​FAILED ​    42:0  
 +3714719.bat+ ​     batch               ​account ​         1     ​FAILED ​    42:0  
 +3714719.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0 
 +</​code>​ 
 +  * Finally, it is possible to trigger a script exit with every error (e. g. in bash ''​set -e''​). This, however, if to be recommended only if you know how to script well. 
 + 
 +<WRAP center round info 80%> 
 +The most useful information can be derived from the application specific output, usually written to the job log files. 
 +</​WRAP>​ 
 + 
 +===== Hardware Errors ===== 
 + 
 +Alas, sometimes((For brand new and very old systems more frequently than "​sometimes"​.)) you might experience node failures or network issues (particularly with very big jobs). In such cases, you job might get aborted with weird messages, e.g. from MPI. If you re-submit SLURM will schedule your new job with a great probability on those nodes where your previous job tried to compute - with the same consequence. 
 + 
 +We try our best to detect hardware issues with scripts //prior// to the execution of a job, but sometimes a glitch passes undetected with the consequences described above. 
 + 
 +If this happens, **please notify us**. 
 + 
 +Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run: 
 + 
 +<code bash> 
 +$ sacct -o JOBID,​EXITCODE,​NODELIST 
 +</​code>​ 
 + 
 +and then with the copied nodelist for the job(s) in question -- without modifying your jobscript:​ 
 + 
 +<code bash> 
 +$ sbatch --exclude <​nodelist>​ <​jobscript>​ 
 +</​code>​ 
  
  
slurm_manage.1502424081.txt.gz · Last modified: 2017/08/11 06:01 by meesters