User Tools

Site Tools


slurm_manage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm_manage [2017/11/13 12:45]
meesters [Software Errors]
slurm_manage [2019/05/06 10:17] (current)
meesters [Pending Reasons]
Line 11: Line 11:
 | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per job) | ''​%%sacct -j <​jobid>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ | | statistics on completed (per username) | ''​%%sacct -u <​username>​ --format=JobID,​JobName,​MaxRSS,​Elapsed %%''​ |
 +| summary statistics on completed job | ''​%%seff <​jobid>​ %%''​ |
    
 <WRAP center round info 80%> <WRAP center round info 80%>
 You can see completed Jobs only wit ''​sacct''​. Note that only recent jobs will be displayed without specifying the ''​-S''​ flag (for the start date to search from). For example ''​-S 0901''​ would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.  ​ You can see completed Jobs only wit ''​sacct''​. Note that only recent jobs will be displayed without specifying the ''​-S''​ flag (for the start date to search from). For example ''​-S 0901''​ would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.  ​
 </​WRAP>​ </​WRAP>​
 +
  
  
Line 41: Line 43:
 ====== Pending Reasons ====== ====== Pending Reasons ======
  
-So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered).+So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''​PD'',​ when ''​squeue''​ is triggered). ​Here, we show some more frequent reasons:
  
 ^ Reason ^ Brief Explanation ^ ^ Reason ^ Brief Explanation ^
Line 47: Line 49:
 | ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. |  | ''​AssocGrpCPURunMinutesLimit''​ | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. | 
 | ''​QOSMaxJobsPerUserLimit''​ | For certain partitions the number of running jobs per user is limited. | | ''​QOSMaxJobsPerUserLimit''​ | For certain partitions the number of running jobs per user is limited. |
-| ''​QOSMaxJobsPerAccountLimit''​ | For certain partitions the number of running jobs per account is limited. | +| ''​QOSMaxJobsPerAccountLimit''​ | For certain partitions the number of running jobs per account is limited. ​
 +| ''​QOSGrpGRESRunMinutes''​ | For certain partitions the generic resources (e.g. GPUs) are limited. See [[gpu|GPU Queues]] ​|
 | ''​QOSGrpMemLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| | ''​QOSGrpMemLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.|
 +| ''​QOSGrpCpuLimit''​ | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.|
 | ''​Resources''​ | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).| | ''​Resources''​ | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).|
 +| ''​ReqNodeNotAvail''​ | simply means that no node with the required resources is available. SLRUM will list //all// non-available nodes, which can be confusing. This reason is similar to ''​Priority''​ as it means that a specific job has to wait for a resource to be released.|
  
 And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]]. And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]].
Line 85: Line 90:
 The exit code of a job is captured by SLURM and saved as part of the job record. For ''​sbatch''​ jobs the exit code of the batch script is captured. For ''​srun''​ or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination,​ the signal number will also be captured, and displayed after the exit code (separated by a colon). The exit code of a job is captured by SLURM and saved as part of the job record. For ''​sbatch''​ jobs the exit code of the batch script is captured. For ''​srun''​ or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination,​ the signal number will also be captured, and displayed after the exit code (separated by a colon).
  
-Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example:+Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example ​(//note for non-R users//: ''​sq''​ does not exist without loading a library which provides it):
  
 <code Rsplus>  <code Rsplus>
Line 119: Line 124:
        ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​        ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 ------------ ---------- ---------- ---------- ---------- ---------- -------- ​ ------------ ---------- ---------- ---------- ---------- ---------- -------- ​
-3695216 ​     A script +      short     zdvhpc ​         ​1 ​ COMPLETED ​     0:0  +3695216 ​     A script +      short    ​account ​         ​1 ​ COMPLETED ​     0:0  
-3695216.bat+ ​     batch                ​zdvhpc ​         ​1 ​ COMPLETED ​     0:0  +3695216.bat+ ​     batch               account ​         ​1 ​ COMPLETED ​     0:0  
-3695216.ext+ ​    ​extern ​               ​zdvhpc ​         ​1 ​ COMPLETED ​     0:0+3695216.ext+ ​    ​extern ​              account ​         ​1 ​ COMPLETED ​     0:0
 </​code>​ </​code>​
  
 There are several solutions to this problem: There are several solutions to this problem:
  
-  * The preferred solution is to create genuine job steps where+  * The //**preferred**// solution is to create genuine job steps where
 <code bash> <code bash>
 R --no-save --slave -f fail.r R --no-save --slave -f fail.r
Line 139: Line 144:
        ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​        ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 ------------ ---------- ---------- ---------- ---------- ---------- -------- ​ ------------ ---------- ---------- ---------- ---------- ---------- -------- ​
-3713748 ​     A script +      short     zdvhpc ​         ​1 ​ COMPLETED ​     0:0  +3713748 ​     A script +      short    ​account ​         ​1 ​ COMPLETED ​     0:0  
-3713748.bat+ ​     batch                ​zdvhpc ​         ​1 ​ COMPLETED ​     0:0  +3713748.bat+ ​     batch               account ​         ​1 ​ COMPLETED ​     0:0  
-3713748.ext+ ​    ​extern ​               ​zdvhpc ​         ​1 ​ COMPLETED ​     0:0  +3713748.ext+ ​    ​extern ​              account ​         ​1 ​ COMPLETED ​     0:0  
-3713748.0 ​            ​R ​               ​zdvhpc ​         ​1 ​    ​FAILED ​     1:0+3713748.0 ​            ​R ​              account ​         ​1 ​    ​FAILED ​     1:0
 </​code>​ </​code>​
 +
 +  * In the case, where the batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes:
 +
 +<code bash>
 +R --no-save --slave -f fail.r || exit 42
 +</​code>​
 +which now translates into a batch script failure
 +<code bash>
 +$ sacct -j 3714719
 +       ​JobID ​   JobName ​ Partition ​   Account ​ AllocCPUS ​     State ExitCode ​
 +------------ ---------- ---------- ---------- ---------- ---------- -------- ​
 +3714719 ​     A script +      short    account ​         1     ​FAILED ​    ​42:​0 ​
 +3714719.bat+ ​     batch               ​account ​         1     ​FAILED ​    ​42:​0 ​
 +3714719.ext+ ​    ​extern ​              ​account ​         1  COMPLETED ​     0:0
 +</​code>​
 +  * Finally, it is possible to trigger a script exit with every error (e. g. in bash ''​set -e''​). This, however, if to be recommended only if you know how to script well.
 +
 +<WRAP center round info 80%>
 +The most useful information can be derived from the application specific output, usually written to the job log files.
 +</​WRAP>​
 +
 +===== Hardware Errors =====
 +
 +Alas, sometimes((For brand new and very old systems more frequently than "​sometimes"​.)) you might experience node failures or network issues (particularly with very big jobs). In such cases, you job might get aborted with weird messages, e.g. from MPI. If you re-submit SLURM will schedule your new job with a great probability on those nodes where your previous job tried to compute - with the same consequence.
 +
 +We try our best to detect hardware issues with scripts //prior// to the execution of a job, but sometimes a glitch passes undetected with the consequences described above.
 +
 +If this happens, **please notify us**.
 +
 +Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run:
 +
 +<code bash>
 +$ sacct -o JOBID,​EXITCODE,​NODELIST
 +</​code>​
 +
 +and then with the copied nodelist for the job(s) in question -- without modifying your jobscript:
 +
 +<code bash>
 +$ sbatch --exclude <​nodelist>​ <​jobscript>​
 +</​code>​
 +
 +
  
slurm_manage.1510573557.txt.gz · Last modified: 2017/11/13 12:45 by meesters