slurm_manage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm_manage [2019/05/06 10:17]
meesters [Pending Reasons]
— (current)
Line 1: Line 1:
-====== Information on Jobs ====== 
- 
-^List job(s) ...  for you (or a different user) ^ Command ^ 
-| | ''%%squeue -u $USER %%'' | 
-| in <partition> | ''%%squeue -u $USER -p <partition> %%'' | 
-| priority | ''%%sprio -l %%'' | 
-| running | ''%%squeue -u $USER -t RUNNING%%'' | 
-| pending | ''%%squeue -u $USER -t PENDING%%'' | 
-| details | ''%%scontrol show jobid -dd <jobid> %%'' | 
-| status info | ''%%sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps %%'' | 
-| statistics on completed (per job) | ''%%sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed %%'' | 
-| statistics on completed (per username) | ''%%sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed %%'' | 
-| summary statistics on completed job | ''%%seff <jobid> %%'' | 
-  
-<WRAP center round info 80%> 
-You can see completed Jobs only wit ''sacct''. Note that only recent jobs will be displayed without specifying the ''-S'' flag (for the start date to search from). For example ''-S 0901'' would loop up the jobs from the September, 1st. See the manpage for more information on time related lookup options.   
-</WRAP> 
- 
- 
- 
-====== Controlling Jobs ====== 
- 
-^ To... job(s) ^ Command ^ 
-| cancel one | ''%% scancel <jobid> %%'' | 
-| cancel all | ''%% scancel -u <username> %%'' | 
-| cancel all the pending | ''%% scancel  -t PENDING <jobid>  %%'' | 
-| cancel one or more by name | ''%% scancel --name <myJobName> %%'' | 
-| pause one | ''%% scontrol hold <jobid> %%'' | 
-| resume one | ''%% scontrol resume <jobid> %%'' | 
-| requeue one | ''%% scontrol requeue <jobid> %%'' | 
- 
-====== Modifying Pending Jobs ====== 
- 
-Sometimes ''%%squeue --start%%'' might indicate a wrong requirement specification, e.g. ''BadConstraints''. In this case a user can figure out the mismatch with ''scontrol show job <jobid>'' (which might require some experience). Wrong requirements can be fixed like: 
- 
-^ To correct a job's ^ Command ^  
-| memory requirement | ''%%scontrol update job <jobid> MinMemoryNode=<mem in MB>%%''| 
-| memory requirement | ''%%scontrol update job <jobid> MinMemoryCPU=<mem in MB>%%''| 
-| number of requested CPUs | ''%%scontrol update job <jobid> NumCPUs=<number>%%'' | 
- 
-For more information see ''man scontrol''. 
- 
-====== Pending Reasons ====== 
- 
-So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''PD'', when ''squeue'' is triggered). Here, we show some more frequent reasons: 
- 
-^ Reason ^ Brief Explanation ^ 
-| ''Priority'' | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. | 
-| ''AssocGrpCPURunMinutesLimit'' | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. |  
-| ''QOSMaxJobsPerUserLimit'' | For certain partitions the number of running jobs per user is limited. | 
-| ''QOSMaxJobsPerAccountLimit'' | For certain partitions the number of running jobs per account is limited. | 
-| ''QOSGrpGRESRunMinutes'' | For certain partitions the generic resources (e.g. GPUs) are limited. See [[gpu|GPU Queues]] | 
-| ''QOSGrpMemLimit'' | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| 
-| ''QOSGrpCpuLimit'' | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| 
-| ''Resources'' | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).| 
-| ''ReqNodeNotAvail'' | simply means that no node with the required resources is available. SLRUM will list //all// non-available nodes, which can be confusing. This reason is similar to ''Priority'' as it means that a specific job has to wait for a resource to be released.| 
- 
-And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]]. 
- 
-====== Investigating Job Failures ====== 
- 
-===== Exceeding Resource Limits ===== 
- 
- 
-Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the ''--time'' and ''--mem-per-core'' (or ''--mem'' if deviating from the [[partitions|partition defaults]]) options. A job is killed if one of these limits is exceeded. In both cases, the error file provides appropriate information: 
- 
-Time limit: 
- 
-<code> 
-(...) 
-slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11:22:57 DUE TO TIME LIMIT *** 
-(...) 
-</code> 
- 
-Memory limit: 
- 
-<code> 
-(...) 
-slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed 
-slurmstepd: error: Exceeded job memory limit 
-slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10:21:37 *** 
-(...) 
-</code> 
- 
-In addition ''sacct'' will display an informative ''State'' string. The error code is 1 for whole job. Jobstep error code will depend on the jobscript and application handling this situation. 
- 
-===== Software Errors ===== 
-  
- 
-The exit code of a job is captured by SLURM and saved as part of the job record. For ''sbatch'' jobs the exit code of the batch script is captured. For ''srun'' or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination, the signal number will also be captured, and displayed after the exit code (separated by a colon). 
- 
-Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example (//note for non-R users//: ''sq'' does not exist without loading a library which provides it): 
- 
-<code Rsplus>  
-var<-sq(1,1000000000) 
-</code> 
- 
-<code bash> 
-#!/bin/bash 
- 
-#SBATCH --job-name="A script which fails, but displays 0 as exit code" 
-#SBATCH --time=00:05:00 
-#SBATCH --mem-per-cpu=3G 
-#SBATCH -p short/smp 
-#SBATCH -A <your account> 
- 
-module load lang/R/3.4.1-foss-2017a 
- 
-# Put your code below this line 
-R --no-save --slave -f fail.r 
-echo "Script finished" 
-</code> 
- 
-We submit this job: 
- 
-<code bash> 
-$ sbatch submit_fail_r.sh  
-Submitted batch job 3695216 
-</code> 
- 
-The exit code and state wrongly indicates that the job finished successfully: 
-<code bash> 
-$ sacct -j 3695216 
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
------------- ---------- ---------- ---------- ---------- ---------- --------  
-3695216      A script +      short    account          1  COMPLETED      0:0  
-3695216.bat+      batch               account          1  COMPLETED      0:0  
-3695216.ext+     extern               account          1  COMPLETED      0:0 
-</code> 
- 
-There are several solutions to this problem: 
- 
-  * The //**preferred**// solution is to create genuine job steps where 
-<code bash> 
-R --no-save --slave -f fail.r 
-</code> 
-would become 
-<code bash> 
-srun R --no-save --slave -f fail.r 
-</code> 
-The output will be a lot more informative: 
-<code bash> 
-$ sacct -j 3713748 
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
------------- ---------- ---------- ---------- ---------- ---------- --------  
-3713748      A script +      short    account          1  COMPLETED      0:0  
-3713748.bat+      batch               account          1  COMPLETED      0:0  
-3713748.ext+     extern               account          1  COMPLETED      0:0  
-3713748.0                           account          1     FAILED      1:0 
-</code> 
- 
-  * In the case, where the batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes: 
- 
-<code bash> 
-R --no-save --slave -f fail.r || exit 42 
-</code> 
-which now translates into a batch script failure 
-<code bash> 
-$ sacct -j 3714719 
-       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
------------- ---------- ---------- ---------- ---------- ---------- --------  
-3714719      A script +      short    account          1     FAILED     42: 
-3714719.bat+      batch               account          1     FAILED     42: 
-3714719.ext+     extern               account          1  COMPLETED      0:0 
-</code> 
-  * Finally, it is possible to trigger a script exit with every error (e. g. in bash ''set -e''). This, however, if to be recommended only if you know how to script well. 
- 
-<WRAP center round info 80%> 
-The most useful information can be derived from the application specific output, usually written to the job log files. 
-</WRAP> 
- 
-===== Hardware Errors ===== 
- 
-Alas, sometimes((For brand new and very old systems more frequently than "sometimes".)) you might experience node failures or network issues (particularly with very big jobs). In such cases, you job might get aborted with weird messages, e.g. from MPI. If you re-submit SLURM will schedule your new job with a great probability on those nodes where your previous job tried to compute - with the same consequence. 
- 
-We try our best to detect hardware issues with scripts //prior// to the execution of a job, but sometimes a glitch passes undetected with the consequences described above. 
- 
-If this happens, **please notify us**. 
- 
-Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run: 
- 
-<code bash> 
-$ sacct -o JOBID,EXITCODE,NODELIST 
-</code> 
- 
-and then with the copied nodelist for the job(s) in question -- without modifying your jobscript: 
- 
-<code bash> 
-$ sbatch --exclude <nodelist> <jobscript> 
-</code> 
- 
- 
  
  • slurm_manage.1557130641.txt.gz
  • Last modified: 2019/05/06 10:17
  • by meesters