slurm_manage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm_manage [2017/08/11 06:02]
meesters [Pending Reasons]
— (current)
Line 1: Line 1:
-====== Information on Jobs ====== 
- 
-^List job(s) ...  for you (or a different user) ^ Command ^ 
-| | ''%%squeue -u $USER %%'' | 
-| in <partition> | ''%%squeue -u $USER -p <partition> %%'' | 
-| priority | ''%%sprio -l %%'' | 
-| running | ''%%squeue -u $USER -t RUNNING%%'' | 
-| pending | ''%%squeue -u $USER -t PENDING%%'' | 
-| details | ''%%scontrol show jobid -dd <jobid> %%'' | 
-| status info | ''%%sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps %%'' | 
-| statistics on completed (per job) | ''%%sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed %%'' | 
-| statistics on completed (per username) | ''%%sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed %%'' | 
-  
-<WRAP center round info 80%> 
-You can see completed Jobs only wit ''sacct'' 
-</WRAP> 
- 
-===== Headline ===== 
- 
-Now, you know why your job is pending and you want to know: What does it mean? There can be several reasons. The most frequent are: 
- 
-  * ''QOSGrpMemLimit'' - the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start. 
-  * ''Resources'' - while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied). 
-  * ''Priority'' - there are sufficient resources, but -- alas -- you do not have sufficient priority to get your job running. 
-  * and then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time.  
- 
-====== Controlling Jobs ====== 
- 
-^ To... job(s) ^ Command ^ 
-| cancel one | ''%% scancel <jobid> %%'' | 
-| cancel all | ''%% scancel -u <username> %%'' | 
-| cancel all the pending | ''%% scancel  -t PENDING <jobid>  %%'' | 
-| cancel one or more by name | ''%% scancel --name <myJobName> %%'' | 
-| pause one | ''%% scontrol hold <jobid> %%'' | 
-| resume one | ''%% scontrol resume <jobid> %%'' | 
-| requeue one | ''%% scontrol requeue <jobid> %%'' | 
- 
-====== Modifying Pending Jobs ====== 
- 
-Sometimes ''%%squeue --start%%'' might indicate a wrong requirement specification, e.g. ''BadConstraints''. In this case a user can figure out the mismatch with ''scontrol show job <jobid>'' (which might require some experience). Wrong requirements can be fixed like: 
- 
-^ To correct a job's ^ Command ^  
-| memory requirement | ''%%scontrol update job <jobid> MinMemoryNode=<mem in MB>%%''| 
-| memory requirement | ''%%scontrol update job <jobid> MinMemoryCPU=<mem in MB>%%''| 
-| number of requested CPUs | ''%%scontrol update job <jobid> NumCPUs=<number>%%'' | 
- 
-For more information see ''man scontrol''. 
- 
-====== Pending Reasons ====== 
- 
-So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''PD'', when ''squeue'' is triggered). 
- 
-^ Reason ^ Brief Explanation ^ 
-| ''Priority'' | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. | 
-| ''AssocGrpCPURunMinutesLimit'' | Indicates, that the partitions associated quality of service in terms of CPU time is up. |  
-| ''QOSMaxJobsPerUserLimit'' | For certain partitions the number of running jobs per user is limited. |  
  
  • slurm_manage.1502424167.txt.gz
  • Last modified: 2017/08/11 06:02
  • by meesters