Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
slurm_manage [2017/09/06 16:55] meesters |
— (current) |
====== Information on Jobs ====== | |
| |
^List job(s) ... for you (or a different user) ^ Command ^ | |
| | ''%%squeue -u $USER %%'' | | |
| in <partition> | ''%%squeue -u $USER -p <partition> %%'' | | |
| priority | ''%%sprio -l %%'' | | |
| running | ''%%squeue -u $USER -t RUNNING%%'' | | |
| pending | ''%%squeue -u $USER -t PENDING%%'' | | |
| details | ''%%scontrol show jobid -dd <jobid> %%'' | | |
| status info | ''%%sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps %%'' | | |
| statistics on completed (per job) | ''%%sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed %%'' | | |
| statistics on completed (per username) | ''%%sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed %%'' | | |
| |
<WRAP center round info 80%> | |
You can see completed Jobs only wit ''sacct'' | |
</WRAP> | |
| |
| |
====== Controlling Jobs ====== | |
| |
^ To... job(s) ^ Command ^ | |
| cancel one | ''%% scancel <jobid> %%'' | | |
| cancel all | ''%% scancel -u <username> %%'' | | |
| cancel all the pending | ''%% scancel -t PENDING <jobid> %%'' | | |
| cancel one or more by name | ''%% scancel --name <myJobName> %%'' | | |
| pause one | ''%% scontrol hold <jobid> %%'' | | |
| resume one | ''%% scontrol resume <jobid> %%'' | | |
| requeue one | ''%% scontrol requeue <jobid> %%'' | | |
| |
====== Modifying Pending Jobs ====== | |
| |
Sometimes ''%%squeue --start%%'' might indicate a wrong requirement specification, e.g. ''BadConstraints''. In this case a user can figure out the mismatch with ''scontrol show job <jobid>'' (which might require some experience). Wrong requirements can be fixed like: | |
| |
^ To correct a job's ^ Command ^ | |
| memory requirement | ''%%scontrol update job <jobid> MinMemoryNode=<mem in MB>%%''| | |
| memory requirement | ''%%scontrol update job <jobid> MinMemoryCPU=<mem in MB>%%''| | |
| number of requested CPUs | ''%%scontrol update job <jobid> NumCPUs=<number>%%'' | | |
| |
For more information see ''man scontrol''. | |
| |
====== Pending Reasons ====== | |
| |
So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled ''PD'', when ''squeue'' is triggered). | |
| |
^ Reason ^ Brief Explanation ^ | |
| ''Priority'' | At first, every job gets this reason. If not scheduled for a while (> several minutes), the job simply lacks priority to start. | | |
| ''AssocGrpCPURunMinutesLimit'' | Indicates, that the partitions associated quality of service in terms of CPU time is exhausted for the [[accounts|account / association in question]] is exhausted. This number will recover. | | |
| ''QOSMaxJobsPerUserLimit'' | For certain partitions the number of running jobs per user is limited. | | |
| ''QOSGrpMemLimit'' | the requested partition is limited in the fraction of resources it can take from the cluster and this amount has been reached: jobs need to end, before new may start.| | |
| ''Resources'' | while the partition may allow to take the resources you requested, it cannot not -- at the time -- provide the nodes to run on (e.g. because of a memory request which cannot be satisfied).| | |
| |
And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. | |
| |