Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
slurm_manage [2018/05/28 22:09] meesters [Information on Jobs] |
— (current) | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Information on Jobs ====== | ||
- | |||
- | ^List job(s) ... for you (or a different user) ^ Command ^ | ||
- | | | '' | ||
- | | in < | ||
- | | priority | '' | ||
- | | running | '' | ||
- | | pending | '' | ||
- | | details | '' | ||
- | | status info | '' | ||
- | | statistics on completed (per job) | '' | ||
- | | statistics on completed (per username) | '' | ||
- | | summary statistics on completed job | '' | ||
- | |||
- | <WRAP center round info 80%> | ||
- | You can see completed Jobs only wit '' | ||
- | </ | ||
- | |||
- | |||
- | |||
- | ====== Controlling Jobs ====== | ||
- | |||
- | ^ To... job(s) ^ Command ^ | ||
- | | cancel one | '' | ||
- | | cancel all | '' | ||
- | | cancel all the pending | '' | ||
- | | cancel one or more by name | '' | ||
- | | pause one | '' | ||
- | | resume one | '' | ||
- | | requeue one | '' | ||
- | |||
- | ====== Modifying Pending Jobs ====== | ||
- | |||
- | Sometimes '' | ||
- | |||
- | ^ To correct a job's ^ Command ^ | ||
- | | memory requirement | '' | ||
- | | memory requirement | '' | ||
- | | number of requested CPUs | '' | ||
- | |||
- | For more information see '' | ||
- | |||
- | ====== Pending Reasons ====== | ||
- | |||
- | So, why do my jobs not start? SLURM may list a number of reasons for pending jobs (those labelled '' | ||
- | |||
- | ^ Reason ^ Brief Explanation ^ | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | |||
- | And then there limitations due to the number of jobs a user or group (a.k.a. account) may run at a given time. More information on partitions can be found [[partitions|on their respective wiki site]]. | ||
- | |||
- | ====== Investigating Job Failures ====== | ||
- | |||
- | ===== Exceeding Resource Limits ===== | ||
- | |||
- | |||
- | Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the '' | ||
- | |||
- | Time limit: | ||
- | |||
- | < | ||
- | (...) | ||
- | slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11: | ||
- | (...) | ||
- | </ | ||
- | |||
- | Memory limit: | ||
- | |||
- | < | ||
- | (...) | ||
- | slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed | ||
- | slurmstepd: error: Exceeded job memory limit | ||
- | slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10: | ||
- | (...) | ||
- | </ | ||
- | |||
- | In addition '' | ||
- | |||
- | ===== Software Errors ===== | ||
- | |||
- | |||
- | The exit code of a job is captured by SLURM and saved as part of the job record. For '' | ||
- | |||
- | Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return zero indicating success. Consider the following simplified example (//note for non-R users//: '' | ||
- | |||
- | <code Rsplus> | ||
- | var< | ||
- | </ | ||
- | |||
- | <code bash> | ||
- | #!/bin/bash | ||
- | |||
- | #SBATCH --job-name=" | ||
- | #SBATCH --time=00: | ||
- | #SBATCH --mem-per-cpu=3G | ||
- | #SBATCH -p short/smp | ||
- | #SBATCH -A <your account> | ||
- | |||
- | module load lang/ | ||
- | |||
- | # Put your code below this line | ||
- | R --no-save --slave -f fail.r | ||
- | echo " | ||
- | </ | ||
- | |||
- | We submit this job: | ||
- | |||
- | <code bash> | ||
- | $ sbatch submit_fail_r.sh | ||
- | Submitted batch job 3695216 | ||
- | </ | ||
- | |||
- | The exit code and state wrongly indicates that the job finished successfully: | ||
- | <code bash> | ||
- | $ sacct -j 3695216 | ||
- | | ||
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
- | 3695216 | ||
- | 3695216.bat+ | ||
- | 3695216.ext+ | ||
- | </ | ||
- | |||
- | There are several solutions to this problem: | ||
- | |||
- | * The // | ||
- | <code bash> | ||
- | R --no-save --slave -f fail.r | ||
- | </ | ||
- | would become | ||
- | <code bash> | ||
- | srun R --no-save --slave -f fail.r | ||
- | </ | ||
- | The output will be a lot more informative: | ||
- | <code bash> | ||
- | $ sacct -j 3713748 | ||
- | | ||
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
- | 3713748 | ||
- | 3713748.bat+ | ||
- | 3713748.ext+ | ||
- | 3713748.0 | ||
- | </ | ||
- | |||
- | * In the case, where the batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes: | ||
- | |||
- | <code bash> | ||
- | R --no-save --slave -f fail.r || exit 42 | ||
- | </ | ||
- | which now translates into a batch script failure | ||
- | <code bash> | ||
- | $ sacct -j 3714719 | ||
- | | ||
- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
- | 3714719 | ||
- | 3714719.bat+ | ||
- | 3714719.ext+ | ||
- | </ | ||
- | * Finally, it is possible to trigger a script exit with every error (e. g. in bash '' | ||
- | |||
- | <WRAP center round info 80%> | ||
- | The most useful information can be derived from the application specific output, usually written to the job log files. | ||
- | </ | ||
- | |||
- | ===== Hardware Errors ===== | ||
- | |||
- | Alas, sometimes((For brand new and very old systems more frequently than " | ||
- | |||
- | We try our best to detect hardware issues with scripts //prior// to the execution of a job, but sometimes a glitch passes undetected with the consequences described above. | ||
- | |||
- | If this happens, **please notify us**. | ||
- | |||
- | Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run: | ||
- | |||
- | <code bash> | ||
- | $ sacct -o JOBID, | ||
- | </ | ||
- | |||
- | and then with the copied nodelist for the job(s) in question -- without modifying your jobscript: | ||
- | |||
- | <code bash> | ||
- | $ sbatch --exclude < | ||
- | </ | ||
- | |||
- | |||