Job Failures

How to investigate job errors

Exceeding Resource Limits

Each partition limits the maximal allowed runtime of a job and provides default values for the estimated job runtime and memory usage per core. A job should request appropriate values for those resources using the --time and --mem-per-core (or --mem if deviating from the partition defaults) options. A job is killed if one of these limits is exceeded. In both cases, the error file provides appropriate information:

Time limit:

(...)
slurmstepd: error: *** JOB xxxxxxx ON a0125 CANCELLED AT 2017-11-30T11:22:57 DUE TO TIME LIMIT ***
(...)

Memory limit:

(...)
slurmstepd: error: Job xxxxxxx exceeded memory limit (120000 > 115500), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB xxxxxxx ON a0543 CANCELLED AT 2017-11-30T10:21:37 ***
(...)

In addition sacct will display an informative State string. The error code is 1 for whole job. Jobstep error code will depend on the jobscript and application handling this situation.

Software Errors

The exit code of a job is captured by SLURM and saved as part of the job record. For sbatch jobs the exit code of the batch script is captured. For srun or jobs steps, the exit code will be the return value of the executed command. Any non-zero exit code is considered a job failure, and results in job state of FAILED. When a signal was responsible for a job/step termination, the signal number will also be captured, and displayed after the exit code (separated by a colon).

Depending on the execution order of the commands in the batch script, it is possible that a specific command fails but the batch script will return 0 indicating success. Consider the following simplified example (note for non-R users: sq does not exist without loading a library which provides it):

# fail.r file
var<-sq(1,1000000000)
#!/bin/bash

#SBATCH --job-name="A script which fails, but displays 0 as exit code"
#SBATCH --time=00:05:00
#SBATCH --mem-per-cpu=3G
#SBATCH -p short/smp
#SBATCH -A <your account>

module load lang/R/3.4.1-foss-2017a

# Put your code below this line
R --no-save --slave -f fail.r
echo "Script finished"

We submit this job:

sbatch submit_fail_r.sh
Submitted batch job 3695216

The exit code and state wrongly indicates that the job finished successfully:

sacct -j 3695216
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3695216      A script +      short    account          1  COMPLETED      0:0
3695216.bat+      batch               account          1  COMPLETED      0:0
3695216.ext+     extern               account          1  COMPLETED      0:0

There are several solutions to this problem:

  • The preferred solution is to create genuine job steps where
    R --no-save --slave -f fail.r
    
    would become
    srun R --no-save --slave -f fail.r
    
    The output will be a lot more informative:
    sacct -j 3713748
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    3713748      A script +      short    account          1  COMPLETED      0:0
    3713748.bat+      batch               account          1  COMPLETED      0:0
    3713748.ext+     extern               account          1  COMPLETED      0:0
    3713748.0             R               account          1     FAILED      1:0
    
  • In the case where batch shall handle all job steps (only sensible, if confined to a single node), you could set your own error codes:
    R --no-save --slave -f fail.r || exit 42
    
    which now translates into a batch script failure
    sacct -j 3714719
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    3714719      A script +      short    account          1     FAILED     42:0
    3714719.bat+      batch               account          1     FAILED     42:0
    3714719.ext+     extern               account          1  COMPLETED      0:0
    
  • Finally, it is possible to trigger a script exit with every error (e. g. in bash set -e). This, however, is only recommended if you know how to script well.
The most useful information can be derived from the application specific output, usually written to the job log files.

Hardware Errors

Sometimes you might experience node failures or network issues (particularly with very big jobs). In such cases, you job might get aborted with weird messages, e.g. from MPI. If you re-submit SLURM will schedule your new job with a great probability on those nodes where your previous job tried to compute - with the same consequence.

We try our best to detect hardware issues with scripts prior to the execution of a job, but sometimes a glitch passes undetected with the consequences described above.

If this happens, please notify us.

Also, when resubmitting you can exclude nodes where failed jobs did run. First you ask SLURM where your previous jobs run:

sacct -o JOBID,EXITCODE,NODELIST

and then with the copied nodelist for the job(s) in question – without modifying your jobscript:

sbatch --exclude <nodelist> <jobscript>