Workflow Organization

How to organize your workflow

Job Dependency

Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the --dependency option to sbatch in the format

sbatch --dependency=<type:job_id[:job_id] [,type:job_id[:job_id] ]> ...

Dependency types are:

ConditionExplanation
after:jobid[:jobid…]job can begin after the specified jobs have started
afterany:jobid[:jobid…]job can begin after the specified jobs have terminated
afternotok:jobid[:jobid…]job can begin after the specified jobs have failed
afterok:jobid[:jobid…]job can begin after the specified jobs have run to completion with an exit code of zero (see the user guide for caveats).
singletonjobs can begin execution after all previously launched jobs with the same name and user have ended. This is useful to collate results of an array job or similar.

To set up pipelines using job dependencies, the most useful types are afterany, afterok, and singleton. The simplest way is to use the afterok` dependency for single consecutive jobs. For example:

sbatch <further arguments> job1.sh
  Submitted batch job 253507
sbatch <further arguments> --dependency=afterok:253507 job2.sh

Now when job1 ends with an exit code of zero, job2 will become eligible for scheduling. However, if job1 fails (ends with a non-zero exit code), job2 will not be scheduled but will remain in the queue and needs to be canceled manually.

As an alternative, the afterany dependency can be used, and checking for successful execution of the prerequisites can be done in the jobscript itself.

Example in Bash

The following ${id1##* } constructs are there because sbatch returns a statements like Submitted batch job <JOBID>.

#!/bin/bash

# first job - no dependencies
jid1=$(sbatch -p short -A <your account> --mem12g --cpus-per-task=4 job1.script)

# multiple jobs can depend on a single job
jid2=$(sbatch  -p short -A <your account> --dependencyafterany:${jid1* } --mem=20g job2.script)
jid3=$(sbatch  -p short -A <your account> --dependencyafterany:${jid1* } --mem=20g job3.script)

# a single job can depend on multiple jobs
jid4=$(sbatch  -p short -A <your account> --dependencyafterany:${jid2* }:${jid3* } job4.script)

# a single job can depend on an array job
# it will start executing when all arrayjobs have fin.scripted
jid5=$(sbatch -p short -A <your account> --dependencyafterany:${jid4* } job5.script)

# a single job can depend on all jobs by the same user with the same name
jid6=$(sbatch -p short -A <your account> --dependencyafterany:${jid5* } --job-name=dtest job6.script)
jid=$(sbatch -p short -A <your account> --dependencyafterany:${jid5* } --job-name=dtest job7.script)
sbatch -p short -A <your account> --dependency=singleton --job-name=dtest job8.script

#show dependencies in squeue output:
squeue -u $USER -o "%.8A %.4C %.10m %.20E"

Unsatisfiable Dependencies

Sometimes a dependency condition cannot be satisfied. For example when asking for a predecessor job to be sucessfully done (afterok:<jobid>) and it fails.

In such a case --kill-on-invalid-dep=<yes|no> can be specified to sbatch.

Job Arrays

Purpose

According to the Slurm Job Array Documentation, “job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.” In general, job arrays are useful for applying the same processing routine to a collection of multiple input data files. Job arrays offer a very simple way to submit a large number of independent processing jobs.

We strongly recommend using parallel processing in addition and to use job arrays as a convenience feature, without neglecting performance optimization.

Maximal Job Array Size

You can look up the maximal job array size using the following command:

scontrol show config | grep MaxArraySize

The reason we do not document this value is that it is subject to change.

Why is there an array size limit at all?


If unlimited or if the limit is huge, some users will see this as an invitation to submit a large number of otherwise unoptimized jobs. The idea behind job arrays is to ease workflows, particularly the submission of a bigger number of jobs. However, the motivations to pool jobs and to optimize still apply.

Job Arrays in Slurm

By submitting a single job array sbatch script, a specified number of “array-tasks” will be created based on this “master” sbatch script. An example job array script is given below:

#!/bin/bash

#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out # redirecting stdout
#SBATCH --error=arrayJob_%A_%a.err  # redirecting stderr
#SBATCH --array=1-16
#SBATCH --time=01:00:00
#SBATCH --partition=parallel        
#SBATCH --ntasks=1                  # number of tasks per array job
#SBATCH --mem-per-cpu=4000


######################
# Begin work section #
######################

# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

# Do some work based on the SLURM_ARRAY_TASK_ID
# For example:
# ./my_process $SLURM_ARRAY_TASK_ID
#
# where my_process is you executable

In the above example, the --array=1-16 option will cause 16 array-tasks (numbered 1, 2, …, 16) to be spawned when this master job script is submitted. The “array-tasks” are simply copies of this master script that are automatically submitted to the scheduler on your behalf. However, in each array-tasks an environment variable called SLURM_ARRAY_TASK_ID will be set to a unique value (in this example, a number in the range 1, 2, …, 16). In your script, you can use this value to select, for example, a specific data file that each array-tasks will be responsible for processing.

Job array indices can be specified in several ways. For example:

#A job array with index values between 0 and 31:
#SBATCH --array=0-31

#A job array with index values of 1, 2, 5, 19, 27:
#SBATCH --array=1,2,5,19,27

#A job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5, 7):
#SBATCH --array=1-7:2

The %A_%a construct in the output and error file names is used to generate unique output and error files based on the master job ID (%A) and the array-tasks ID (%a). In this fashion, each array-tasks will be able to write to its own output and error file.

Limiting the number of concurrent jobs of an array

It is possible to limit the number of concurrently executed jobs of an array, e.g. to minimize I/O overhead within one approach, with this syntax:

#SBATCH --array=1-1000%50

where a limit of 50 concurrent jobs would be in place.

Multiprog for “Uneven” Arrays

The --multi-prog option in srun allows you to assign each parallel task in your job with a different option. More information can be found on our wiki page on node-local scheduling](/docs/advanced/workflow_organization/#Script_Para).

Script Based Parallelization

There are some use cases, where you would want to simply request a full cluster node from slurm and then run many (e.g. much more than 64) small (e.g. only a fragment of the total job runtime) tasks on this full node. Then of course you will need some local scheduling on this node to ensure proper utilization of all cores.

To accomplish this, we suggest you use the GNU Parallel program. The program is installed to /cluster/bin, but you can also simply load the modulefile software/gnu_parallel so that you can also access its man page.

For more documentation on how to use GNU Parallel, please read man parallel and man parallel_tutorial, where you’ll find a great number of examples and explanations.

MOGON Usage Example

Let’s say we have some input data files that contain differing parameters that are going to be processed independently by our program:

ls data_*.in
  data_001.in
  data_002.in
  [...]
  data_149.in
  data_150.in
cat data_001.in
  1 2 3 4 5
  6 7 8 9 0

Now of course we could submit 150 jobs using Slurm or we could use one job which processes the files one after another, but the most elegant way would be to submit one job for 64 cores (e.g. a whole node on Mogon) and process the files in parallel. This is especially convenient since we can then use the nodeshort queue which has better scheduling characteristics than short (while both show better scheduling compared to their long counterparts:

#!/bin/bash

#SBATCH --job-name=demo_gnu_parallel
#SBATCH --output=res_gnu_parallel.txt
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#SBATCH -p smp
#SBATCH -A <your account>

# will load the most recent version of `parallel`
module load tools/parallel

# Store working directory to be safe
SAVEDPWD=$(pwd)
# set jobdir
export JOBDIR=/localscratch/$SLURM_JOB_ID

# suppose we want to process 150 data files, we need to create them for the purpose of the example:
for ((i=0; i < 151; i++)); do
    fname="data_$(printf "%03d" $i).in"
    echo "{0..4}" >> $fname
    echo "{5..9}" >> $fname
done

# First, we copy the input data files and the program to the local filesystem of our node
# (we pretend it is useful - an actual use case are programs with random I/O) on those files
cp "${SAVEDPWD}"/data_*.in $JOBDIR

# Change directory to jobdir
cd $JOBDIR

# we could set the number of threads for the program to use like this:
# export OMP_NUM_THREADS=4
# but in this case the program is not threaded

# -t enables verbose output to stderr
# We could also set -j $((LSB_DJOB_NUMPROC/OMP_NUM_THREADS)) to be more dynamic
# The --delay parameter should be used to distribute I/O load at the beginning of program execution by
#   introducing a delay of 1 second before starting the next task
# --progress will output the current progress of the parallel task execution
# {} will be replaced by each filename
# {#} will be replaced by the consecutive job number
# Both variants will have equal results:
#parallel -t -j 16 --delay 1 --progress "./program {/} > {/.}.out" ::: data_*.in
find . -name 'data_*.in' | parallel -t -j $SLURM_NPROCS  "wc {/} > {/.}.out"

# See the GNU Parallel documentation for more examples and explanation

# Now capture exit status code, parallel will have set it to the number of failed tasks
STATUS=$?

# Copy output data back to the previous working directory
cp $JOBDIR/data_*.out $SAVEDPWD/

exit $STATUS
sbatch parallel_example_script.sh

After this job has run, we should have the results/output data (in this case, it’s just the output of wc, for demonstration):

ls data_*.out
  data_001.out
  data_002.out
  [...]
  data_149.out
  data_150.out
cat data_001.out
  2 10 20 data_001.in

Multithreaded Programs

Let’s further assume that our program can work in parallel itself using OpenMP. We determined that OMP_NUM_THREADS=8 is the best amount of parallel work for one set of input data. This means we can launch 64/8=8 processes using GNU Parallel on the one node we have

#!/bin/bash
#SBATCH --job-name=demo_gnu_parallel
#SBATCH --output=res_gnu_parallel.txt
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#SBATCH -p smp
#SBATCH -A <your account>

# will load the most recent version of `parallel`
module load tools/parallel

# Store working directory to be safe
SAVEDPWD=$(pwd)

JOBDIR=/localscratch/$SLURM_JOBID
RAMDISK=$JOBDIR/ramdisk

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# -t enables verbose output to stderr
# We could also set -j $((LSB_DJOB_NUMPROC/OMP_NUM_THREADS)) to be more dynamic
# The --delay parameter is used to distribute I/O load at the beginning of program execution by
#   introducing a delay of 1 second before starting the next task
# --progress will output the current progress of the parallel task execution
# {} will be replaced by each filename
# {#} will be replaced by the consecutive job number
# Both variants will have equal results:
#parallel -t -j 16 --delay 1 --progress "./program {/} > {/.}.out" ::: data_*.in
find . -name 'data_*.in' | parallel -t -j 8 --delay 1 --progress "./program {/} > {/.}.out"
# See the GNU Parallel documentation for more examples and explanation

# Now capture exit status code, parallel will have set it to the number of failed tasks
STATUS=$?

exit $STATUS

Running on several hosts

We do not recommend supplying a hostlist to GNU parallel with the -S option, as GNU parallel attempts to ssh on the respective nodes (including the master host) and therefore loses the environment. You can script around this, but you will run into a quotation hell.

#!/bin/bash
#SBATCH -J <your meaningful job name>
#SBATCH -A <your account>
#SBATCH -p parallel  
#SBATCH --nodes=3 # appropriate number of Nodes
#SBATCH -n 24    # example value for Mogon I, see below
#SBATCH -t 300
#SBATCH -c=8 # we assume an application which scales to 8 threads, but
             # -c / --cpus-per-task cat be ommited (default is =1)
             # or set to a different value.
#SBATCH -o <your logfile prefix>_%j.log

#adjust / overwrite those two commands to enhance readability & overview
# parameterize srun
srun=srun -N1 -n 1 -c $SLURM_CPUS_PER_TASK  --jobid $SLURM_JOBID --cpu_bindq --mem-per-cpu="$((SLURM_MEM_PER_NODE / SLURM_NTASKS))"
# parameterize parallel
parallel="parallel -j $SLURM_NTASKS --no-notice "

# your preprocessing goes here

# start the run with GNU parallel
$parallel $srun <command> ::: <parameter list>

The number of tasks (given by -n) times the number of CPUs per task (given by -c) needs to be equal to the number of nodes (given by -N) times the number of CPUs per nodes (to be inferred from scontrol show node ` or in the wiki). Or (in pseudo bash):

# ensure
(SLURM_CPUS_PER_TASK * SLURM_NTASKS) -eq $(SLURM_CPUS_ON_NODE * SLURM_CPUS_ON_NODE)

SLURM multiprog for uneven arrays

The SLURM multiprog option in srun essentially displays a master-slave setup. You need it to run within a SLURM job allocation and trigger srun with the %%--multi-prog%% option and appropriate multiprog file:

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
# parameters of this snippet, choose sensible values for your setup
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

# for the purpose of this course
#SBATCH -A <your account>
#SBATCH -p short

srun <other parameters> --multi-prog multi.conf

Then, of course, the multi.conf` file has to exist:

0      echo     'I am the Master'
1-3    bash -c 'printenv SLURM_PROCID'

Indeed, as the naming suggests, you can use such a setup to emulate a master-slave environment. But then the processes have to care themselves about their communication (sockets, regular files, etc.). And the most cumbersome aspect is: You have to maintain two files at all times, whenever the setup has to be changed, and all parameters have to match.

The configuration file contains three fields, separated by blanks. These fields are:

  • Task number
  • Executable File
  • Argument

Parameters available :

  • %t - The task number of the responsible task
  • %o - The task offset (task’s relative position in the task range).

Checkpoint Restarting

Experimental Feature!!!

Introducing wall times is one measure to ensure a balanced distribution of resources in every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing, where a snapshot of the running application is saved in pre-defined intervals. This provides the ability to restart an application from the point where the checkpoint has been saved.

We want to provide integrated checkpointing with slurm, eventually. Until then only third-party tools are offered without additional documentation from our part.

Third party tools

Checkpointing multithreaded applications with dmtcp

dmtcp is a versatile checkpointing application providing good documentation (incl. a video).

We provide at least one module for dmtcp, check:

tools/DMTCP/2.4.5