node_local_scheduling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
node_local_scheduling [2017/10/09 08:50]
meesters [Taskfarm: Working with one application on many files]
— (current)
Line 1: Line 1:
-====== Node-local scheduling ====== 
  
-There are some use cases, where you would want to simply request a **full cluster node** from the LSF batch system and then run **many** //(e.g. much more than 64)// **smaller** //(e.g. only a fragment of the total job runtime)// tasks on this full node. Then of course you will need some **local scheduling** on this node to ensure proper utilization of all cores. 
- 
-To accomplish this, we suggest you use the [[http://www.gnu.org/software/parallel/|GNU Parallel]] program. The program is installed to ''/cluster/bin'', but you can also simply load the [[modules|modulefile]] ''software/gnu_parallel'' so that you can also access its man page. 
- 
-For more documentation on how to use GNU Parallel, please read ''[[http://www.gnu.org/software/parallel/man.html#name|man parallel]]'' and ''[[http://www.gnu.org/software/parallel/parallel_tutorial.html#gnu_parallel_tutorial|man parallel_tutorial]]'', where you'll find a great number of examples and explanations. 
- 
-===== Mogon Usage Example ===== 
- 
-Let's say we have a number of input data files that contain differing parameters that are going to be processed independently by our program: 
- 
-<file bash> 
-$ ls data_*.in 
-data_001.in 
-data_002.in 
-[...] 
-data_149.in 
-data_150.in 
-$ cat data_001.in 
-1 2 3 4 5 
-6 7 8 9 0 
-</file> 
- 
-Now of course we could submit 150 jobs using LSF or we could use one job which processes the files one after another, but the most elegant way would be to submit one job for 64 cores (e.g. a whole node on Mogon I) and process the files in parallel. This is especially convenient, since we can then use the ''nodeshort'' queue which has better scheduling characteristics than ''short'' (while both show better scheduling compared to there ''long'' counterparts: 
- 
-<file bash parallel_job> 
-#!/bin/bash 
- 
-#SBATCH --job-name=demo_gnu_parallel 
-#SBATCH --output=res_gnu_parallel.txt 
-#SBATCH --ntasks=4 
-#SBATCH --time=10:00 
-#SBATCH --mem-per-cpu=100 
-#SBATCH -p short 
-#SBATCH -A <your account> 
- 
-# will load the most recent version of ''parallel'' 
-module load tools/parallel 
- 
-# Store working directory to be safe 
-SAVEDPWD=$(pwd) 
-# set jobdir 
-export JOBDIR=/localscratch/$SLURM_JOB_ID 
- 
-# suppose we want to process 150 data files, we need to create them for the purpose of the example: 
-for ((i=0; i < 151; i++)); do 
-    fname="data_$(printf "%03d" $i).in" 
-    echo "{0..4}" >> $fname 
-    echo "{5..9}" >> $fname 
-done 
-      
-# First, we copy the input data files and the program to the local filesystem of our node  
-# (we pretend it is useful - an actual use case are programs with random I/O) on those files 
-cp "${SAVEDPWD}"/data_*.in $JOBDIR 
- 
-# Change directory to jobdir 
-cd $JOBDIR 
-      
-# we could set the number of threads for the program to use like this: 
-# export OMP_NUM_THREADS=4 
-# but in this case the program is not threaded 
-      
-# -t enables verbose output to stderr 
-# We could also set -j $((LSB_DJOB_NUMPROC/OMP_NUM_THREADS)) to be more dynamic 
-# The --delay parameter should be used to distribute I/O load at the beginning of program execution by 
-#   introducing a delay of 1 second before starting the next task 
-# --progress will output the current progress of the parallel task execution 
-# {} will be replaced by each filename 
-# {#} will be replaced by the consecutive job number 
-# Both variants will have equal results: 
-#parallel -t -j 16 --delay 1 --progress "./program {/} > {/.}.out" ::: data_*.in 
-find . -name 'data_*.in' | parallel -t -j $SLURM_NPROCS  "wc {/} > {/.}.out" 
- 
-# See the GNU Parallel documentation for more examples and explanation 
-      
-# Now capture exit status code, parallel will have set it to the number of failed tasks 
-STATUS=$? 
-     
-# Copy output data back to the previous working directory 
-cp $JOBDIR/data_*.out $SAVEDPWD/ 
-      
-exit $STATUS 
-</file> 
- 
-<code bash> 
-$ sbatch parallel_example_script.sh 
-</code> 
- 
-After this job has run, we should have the results/output data (in this case, it's just the output of ''wc'', for demonstration): 
- 
-<file bash> 
-$ ls data_*.out 
-data_001.out 
-data_002.out 
-[...] 
-data_149.out 
-data_150.out 
-$ cat data_001.out 
-2 10 20 data_001.in 
-</file> 
- 
-===== Multithreaded Programs ===== 
- 
-Let's further assume that our program is able to work in parallel itself using OpenMP. 
-We determined that ''OMP_NUM_THREADS=8'' is the best amount of parallel work for one set of input data. 
-This means we can launch ''64/8=8'' processes using GNU Parallel on the one node we have 
- 
-<file bash parallel_job2> 
-#!/bin/bash 
-#SBATCH --job-name=demo_gnu_parallel 
-#SBATCH --output=res_gnu_parallel.txt 
-#SBATCH --cpus-per-task=8 
-#SBATCH --time=10:00 
-#SBATCH --mem-per-cpu=100 
-#SBATCH -p short 
-#SBATCH -A <your account> 
- 
-# Store working directory to be safe 
-SAVEDPWD=$(pwd) 
- 
-JOBDIR=/localscratch/$SLURM_JOBID 
-RAMDISK=$JOBDIR/ramdisk 
- 
-export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 
- 
-# -t enables verbose output to stderr 
-# We could also set -j $((LSB_DJOB_NUMPROC/OMP_NUM_THREADS)) to be more dynamic 
-# The --delay parameter is used to distribute I/O load at the beginning of program execution by 
-#   introducing a delay of 1 second before starting the next task 
-# --progress will output the current progress of the parallel task execution 
-# {} will be replaced by each filename 
-# {#} will be replaced by the consecutive job number 
-# Both variants will have equal results: 
-#parallel -t -j 16 --delay 1 --progress "./program {/} > {/.}.out" ::: data_*.in 
-find . -name 'data_*.in' | parallel -t -j 8 --delay 1 --progress "./program {/} > {/.}.out" 
-# See the GNU Parallel documentation for more examples and explanation 
- 
-# Now capture exit status code, parallel will have set it to the number of failed tasks 
-STATUS=$? 
- 
-exit $STATUS 
-</file> 
- 
- 
- 
- 
- 
-==== An example showing the use of functions, variables and redirection ==== 
- 
-This example shows how to use user-defined functions, variables and anonymous pipes in bash. It uses [[http://bio-bwa.sourceforge.net/|bwa]], a bio-informatics tool to map unknown DNA-sequences to a given reference. The reason, why it is chosen, is that it requires different kind of inputs: A reference file or directory and at least two input files. These input files may not be compressed, but, as the script shows, uncompression be means of a gzip / zcat and redirection is a working solution. 
- 
-  * Note, that this example sets ''set -x'' to print command traces. This is intended merely to ease comprehension. For a script in production, this should be out-commented. 
-  * Also note, that information about the function is carried to the sub-shells with the ''export -f'' statement. 
-  * Variables which are not stated upon the call of GNU ''parallel'' can be made available to a function with additional ''export'' statements (here: ''$OUTPUTDIR''). 
-  * Positional variables (here, just ''$1'' in our example function) are given by the call through GNU ''parallel''. This is useful for parameters which should change for every iteration, here: the input file name. 
- 
-//In particular//: bwa can be slow on files bigger than a few GB. Hence, the proposed trade-off between threads and concurrently running processes, the load balancing, might not be optimal. (Many smaller files will probably analyzed faster with only 4 threads and 16 concurrently running processes (the ''-j'' option).) 
- 
-<file bash> 
-#!/bin/bash 
- 
-#SBATCH --job-name=bwa_demo_gnu_parallel 
-#SBATCH --output=res_bwa_gnu_parallel.log 
-#SBATCH -N 1 
-#SBATCH --time=300 
-#SBATCH -p nodeshort 
-#SBATCH -A <your account> 
-#SBATCH --gres=ramdisk:30G 
- 
- 
-# This script is written by Christian Meesters (HPC-team, ZDV, Mainz) 
- 
-# Please note: It is valid for Mogon I. The following restrictions apply: 
-# - if your fastq-files in the defined inputdirectory are big, the given 
-#   memory might not be sufficient. In this case restrict yourself to  
-#   a data subset. 
- 
-# in order to see the output of all commands, we set this: 
-set -x 
- 
-#1 we purge all possible module to avoid a mangled setup 
-module purge 
- 
-#2 we load out GNU parallel module (latest version) 
-module load tools/parallel 
- 
-#3 in order perform our alignment (here: bwa sampe) and subsequent sorting we 
-#  load bwa and samtools 
-module load bio/BWA 
-module load bio/SAMtools 
- 
-#4 make the return value of the last pipe command which fails the return value 
-set -o pipefail 
- 
-#5 set a path the reference genome, extract its directory path 
-REFERENCEGENOME="reference/hg19.fasta" 
-REFERENCEDIR=$(dirname $REFERENCEGENOME) 
- 
-#6 select a base directory for all input and traverse through it: 
-INPUTBASEDIR=./input 
-#6b now we gather all input files: 
-#  - a first fastq file (ending on _1.fastq) 
-#  - its mate (ending on _2.fastq) 
-# If your files name use a different scheme, adjust this script 
-FORWARD_READS=$(find -L $INPUTBASEDIR -type f -name '*_1.fastq*') 
- 
-#7 create an output directory, here: according to bwa and samtools versions 
-BWA_VERSION=$(bwa |& grep Version | cut -d ' ' -f2 | cut -d '-' -f1 ) 
-export OUTPUTDIR="bwa${BWA_VERSION}_samtools_${SLURM_JOB_ID}" 
- 
-if [ ! -d "$OUTPUTDIR" ]; then 
-    mkdir -p "$OUTPUTDIR" 
-fi 
- 
-#8 copy the reference to the ramdisk 
-NEWREFERENCEDIR=/localscratch/${SLURM_JOB_ID}/ramdisk/reference 
-mkdir -p $NEWREFERENCEDIR 
- 
-for FILE in $REFERENCEDIR/*; do 
-  sbcast -f $FILE $NEWREFERENCEDIR/$(basename $FILE) 
-done 
- 
-REFERENCEGENOME=$NEWREFERENCEDIR/$(basename $REFERENCEGENOME) 
-REFERENCEDIR=$NEWREFERENCEDIR 
- 
-#9 create an alignment function with the appropriate calls for bwa and samtools 
-function bwa_aln { 
-  TEMPOUT=$(basename $1) 
-  # check file ending: is the file ending on gz? 
-  if [ "$1" == "*.gz" ]; then 
-        #bwa sampe $REFERENCEGENOME <( bwa aln -t 4 $REFERENCEGENOME <(zcat $1) ) \ 
-        #                           <( bwa aln -t 4 $REFERENCEGENOME <(zcat ${1/_1/_2} ) ) \ 
-        #                           <(zcat $1) <(zcat ${1/_1/_2}) | \ 
-        #samtools view -Shb /dev/stdin > "$OUTPUTDIR/${TEMPOUT%_1.fastq.gz}_aligned.bam" 
-        bwa mem -M -t 8 $REFERENCEGENOME <(zcat $1) <(zcat ${1/_1/_2})  | 
-        samtools view -Shb /dev/stdin > "$OUTPUTDIR/${TEMPOUT%_1.fastq}_aligned.bam" 
-  else 
-        #bwa sampe $REFERENCEGENOME <( bwa aln -t 4 $REFERENCEGENOME $1 ) \ 
-        #                           <( bwa aln -t 4 $REFERENCEGENOME ${1/_1/_2} ) \ 
-        #                           $1 ${1/_1/_2} | \ 
- 
-        bwa mem -M -t 8 $REFERENCEGENOME $1 ${1/_1/_2} | 
-        samtools view -Shb /dev/stdin > "$OUTPUTDIR/${TEMPOUT%_1.fastq}_aligned.bam" 
-  fi 
-} 
- 
-#9b we need to export this function, such that all subprocesses will see it (only works in bash) 
-export -f bwa_aln 
- 
-# finally we start processing 
-#  we consider taking 4 thread for each call of bwa aln, hence 8 threads 
-#  and 64 / 8 is 8. This results in a little over subscription, as bwa mem 
-#  runs with 8 threads and samtools is run, too. 
-#  Note the ungrouping of output with the -u option. 
-parallel -v -u --env bwa_aln --no-notice -j 8 bwa_aln ::: $FORWARD_READS 
-</file> 
-===== Running on several hosts ===== 
- 
-We do not recommend supplying a hostlist to GNU parallel with the ''-S'' option, as GNU parallel attempts to ssh on the respective nodes (inluding the master host) and therefore looses the environment. You can script around this, but you will run into a quotation hell. 
- 
-Instead, we recommend a setup similar to 
- 
-<file bash multi_host> 
-#!/bin/bash 
-#SBATCH -J <your meaningful job name> 
-#SBATCH -A <your account> 
-#SBATCH -p nodeshort # for Mogon I 
-#SBATCH -p parallel  # for Mogon II 
-#SBATCH --nodes=3 # appropriate number of Nodes 
-#SBATCH -n 192    # example value for Mogon I, see below 
-#SBATCH -t 300 
-#SBATCH --cpus-per-task=8 # we assume an application which scales to 8 threads, but 
-                          # -c / --cpus-per-task could also be ommited (default is =1) 
-                          # or set to a different value. 
-#SBATCH -o <your logfile prefix>_%j.log 
- 
-#adjust / overwrite those two commands to enhance readability & overview 
-# parameterize srun 
-srun="srun -N1 -n 1 -c $SLURM_CPUS_PER_TASK  --jobid $SLURM_JOBID --cpu_bind=q" 
-# parameterize parallel 
-parallel="parallel -j $SLURM_NTASKS --no-notice " 
- 
-# your preprocessing goes here 
- 
-# start the run with GNU parallel 
-$parallel $srun <command> ::: <parameter list> 
-</file> 
- 
-The number of tasks given by ''-n'' should be the number of CPUs * the number of nodes. However, bear in mind that the a-nodes of Mogon I have 1 FPU per 2 CPU Module and the z-nodes of Mogon II have 20 CPUs, each with hyptherthreading enables. Wich number you best assume to be the number of cores is application depended and should best be determinded experimentally. 
- 
- 
-===== Using SLURM multiprog ===== 
- 
-The [[https://slurm.schedmd.com/srun.html|SLURM multiprog]] option in ''srun'' essentially displays a master-slave setup. You need it to run within a SLRUM job allocation and trigger ''srun'' with the ''%%--multi-prog%%'' option and appropriate multiprog file: 
- 
-<file bash master_slave_simple.sh> 
-#!/bin/bash 
-# 
-#SBATCH --job-name=test_ms 
-#SBATCH --output=res_ms.txt 
-# parameters of this snippet, choose sensible values for your setup 
-#SBATCH --ntasks=4 
-#SBATCH --time=10:00 
-#SBATCH --mem-per-cpu=100 
- 
-# for the purpose of this course 
-#SBATCH -A <your account> 
-#SBATCH -p short 
- 
-srun <other parameters> --multi-prog multi.conf 
-</file> 
- 
-Then, of course the ''multi.conf'' file has to exist: 
- 
-<file bash multi.conf> 
-0      echo     'I am the Master' 
-1-3    bash -c 'printenv SLURM_PROCID' 
-</file> 
- 
-Indeed, as the naming suggests, you can use such setup to emulate a master-slave environment. But then the processes have to care themselves about there communication (sockets, regular files, etc.). And the most cumbersome aspect is: You have to maintain two files at all times, whenever the setup has to be changed, and all parameters have to match. 
- 
-The configuration file contains three fields, separated by blanks. These fields are : 
- 
-  *   Task number 
-  *     Executable File 
-  *     Argument  
-  
-Parameters available : 
- 
-  * ''%t'' - The task number of the responsible task 
-  * ''%o'' - The task offset (task's relative position in the task range). 
- 
- 
-====== The ZDV-taskfarm Script an alternative to multiprog ====== 
- 
-The script is hosted on [[https://github.com/cmeesters/staskfarm|github]], forked from [[https://www.tchpc.tcd.ie/person/paddy-doyle|Paddy Doyle, Trinity College, Dublin]] and adapted for Mogon (I and II). 
- 
-The slurm multi-prog setup can be difficult for some scenarios: 
- 
-  * only one executable can be specified per task (e.g. no chain of commands or shell loops are possible, such as ''cd dir01; ./my_exec'') 
-  * limitation on the maximum number of characters per task description (256) 
-  * building the multi-prog file can be onerous, if you do not have the luxury of using the '%t' tokens in your commands or arguments 
-  * the number of commands must match exactly the number of slurm tasks (''-n''), which means updating two files if you wish to add or remove tasks 
- 
-[[job_arrays|Slurm Job Arrays]] are a better option to multi-prog, unless  using the ''parallel'' (Mogon II) or ''node*'' (Mogon I) partitions, with scalable software, anyway. 
- 
-The taskfarm script makes using multi-prog setups easy. Please only use it, if your tasks have +/- the same run time or else huge parts of the reserved nodes can be left idle. 
- 
-For a full listing of the command line interface you can load the module and ask the script itself for help: 
-<code shell> 
-$ module load tools/staskfarm 
-$ staskfarm -h 
-</code> 
- 
- 
-===== Taskfarm: Working with one application on many files ===== 
- 
-<file bash taskfarm_file> 
-#!/bin/bash 
- 
-#SBATCH -J taskfarm_example 
-#SABTCH -o taskfarm_example_%j.out 
-#SBATCH -N2 # in this example we take 2 nodes 
-#SBATCH -n 128 # optional argument - the optimal setting (or ommitance) has to be tried on a case basis 
-#SBATCH -A <your account> 
-#SBATCH -p nodeshort 
- 
-# will load the most recent module version of the taskfarm 
-module load tools/staskfarm 
- 
-# - suppose we have a program which requires 2 intputs: 
-#    'input_<number>_R1.fastq' and 'input_<number>_R2.fastq' 
-# - assume further we have 302 such files 
-# - and we want to work on them in a round robin manner 
- 
-# 1st we "produce" such dummy files: 
-for ((i=0; i < 303; i++)); do 
-   touch "input_${i}_R1.fastq"  
-   touch "input_${i}_R2.fastq"  
-done 
- 
-# 3rd, we specify our input command. 
-#      Instead of a 'real' application we drop in 'echo'. 
-#      And we use pattern expansion to retrieve the 2nd file name, 
-#      as we cannot (always) loop over several expressions. 
-echo '#!/bin/bash' > cmd_file.sh 
-echo 'echo "working on node $(hostname) on files $1 and ${1%%_*}_R2.fastq"' >> cmd_file.sh 
-chmod +x cmd_file.sh 
-cmd=$(pwd)/cmd_file.sh 
- 
-# 4th, start the taskfarm: 
-staskfarm $cmd *_R1.fastq 
- 
-# finally, we need to clean up our mess: 
-rm *fastq cmd_file.sh 
-</file> 
- 
-===== Taskfarm: Screening one application many parameters ===== 
- 
-<WRAP center round alert 80%> 
-As stated, the most sensible use case for the taskfarm are application with +/- equal run times for the inputs. When screening parameters in a simulation, is is likely that run times greatly depend on the parameters. Therefore, it might be better to consider using GNU parallel or other parallelisation schemes. 
-</WRAP> 
- 
-<file bash taskfarm_file2> 
-#!/bin/bash 
- 
-#SBATCH -J taskfarm_example 
-#SABTCH -o taskfarm_example_%j.out 
-#SBATCH -N1 # in this example we take <1 node 
-#SBATCH -n 40 
-#SBATCH --cpus-per-task=2 
-#SBATCH -A <your account> 
-#SBATCH --time 10 
-#SBATCH -p short 
- 
-# will load the most recent module version of the taskfarm 
-module load tools/staskfarm 
- 
-# - suppose we have a program which requires 2 intputs: 
-#    all natural numbers for arg1 between 1 and 5 and for arg2 just 0 and 1 
-# - assume further we want to run a permutation test 
- 
-# 1st we "produce" such dummy files: 
-permutations=$(echo {1..5},{0..1}) 
- 
-# 3rd, we specify our input command. 
-#      Instead of a 'real' application we drop in 'echo'. 
-echo '#!/bin/bash' > cmd_file.sh 
-echo 'echo "working on node $(hostname) on args $1 and with $OMP_NUM_THREADS threads' >> cmd_file.sh 
-chmod +x cmd_file.sh 
-cmd=$(pwd)/cmd_file.sh 
- 
-# 4th, start the taskfarm: 
-#      NOTE: We start each application with 2 threads (-t 2) 
-# the '--parameters' flag, will skip the test for input files (as parameters aren't files) 
-staskfarm -t 2 --parameters $cmd $permutations 
- 
-# finally, we need to clean up our mess: 
-rm cmd_file.sh 
-</file> 
  • node_local_scheduling.1507531854.txt.gz
  • Last modified: 2017/10/09 08:50
  • by meesters