Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
start:working_on_mogon:io_odds_and_ends:slurm_localscratch [2020/10/19 13:51] meesters [Copy files via job script] |
start:working_on_mogon:io_odds_and_ends:slurm_localscratch [2022/06/20 18:05] meesters [Signalling in SLURM -- difference between signalling submission scripts and applications] - minor grammar fixes and removed doubled lines |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Local Scratch Space ====== | ====== Local Scratch Space ====== | ||
- | On every node, there is local scratch space available to your running jobs that you should use if possible. | + | On every node, there is local scratch space available to your running jobs. |
Every job can therefore use a directory called ''/ | Every job can therefore use a directory called ''/ | ||
+ | |||
+ | <callout type=" | ||
+ | If your job(s) in question are merely reading and writing big files in a linear mode, there is no requirement to use a local scratch or a ramdisk. However, these are scenarios, where using the local scratch might be beneficial: | ||
+ | * if your job produces many temporary files | ||
+ | * if your job reads a file or set of files in a directory repeatedly during run time (for multiple threads or concurrent jobs mean a random access pattern to the global file system, which is a true performance killer) | ||
+ | </ | ||
<callout type=" | <callout type=" | ||
Line 12: | Line 18: | ||
</ | </ | ||
- | **Attention: | ||
If your job runs on multiple nodes, you cannot use the local scratch space on one node from the other nodes.\\ | If your job runs on multiple nodes, you cannot use the local scratch space on one node from the other nodes.\\ | ||
If you need your input data on every node, please refer to the section [[slurm_localscratch# | If you need your input data on every node, please refer to the section [[slurm_localscratch# | ||
Line 21: | Line 26: | ||
Assume you would normally start the program in the current working directory where it will read and write its data like this: | Assume you would normally start the program in the current working directory where it will read and write its data like this: | ||
<code bash> | <code bash> | ||
- | $ sbatch -N1 -p nodeshort ./ | + | $ sbatch -N1 -p parallel ./ |
- | # | + | |
- | $ sbatch -N1 -p parallel ./ | + | |
</ | </ | ||
Now to get the performance of local disk access, you want to use the aforementioned local scratch space on the compute node. | Now to get the performance of local disk access, you want to use the aforementioned local scratch space on the compute node. | ||
Line 89: | Line 91: | ||
$ sbatch --signal=SIGUSR2@600 ... | $ sbatch --signal=SIGUSR2@600 ... | ||
</ | </ | ||
- | This would send the signal '' | + | This would send the signal '' |
**Usually** this requires you to use | **Usually** this requires you to use | ||
Line 97: | Line 99: | ||
</ | </ | ||
- | or rather | + | within |
- | + | ||
- | <code bash> | + | |
- | #SBATCH --signal=B: | + | |
- | </ | + | |
- | + | ||
- | withing | + | |
- | + | ||
- | <code bash> | + | |
- | # list of process IDs (PIDs) to signal | + | |
- | QUEUE="" | + | |
- | + | ||
- | function queue { | + | |
- | QUEUE=" | + | |
- | } | + | |
- | + | ||
- | function forward_signal() { | + | |
- | # this function might fulfill additional purposes, like | + | |
- | # forwarding the signal, waiting a checkpoint to be written | + | |
- | # and then copying the last checkpoint back to the parallel file system | + | |
- | + | ||
- | # just send the desired signal, e.g. SIGUSR2 | + | |
- | kill -s SIGUSR2 $1 | + | |
- | } | + | |
- | + | ||
- | # trap the signal within the bash script | + | |
- | # it is possible to connect several functions with a signal | + | |
- | trap ' | + | |
- | + | ||
- | # start the desired application(s) - note the & | + | |
- | eval "my command and its parameters &" | + | |
- | # store the PID of the desired application(s) | + | |
- | queue $! | + | |
- | # The sequence above needs to be carried out for every application instance | + | |
- | # you want to be signalled. | + | |
- | </ | + | |
</ | </ | ||
- | |||
- | |||
===== Copy files to multiple nodes via job script ===== | ===== Copy files to multiple nodes via job script ===== | ||
The following script can be used to ensure that input files are present in the job directory on **all** nodes.\\ | The following script can be used to ensure that input files are present in the job directory on **all** nodes.\\ | ||
- | This is required for e.g. [[software: | ||
- | This script is very verbose, you might want to delete or comment out the '' | + | The demonstrated '' |
- | + | ||
- | Also, this script copies data from **all** nodes back into separate directories named '' | + | |
- | \\If your application //only// needs to read on every node but does not write on every node, you want to use the cleanup function from the script posted above. | + | |
<file bash job_multinode.sh> | <file bash job_multinode.sh> | ||
#!/bin/bash | #!/bin/bash | ||
- | #SBATCH -N 2 # assuming mogon I ' | + | #SBATCH -N 2 |
- | #SBATCH -J ' | + | # use other parameterization as appropriate |
- | #SBATCH -p nodeshort | + | |
- | #SBATCH -mem 1800M | + | |
JOBDIR="/ | JOBDIR="/ | ||
- | HOSTLIST=$(scontrol show hostname $SLURM_JOB_NODELIST | paste -d, -s | tr ',' | ||
- | echo $HOSTLIST | ||
- | # Store working directory to be safe | ||
- | SAVEDPWD=$(pwd) | ||
- | |||
- | # We define a bash function to do the cleaning when the signal is caught | ||
- | cleanup() { | ||
- | | ||
- | exit 0 | ||
- | } | ||
- | |||
- | # Register the cleanup function when SIGUSR2 is sent, | ||
- | # ten minutes before the job gets killed | ||
- | trap ' | ||
# copy the input file on all nodes | # copy the input file on all nodes | ||
- | sbcast | + | sbcast |
- | # some applications only need the file on the ' | + | |
- | # in this case you can restrict yourself to: | + | |
- | cp ${HOME}/ | + | |
- | # Go to jobdir | + | # NOTE: Unlike ' |
- | cd " | + | # the destination file carries the same name, ' |
- | + | # | |
- | $@ " | + | |
- | # Call the cleanup function when everything went fine | ||
- | cleanup | ||
</ | </ | ||
- | This script is used as follows: | ||
- | <code bash> | ||
- | $ chmod +x ./ | ||
- | $ namd2 # after loading the appropriate module | ||
- | </ | ||