Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
start:working_on_mogon:io_odds_and_ends:slurm_localscratch [2020/10/19 13:48] meesters [Sending signals to jobs withing SLURM] |
start:working_on_mogon:io_odds_and_ends:slurm_localscratch [2020/10/19 13:57] meesters [Copy files to multiple nodes via job script] |
||
---|---|---|---|
Line 12: | Line 12: | ||
</ | </ | ||
- | **Attention: | ||
If your job runs on multiple nodes, you cannot use the local scratch space on one node from the other nodes.\\ | If your job runs on multiple nodes, you cannot use the local scratch space on one node from the other nodes.\\ | ||
If you need your input data on every node, please refer to the section [[slurm_localscratch# | If you need your input data on every node, please refer to the section [[slurm_localscratch# | ||
Line 21: | Line 20: | ||
Assume you would normally start the program in the current working directory where it will read and write its data like this: | Assume you would normally start the program in the current working directory where it will read and write its data like this: | ||
<code bash> | <code bash> | ||
- | $ sbatch -N1 -p nodeshort ./ | + | $ sbatch -N1 -p parallel ./ |
- | # | + | |
- | $ sbatch -N1 -p parallel ./ | + | |
</ | </ | ||
Now to get the performance of local disk access, you want to use the aforementioned local scratch space on the compute node. | Now to get the performance of local disk access, you want to use the aforementioned local scratch space on the compute node. | ||
Line 33: | Line 29: | ||
- | ===== Copy files via job script ===== | + | ===== Copy files via job script |
- | This methods requires you to wrap your program in a small shell script | + | The following example will submit |
<file bash job.sh> | <file bash job.sh> | ||
Line 103: | Line 99: | ||
</ | </ | ||
- | withing a submission script to signal the batch-job (instead of all the children of but not the batch job itselft). The reason is: If using a submission script like the one above, you trap the signal within the script, not the application. | + | withing a submission script to signal the batch-job (instead of all the children of but not the batch job itselft). The reason is: If using a submission script like the one above, you trap the signal within the script, not the application. |
- | + | ||
- | <code bash> | + | |
- | # list of process IDs (PIDs) to signal | + | |
- | QUEUE="" | + | |
- | + | ||
- | function queue { | + | |
- | QUEUE=" | + | |
- | } | + | |
- | + | ||
- | function forward_signal() { | + | |
- | # this function might fulfill additional purposes, like | + | |
- | # forwarding the signal, waiting a checkpoint to be written | + | |
- | # and then copying the last checkpoint back to the parallel file system | + | |
- | + | ||
- | # just send the desired signal, e.g. SIGUSR2 | + | |
- | kill -s SIGUSR2 $1 | + | |
- | } | + | |
- | + | ||
- | # trap the signal within the bash script | + | |
- | # it is possible to connect several functions with a signal | + | |
- | trap ' | + | |
- | + | ||
- | # start the desired application(s) - note the & | + | |
- | eval "my command and its parameters &" | + | |
- | # store the PID of the desired application(s) | + | |
- | queue $! | + | |
- | # The sequence above needs to be carried out for every application instance | + | |
- | # you want to be signalled. | + | |
- | </ | + | |
</ | </ | ||
- | |||
- | |||
===== Copy files to multiple nodes via job script ===== | ===== Copy files to multiple nodes via job script ===== | ||
The following script can be used to ensure that input files are present in the job directory on **all** nodes.\\ | The following script can be used to ensure that input files are present in the job directory on **all** nodes.\\ | ||
- | This is required for e.g. [[software: | ||
- | |||
- | This script is very verbose, you might want to delete or comment out the '' | ||
- | Also, this script copies data from **all** nodes back into separate directories named '' | + | The demonstrated '' |
- | \\If your application //only// needs to read on every node but does not write on every node, you want to use the cleanup function from the script posted above. | + | |
<file bash job_multinode.sh> | <file bash job_multinode.sh> | ||
#!/bin/bash | #!/bin/bash | ||
- | #SBATCH -N 2 # assuming mogon I ' | + | #SBATCH -N 2 |
- | #SBATCH -J ' | + | # use other parameterization as appropriate |
- | #SBATCH -p nodeshort | + | |
- | #SBATCH -mem 1800M | + | |
JOBDIR="/ | JOBDIR="/ | ||
- | HOSTLIST=$(scontrol show hostname $SLURM_JOB_NODELIST | paste -d, -s | tr ',' | ||
- | echo $HOSTLIST | ||
- | # Store working directory to be safe | ||
- | SAVEDPWD=$(pwd) | ||
- | |||
- | # We define a bash function to do the cleaning when the signal is caught | ||
- | cleanup() { | ||
- | | ||
- | exit 0 | ||
- | } | ||
- | |||
- | # Register the cleanup function when SIGUSR2 is sent, | ||
- | # ten minutes before the job gets killed | ||
- | trap ' | ||
# copy the input file on all nodes | # copy the input file on all nodes | ||
- | sbcast | + | sbcast |
- | # some applications only need the file on the ' | + | |
- | # in this case you can restrict yourself to: | + | |
- | cp ${HOME}/ | + | |
- | # Go to jobdir | + | # NOTE: Unlike ' |
- | cd " | + | # the destination file carries the same name, ' |
- | + | # | |
- | $@ " | + | |
- | # Call the cleanup function when everything went fine | ||
- | cleanup | ||
</ | </ | ||
- | This script is used as follows: | ||
- | <code bash> | ||
- | $ chmod +x ./ | ||
- | $ namd2 # after loading the appropriate module | ||
- | </ | ||