This is an old revision of the document!
Checkpointing & Restarting Jobs
Motivation & Introduction
Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing, where a snapshot of the running application is saved in pre-defined intervals.
We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part.
Third party tools
Checkpointing multithreaded applications with dmtcp
dmtcp is a versatile checkpointing application providing a good documentation (incl. a video).
We provide at least one module for dmtcp, check:
tools/DMTCP/2.4.5