This is an old revision of the document!
Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing, where a snapshot of the running application is saved in pre-defined intervals. This provides the ability to restart an application from the point on, where the checkpoint has been saved.
We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part.
dmtcp is a versatile checkpointing application providing a good documentation (incl. a video).
We provide at least one module for dmtcp, check: