This is an old revision of the document!

Checkpointing & Restarting Jobs

Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing, where a snapshot of the running application is saved in pre-defined intervals.

We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part.

dmtcp is a versatile checkpointing application providing a good documentation (incl. a video).

We provide at least one module for dmtcp, check:

  • checkpoint_restart.1499139064.txt.gz
  • Last modified: 2017/07/04 05:31
  • by meesters