User Tools

Site Tools


checkpoint_restart

This is an old revision of the document!


Checkpointing & Restarting Jobs

Motivation & Introduction

Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is Application Checkpointing, where a snapshot of the running application is saved in pre-defined intervals. This provides the ability to restart an application from the point on, where the checkpoint has been saved.

We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part.

Third party tools

Checkpointing multithreaded applications with dmtcp

dmtcp is a versatile checkpointing application providing a good documentation (incl. a video).

We provide at least one module for dmtcp, check:

tools/DMTCP/2.4.5
checkpoint_restart.1499139117.txt.gz · Last modified: 2017/07/04 05:31 by meesters