
====== Checkpointing & Restarting Jobs ====== 
===== Motivation & Introduction ===== 
Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is [[|Application Checkpointing]], where a snapshot of the running application is saved in pre-defined intervals. 
<WRAP center round info 95%> 
We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part. 
===== Third party tools ===== 
==== Checkpointing multithreaded applications with dmtcp ==== 
[[|dmtcp]] is a versatile checkpointing application providing a good documentation (incl. a video). 
We provide at least one module for dmtcp, check: 
<code bash> 
