This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
checkpoint_restart [2017/07/04 05:31]
meesters created
— (current)
Line 1: Line 1:
-====== Checkpointing & Restarting Jobs ====== 
-===== Motivation & Introduction ===== 
-Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is [[|Application Checkpointing]], where a snapshot of the running application is saved in pre-defined intervals. 
-<WRAP center round info 95%> 
-We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part. 
-===== Third party tools ===== 
-==== Checkpointing multithreaded applications with dmtcp ==== 
-[[|dmtcp]] is a versatile checkpointing application providing a good documentation (incl. a video). 
-We provide at least one module for dmtcp, check: 
-<code bash> 
  • checkpoint_restart.1499139064.txt.gz
  • Last modified: 2017/07/04 05:31
  • by meesters