checkpoint_restart

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
checkpoint_restart [2020/10/02 15:00]
jrutte02 removed
— (current)
Line 1: Line 1:
-====== Checkpointing & Restarting Jobs ====== 
- 
-<WRAP center round todo 90%> 
-''This feature is experimental. We hope to provide more information, soon.'' 
-</WRAP> 
- 
-===== Motivation & Introduction ===== 
- 
-Introducing wall times is one measure to ensure balanced distribution of resources on every HPC cluster. Yet, some applications need to have extremely long run times. The solution is [[https://en.wikipedia.org/wiki/Application_checkpointing|Application Checkpointing]], where a snapshot of the running application is saved in pre-defined intervals. This provides the ability to restart an application from the point on, where the checkpoint has been saved. 
- 
-<WRAP center round info 95%> 
-We want to provide integrated checkpointing with slurm, eventually. Until then only third party tools are offered without additional documentation from our part. 
-</WRAP> 
- 
-===== Third party tools ===== 
- 
-==== Checkpointing multithreaded applications with dmtcp ==== 
- 
-[[http://dmtcp.sourceforge.net/index.html|dmtcp]] is a versatile checkpointing application providing a good documentation (incl. a video). 
- 
-We provide at least one module for dmtcp, check: 
-<code bash> 
-tools/DMTCP/2.4.5 
-</code> 
- 
- 
- 
- 
  
  • checkpoint_restart.1601643616.txt.gz
  • Last modified: 2020/10/02 15:00
  • by jrutte02