DMTCP Checkpointing

DMTCP (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources).

DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.

DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available both on Crane, and is enabled by typing:

module load dmtcp

After the module is loaded, the first step is to run the command:

[<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>

where --rm option enables SLURM support, <interval_time_seconds> is the time in seconds between automatic checkpoints, and <your_command> is the actual command you want to run and checkpoint.

Beside the general options shown above, more dmtcp_launch options can be seen by using:

[<username>@login.crane ~]$ dmtcp_launch --help

dmtcp_launch creates few files that are used to resume the cancelled job, such as ckpt_*.dmtcp and dmtcp_restart_script*.sh. Unless otherwise stated (using --ckptdir option), these files are stored in the current working directory.

The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that:

  • dmtcp_restart ckpt_*.dmtcp <options> (before running this command delete any old ckp_*.dmtcp files in your current directory)

  • ./dmtcp_restart_script.sh <options>

If there are no options defined in the <options> field, DMTCP will keep running with the options defined in the initial dmtcp_launch call (such as interval time, output directory etc).

Simple example of using DMTCP with BLAST on crane is shown below:

dmtcp_blastx.submit
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_1.txt
#SBATCH --error=BlastX_error_1.txt
 
module load dmtcp
module load blast/2.4

cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/  
cp input_reads.fasta /tmp/

dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \
/tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \
-num_threads $SLURM_NTASKS_PER_NODE

In this example, DMTCP takes checkpoints every hour (--interval 3600), and the actual command we want to checkpoint is blastx with some general BLAST options defined with -query, -db, -out, -num_threads.

If this job is killed for various reasons, it can be restarted using the following submit file:

dmtcp_restart_blastx.submit
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_2.txt
#SBATCH --error=BlastX_error_2.txt

module load dmtcp
module load blast/2.4

cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/

# Start DMTCP
dmtcp_coordinator --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_HOST=`hostname`
export DMTCP_COORD_PORT=$(</tmp/port)

# Restart job 
./dmtcp_restart_script.sh

dmtcp_restart generates new ckpt_*.dmtcp and dmtcp_restart_script*.sh files. Therefore, if the restarted job is also killed due to unavailable/exceeded resources, you can resubmit the same job again without any changes in the submit file shown above (just don’t forget to delete the old ckpt_*.dmtcp files if you are using these files instead of dmtcp_restart_script.sh)

Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.