DMTCP (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available both on Crane, and is enabled by typing:
module load dmtcp
After the module is loaded, the first step is to run the command:
[<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
--rm option enables SLURM support,
<interval_time_seconds> is the time in seconds between
automatic checkpoints, and <your_command> is the actual
command you want to run and checkpoint.
Beside the general options shown above, more
can be seen by using:
[<username>@login.crane ~]$ dmtcp_launch --help
dmtcp_launch creates few files that are used to resume the
cancelled job, such as ckpt_*.dmtcp and
dmtcp_restart_script*.sh. Unless otherwise stated
--ckptdir option), these files are stored in the current
The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that:
dmtcp_restart ckpt_*.dmtcp <options> (before running
this command delete any old ckp_*.dmtcp files in your current
If there are no options defined in the <options> field, DMTCP will keep running with the options defined in the initial dmtcp_launch call (such as interval time, output directory etc).
Simple example of using DMTCP with BLAST on crane is shown below:
#!/bin/sh #SBATCH --job-name=BlastX #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=50:00:00 #SBATCH --mem=20gb #SBATCH --output=BlastX_info_1.txt #SBATCH --error=BlastX_error_1.txt module load dmtcp module load blast/2.4 cd $WORK/<project_folder> cp -r /work/HCC/DATA/blastdb/nr/ /tmp/ cp input_reads.fasta /tmp/ dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \ /tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \ -num_threads $SLURM_NTASKS_PER_NODE
In this example, DMTCP takes checkpoints every hour (
and the actual command we want to checkpoint is
some general BLAST options defined with
If this job is killed for various reasons, it can be restarted using the following submit file:
#!/bin/sh #SBATCH --job-name=BlastX #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=50:00:00 #SBATCH --mem=20gb #SBATCH --output=BlastX_info_2.txt #SBATCH --error=BlastX_error_2.txt module load dmtcp module load blast/2.4 cd $WORK/<project_folder> cp -r /work/HCC/DATA/blastdb/nr/ /tmp/ cp input_reads.fasta /tmp/ # Start DMTCP dmtcp_coordinator --daemon --port 0 --port-file /tmp/port export DMTCP_COORD_HOST=`hostname` export DMTCP_COORD_PORT=$(</tmp/port) # Restart job ./dmtcp_restart_script.sh
dmtcp_restart generates new
dmtcp_restart_script*.sh files. Therefore, if
the restarted job is also killed due to unavailable/exceeded resources,
you can resubmit the same job again without any changes in the submit
file shown above (just don’t forget to delete the old
files if you are using these files instead of
Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.