DMTCP (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available on Swan, and is enabled by typing:
module load dmtcp
After the module is loaded, the first step is to run the command:
[<username>@login1.swan ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
where --rm
option enables SLURM support,
<interval_time_seconds> is the time in seconds between
automatic checkpoints, and <your_command> is the actual
command you want to run and checkpoint.
Beside the general options shown above, more dmtcp_launch
can be seen by using:
[<username>@login1.swan ~]$ dmtcp_launch --help
creates few files that are used to resume the
cancelled job, such as ckpt_*.dmtcp and
dmtcp_restart_script*.sh. Unless otherwise stated
(using --ckptdir
option), these files are stored in the current
working directory.
The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that:
dmtcp_restart ckpt_*.dmtcp
<options> (before running
this command delete any old ckp_*.dmtcp files in your current
If there are no options defined in the <options> field, DMTCP will keep running with the options defined in the initial dmtcp_launch call (such as interval time, output directory etc).
Simple example of using DMTCP with BLAST on swan is shown below:
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_1.txt
#SBATCH --error=BlastX_error_1.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \
/tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \
In this example, DMTCP takes checkpoints every hour (--interval 3600
and the actual command we want to checkpoint is blastx
some general BLAST options defined with -query
, -db
, -out
If this job is killed for various reasons, it can be restarted using the following submit file:
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_2.txt
#SBATCH --error=BlastX_error_2.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
# Start DMTCP
dmtcp_coordinator --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_HOST=`hostname`
export DMTCP_COORD_PORT=$(</tmp/port)
# Restart job
generates new
and dmtcp_restart_script*.sh
files. Therefore, if
the restarted job is also killed due to unavailable/exceeded resources,
you can resubmit the same job again without any changes in the submit
file shown above (just don’t forget to delete the old ckpt_*.dmtcp
files if you are using these files instead of
Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.