DMTCP (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available on Swan, and is enabled by typing:
module load dmtcp
After the module is loaded, the first step is to run the command:
[<username>@login1.swan ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
where --rm
option enables SLURM support,
<interval_time_seconds> is the time in seconds between
automatic checkpoints, and <your_command> is the actual
command you want to run and checkpoint.
Beside the general options shown above, more dmtcp_launch
options
can be seen by using:
[<username>@login1.swan ~]$ dmtcp_launch --help
dmtcp_launch
creates few files that are used to resume the
cancelled job, such as ckpt_*.dmtcp and
dmtcp_restart_script*.sh. Unless otherwise stated
(using --ckptdir
option), these files are stored in the current
working directory.
The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that:
dmtcp_restart ckpt_*.dmtcp
<options> (before running
this command delete any old ckp_*.dmtcp files in your current
directory)
./dmtcp_restart_script.sh
<options>
If there are no options defined in the <options> field, DMTCP will keep running with the options defined in the initial dmtcp_launch call (such as interval time, output directory etc).
Simple example of using DMTCP with BLAST on swan is shown below:
#!/bin/bash
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_1.txt
#SBATCH --error=BlastX_error_1.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \
/tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \
-num_threads $SLURM_NTASKS_PER_NODE
In this example, DMTCP takes checkpoints every hour (--interval 3600
),
and the actual command we want to checkpoint is blastx
with
some general BLAST options defined with -query
, -db
, -out
,
-num_threads
.
If this job is killed for various reasons, it can be restarted using the following submit file:
#!/bin/bash
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_2.txt
#SBATCH --error=BlastX_error_2.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
# Start DMTCP
dmtcp_coordinator --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_HOST=`hostname`
export DMTCP_COORD_PORT=$(</tmp/port)
# Restart job
./dmtcp_restart_script.sh
dmtcp_restart
generates new
ckpt_*.dmtcp
and dmtcp_restart_script*.sh
files. Therefore, if
the restarted job is also killed due to unavailable/exceeded resources,
you can resubmit the same job again without any changes in the submit
file shown above (just don’t forget to delete the old ckpt_*.dmtcp
files if you are using these files instead of dmtcp_restart_script.sh
)
Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.