CAP3 (Contig Assembly Program) is a DNA sequence assembly program for small-scale assembly with or without quality values.
The basic usage of CAP3 is:
$ cap3 input_reads.fasta [options] > output.txt
input_reads.fasta
is an input file of sequence reads in fasta format, and options
are optional parameters that can be found by typing:
$ cap3
An example of how to run basic CAP3 SLURM script on Swan is shown below:
cap3.submit
#!/bin/bash
#SBATCH --job-name=CAP3
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=CAP3.%J.out
#SBATCH --error=CAP3.%J.err
module load cap3/122107
cap3 input_reads.fasta > output.txt
CAP3 is single threaded program, and therefore both #SBATCH --nodes
and #SBATCH --ntasks-per-node
are set to 1
.
CAP3 returns few output files, input_reads.fasta.cap.singlets
, input_reads.fasta.cap.contigs
, input_reads.fasta.cap.contigs.links
, input_reads.fasta.cap.qual
, input_reads.fasta.cap.ace
, input_reads.fasta.cap.info
.
The consensus fasta sequences are saved in the file input_reads.fasta.cap.contigs
, while the reads that are not used in the assembly are stored in the fasta file input_reads.fasta.cap.singlets
.
In order to test the CAP3 (cap3/122107) performance on Swan, we created separately three nucleotide datasets, small.fasta
, medium.fasta
and large.fasta
. Some statistics about the input datasets and the time and memory resources used by CAP3 on Swan are shown in the table below:
total # of sequences | total # of bases | total size in MB | used time | used memory | # of used CPUs | |
---|---|---|---|---|---|---|
small.fasta | 41,715 | 35,581,740 | 37.627 | ~ 1.6 hours | ~ 1.5 GB | 1 |
medium.fasta | 110,478 | 147,543,113 | 149 | ~ 2 hours | ~ 5 GB | 1 |
large.fasta | 592,593 | 827,629,204 | 836 | ~ 12 hours | ~ 28 GB | 1 |