SOAPdenovo is a de novo genome assembler for short reads. It is specially designed for Illumina GA short reads and large plant and animal genomes. SOAPdenovo2 is a newer version of SOAPdenovo with improved algorithm that reduces memory consumption, resolves more repeat regions, increases coverage, and optimizes the assembly for large genomes.

SOAPdenovo2 has two commands, SOAPdenovo-63mer and SOAPdenovo-127mer. The first one is suitable for assembly with k-mer values less than 63 bp, requires less memory and runs faster. The latter one works for k-mer values less than 127 bp.

In order to see the options available for SOAPdenovo-63mer just type:

$ SOAPdenovo-63mer

SOAPdenovo2 provides a mechanism to run the whole workflow at once, or in 5 separate steps.

The basic usage of SOAPdenovo2 is:

$ SOAPdenovo-63mer all -s configFile -o output_directory/outputGraph -K <kmer_value> [options]
where configFile is a defined configuration file, outputGraph is the prefix of the output files, and kmer_value is the value of k-mer used for building the assembly (<=63 for SOAPdenovo-63mer and <=127 for SOAPdenovo-127mer).

If you want to run the assembly process step by step, then use the following sequential commands:

SOAPdenovo2 Step 1 Options
SOAPdenovo-63mer pregraph -s configFile -o outputGraph [options]
SOAPdenovo-63mer sparse_pregraph -s configFile -K <kmer_value> -z <genome_size> -o outputGraph [options]
SOAPdenovo2 Step 2 Options
SOAPdenovo-63mer contig -g inputGraph [options]
SOAPdenovo2 Step 3 Options
SOAPdenovo-63mer map -s configFile -g inputGraph [options]
SOAPdenovo2 Step 4 Options
SOAPdenovo-63mer scaff -g inputGraph [options]

As you can notice from the commands above, in order to run SOAPdenovo2, you first need to create a config file (configFile) that contains different information about the read files (read length, insert size, reads location). SOAPdenovo2 accepts read files in 3 formats: fasta, fastq and bam.

The example configuration file configFile for 2 paired-end fastq files, 1 paired-end fasta file and 1 single-end fastq file looks like:

#maximal read length
#average insert size of the library
#if sequences are forward-reverse of reverse-forward
#in which part(s) the reads are used (only contigs, only scaffolds, both contigs and scaffolds, only gap closure)
#cut the reads to the given length
#in which order the reads are used while scaffolding
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
#paired-end fastq files, read 1 file should always be followed by read 2 file
#another pair of paired-end fastq files, read 1 file should always be followed by read 2 file
#paired-end fasta files, read 1 file should always be followed by read 2 file
#fastq file for single reads

After creating the configuration file configFile, the next step is to run the assembler using this file.

Simple SLURM script for running SOAPdenovo2 with k-mer=31, 8 CPUSs and 50GB of RAM is shown below:

#SBATCH --job-name=SOAPdenovo2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --output=SOAPdenovo2.%J.out
#SBATCH --error=SOAPdenovo2.%J.err

module load soapdenovo2/r240

SOAPdenovo-63mer all -s configFile -K 31 -o output_directory/output31 -p $SLURM_NTASKS_PER_NODE

SOAPdenovo2 Output

SOAPdenovo2 outputs number of files in its output_directory/ after each executed step. The final assembly output is in the .contig file.

Output directory after SOAPdenovo2
$ ls
output31.Arc            output31.ContigIndex       output31.gapSeq    output31.newContigInde
output31.bubbleInScaff  output31.contigPosInscaff  output31.kmerFreq  output31.peGrads
output31.contig         output31.edge.gz           output31.links     output31.preArc

Useful Information

In order to test the SOAPdenovo2 (soapdenovo2/r240) performance, we used three different size input files. Some statistics about the input files and the time and memory resources used by SOAPdenovo2 are shown in the table below:

total # of sequences total # of bases total size in GB used time used memory used k-mer value # of used CPUs
Input data 1 49,720,374 7,295,342,636 10 ~ 1.5 hours ~ 32 GB 31 8
Input data 2 166,040,440 25,072,106,440 62 ~ 6.5 hours ~ 125 GB 31 8
Input data 3 318,681,730 48,120,941,230 115 ~ 13 hours ~ 125 GB 31 8

In general, SOAPdenovo2 is a memory intensive assembler that requires approximately 30-60 GB memory for assembling 50 million reads. However, SOAPdenovo2 is a fast assembler and it takes around an hour to assemble 50 million reads.