SOAPdenovo2

SOAPdenovo is a de novo genome assembler for short reads. It is specially designed for Illumina GA short reads and large plant and animal genomes. SOAPdenovo2 is a newer version of SOAPdenovo with improved algorithm that reduces memory consumption, resolves more repeat regions, increases coverage, and optimizes the assembly for large genomes.

SOAPdenovo2 has two commands, SOAPdenovo-63mer and SOAPdenovo-127mer. The first one is suitable for assembly with k-mer values less than 63 bp, requires less memory and runs faster. The latter one works for k-mer values less than 127 bp.

In order to see the options available for SOAPdenovo-63mer just type:

$ SOAPdenovo-63mer

SOAPdenovo2 provides a mechanism to run the whole workflow at once, or in 5 separate steps.

The basic usage of SOAPdenovo2 is:

$ SOAPdenovo-63mer all -s configFile -o output_directory/outputGraph -K <kmer_value> [options]
where configFile is a defined configuration file, outputGraph is the prefix of the output files, and kmer_value is the value of k-mer used for building the assembly (<=63 for SOAPdenovo-63mer and <=127 for SOAPdenovo-127mer).

If you want to run the assembly process step by step, then use the following sequential commands:

SOAPdenovo2 Step 1 Options
SOAPdenovo-63mer pregraph -s configFile -o outputGraph [options]
OR
SOAPdenovo-63mer sparse_pregraph -s configFile -K <kmer_value> -z <genome_size> -o outputGraph [options]
SOAPdenovo2 Step 2 Options
SOAPdenovo-63mer contig -g inputGraph [options]
SOAPdenovo2 Step 3 Options
SOAPdenovo-63mer map -s configFile -g inputGraph [options]
SOAPdenovo2 Step 4 Options
SOAPdenovo-63mer scaff -g inputGraph [options]

As you can notice from the commands above, in order to run SOAPdenovo2, you first need to create a config file (configFile) that contains different information about the read files (read length, insert size, reads location). SOAPdenovo2 accepts read files in 3 formats: fasta, fastq and bam.

The example configuration file configFile for 2 paired-end fastq files, 1 paired-end fasta file and 1 single-end fastq file looks like:

configFile
#maximal read length
max_rd_len=150
[LIB]
#average insert size of the library
avg_ins=300
#if sequences are forward-reverse of reverse-forward
reverse_seq=0
#in which part(s) the reads are used (only contigs, only scaffolds, both contigs and scaffolds, only gap closure)
asm_flags=3
#cut the reads to the given length
rd_len_cutoff=100
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads1_pair_1.fq
q2=input_reads1_pair_2.fq
#another pair of paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads2_pair_1.fq
q2=input_reads2_pair_2.fq
#paired-end fasta files, read 1 file should always be followed by read 2 file
f1=input_reads_pair_1.fa
f2=input_reads_pair_2.fa
#fastq file for single reads
q=input_reads.fq

After creating the configuration file configFile, the next step is to run the assembler using this file.

Simple SLURM script for running SOAPdenovo2 with k-mer=31, 8 CPUSs and 50GB of RAM on Tusker is shown below:

soapdenovo2.submit
#!/bin/sh
#SBATCH --job-name=SOAPdenovo2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --output=SOAPdenovo2.%J.out
#SBATCH --error=SOAPdenovo2.%J.err

module load soapdenovo2/r240

SOAPdenovo-63mer all -s configFile -K 31 -o output_directory/output31 -p $SLURM_NTASKS_PER_NODE

SOAPdenovo2 Output

SOAPdenovo2 outputs number of files in its output_directory/ after each executed step. The final assembly output is in the .contig file.

Output directory after SOAPdenovo2
$ ls
output31.Arc            output31.ContigIndex       output31.gapSeq    output31.newContigInde
output31.bubbleInScaff  output31.contigPosInscaff  output31.kmerFreq  output31.peGrads
output31.contig         output31.edge.gz           output31.links     output31.preArc

Useful Information

In order to test the SOAPdenovo2 (soapdenovo2/r240) performance on Tusker, we used three different size input files. Some statistics about the input files and the time and memory resources used by SOAPdenovo2 are shown in the table below:

total # of sequences total # of bases total size in GB used time used memory used k-mer value # of used CPUs
Input data 1 49,720,374 7,295,342,636 10 ~ 1.5 hours ~ 32 GB 31 8
Input data 2 166,040,440 25,072,106,440 62 ~ 6.5 hours ~ 125 GB 31 8
Input data 3 318,681,730 48,120,941,230 115 ~ 13 hours ~ 125 GB 31 8

In general, SOAPdenovo2 is a memory intensive assembler that requires approximately 30-60 GB memory for assembling 50 million reads. However, SOAPdenovo2 is a fast assembler and it takes around an hour to assemble 50 million reads.