SOAPdenovo is a de novo genome assembler for short reads. It is specially designed for Illumina GA short reads and large plant and animal genomes. SOAPdenovo2 is a newer version of SOAPdenovo with improved algorithm that reduces memory consumption, resolves more repeat regions, increases coverage, and optimizes the assembly for large genomes.
SOAPdenovo2 has two commands, SOAPdenovo-63mer and SOAPdenovo-127mer. The first one is suitable for assembly with k-mer values less than 63 bp, requires less memory and runs faster. The latter one works for k-mer values less than 127 bp.
In order to see the options available for SOAPdenovo-63mer just type:
$ SOAPdenovo-63mer
SOAPdenovo2 provides a mechanism to run the whole workflow at once, or in 5 separate steps.
The basic usage of SOAPdenovo2 is:
$ SOAPdenovo-63mer all -s configFile -o output_directory/outputGraph -K <kmer_value> [options]
<=63
for SOAPdenovo-63mer and <=127
for SOAPdenovo-127mer).
If you want to run the assembly process step by step, then use the following sequential commands:
SOAPdenovo-63mer pregraph -s configFile -o outputGraph [options]
OR
SOAPdenovo-63mer sparse_pregraph -s configFile -K <kmer_value> -z <genome_size> -o outputGraph [options]
SOAPdenovo-63mer contig -g inputGraph [options]
SOAPdenovo-63mer map -s configFile -g inputGraph [options]
SOAPdenovo-63mer scaff -g inputGraph [options]
As you can notice from the commands above, in order to run SOAPdenovo2, you first need to create a config file (configFile
) that contains different information about the read files (read length
, insert size
, reads location
). SOAPdenovo2 accepts read files in 3 formats: fasta, fastq and bam.
The example configuration file configFile for 2 paired-end fastq files, 1 paired-end fasta file and 1 single-end fastq file looks like:
configFile
#maximal read length
max_rd_len=150
[LIB]
#average insert size of the library
avg_ins=300
#if sequences are forward-reverse of reverse-forward
reverse_seq=0
#in which part(s) the reads are used (only contigs, only scaffolds, both contigs and scaffolds, only gap closure)
asm_flags=3
#cut the reads to the given length
rd_len_cutoff=100
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads1_pair_1.fq
q2=input_reads1_pair_2.fq
#another pair of paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads2_pair_1.fq
q2=input_reads2_pair_2.fq
#paired-end fasta files, read 1 file should always be followed by read 2 file
f1=input_reads_pair_1.fa
f2=input_reads_pair_2.fa
#fastq file for single reads
q=input_reads.fq
After creating the configuration file configFile, the next step is to run the assembler using this file.
Simple SLURM script for running SOAPdenovo2 with k-mer=31
, 8 CPUSs
and 50GB of RAM
is shown below:
soapdenovo2.submit
#!/bin/bash
#SBATCH --job-name=SOAPdenovo2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --output=SOAPdenovo2.%J.out
#SBATCH --error=SOAPdenovo2.%J.err
module load soapdenovo2/r240
SOAPdenovo-63mer all -s configFile -K 31 -o output_directory/output31 -p $SLURM_NTASKS_PER_NODE
SOAPdenovo2 outputs number of files in its output_directory/
after each executed step. The final assembly output is in the .contig
file.
Output directory after SOAPdenovo2
$ ls
output31.Arc output31.ContigIndex output31.gapSeq output31.newContigInde
output31.bubbleInScaff output31.contigPosInscaff output31.kmerFreq output31.peGrads
output31.contig output31.edge.gz output31.links output31.preArc
In order to test the SOAPdenovo2 (soapdenovo2/r240) performance, we used three different size input files. Some statistics about the input files and the time and memory resources used by SOAPdenovo2 are shown in the table below:
total # of sequences | total # of bases | total size in GB | used time | used memory | used k-mer value | # of used CPUs | |
---|---|---|---|---|---|---|---|
Input data 1 | 49,720,374 | 7,295,342,636 | 10 | ~ 1.5 hours | ~ 32 GB | 31 | 8 |
Input data 2 | 166,040,440 | 25,072,106,440 | 62 | ~ 6.5 hours | ~ 125 GB | 31 | 8 |
Input data 3 | 318,681,730 | 48,120,941,230 | 115 | ~ 13 hours | ~ 125 GB | 31 | 8 |
In general, SOAPdenovo2 is a memory intensive assembler that requires approximately 30-60 GB memory for assembling 50 million reads. However, SOAPdenovo2 is a fast assembler and it takes around an hour to assemble 50 million reads.