Ray

Ray is a de novo de Bruijn genome assembler that works with next-generation sequencing data (Illumina, 454, SOLiD). Ray is scalable and parallel software that takes advantage of multiple nodes and multiple CPUs using MPI (message passing interface).

Ray can be used for building multiple applications:

de novo genome assembly
de novo meta-genome assembly
de novo transcriptome assembly
quantification of contig abundances, microbiome consortia members, transcript expression
taxonomy and gene ontology profiling of samples
comparing DNA samples using words

In order to see all options available for running Ray, just type:

$ mpiexec Ray -help

All options used for Ray can be defined on the command line:

$ mpiexec Ray -k <kmer_value> -p input_reads_pair_1.[fa|fq] input_reads_pair_2.[fa|fq] -s input_reads.[fa|fq] -o <output_directory>

or can be stored in a configuration file .conf (one option per line):

$ mpiexec Ray Ray.conf

Ray supports both paired-end (-p) and single-end reads (-s). Moreover, Ray can detect the input files automatically if the input directory is provided (-detect-sequence-files input_directory).

Ray supports odd values for k-mer equal to or greater than 21 (-k <kmer_value>). Ray supports multiple file formats such as fasta, fa, fasta.gz, fa.gz,fasta.bz2,fa.bz2,fastq,fq,fastq.gz,fq.gz,fastq.bz2,fq.bz2,sff,csfasta,csfa`.

Simple SLURM script for running Ray with both paired-end and single-end data with k-mer=31, 8 CPUs and 4 GB RAM per CPU is shown below:

#!/bin/bash
#SBATCH --job-name=Ray
#SBATCH --ntasks=8
#SBATCH --time=168:00:00
#SBATCH --mem-per-cpu=4gb
#SBATCH --output=Ray.%J.out
#SBATCH --error=Ray.%J.err

module load compiler/gcc/4.7 openmpi/1.6 ray/2.3

mpiexec Ray -k 31 -p input_reads_pair_1.fastq input_reads_pair_2.fastq -s input_reads.fasta -o output_directory

where input_reads_pair_1.fastq and input_reads_pair_2.fastq are the paired-end input files in fastq format, and input_reads.fasta is the single-end input file in fasta format.

!!! note %}} It is not necessary to specify the number of processes with the -n option to mpiexec. OpenMPI will determine that automatically from SLURM based on the value of the --ntasks option.

Ray Output¶

In the output folder (-o output_directory) Ray prints a lot of files with information about different steps and statistics from the execution process. Information about all output files can be found in Ray's manual.

One of the most important results are:

Scaffolds.fasta: scaffold sequences in FASTA format
ScaffoldComponents.txt: components of each scaffold
Contigs.fasta: contiguous sequences in FASTA format
OutputNumbers.txt: overall numbers for the assembly

Useful Information¶

In order to test the Ray performance, we used three paired-end input fastq files, small_1.fastq and small_2.fastq, medium_1.fastq and medium_2.fastq, and large_1.fastq and large_2.fastq. Some statistics about the input files and the time and memory resources used by Ray are shown in the table below:

	total # of sequences	total # of bases	total size in MB	used time	used memory	# of used CPUs
small_1.fastq	50,121	2,506,050	8.010	~ 3 minutes	~ 1 GB	8
small_2.fastq	50,121	2,506,050	8.010	~ 3 minutes	~ 1 GB	8
medium_1.fastq	786,742	59,792,392	152	~ 15 minutes	~ 1.5 GB	8
medium_2.fastq	786,742	59,792,392	152	~ 15 minutes	~ 1.5 GB	8
large_1.fastq	10,174,715	1,027,646,215	3,376	~ 5 hours	~ 17 GB	8
large_2.fastq	10,174,715	1,027,646,215	3,376	~ 5 hours	~ 17 GB	8