Ray

Ray is a de novo de Bruijn genome assembler that works with next-generation sequencing data (Illumina, 454, SOLiD). Ray is scalable and parallel software that takes advantage of multiple nodes and multiple CPUs using MPI (message passing interface).

Ray can be used for building multiple applications:

  • de novo genome assembly
  • de novo meta-genome assembly
  • de novo transcriptome assembly
  • quantification of contig abundances, microbiome consortia members, transcript expression
  • taxonomy and gene ontology profiling of samples
  • comparing DNA samples using words

In order to see all options available for running Ray, just type:

$ mpiexec Ray -help

All options used for Ray can be defined on the command line:

$ mpiexec Ray -k <kmer_value> -p input_reads_pair_1.[fa|fq] input_reads_pair_2.[fa|fq] -s input_reads.[fa|fq] -o <output_directory>
or can be stored in a configuration file .conf (one option per line):
$ mpiexec Ray Ray.conf

Ray supports both paired-end (-p) and single-end reads (-s). Moreover, Ray can detect the input files automatically if the input directory is provided (-detect-sequence-files input_directory).

Ray supports odd values for k-mer equal to or greater than 21 (-k <kmer_value>). Ray supports multiple file formats such as fasta, fa, fasta.gz, fa.gz,fasta.bz2,fa.bz2,fastq,fq,fastq.gz,fq.gz,fastq.bz2,fq.bz2,sff,csfasta,csfa`.

Simple SLURM script for running Ray on Tusker with both paired-end and single-end data with k-mer=31, 8 CPUs and 4 GB RAM per CPU is shown below:

ray.submit
#!/bin/sh
#SBATCH --job-name=Ray
#SBATCH --ntasks=8
#SBATCH --time=168:00:00
#SBATCH --mem-per-cpu=4gb
#SBATCH --output=Ray.%J.out
#SBATCH --error=Ray.%J.err

module load compiler/gcc/4.7 openmpi/1.6 ray/2.3

mpiexec Ray -k 31 -p input_reads_pair_1.fastq input_reads_pair_2.fastq -s input_reads.fasta -o output_directory
where input_reads_pair_1.fastq and input_reads_pair_2.fastq are the paired-end input files in fastq format, and input_reads.fasta is the single-end input file in fasta format.

It is not necessary to specify the number of processes with the -n option to mpiexec. OpenMPI will determine that automatically from SLURM based on the value of the --ntasks option.

Ray Output

In the output folder (-o output_directory) Ray prints a lot of files with information about different steps and statistics from the execution process. Information about all output files can be found in Ray’s manual.

One of the most important results are:

  • Scaffolds.fasta: scaffold sequences in FASTA format
  • ScaffoldComponents.txt: components of each scaffold
  • Contigs.fasta: contiguous sequences in FASTA format
  • OutputNumbers.txt: overall numbers for the assembly

Useful Information

In order to test the Ray performance on Tusker, we used three paired-end input fastq files, small_1.fastq and small_2.fastq, medium_1.fastq and medium_2.fastq, and large_1.fastq and large_2.fastq. Some statistics about the input files and the time and memory resources used by Ray on Tusker are shown in the table below:

total # of sequences total # of bases total size in MB used time used memory # of used CPUs
small_1.fastq 50,121 2,506,050 8.010 ~ 3 minutes ~ 1 GB 8
small_2.fastq 50,121 2,506,050 8.010
medium_1.fastq 786,742 59,792,392 152 ~ 15 minutes ~ 1.5 GB 8
medium_2.fastq 786,742 59,792,392 152
large_1.fastq 10,174,715 1,027,646,215 3,376 ~ 5 hours ~ 17 GB 8
large_2.fastq 10,174,715 1,027,646,215 3,376