Ray is a de novo de Bruijn genome assembler that works with next-generation sequencing data (Illumina, 454, SOLiD). Ray is scalable and parallel software that takes advantage of multiple nodes and multiple CPUs using MPI (message passing interface).
Ray can be used for building multiple applications:
In order to see all options available for running Ray, just type:
$ mpiexec Ray -help
All options used for Ray can be defined on the command line:
$ mpiexec Ray -k <kmer_value> -p input_reads_pair_1.[fa|fq] input_reads_pair_2.[fa|fq] -s input_reads.[fa|fq] -o <output_directory>
.conf
(one option per line):
$ mpiexec Ray Ray.conf
Ray supports both paired-end (-p
) and single-end reads (-s
). Moreover, Ray can detect the input files automatically if the input directory is provided (-detect-sequence-files input_directory
).
Ray supports odd values for k-mer equal to or greater than 21 (-k <kmer_value>
). Ray supports multiple file formats such as fasta
, fa
, fasta.gz
, fa.gz,
fasta.bz2,
fa.bz2,
fastq,
fq,
fastq.gz,
fq.gz,
fastq.bz2,
fq.bz2,
sff,
csfasta,
csfa`.
Simple SLURM script for running Ray with both paired-end and single-end data with k-mer=31
, 8 CPUs
and 4 GB RAM per CPU
is shown below:
ray.submit
#!/bin/bash
#SBATCH --job-name=Ray
#SBATCH --ntasks=8
#SBATCH --time=168:00:00
#SBATCH --mem-per-cpu=4gb
#SBATCH --output=Ray.%J.out
#SBATCH --error=Ray.%J.err
module load compiler/gcc/4.7 openmpi/1.6 ray/2.3
mpiexec Ray -k 31 -p input_reads_pair_1.fastq input_reads_pair_2.fastq -s input_reads.fasta -o output_directory
fastq
format, and input_reads.fasta is the single-end input file in fasta
format.
It is not necessary to specify the number of processes with the -n
option to mpiexec
. OpenMPI will determine that automatically from SLURM based on the value of the --ntasks
option.
In the output folder (-o output_directory
) Ray prints a lot of files with information about different steps and statistics from the execution process. Information about all output files can be found in Ray’s manual.
One of the most important results are:
In order to test the Ray performance, we used three paired-end input fastq files, small_1.fastq
and small_2.fastq
, medium_1.fastq
and medium_2.fastq
, and large_1.fastq
and large_2.fastq
. Some statistics about the input files and the time and memory resources used by Ray are shown in the table below:
total # of sequences | total # of bases | total size in MB | used time | used memory | # of used CPUs | |
---|---|---|---|---|---|---|
small_1.fastq | 50,121 | 2,506,050 | 8.010 | ~ 3 minutes | ~ 1 GB | 8 |
small_2.fastq | 50,121 | 2,506,050 | 8.010 | |||
medium_1.fastq | 786,742 | 59,792,392 | 152 | ~ 15 minutes | ~ 1.5 GB | 8 |
medium_2.fastq | 786,742 | 59,792,392 | 152 | |||
large_1.fastq | 10,174,715 | 1,027,646,215 | 3,376 | ~ 5 hours | ~ 17 GB | 8 |
large_2.fastq | 10,174,715 | 1,027,646,215 | 3,376 |