SRAtoolkit

SRA (Sequence Read Archive) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for downloading data, converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.

The SRA Toolkit allows converting data from the SRA format to the following formats: ABI SOLiD native, fasta, fastq, sff, sam, and Illumina native. Also, the SRA Toolkit allows converting data from fasta, fastq, AB SOLiD-SRF, AB SOLiD-native, Illumina SRF, Illumina native, sff, and bam format into the SRA format.

The SRA Toolkit supports downloading SRA data using the "prefetch" command:

$ prefetch <sra_id>

where <sra_id> is the assigned SRA identification in NCBI (e.g., SRR1482462).

The SRA Toolkit contains multiple "format"-dump commands, where format is the file format the SRA data is converted to abi-dump, fastq-dump, illumina-dump, sam-dump, sff-dump, and vdb-dump.

One of the most commonly used commands is fastq-dump:

$ fastq-dump [options] input_reads.sra

This command can be applied on the downloaded SRA data with "prefetch".

An example of running fastq-dump on Swan to convert SRA file containing paired-end reads is:

#!/bin/bash
#SBATCH --job-name=SRAtoolkit
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err

module load SRAtoolkit/2.11

fastq-dump --split-files input_reads.sra

This script outputs two fastq paired end reads input_reads_1.fastq and input_reads_2.fastq.

To download bam files from NCBI using the SRA identification, the following commands can be used:

$ module load SRAtoolkit/2.11 samtools
$ sam-dump <sra_id> | samtools view -bS - > <sra_id>.bam

where <sra_id> is the assigned SRA identification in NCBI (e.g., SRR1482462).

All SRAtoolkit commands are single threaded, and therefore both #SBATCH --nodes and #SBATCH --ntasks-per-node in the SLURM script are set to 1.

The SRA Toolkit contains multiple "format"-load commands, where format is the file format of the data that is uploaded to NCBI: srf-load, sff-load, refseq-load, pacbio-load, illumina-load, helicos-load, fastq-load, cg-load, bam-load, and abi-load.

An example of bam file input_alignments.bam uploaded to NCBI is shown below:

$ bam-load \-o input_reads.sra input_alignments.bam

Other frequently used SRAtoolkit tools are:

sra-stat: generate statistics about SRA data
sra-pileup: generate pileup statistics on aligned SRA data
vdb-config: display and modify VDB configuration information
vdb-encrypt: encrypt non-SRA dbGaP data
vdb-decrypt: decrypt non-SRA dbGaP data
vdb-validate: validate the integrity of downloaded SRA data

Note

If needed, the location of the caching on a per-user basis can be changed with vdb-config -i.