SRA (Sequence Read Archive) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
The SRA Toolkit allows converting data from the SRA format to the following formats: ABI SOLiD native
, fasta
, fastq
, sff
, sam
, and Illumina native
. Also, the SRA Toolkit allows converting data from fasta
, fastq
, AB SOLiD-SRF
, AB SOLiD-native
, Illumina SRF
, Illumina native
, sff
, and bam
format into the SRA format.
The SRA Toolkit contains multiple “format”-dump commands, where format is the file format the SRA data is converted to abi-dump, fastq-dump, illumina-dump, sam-dump, sff-dump, and vdb-dump.
One of the most commonly used commands is fastq-dump:
$ fastq-dump [options] input_reads.sra
An example of running fastq-dump on Crane to convert SRA file containing paired-end reads is:
sratoolkit.submit
#!/bin/bash
#SBATCH --job-name=SRAtoolkit
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err
module load SRAtoolkit/2.9
fastq-dump --split-files input_reads.sra
input_reads_1.fastq
and input_reads_2.fastq
.
All SRAtoolkit commands are single threaded, and therefore both #SBATCH --nodes
and #SBATCH --ntasks-per-node
in the SLURM script are set to 1.
The SRA Toolkit contains multiple “format”-load commands, where format is the file format of the data that is uploaded to NCBI: srf-load
, sff-load
, refseq-load
, pacbio-load
, illumina-load
, helicos-load
, fastq-load
, cg-load
, bam-load
, and abi-load
.
An example of bam file input_alignments.bam
uploaded to NCBI is shown below:
$ bam-load \-o input_reads.sra input_alignments.bam
Other frequently used SRAtoolkit tools are:
Prefetch instructions:
When prefetch is used, the files are downloaded in ${HOME}/ncbi/public by default.
Since the /home directory ($HOME) is not writable from the worker nodes, the file can not be saved in $(HOME)/ncbi/public when submitting a SLURM job.
To change the default output directory for prefetch to ${WORK}/ncbi/public, please follow these three steps:
$ wget https://raw.githubusercontent.com/ncbi/ncbi-vdb/master/libs/kfg/default.kfg -P $HOME/.ncbi/
$ vim $HOME/.ncbi/default.kfg
Here, set ”/repository/user/main/public/root” to ”/work/group/username/ncbi/public”, where group is the name of your HCC group, and username is your HCC username.
$ export VDB_CONFIG=$HOME/.ncbi/default.kfg
You need to do these steps only once.