Trinity

Trinity is a method for efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines four independent software modules: Normalization, Inchworm, Chrysalis and Assembly. All these modules can be applied sequentially to process large RNA-Seq datasets.

The basic usage of Trinity is:

$ Trinity --seqType [fa|fq] --max_memory <maximum_memory> --left input_reads_pair_1.[fa|fq] --right input_reads_pair_2.[fa|fq] [options]
where input_reads_pair_1.[fa|fq] and input_reads_pair_2.[fa|fq] are the input paired-end files of sequence reads in fasta/fastq format, and –seqType is the type of these input reads. The option –max_memory specifies the maximum memory to use with Trinity.

Trinity produces many intermediate files that can affect the file system. To avoid any issues, please copy all the input data to the faster local storage called “scratch”, store the output in “scratch” and finally copy all the needed output files from “scratch” to /work. The “scratch” directories are unique per job and are deleted when the job finishes. This can greatly improve performance!

Additional Trinity options can be found in the Trinity website, or by typing:

$ Trinity

Running the Trinity pipeline from beginning to end on large datasets may exceed the walltime limit for a single job. Therefore, Trinity provides a mechanism to run the workflow in four separate steps, where each step resumes from the previous one. The same Trinity command and options are run for each step, with an additional option that is included for the different steps. On the last step, the Trinity command is run as normal.

Step 1 Options
Trinity [options] --no_run_inchworm
Step 2 Options
Trinity [options] --no_run_chrysalis
Step 3 Options
Trinity [options] --no_distributed_trinity_exec
Step 4 Options
Trinity [options]

Each step may be run as its own job, providing a workaround for the single job walltime limit. To see how to run each step of Trinity as a single job under the SLURM scheduler on HCC, please check:

Useful Information

In order to test the Trinity (trinity/r2014-04-13p1) performance, we used three paired-end input fastq files, small_1.fastq and small_2.fastq, medium_1.fastq and medium_2.fastq, and large_1.fastq and large_2.fastq. Some statistics about the input files and the time and memory resources used by Trinity are shown in the table below:

total # of sequences total # of bases total size in MB Trinity step 1 used time Trinity step 1 used memory Trinity step 2 used time Trinity step 2 used memory Trinity step 3 used time Trinity step 3 used memory Trinity step 4 used time Trinity step 4 used memory # of used CPUs
small_1.fastq 50,121 2,506,050 8.010 ~ 1 minute ~ 35 GB ~ 0.01 hours ~ 0.6 GB ~ 0.2 minutes ~ 0.07 GB ~ 0.008 hours ~ 0.8 GB 8
small_2.fastq 50,121 2,506,050 8.010
medium_1.fastq 786,742 59,792,392 152 ~ 3 minutes ~ 68 GB ~ 0.1 hours ~ 3 GB ~ 0.8 minutes ~ 0.6 GB ~ 0.3 hours ~ 5 GB 8
medium_2.fastq 786,742 59,792,392 152
large_1.fastq 10,174,715 1,027,646,215 3,376 ~ 58 minutes ~ 80 GB ~ 5 hours ~ 30 GB ~ 35 minutes ~ 8 GB ~ 13 hours ~ 30 GB 8
large_2.fastq 10,174,715 1,027,646,215 3,376

The Inchworm (step 1) and Chrysalis (step 2) steps can be memory intensive. A basic recommendation is to have 1GB of RAM per 1M ~76 base Illumina paired-end reads.