Trinity is a method for efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines four independent software modules: Normalization
, Inchworm
, Chrysalis
and Assembly
. All these modules can be applied sequentially to process large RNA-Seq datasets.
The basic usage of Trinity is:
$ Trinity --seqType [fa|fq] --max_memory <maximum_memory> --left input_reads_pair_1.[fa|fq] --right input_reads_pair_2.[fa|fq] [options]
Trinity produces many intermediate files that can affect the file system. To avoid any issues, please copy all the input data to the faster local storage called “scratch”, store the output in “scratch” and finally copy all the needed output files from “scratch” to /work. The “scratch” directories are unique per job and are deleted when the job finishes. This can greatly improve performance!
Additional Trinity options can be found in the Trinity website, or by typing:
$ Trinity
Running the Trinity pipeline from beginning to end on large datasets may exceed the walltime limit for a single job. Therefore, Trinity provides a mechanism to run the workflow in four separate steps, where each step resumes from the previous one. The same Trinity command and options are run for each step, with an additional option that is included for the different steps. On the last step, the Trinity command is run as normal.
Trinity [options] --no_run_inchworm
Trinity [options] --no_run_chrysalis
Trinity [options] --no_distributed_trinity_exec
Trinity [options]
Each step may be run as its own job, providing a workaround for the single job walltime limit. To see how to run each step of Trinity as a single job under the SLURM scheduler on HCC, please check:
In order to test the Trinity (trinity/r2014-04-13p1) performance, we used three paired-end input fastq files, small_1.fastq
and small_2.fastq
, medium_1.fastq
and medium_2.fastq
, and large_1.fastq
and large_2.fastq
. Some statistics about the input files and the time and memory resources used by Trinity are shown in the table below:
total # of sequences | total # of bases | total size in MB | Trinity step 1 used time | Trinity step 1 used memory | Trinity step 2 used time | Trinity step 2 used memory | Trinity step 3 used time | Trinity step 3 used memory | Trinity step 4 used time | Trinity step 4 used memory | # of used CPUs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
small_1.fastq | 50,121 | 2,506,050 | 8.010 | ~ 1 minute | ~ 35 GB | ~ 0.01 hours | ~ 0.6 GB | ~ 0.2 minutes | ~ 0.07 GB | ~ 0.008 hours | ~ 0.8 GB | 8 |
small_2.fastq | 50,121 | 2,506,050 | 8.010 | |||||||||
medium_1.fastq | 786,742 | 59,792,392 | 152 | ~ 3 minutes | ~ 68 GB | ~ 0.1 hours | ~ 3 GB | ~ 0.8 minutes | ~ 0.6 GB | ~ 0.3 hours | ~ 5 GB | 8 |
medium_2.fastq | 786,742 | 59,792,392 | 152 | |||||||||
large_1.fastq | 10,174,715 | 1,027,646,215 | 3,376 | ~ 58 minutes | ~ 80 GB | ~ 5 hours | ~ 30 GB | ~ 35 minutes | ~ 8 GB | ~ 13 hours | ~ 30 GB | 8 |
large_2.fastq | 10,174,715 | 1,027,646,215 | 3,376 |
The Inchworm (step 1) and Chrysalis (step 2) steps can be memory intensive. A basic recommendation is to have 1GB of RAM per 1M ~76 base Illumina paired-end reads.