TopHat/TopHat2

TopHat is a fast splice junction mapper for RNA-Seq data. It first aligns RNA-Seq reads to reference genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Although there is no difference between the available options for both TopHat and TopHat2 and the number of output files, TopHat2 incorporates many significant improvements to TopHat. The TopHat package at HCC supports both tophat and tophat2.

The basic usage of TopHat2 is:

$ [tophat|tophat2] [options] index_prefix [input_reads_pair_1.[fasta|fastq] input_reads_pair_2.[fasta|fastq] | input_reads.[fasta|fastq]]
where index_prefix is the basename of the genome index to be searched. This index is generated prior running TopHat/TopHat2 by using Bowtie/Bowtie2.

TopHat2 uses single or comma-separated list of paired-end and single-end reads in fasta or fastq format. The single-end reads need to be provided after the paired-end reads.

More advanced TopHat2 options can be found in its manual, or by typing:

$ tophat2 -h

Prior running TopHat/TopHat2, an index from the reference genome should be built using Bowtie/Bowtie2. Moreover, TopHat2 requires both, the index file and the reference file, to be in the same directory. If the reference file is not available,TopHat2 reconstructs it in its initial step using the index file.

An example of how to run TopHat2 on Swan with paired-end fastq files input_reads_pair_1.fastq and input_reads_pair_2.fastq, reference index index_prefix and 8 CPUs is shown below:

tophat2_alignment.submit
#!/bin/bash
#SBATCH --job-name=Tophat2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Tophat2.%J.out
#SBATCH --error=Tophat2.%J.err

module load samtools/1.3 bowtie/2.3 tophat/2.0

tophat2 -p $SLURM_NTASKS_PER_NODE index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq

TopHat2 generates its own output directory tophat_output/ that contains multiple TopHat2 generated files.

TopHat2 Output

TopHat2 produces number of files in its tophat_out/ output directory. Some of the generated files are:

  • accepted_hits.bam: list of read alignments in BAM format
  • unmapped.bam: list of unmapped reads in BAM format
  • junctions.bed: BED track of reported junctions
  • insertions.bed: BED track of insertions reported by TopHat
  • deletions.bed: BED track of deletions reported by TopHat
  • prep_reads.info: statistics about the input sequencing data (min/max read length, number of reads)
  • align_summary.txt: summary of the alignment counts (number of mapped reads, overall read mapping rate)