Scythe

Scythe is a 3’ end adapter trimmer that uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. 3’ ends often include poor quality bases which need to be removed prior the quality-based trimming, mapping, assemblies, and further analysis.

The basic usage of Scythe is:

$ scythe -a adapter_file.fasta input_reads.fastq -o output_reads.fastq
where adapter_file.fasta is fasta input file of the adapter sequences that need to be removed from the 3’ end of the sequence data, and input_reads.fastq is the input sequencing data in fastq format.

The file output_reads.fastq contains the sequencing reads with removed adapters. If the adapter sequences are unknown, Scythe by itself provides two adapter sequences that can be used with the -a option: illumina_adapters.fa and truseq_adapters.fasta.

More information about Scythe can found by typing:

$ scythe --help

Simple Scythe script that uses the illumina_adapters.fa file and input_reads.fastq for Tusker is shown below:

scythe.submit
#!/bin/sh
#SBATCH --job-name=Scythe
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Scythe.%J.out
#SBATCH --error=Scythe.%J.err

module load scythe/0.991

scythe -a ${SCYTHE_HOME}/illumina_adapters.fa input_reads.fastq -o output_reads.fastq

Scythe is single threaded program, and therefore both #SBATCH --nodes and #SBATCH --ntasks-per-node are set to 1.

The two adapter sequences provided by Scythe are stored in $SCYTHE_HOME. Hence, to access the illumina adapter file use: $SCYTHE_HOME/illumina_adapters.fa, and to access the TruSeq file use: $SCYTHE_HOME/truseq_adapters.fasta.

Scythe Output

Scythe returns fastq file of reads with removed adapter sequences.

Useful Information

In order to test the Scythe (scythe/0.991) performance on Tusker, we used three paired-end input fastq files, small_1.fastq and small_2.fastq, medium_1.fastq and medium_2.fastq, and large_1.fastq and large_2.fastq. Some statistics about the input files and the time and memory resources used by Scythe on Tusker are shown in the table below:

total # of sequences total # of bases total size in MB used time used memory # of used CPUs
small_1.fastq 50,121 2,506,050 8.010 ~ 0.04 minutes ~ 0.014 GB 1
small_2.fastq 50,121 2,506,050 8.010 ~ 0.04 minutes ~ 0.014 GB 1
medium_1.fastq 786,742 59,792,392 152 ~ 1 minute ~ 0.2 GB 1
medium_2.fastq 786,742 59,792,392 152 ~ 1 minute ~ 0.2 GB 1
large_1.fastq 10,174,715 1,027,646,215 3,376 ~ 13 minutes ~ 3 GB 1
large_2.fastq 10,174,715 1,027,646,215 3,376 ~ 17 minutes ~ 6.5 GB 1