CD-HIT

CD-HIT (Cluster Database at High Identity with Tolerance), http://weizhong-lab.ucsd.edu/cd-hit, is a program for clustering and comparing nucleotide or protein sequences. CD-HIT is very fast and can handle large DNA/RNA datasets.

Some of the most frequently used executables from the CD-HIT package are: CD-HIT, CD-HIT-2D, CD-HIT-EST, CD-HIT-EST-2D, CD-HIT-454, CD-HIT-PARA, PSI-CD-HIT, CD-HIT-OTU, CD-HIT-LAP and CD-HIT-DUP:

CD-HIT or CD-HIT-EST clusters similar proteins or DNAs into clusters that meet a defined similarity threshold
CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a given threshold
CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads
CD-HIT-OTU clusters rRNA tags into OTUs
CD-HIT-DUP identifies duplicates from single or paired Illumina reads
CD-HIT-LAP identifies overlapping reads

Detailed overview of the whole CD-HIT package and executables can be found in the CD-HIT user's guide.

The basic usage of CD-HIT is:

$ cd-hit -i input_reads.fasta -o output [options]

where input_reads.fasta is an input file of sequence reads in fasta format, output is the prefix of the output files, and options are optional parameters that can be found by typing:

$ cd-hit

CD-HIT is multi-threaded program, and therefore, using multiple threads is recommended. By setting the CD-HIT parameter -T 0, all CPUs defined in the SLURM script will be used. Setting the parameter -M 0 allows unlimited usage of the available memory.

Simple SLURM CD-HIT script for Swan with 8 CPUs is given in addition:

#!/bin/bash
#SBATCH --job-name=CD-HIT
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=CD-HIT.%J.out
#SBATCH --error=CD-HIT.%J.err

module load cd-hit/4.6

cd-hit -i input_reads.fasta -o output -M 0 -T 0

CD-HIT Output¶

CD-HIT prints out 2 files: output and output.clstr. output contains the final clustered non-redundant sequences in fasta format, while output.clstr has an information about the clusters with its associated sequences.