CD-HIT (Cluster Database at High Identity with Tolerance), http://weizhong-lab.ucsd.edu/cd-hit, is a program for clustering and comparing nucleotide or protein sequences. CD-HIT is very fast and can handle large DNA/RNA datasets.
Some of the most frequently used executables from the CD-HIT package are: CD-HIT, CD-HIT-2D, CD-HIT-EST, CD-HIT-EST-2D, CD-HIT-454, CD-HIT-PARA, PSI-CD-HIT, CD-HIT-OTU, CD-HIT-LAP and CD-HIT-DUP:
Detailed overview of the whole CD-HIT package and executables can be found in the CD-HIT user’s guide.
The basic usage of CD-HIT is:
$ cd-hit -i input_reads.fasta -o output [options]
input_reads.fasta
is an input file of sequence reads in fasta format, output
is the prefix of the output files, and options
are optional parameters that can be found by typing:
$ cd-hit
CD-HIT is multi-threaded program, and therefore, using multiple threads is recommended. By setting the CD-HIT parameter -T 0
, all CPUs defined in the SLURM script will be used. Setting the parameter -M 0
allows unlimited usage of the available memory.
Simple SLURM CD-HIT script for Swan with 8 CPUs is given in addition:
cd-hit.submit
#!/bin/bash
#SBATCH --job-name=CD-HIT
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=CD-HIT.%J.out
#SBATCH --error=CD-HIT.%J.err
module load cd-hit/4.6
cd-hit -i input_reads.fasta -o output -M 0 -T 0
CD-HIT prints out 2 files: output
and output.clstr
. output
contains the final clustered non-redundant sequences in fasta format, while output.clstr
has an information about the clusters with its associated sequences.