Clustal Omega

Clustal Omega is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.

The general usage of Clustal Omega is:

$ clustalo -i input_file.fasta -o output_file.fasta [options]
where input_file.fasta is the multiple sequence input file in fasta format, and output_file.fasta is the multiple sequence alignment output file in fasta format.

Clustal Omega accepts 3 types of sequence input files:

  • sequence file with aligned/unaligned sequences
  • multiple alignment in a file/profile of aligned sequences
  • Hidden Markov Model (HMM) 

These input files must contain at least 2 sequences and must be in one of the following MSA file formats: a2m, fa[sta], clu[stal], msf, phy[lip], selex, st[ockholm], vie[nna]. Moreover, if not specified, the generated output file is in fasta format.

More Clustal Omega options can be found by typing:

$ clustalo -h

Running Clustal Omega on Crane with input file input_reads.fasta with 8 threads and 10GB memory is shown below:

clustal_omega.submit
#!/bin/sh
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=10gb
#SBATCH --output=ClustalOmega.%J.out
#SBATCH --error=ClustalOmega.%J.err

module load clustal-omega/1.2

clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st  --threads=$SLURM_NTASKS_PER_NODE

The output file output_msa.sto contains the resulting multiple sequence alignments in Stockholm format (–outfmt=st).

Moreover, if you change the command above with:

$ clustalo -i input_reads.sto --dealign -v
Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (-v). Because it is not specified, the output will be in the default fasta format.

Clustal Omega Output

The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: –distmat-out= (pairwise distance matrix output file) and –guidetree-out= (guide tree output file).

Useful Information

In order to test the Clustal Omega performance on Tusker, we used three DNA and protein input fasta files, data_1.fasta, data_2.fasta, data_3.fasta. Some statistics about the input files and the time and memory resources used by Clustal Omega on Tusker are shown on the table below:

total # of sequences average sequence length total size used time used memory # of used CPUs
data_1.fasta 1,200 510.17 641 KB ~ 5 minutes ~ 65 MB 8
data_2.fasta 5,715 174.20 1,100 KB ~ 5 minutes ~ 140 MB 8
data_3.fasta 93,675 94.29 11,000 KB ~ 30 minutes ~ 2 GB 8