Clustal Omega

Clustal Omega is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.

The general usage of Clustal Omega is:

$ clustalo -i input_file.fasta -o output_file.fasta [options]

where input_file.fasta is the multiple sequence input file in fasta format, and output_file.fasta is the multiple sequence alignment output file in fasta format.

Clustal Omega accepts 3 types of sequence input files:

sequence file with aligned/unaligned sequences
multiple alignment in a file/profile of aligned sequences
Hidden Markov Model (HMM)

These input files must contain at least 2 sequences and must be in one of the following MSA file formats: a2m, fa[sta], clu[stal], msf, phy[lip], selex, st[ockholm], vie[nna]. Moreover, if not specified, the generated output file is in fasta format.

More Clustal Omega options can be found by typing:

$ clustalo -h

Running Clustal Omega on Swan with input file input_reads.fasta with 8 threads and 10GB memory is shown below:

#!/bin/bash
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=10gb
#SBATCH --output=ClustalOmega.%J.out
#SBATCH --error=ClustalOmega.%J.err

module load clustal-omega/1.2

clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st     --threads=$SLURM_NTASKS_PER_NODE

The output file output_msa.sto contains the resulting multiple sequence alignments in Stockholm format (--outfmt=st).

Moreover, if you change the command above with:

$ clustalo -i input_reads.sto --dealign -v

Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (-v). Because it is not specified, the output will be in the default fasta format.

Clustal Omega Output¶

The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: --distmat-out= (pairwise distance matrix output file) and --guidetree-out= (guide tree output file).

Useful Information¶

In order to test the Clustal Omega performance, we used three DNA and protein input fasta files, data_1.fasta, data_2.fasta, data_3.fasta. Some statistics about the input files and the time and memory resources used by Clustal Omega are shown on the table below:

	total # of sequences	average sequence length	total size	used time	used memory	# of used CPUs
data_1.fasta	1,200	510.17	641 KB	~ 5 minutes	~ 65 MB	8
data_2.fasta	5,715	174.20	1,100 KB	~ 5 minutes	~ 140 MB	8
data_3.fasta	93,675	94.29	11,000 KB	~ 30 minutes	~ 2 GB	8