Clustal Omega is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.
The general usage of Clustal Omega is:
$ clustalo -i input_file.fasta -o output_file.fasta [options]
fasta
format, and output_file.fasta is the multiple sequence alignment output file in fasta
format.
Clustal Omega accepts 3 types of sequence input files:
These input files must contain at least 2 sequences and must be in one of the following MSA file formats: a2m
, fa[sta]
, clu[stal]
, msf
, phy[lip]
, selex
, st[ockholm]
, vie[nna]
. Moreover, if not specified, the generated output file is in fasta
format.
More Clustal Omega options can be found by typing:
$ clustalo -h
Running Clustal Omega on Swan with input file input_reads.fasta
with 8 threads
and 10GB memory
is shown below:
clustal_omega.submit
#!/bin/bash
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=10gb
#SBATCH --output=ClustalOmega.%J.out
#SBATCH --error=ClustalOmega.%J.err
module load clustal-omega/1.2
clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st --threads=$SLURM_NTASKS_PER_NODE
The output file output_msa.sto
contains the resulting multiple sequence alignments in Stockholm format (–outfmt=st).
Moreover, if you change the command above with:
$ clustalo -i input_reads.sto --dealign -v
fasta
format.
The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: –distmat-out=
In order to test the Clustal Omega performance, we used three DNA and protein input fasta files, data_1.fasta
, data_2.fasta
, data_3.fasta
. Some statistics about the input files and the time and memory resources used by Clustal Omega are shown on the table below:
total # of sequences | average sequence length | total size | used time | used memory | # of used CPUs | |
---|---|---|---|---|---|---|
data_1.fasta | 1,200 | 510.17 | 641 KB | ~ 5 minutes | ~ 65 MB | 8 |
data_2.fasta | 5,715 | 174.20 | 1,100 KB | ~ 5 minutes | ~ 140 MB | 8 |
data_3.fasta | 93,675 | 94.29 | 11,000 KB | ~ 30 minutes | ~ 2 GB | 8 |