Running BLAST Alignment

Basic BLAST has the following commands:

  • blastn: search nucleotide database using a nucleotide query
  • blastp: search protein database using a protein query
  • blastx: search protein database using a translated nucleotide query
  • tblastn: search translated nucleotide database using a protein query
  • tblastx: search translated nucleotide database using a translated nucleotide query

The basic usage of blastn is:

$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
where input_reads.fasta is an input file of sequence data in fasta format, input_reads_db is the generated BLAST database, and blastn_output.alignments is the output file where the alignments are stored.

Additional parameters can be found in the BLAST manual, or by typing:

$ blastn -help

These BLAST alignment commands are multi-threaded, and therefore using the BLAST option -num_threads is recommended.

HCC hosts multiple BLAST databases and indices on Swan. In order to use these resources, the “biodata” module needs to be loaded first. The $BLAST variable contains the following currently available databases:

  • 16SMicrobial
  • nr
  • nt
  • refseq_genomic
  • refseq_rna
  • swissprot

If you want to create and use a BLAST database that is not mentioned above, check Create Local BLAST Database. If you want a database to be added to the “biodata” module, please send a request to bcrf-support@unl.edu.

To access the older format of BLAST databases that work with BLAST+ 2.9 and lower, please use the variable BLAST_V4. The variable BLAST points to the directory with the new version 5 of the nucleotide and protein databases required for BLAST+ 2.10 and higher.

Basic SLURM example of nucleotide BLAST run against the non-redundant nt BLAST database with 8 CPUs is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the /scratch/ directory of the worker node. Moreover, the BLAST output is also saved in this directory (/scratch/blastn_output.alignments). After BLAST finishes, the output file is copied from the worker node to your current work directory.

This example will first copy the database and your input file to faster local storage called “scratch”, assuming that the input file exists in your current directory. This can greatly improve performance!

blastn_alignment.submit
#!/bin/bash
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err

module load blast/2.10
module load biodata/1.0

# Be sure to use a directory under $WORK for your job
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/

blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastn_output.alignments .

One important BLAST parameter is the e-value threshold that changes the number of hits returned by showing only those with value lower than the given. To show the hits with e-value lower than 1e-10, modify the given script as follows:

$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10

The default BLAST output is in pairwise format. However, BLAST’s parameter -outfmt supports output in different formats that are easier for parsing.

Basic SLURM example of protein BLAST run against the non-redundant nr BLAST database with tabular output format and 8 CPUs is shown below. Similarly as before, the query and database files are copied to the /scratch/ directory. The BLAST output is also saved in this directory (/scratch/blastx_output.alignments). After BLAST finishes, the output file is copied from the worker node to your current work directory.

This example will first copy the database and your input file to faster local storage called “scratch”, assuming that the input file exists in your current directory. This can greatly improve performance!

blastx_alignment.submit
#!/bin/bash
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err

module load blast/2.10
module load biodata/1.0

# Be sure to use a directory under $WORK for your job
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/

blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastx_output.alignments .