Running BLAST Alignment

Basic BLAST has the following commands:

  • blastn: search nucleotide database using a nucleotide query
  • blastp: search protein database using a protein query
  • blastx: search protein database using a translated nucleotide query
  • tblastn: search translated nucleotide database using a protein query
  • tblastx: search translated nucleotide database using a translated nucleotide query

The basic usage of blastn is:

$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
where input_reads.fasta is an input file of sequence data in fasta format, input_reads_db is the generated BLAST database, and blastn_output.alignments is the output file where the alignments are stored.

Additional parameters can be found in the BLAST manual, or by typing:

$ blastn -help

These BLAST alignment commands are multi-threaded, and therefore using the BLAST option -num_threads is recommended.

HCC hosts multiple BLAST databases and indices on Crane. In order to use these resources, the “biodata” module needs to be loaded first. The $BLAST variable contains the following currently available databases:

  • 16SMicrobial
  • env_nt
  • est
  • est_human
  • est_mouse
  • est_others
  • gss
  • human_genomic
  • human_genomic_transcript
  • mouse_genomic_transcript
  • nr
  • nt
  • other_genomic
  • refseq_genomic
  • refseq_rna
  • sts
  • swissprot
  • tsa_nr
  • tsa_nt

If you want to create and use a BLAST database that is not mentioned above, check Create Local BLAST Database.

Basic SLURM example of nucleotide BLAST run against the non-redundant nt BLAST database with 8 CPUs is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the /scratch/ directory of the worker node. Moreover, the BLAST output is also saved in this directory (/scratch/blastn_output.alignments). After BLAST finishes, the output file is copied from the worker node to your current work directory.

Please note that the worker nodes can not write to the /home/ directories and therefore you need to run your job from your /work/ directory. This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!

blastn_alignment.submit
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err

module load blast/2.7
module load biodata/1.0

cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/

blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastn_output.alignments $WORK/<project_folder>

One important BLAST parameter is the e-value threshold that changes the number of hits returned by showing only those with value lower than the given. To show the hits with e-value lower than 1e-10, modify the given script as follows:

$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10

The default BLAST output is in pairwise format. However, BLAST’s parameter -outfmt supports output in different formats that are easier for parsing.

Basic SLURM example of protein BLAST run against the non-redundant nr BLAST database with tabular output format and 8 CPUs is shown below. Similarly as before, the query and database files are copied to the /scratch/ directory. The BLAST output is also saved in this directory (/scratch/blastx_output.alignments). After BLAST finishes, the output file is copied from the worker node to your current work directory.

Please note that the worker nodes can not write to the /home/ directories and therefore you need to run your job from your /work/ directory. This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!

blastx_alignment.submit
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err

module load blast/2.7
module load biodata/1.0

cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/

blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastx_output.alignments $WORK/<project_folder>