Running BWA Commands

BWA Index¶

The first step of using BWA is to make an index of the reference genome in fasta format. The basic usage of the bwa index is:

$ bwa index [-a bwtsw|is] input_reference.fasta index_prefix

where input_reference.fasta is an input file of the reference genome in fasta format, and index_prefix is the prefix of the generated index files. The option -a is required and can have two values: bwtsw (does not work for short genomes) and is (does not work for long genomes). Therefore, this value is chosen according to the length of the genome.

BWA Mem¶

The bwa mem algorithm is one of the three algorithms provided by BWA. It performs local alignment and produces alignments for different part of the query sequence. The basic usage of bwa mem is:

$ bwa mem index_prefix [input_reads.fastq|input_reads_pair_1.fastq input_reads_pair_2.fastq] [options]

where index_prefix is the index for the reference genome generated from bwa index, and input_reads.fastq, input_reads_pair_1.fastq, input_reads_pair_2.fastq are the input files of sequencing data that can be single-end or paired-end respectively. Additional options for bwa mem can be found in the BWA manual.

Simple SLURM script for running bwa mem on Swan with paired-end fastq input data, index_prefix as reference genome index, SAM output file and 8 CPUs is shown below:

#!/bin/bash
#SBATCH --job-name=Bwa_Mem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=BwaMem.%J.out
#SBATCH --error=BwaMem.%J.err

module load bwa/0.7

bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq -t $SLURM_NTASKS_PER_NODE > bwa_mem_alignments.sam

BWA Bwasw¶

The bwa bwasw algorithm is another algorithm provided by BWA. For input files with single-end reads it aligns the query sequences. For input files with paired-ends reads it performs paired-end alignment that only works for Illumina reads.

An example of bwa bwasw for single-end input file input-reads.fasta in fasta format and output file bwa_bwasw_alignments.sam where the alignments are stored, is shown below:

$ bwa bwasw index_prefix input_reads.fasta -t $SLURM_NTASKS_PER_NODE > bwa_bwasw_alignments.sam

BWA Aln¶

The third BWA algorithm, bwa aln, aligns the input file of sequence data to the reference genome. In addition, there is an example of running bwa aln with single-end input_reads.fasta input file and 8 CPUs:

$ bwa aln index_prefix input_reads.fasta -0 -t $SLURM_NTASKS_PER_NODE > bwa_aln_alignments.sai

BWA Samse and BWA Sampe¶

The command bwa samse uses the bwa_aln_alignments.sai output from bwa aln in order to generate SAM file from the alignments for single-end reads.

$ bwa samse -f bwa_aln_alignments.sam index_prefix bwa_aln_alignments.sai input_reads.fasta   output31.preArc

The command bwa sampe uses the bwa_aln_alignments.sai output form bwa aln in order to generate SAM file from the alignments for paired-end reads.

$ bwa samse -f bwa_aln_alignments.sam index_prefix bwa_aln_alignments_pair_1.sai bwa_aln_alignments_pair_2.sai input_reads_pair_1.fasta input_reads_pair_2.fasta

BWA Fastmap¶

The command bwa fastmap identifies and outputs super-maximal exact matches (SMEMs). The basic usage of bwa fastmap is:

$ bwa fastmap index_prefix input_reads.fasta > bwa_fastmap.matches

BWA Pemerge¶

The command bwa pemerge merges overlapping paired ends and can print either only the merged reads or the unmerged ones. An example of bwa pemerge of input_reads_pair_1.fastq and input_reads_pair_2.fastq with 8 CPUs and output file output_reads_merged.fastq that contains only the merged reads is shown below:

$ bwa pemerge -m input_reads_pair_1.fastq input_reads_pair_2.fastq -t $SLURM_NTASKS_PER_NODE > output_reads_merged.fastq

BWA Fa2pac¶

The command bwa fa2pac converts fasta to pac files. The general usage of bwa fa2pac is:

$ bwa fa2pac input_reads.fasta pac_prefix

BWA Pac2bwt and BWA Pac2bwtgen¶

The commands bwa pac2bwt and bwa pac2bwtgen convert pac to bwt files.

$ bwa pac2bwt input_reads.pac output_reads.bwt

$ bwa pac2bwtgen input_reads.pac output_reads.bwt

BWA Bwtupdate¶

The command bwa bwtupdate updates bwt files to the new format. The general usage of bwa bwtupdate is:

$ bwa bwtupdate input_reads.bwt

BWA Bwt2sa¶

The command bwa bwt2sa generates sa files from bwt and Occ files. The basic usage of bwa bwt2sa is:

$ bwa bwt2sa input_reads.bwt output_reads.sa

Useful Information¶

In order to test the scalability of BWA (bwa/0.7) on Swan, we used two paired-end input fastq files, large_1.fastq and large_2.fastq, and one single-end input fasta file, large.fasta. Some statistics about the input files and the time and memory resources used by bwa mem are shown on the table below:

	total # of sequences	total size in MB	# of used CPUs	used time for 4 CPUs	used memory for 4 CPUs	# of used CPUs	used time for 8 CPUs	used memory for 8 CPUs	# of used CPUs	used time for 16 CPUs	used memory for 16 CPUs
large_1.fastq	10,174,715	3,376	4	~ 35 minutes	~ 12 GB	8	~ 18.5 minutes	~ 18 GB	16	~ 10 minutes	~ 19 GB
large_2.fastq	10,174,715	3,376	4	~ 35 minutes	~ 12 GB	8	~ 18.5 minutes	~ 18 GB	16	~ 10 minutes	~ 19 GB
large.fasta	592,593	836	4	~ 5.5 minutes	~ 3 GB	8	~ 3 minutes	~ 4 GB	16	~ 2 minutes	~ 6.2 GB