Scientific Supercomputing at the NIH

Burrows-Wheeler Alignment (BWA) Tool on Helix

BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence, except for disallowing gaps close to the end of the query. It can also be tuned to find a fraction of longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads against the human genome can be done in 20 minutes. Usually the speed is gained at the cost of huge memory, disallowing gaps and/or the hard limits on the maximum read length and the maximum mismatches. BWA does not. It is still relatively light-weighted (2.3GB memory for human alignment), performs gapped alignment, and does not set a hard limit on read length or maximum mismatches.

Given a database file in FASTA format, BWA first builds BWT index with the 'index' command. The alignments in suffix array (SA) coordinates are then generated with the 'aln' command. The resulting file contains ALL the alignments found by BWA. The 'samse/sampe' command converts SA coordinates to chromosomal coordinates. For single-end reads, most of computing time is spent on finding the SA coordinates (the aln command). For paired-end reads, half of computing time may be spent on pairing (the sampe command) given 32bp reads. Using longer reads would reduce the fraction of time spent on pairing because each end in a pair would be mapped to fewer places.

Programs Location

/usr/local/bwa/bwa

Version

Type '/usr/local/bwa/bwa' on commend line

Sample Sessions On Biowulf

BWA sample files can be copied from:

/usr/local/bwa/sample

Submitting a single BWA batch job

1. Create a script file. The file will contain the lines similar to the lines below between dotted lines. Modify the path of location before running.

.....................file /home/username/runBWA........................
#!/bin/tcsh
# This file is runBWA
#
#PBS -N BWA
#PBS -m be
#PBS -k oe
cd /home/user/bwa/run1
/usr/local/bwa/bwa index -a bwtsw tttF3.csfasta
/usr/local/bwa/bwa aln tttF3.csfasta ttt.fastq ttt.sai
/usr/local/bwa/bwa samse tttF3.csfasta ttt.sai ttt.single.fastq ttt.sam

2. Submit the script using the 'qsub' command, e.g.

qsub -l nodes=1:x86-64:m8192 /home/username/runBWA

Submitting a swarm of BWA jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run spontaneously.

Set up a swarm command file (eg /home/username/cmdfile). Here is a sample file

cd /home/user/bwa/run1; /usr/local/bwa/bwa index -a bwtsw tttF3.csfasta; /usr/local/bwa/bwa aln tttF3.csfasta ttt.fastq ttt.sai; /usr/local/bwa/bwa samse tttF3.csfasta ttt.sai ttt.single.fastq ttt.sam
cd /home/user/bwa/run2;/usr/local/bwa/bwa index -a bwtsw tttF3.csfasta; /usr/local/bwa/bwa aln tttF3.csfasta ttt.fastq ttt.sai; /usr/local/bwa/bwa samse tttF3.csfasta ttt.sai ttt.single.fastq ttt.sam
cd /home/user/bwa/run3; /usr/local/bwa/bwa index -a bwtsw tttF3.csfasta; /usr/local/bwa/bwa aln tttF3.csfasta ttt.fastq ttt.sai; /usr/local/bwa/bwa samse tttF3.csfasta ttt.sai ttt.single.fastq ttt.sam

Each line of the commands above will be executed on one processor.

Submit this swarm command file to the batch system with the command:

biowulf> % swarm -f comdfile -l nodes=1:m8192:x86-64

Swarm utility program will create the batch scripts and submit them to the batch system.

Note, if your job requires more than 8gb of memory, request g72 instead of m8192 and change the 'n' number based on your requirement. For example, if your job require 24 gb of memory each:

biowulf> % swarm -f comdfile -n 3 -l nodes=1:g72

g72 nodes have 72 gb of memory each.

Documentation

http://maq.sourceforge.net/bwa-man.shtml