Biowulf at the NIH
RSS Feed
Burrows-Wheeler Alignment (BWA) Tool on Biowulf

Description

BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence, except for disallowing gaps close to the end of the query. It can also be tuned to find a fraction of longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads against the human genome can be done in 20 minutes. Usually the speed is gained at the cost of huge memory, disallowing gaps and/or the hard limits on the maximum read length and the maximum mismatches. BWA does not. It is still relatively light-weighted (2.3GB memory for human alignment), performs gapped alignment, and does not set a hard limit on read length or maximum mismatches.

Given a database file in FASTA format, BWA first builds BWT index with the 'index' command. The alignments in suffix array (SA) coordinates are then generated with the 'aln' command. The resulting file contains ALL the alignments found by BWA. The 'samse/sampe' command converts SA coordinates to chromosomal coordinates. For single-end reads, most of computing time is spent on finding the SA coordinates (the aln command). For paired-end reads, half of computing time may be spent on pairing (the sampe command) given 32bp reads. Using longer reads would reduce the fraction of time spent on pairing because each end in a pair would be mapped to fewer places.

How to Use

Module

There are multiple versions of BWA available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail bwa

To select a module, type

module load bwa/[ver]

where [ver] is the version of choice. This will set your $PATH variable.

Index Files

Pre-build BWA index files are available in

/fdb/igenomes/[organism]/[source]/[build]/Sequence/BWAIndex/genome.fa
  • [organism] is the specific organism of interest (Gallus_gallus, Rattus_norvegicus, etc.)
  • [source] is the source for the sequence (NCBI, Ensembl, UCSC)
  • [build] is the specific genome draft of interest (hg19, build37.2, GRCh37)

Some users have noticed that newer version of BWA doesn't work with index files from previous versions in /fdb/bwa/indexes. Please use the index files above under /fdb/igenomes instead.

Sample Sessions On Biowulf

BWA sample files can be copied from:

/usr/local/bwa/sample

Memory Requirements

As a rule of thumb, assume that bwa will require as much or more memory as the size of the .bwt index file. For example, the hg19 bwa index file is 2.9gb:

$ ls -lLh /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt
-rwxr-xr-x 1 maoj staff 2.9G Mar 15  2012 /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt

If the UCSC/hg19 BWA index file were used, the bwa process will need at least 3gb of memory.

Multithreading

BWA is a multithreaded application. That is, the bwa command can distribute its work across multiple CPUs on a single node. The number of threads BWA will use is controlled by the -t option. The total number of threads allocated to multiple bwa processes on the same node should not exceed the total number of CPUs on the node.

Submitting a single BWA batch job

1. Create a script file. The file will contain the lines similar to the one below.

2. Make sure you use an appropriate number of threads (-t) for bwa processes. For example, g72 nodes have 16 CPUs, while g4 nodes have 2 CPUs. in the example below, bwa is directed to use four threads:

#!/bin/bash
# This file is runBWA
#
#PBS -N BWA
#PBS -m be
#PBS -k oe

cd /home/user/bwa/run1
module load bwa
bwa index -a bwtsw tttF3.csfasta
bwa aln -t 4 tttF3.csfasta ttt.fastq > ttt.sai
bwa samse tttF3.csfasta ttt.sai ttt.fastq > ttt.sam

3. Submit the script using the 'qsub' command, e.g.

qsub -l nodes=1:g8 /home/username/runBWA

Here the g8 node was requested so in the script -t 4 was used since g8 node has 4 CPUs. Run freen on biowulf to check for core number for different nodes.

Submitting a swarm of BWA jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /home/username/cmdfile). Here is a sample file

cd /home/user/bwa/run1 ; bwa index -a bwtsw tttF3.csfasta ; \
  bwa aln -t 4 tttF3.csfasta ttt.fastq > ttt.sai ; \
  bwa samse tttF3.csfasta ttt.sai ttt.fastq > ttt.sam
cd /home/user/bwa/run2 ; bwa index -a bwtsw tttF3.csfasta ; \
  bwa aln -t 4 tttF3.csfasta ttt.fastq > ttt.sai ; \
  bwa samse tttF3.csfasta ttt.sai ttt.fastq > ttt.sam
...
cd /home/user/bwa/run15 ; bwa index -a bwtsw tttF3.csfasta ; \
  bwa aln -t 4 tttF3.csfasta ttt.fastq > ttt.sai ; \
  bwa samse tttF3.csfasta ttt.sai ttt.fastq > ttt.sam

The -f option is required for swarm. Because bwa is multithreaded, the -t option is be used to direct swarm to allocate multiple cpus per bwa process. Also because full genome alignments using bwa require substantial memory utilization, the -g option can be used to direct swarm to allocate how many gb of memory per bwa process.

By default, swarm will execute each line on one CPU, using 1gb of memory. In the above case, bwa requires four threads, so the swarm commandline should be:

swarm -f cmdfile -t 4 --module bwa

If a larger BWA index file were used, for example hg19, then the amount of memory per bwa process must be increased using the -g option:

swarm -f cmdfile -t 4 -g 3 --module bwa

For more information regarding running swarm, see swarm.html

Documentation

To see a full listing of the options available for bwa, type bwa at the prompt.

$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.6.2-r126
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa  [options]

Command: index         index sequences in the FASTA format
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries
         fastmap       identify super-maximal exact matches

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ
         pac2cspac     convert PAC to color-space PAC
         stdsw         standard SW/NW alignment

http://maq.sourceforge.net/bwa-man.shtml