Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
Bowtie was developed by Langmead et al at the University of Maryland. [Bowtie reference]
Bowtie is multi-threaded, which means that you can utilize all the cores on a node for a single Bowtie job. It scales very well and so running on 8 cores will take 1/8 the time of running on 1 core.
The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands as in the example below.
[user@biowulf]$ module avail bowtie ------------- /usr/local/Modules/3.2.9/modulefiles -----------------------
bowtie/0.12.8 bowtie/2-2.0.0-beta7 bowtie/2-2.0.2(default)
[user@biowulf]$ module load bowtie [user@biowulf]$ module list Currently Loaded Modulefiles: 1) bowtie/2-2.0.2 [user@biowulf]$ module unload bowtie [user@biowulf]$ module load bowtie/0.12.8 [user@biowulf]$ module list Currently Loaded Modulefiles: 1) bowtie/0.12.8 [user@biowulf]$ module show bowtie ------------------------------------------------------------------- /usr/local/Modules/3.2.9/modulefiles/bowtie/2-2.0.2: module-whatis Sets up bowtie2-2.0.2 prepend-path PATH /usr/local/apps/bowtie/2-2.0.2 -------------------------------------------------------------------
The iGenomes is available on helix/biowulf in /fdb/igenomes.
Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files called iGenomes. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. Bowtie indexes can be found under the 'Sequence' subdirectory of each organism in iGenomes.
Create a script file along the lines of the one below.
#!/bin/bash # This file is runBowtie # #PBS -N Bowtie #PBS -m be #PBS -k oe cd /data/$USER/bowtie/run1 export BOWTIE2_INDEXES=/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index module load bowtie bowtie2 -p $np -t -x genome -1 reads/ABCD_1000_1.fq -2 reads/ABCD_1000_2.fq -S out.sam
In this example, the input files are in /data/$USER/bowtie/run1/reads, and the arguments to bowtie are:
-p $np : the number of threads to run. The value of $np will be passed to the script at submission time later.Note, the parameters for bowtie vs. bowtie2 are different. Please refer to the documentation in the Bowtie documentation and Bowtie2 documentation
-t : print timing statistics
-x genome: the basename of the index for the genome to be searched in $BOWTIE2_INDEXES. Since $BOWTIE2_INDEXES has been set, the hg19 indexes (the basename of the index files in that directory is 'genome') from that directory will be used. NOTE, use $BOWTIE_INDEXES instead of $BOWTIE2_INDEXES if running bowtie earlier than 2.0 version. The 'module load bowtie' above load default version bowtie2.
reads/ABCD_1000_*.fq: the Fastq file containing the reads
-S out.sam: output file for the alignments.
The output should look something like this:
Time loading reference: 00:00:10 Time loading forward index: 00:00:24 Time loading mirror index: 00:00:23 Seeded quality full-index search: 02:19:55 Seeded quality full-index search: 00:20:23 # reads processed: 29263228 # reads with at least one reported alignment: 23848839 (81.50%) # reads that failed to align: 5414389 (18.50%) Reported 23848839 paired-end alignments to 1 output stream(s) Time searching: 00:21:10 Overall time: 00:21:10
Memory requirements: Bowtie uses approximately as much memory as the size of the bowtie indices. For the human genome, this is about 3.4 GB. You should run this job on a node with at least 4 GB of RAM.
3. Submit the script using the 'qsub' command on Biowulf, e.g. to run 4 threads on an 8GB node, use
qsub -v np=4 -l nodes=1:g8 /data/$USER/runBowtie
or to run 16 threads on an 24 GB node, use
qsub -v np=16 -l nodes=1:g24:c16 /data/$USER/runBowtie
You can use 'freen' to identify the number of cores on each kind of node. The value of np should be set to the number of cores on the type of node to which you are submitting.
Users will typically want to submit large numbers of Bowtie jobs. The easiest way to do this is via swarm.
Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:
bowtie2 -p 16 -t -x genome -1 reads1/ABCD_1000_1.fq -2 reads/ABCD_1000_2.fq -S out1.sam bowtie2 -p 16 -t -x genome -1 reads2/ABCD_1000_1.fq -2 reads/ABCD_1000_2.fq -S out2.sam bowtie2 -p 16 -t -x genome -1 reads3/ABCD_1000_1.fq -2 reads/ABCD_1000_2.fq -S out3.sam [...]
-f: the swarm command file name above (required)
-t: number of processors per node to use for each line of the commands in the swarm file above. (required if you are using more than 1 thread per bowtie process. This number should be equal to the value in the '-p' bowtie argument)
-g: GB of memory needed for each line of the commands in the swarm file above. (required if a single bowtie process will use more than the default 1 GB of memory)
By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If this is not what you want, you will need to specify '-t' and '-g' flags when you submit the job on biowulf.
For the example above, '-p 16' is specified for user's program which means each line of the commands will be executed on '16' processors. So when user submit to swarm, make sure swarm understands this by including '-t 16' flag.
In this example, 'genome' is the basename of bowtie2 indexes in $BOWTIE2_INDEXES directory. In order for bowtie2 to find these index files, the $BOWTIE2_INDEXES variable needs to be set and passed to swarm when submit the job (see below). Also, the 'bowtie' module needs to be loaded so that 'bowtie2' program path can be set and found (see below).
biowulf> $ swarm -t 16 -f cmdfile -v BOWTIE2_INDEXES=/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index --module bowtie
- -t: 16 threads will be used for each line of commands
- -f: swarm command file name
- -v: environment variable $BOWTIE2_INDEXES
- --module: bowtie module will be loaded
Say if each line of the commands above also will need to use 10gb of memory instead of the default 1gb of memory, make sure swarm understands this by including '-g 10' flag:
biowulf> $ swarm -g 10 -t 16 -f cmdfile -v BOWTIE2_INDEX=/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index --module bowtie
For more information regarding running swarm, see swarm.html