Biowulf at the NIH
RSS Feed
TopHat on Biowulf

TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. TopHat begins supporting Applied Biosystems' Colorspace format after version 1.1.0. The software is optimized for reads 75bp or longer. Currently, TopHat does not allow short (fewer than a few nucleotides) insertions and deletions in the alignments it reports. Support for insertions and deletions will eventually be added. Finally, mixing paired- and single- end reads together is not supported.

TopHat is a collaborative effort between the University of Maryland Center for Bioinformatics and Computational Biology and the University of California, Berkeley Departments of Mathematics and Molecular and Cell Biology.

The Tophat and Bowtie executables need to be added to your path. The easiest way to do this is by using the modules commands, as in the example below.

Most versions of tophat can be used with bowtie or bowtie2. For example, loading the module 'tophat/2.05/bowtie2' would set up the environment for Tophat 2.05 and Bowtie2. Loading 'tophat/2.05/bowtie' would set up the environment for Tophat 2.05 and Bowtie.

[user@biowulf]$ module avail tophat

----------------- /usr/local/Modules/3.2.9/modulefiles -----------------------------
tophat/2.04/bowtie           tophat/2.05/bowtie           tophat/2.06/bowtie
tophat/2.04/bowtie2          tophat/2.05/bowtie2          tophat/2.06/bowtie2(default)

[user@biowulf]$ module load tophat       (loads the default version)

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) tophat/2.06/bowtie2

[user@biowulf]$ module unload tophat

[user@biowulf]$ module load tophat/2.05/bowtie    (loads a specific version)

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) tophat/2.05/bowtie

The iGenomes is available on helix/biowulf in /fdb/igenomes.
Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files called iGenomes. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. Bowtie, Bowtie2 and BWA indexes can be found under the 'Sequence' subdirectory of each organism in iGenomes.

Submitting a single TopHat batch job

1. Create a script file, similar to the one below:

2. By default, Tophat will run a single thread. For efficiency and speed you probably will want to use all the available cores on the node. Thus, you should set the value of the '-p' parameter to be the same (never larger!) as the number of cores on the node. You can see the number of cores on each type of node by typing 'freen'.

# This file is runTopHat
#PBS -N tophat
#PBS -m be
#PBS -k oe

module load tophat/2.0.6

cd /data/userID/tophat/run1
tophat -r 20 -p 24 test_ref reads_1.fq reads_2.fq

3. Submit the script using the 'qsub' command on Biowulf,

qsub -l nodes=1:g24:c24 /data/username/runTopHat

In this case, the job is being run on a g24 node (24 GB of memory). According to 'freen', the g24 nodes have either 24 cores or 16 cores. In the example above, the user has chosen to run on the 24-core nodes (c24), so has specified '-p 24' in the batch script.

Submitting a swarm of TopHat jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (e.g. /data/username/cmdfile). Here is a sample file. Please note each command is one single line. Do not add any line breaks in one command. Also note that each tophat jobs runs in its own subdirectory. This is required, as the default output directory for all tophat jobs is identical (tophat_out/). Alternatively, one could manually redirect the output directory using the -o option.

cd /data/user/tophat/run1; tophat -p 8 -r 20 test_ref reads1.fq reads2.fq
cd /data/user/tophat/run2; tophat -p 8 -r 20 test_ref reads1.fq reads2.fq
cd /data/user/tophat/run3; tophat -p 8 -r 20 test_ref reads1.fq reads2.fq
cd /data/user/tophat/run4; tophat -p 8 -r 20 test_ref reads1.fq reads2.fq

Swarm requires one flag: -f, and users will probably want to specify -t, -g, and --module

-f: the swarm command file name above (required)
-t: number of processors per node to use for each line of the commands in the swarm file above.(optional)
-g: GB of memory needed for each line of the commands in the swarm file above.(optional) --module: setup tophat environmental variables for each swarm job

The tophat commands in the swarm file above have parameter -p 8, which means that each command will run 8 threads and therefore use 8 cores. You need to tell swarm that each command requires 8 cores. This is done with the -t 8 switch to swarm. In addition, each tophat command may require, say, 12 GB of memory. This is specified to swarm using the -g 12 switch. Thus, this swarm command file can be submitted with:

biowulf> $ swarm -t 8 -g 12 --module tophat -f cmdfile
Users may need to run a few test jobs to determine how much memory is used. Set up a single tophat job, then submit it to a g24 node. The output from the job will list the memory used by that job.

For more information regarding running swarm, see swarm.html