Biowulf at the NIH
RSS Feed
TophatFusion on Biowulf

TophatFusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.

The executables for TopHatFusion, Samtools and Bowtie need to be added to your path. The easiest way to do this is by using the modules commands as in the example below:

biowulf% module load tophatfusion

biowulf% module list
Currently Loaded Modulefiles:
  1) tophatfusion

The iGenomes is available on helix/biowulf in /fdb/igenomes.
Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files called iGenomes. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. Bowtie indexes can be found under the 'Sequence' subdirectory of each organism in iGenomes.

Sample Sessions On Biowulf

Submitting a single TophatFusion batch job

Note, we have downloaded prebuilt indexes under /fdb/igenomes

1. Create a script file similar to the one below.

2. By default, tophatfusion will run a single thread and use a single core. For speed and efficiency, you will probably want to use all the cores available on the node. You can see the number of cores on each type of node by typing 'freen'. The value of the -p paramter for tophatfusion should be set to this number or lower (never higher! or you will overload the node).

#!/bin/bash
# This file is runTopHatFusion
#
#PBS -N tophatfusion
#PBS -m be
#PBS -k oe

module load tophatfusion

cd /data/userID/tophatfusion/run1
tophat-fusion -o tophat_MCF7 -p 16 --allow-indels --no-coverage-search -r 0 --mate-std-dev 80 --fusion-min-dist 100000 --fusion-anchor-length 13 \
/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/genome.fa SRR064286_1.fastq SRR064286_2.fastq

3. Submit the script using the 'qsub' command on Biowulf.

        
qsub -l nodes=1:g24:c16 /data/username/runTopHatFusion     
In this case, the user has decided to run on a g24:c16 node (24 GB RAM, 16 cores). Thus, in the tophat-fusion command in the batch script above, -p 16 has been set. The tophat-fusion command will run 16 threads on the 16-core node.

Submitting a swarm of TopHat-Fusion jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (e.g. /data/username/cmdfile). Here is a sample file. Please note each command is one single line. Do not add any line break in one command:

module load tophatfusion; cd /data/userID/tophat/run1; tophat-fusion -o tophat_MCF7 -p 8 ....
module load tophatfusion; cd /data/userID/tophat/run2; tophat-fusion -o tophat_MCF7 -p 8 ....
module load tophatfusion; cd /data/userID/tophat/run3; tophat-fusion -o tophat_MCF7 -p 8 ....
module load tophatfusion; cd /data/userID/tophat/run4; tophat-fusion -o tophat_MCF7 -p 8 ....

Swarm requires one flag: -f, and users will probably want to specify two other flags: -t and -g.

-f: the swarm command file name above (required)
-t: number of processors per node to use for each line of the commands in the swarm file above.(optional)
-g: GB of memory needed for each line of the commands in the swarm file above.(optional)

The tophat-fusion commands above are set with -p 8 (8 threads), so swarm has to be told that each command above will require 8 cores. This is done with -t 8. In addition, if each tophat-fusion command requires, say, 6 GB of memory, the swarm command should have -g 6. So this swarm command file would be submitted with:

biowulf$ swarm -t 8 -g 6 -f cmdfile
Users may need to run a few test jobs to find out how much memory their tophat-fusion commands require. Set up a single TopHatFusion job and submit to a g24 node. The output from the job will list the amount of memory that was actually used by the job.

For more information regarding running swarm, see swarm.html

Documentation

http://tophat-fusion.sourceforge.net/tutorial.html