TophatFusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.
The executables for TopHatFusion, Samtools and Bowtie need to be added to your path. The easiest way to do this is by using the modules commands as in the example below:
biowulf% module load tophatfusion biowulf% module list Currently Loaded Modulefiles: 1) tophatfusion
The iGenomes is available on helix/biowulf in /fdb/igenomes.
Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files called iGenomes. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. Bowtie indexes can be found under the 'Sequence' subdirectory of each organism in iGenomes.
Sample Sessions On Biowulf
Submitting a single TophatFusion batch job
Note, we have downloaded prebuilt indexes under /fdb/igenomes
1. Create a script file similar to the one below.
2. By default, tophatfusion will run a single thread and use a single core. For speed and efficiency, you will probably want to use all the cores available on the node. You can see the number of cores on each type of node by typing 'freen'. The value of the -p paramter for tophatfusion should be set to this number or lower (never higher! or you will overload the node).
#!/bin/bash # This file is runTopHatFusion # #PBS -N tophatfusion #PBS -m be #PBS -k oe module load tophatfusion cd /data/userID/tophatfusion/run1 tophat-fusion -o tophat_MCF7 -p 16 --allow-indels --no-coverage-search -r 0 --mate-std-dev 80 --fusion-min-dist 100000 --fusion-anchor-length 13 \ /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/genome.fa SRR064286_1.fastq SRR064286_2.fastq
3. Submit the script using the 'qsub' command on Biowulf.
qsub -l nodes=1:g24:c16 /data/username/runTopHatFusion
Submitting a swarm of TopHat-Fusion jobs
Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.
Set up a swarm command file (e.g. /data/username/cmdfile). Here is a sample file. Please note each command is one single line. Do not add any line break in one command:
module load tophatfusion; cd /data/userID/tophat/run1; tophat-fusion -o tophat_MCF7 -p 8 .... module load tophatfusion; cd /data/userID/tophat/run2; tophat-fusion -o tophat_MCF7 -p 8 .... module load tophatfusion; cd /data/userID/tophat/run3; tophat-fusion -o tophat_MCF7 -p 8 .... module load tophatfusion; cd /data/userID/tophat/run4; tophat-fusion -o tophat_MCF7 -p 8 ....
Swarm requires one flag: -f, and users will probably want to specify two other flags: -t and -g.
-f: the swarm command file name above (required)
-t: number of processors per node to use for each line of the commands in the swarm file above.(optional)
-g: GB of memory needed for each line of the commands in the swarm file above.(optional)
The tophat-fusion commands above are set with -p 8 (8 threads), so swarm has to be told that each command above will require 8 cores. This is done with -t 8. In addition, if each tophat-fusion command requires, say, 6 GB of memory, the swarm command should have -g 6. So this swarm command file would be submitted with:
biowulf$ swarm -t 8 -g 6 -f cmdfile
For more information regarding running swarm, see swarm.html