Biowulf at the NIH
RSS Feed
FusionSeq on Biowulf

A computational framework to identify fusion transcripts from paired-end RNA-Seq data. Fusionseq was developed by the Gerstein Lab.

FusionSeq resource file

Fusionseq requires a resource file (/home/$USER/.fusioneqrc) that sets the location for the bowtie indexes and so on. When you run 'module load fusionseq' to set up the environment for fusionseq, if this file does not exist the default template will be copied into /home/$USER/.fusionseq. Example below:

biowulf% module load fusionseq
/home/susanc/.fusionseqrc does not exist; creating it

biowulf% module list
Currently Loaded Modulefiles:
  1) fusionseq/0.7.0

This is what the resource file looks like:

# --------------------------------- This section is required ---------------------------------
# Location of the bowtie indexes of the human genome and the composite model
BOWTIE_INDEXES="/usr/local/bowtie-indexes" 
#define BOWTIE_INDEXES_TMP "/data/YourUserId/MoreIfNeededHere"
# the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build
BOWTIE_GENOME="hg18_nh"
# the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build
BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel"

# Pointer to the program twoBitToFa part of the blat suite
BLAT_TWO_BIT_TO_FA="/usr/local/blat/x86_64/twoBitToFa" 
# Location and filename of  the reference genome in 2bit format (to be used by blat)
BLAT_DATA_DIR="/usr/local/bowtie-indexes/hg18_nh"
BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit"

# Location and name of the transcript composite model sequence and interval files
TRANSCRIPT_COMPOSITE_MODEL_DIR="/usr/local/bowtie-indexes/FusionSeq_Data"
TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa"
TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt" 

# location of the annotation files
ANNOTATION_DIR="/usr/local/bowtie-indexes/FusionSeq_Data" 
# conversion of knownGenes to gene symbols, description, etc. 
KNOWN_GENE_XREF_FILENAME="kgXref.txt" 
# conversion of knownGenes to TreeFam
KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt" 

# Location and filename of the ribosomal library
RIBOSOMAL_DIR="/usr/local/bowtie-indexes/FusionSeq_Data"
RIBOSOMAL_FILENAME="ribosomal.2bit"

The environment variables in the upper required section have been set to point to the locations of indexes on Biowulf. If you have your own indexes, you may want to modify these variables.

Submitting a single FusionSeq batch job

1. Create a script file along the lines of the one below:

#!/bin/bash
# This file is fusionseqscript
#
#PBS -N fusionseq
#PBS -m be
#PBS -k oe

module load fusionseq/0.7.0 

cd /data/user/mydir

#sample FusionSeq commands
geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log
(gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter | gfrAbnormalInsertSizeFilter 0.01 | gfrPCRFilter 4 4 |\
gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | gfrAnnotationConsistencyFilter pseudogenes | \
gfrLargeScaleHomologyFilter | gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log
gfrConfidenceValues data < data.gfr > data.confidence.gfr
gfr2bpJunctions data.confidence.gfr 40 200 minDASPER

3. Submit the script using the 'qsub' command on Biowulf.

qsub -l nodes=1 /data/username/fusionseqscript

If your fusionseq job requires more than the default 1 GB of memory, you can specify a node property to get a node with more memory. e.g.

qsub -l nodes=1:g24:c16 /data/username/fusionseqscript
will request a node with 24 GB of memory (g24) and 16 cores (c16). Type 'freen' to see available node types.

Submitting a swarm of fusionseq jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

cd /data/user/mydir; geneFusions data 4 < data1.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log
cd /data/user/mydir; geneFusions data 4 < data2.mrf | gfrClassify > data.2.gfr 2> data.2.gfr.log
cd /data/user/mydir; geneFusions data 4 < data3.mrf | gfrClassify > data.3.gfr 2> data.3.gfr.log
[...]

Submit this swarm with:

swarm -f cmdfile -module fusionseq/0.7.0

By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If your fusionseq commands require more than 1 GB of memory, you must specify the memory required to swarm by using the -g # flag, where # is the number of Gigabytes of memory required. For example, if each line of the commands above require 8 GB of memory, you would submit the swarm with:

swarm -g 8 -f cmdfile -module fusionseq/0.7.0

For more information regarding running swarm, see swarm.html

Running an interactive FusionSeq job

Users may need to run jobs interactively sometimes for debugging purposes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns
[user@p4]$ module load fusionseq/
[user@p4]$ cd /data/userID/fusionseq/run1
[user@p4]$ geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log;\
[user@p4]$ (gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter) > data.gfr 2> data.log;\
[user@p4]$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$

To request a specific type of interactive node (e.g. with more memory), specify the node property in the qsub command. For example, if you need a node with 24 GB of memory to run an interactive job:

biowulf% qsub -I -l nodes=1:g24:c16

Documentation

List of programs and usage

http://info.gersteinlab.org/FusionSeq

FusionSeq FAQ