A computational framework to identify fusion transcripts from paired-end RNA-Seq data. Fusionseq was developed by the Gerstein Lab.
Fusionseq requires a resource file (/home/$USER/.fusioneqrc) that sets the location for the bowtie indexes and so on. When you run 'module load fusionseq' to set up the environment for fusionseq, if this file does not exist the default template will be copied into /home/$USER/.fusionseq. Example below:
biowulf% module load fusionseq /home/susanc/.fusionseqrc does not exist; creating it biowulf% module list Currently Loaded Modulefiles: 1) fusionseq/0.7.0
This is what the resource file looks like:
# --------------------------------- This section is required --------------------------------- # Location of the bowtie indexes of the human genome and the composite model BOWTIE_INDEXES="/usr/local/bowtie-indexes" #define BOWTIE_INDEXES_TMP "/data/YourUserId/MoreIfNeededHere" # the subdirectory of BOWTIE_INDEXES where the human genome is indexed by bowtie-build BOWTIE_GENOME="hg18_nh" # the subdirectory of BOWTIE_INDEXES where the composite model is indexed by bowtie-build BOWTIE_COMPOSITE="hg18_knownGeneAnnotationTranscriptCompositeModel" # Pointer to the program twoBitToFa part of the blat suite BLAT_TWO_BIT_TO_FA="/usr/local/blat/x86_64/twoBitToFa" # Location and filename of the reference genome in 2bit format (to be used by blat) BLAT_DATA_DIR="/usr/local/bowtie-indexes/hg18_nh" BLAT_TWO_BIT_DATA_FILENAME="hg18.2bit" # Location and name of the transcript composite model sequence and interval files TRANSCRIPT_COMPOSITE_MODEL_DIR="/usr/local/bowtie-indexes/FusionSeq_Data" TRANSCRIPT_COMPOSITE_MODEL_FA_FILENAME="knownGeneAnnotationTranscriptCompositeModel.fa" TRANSCRIPT_COMPOSITE_MODEL_FILENAME="knownGeneAnnotationTranscriptCompositeModel.txt" # location of the annotation files ANNOTATION_DIR="/usr/local/bowtie-indexes/FusionSeq_Data" # conversion of knownGenes to gene symbols, description, etc. KNOWN_GENE_XREF_FILENAME="kgXref.txt" # conversion of knownGenes to TreeFam KNOWN_GENE_TREE_FAM_FILENAME="knownToTreefam.txt" # Location and filename of the ribosomal library RIBOSOMAL_DIR="/usr/local/bowtie-indexes/FusionSeq_Data" RIBOSOMAL_FILENAME="ribosomal.2bit"
The environment variables in the upper required section have been set to point to the locations of indexes on Biowulf. If you have your own indexes, you may want to modify these variables.
1. Create a script file along the lines of the one below:
#!/bin/bash # This file is fusionseqscript # #PBS -N fusionseq #PBS -m be #PBS -k oe module load fusionseq/0.7.0 cd /data/user/mydir #sample FusionSeq commands geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log (gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter | gfrAbnormalInsertSizeFilter 0.01 | gfrPCRFilter 4 4 |\ gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | gfrAnnotationConsistencyFilter pseudogenes | \ gfrLargeScaleHomologyFilter | gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log gfrConfidenceValues data < data.gfr > data.confidence.gfr gfr2bpJunctions data.confidence.gfr 40 200 minDASPER
3. Submit the script using the 'qsub' command on Biowulf.
qsub -l nodes=1 /data/username/fusionseqscript
If your fusionseq job requires more than the default 1 GB of memory, you can specify a node property to get a node with more memory. e.g.
qsub -l nodes=1:g24:c16 /data/username/fusionseqscript
Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.
Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:
cd /data/user/mydir; geneFusions data 4 < data1.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log cd /data/user/mydir; geneFusions data 4 < data2.mrf | gfrClassify > data.2.gfr 2> data.2.gfr.log cd /data/user/mydir; geneFusions data 4 < data3.mrf | gfrClassify > data.3.gfr 2> data.3.gfr.log [...]
Submit this swarm with:
swarm -f cmdfile -module fusionseq/0.7.0
By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If your fusionseq commands require more than 1 GB of memory, you must specify the memory required to swarm by using the -g # flag, where # is the number of Gigabytes of memory required. For example, if each line of the commands above require 8 GB of memory, you would submit the swarm with:
swarm -g 8 -f cmdfile -module fusionseq/0.7.0
For more information regarding running swarm, see swarm.html
Users may need to run jobs interactively sometimes for debugging purposes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready
[user@p4]$ cd /data/user/myruns
[user@p4]$ module load fusionseq/
[user@p4]$ cd /data/userID/fusionseq/run1
[user@p4]$ geneFusions data 4 < data.mrf | gfrClassify > data.1.gfr 2> data.1.gfr.log;\
[user@p4]$ (gfrMitochondrialFilter < data.1.gfr | gfrCountPairTypes | gfrExpressionConsistencyFilter) > data.gfr 2> data.log;\
[user@p4]$ gfrConfidenceValues data < data.gfr > data.confidence.gfr
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$
To request a specific type of interactive node (e.g. with more memory), specify the node property in the qsub command. For example, if you need a node with 24 GB of memory to run an interactive job:
biowulf% qsub -I -l nodes=1:g24:c16
http://info.gersteinlab.org/FusionSeq


