Biowulf at the NIH
RSS Feed
eXpress on Biowulf

eXpress is a streaming tool for quantifying the abundances of a set of target sequences from sampled subsequences. Example applications include transcript-level RNA-Seq quantification, allele-specific/haplotype expression analysis (from RNA-Seq), transcription factor binding quantification in ChIP-Seq, and analysis of metagenomic data. It is based on an online-EM algorithm [1] that results in space (memory) requirements proportional to the total size of the target sequences and time requirements that are proportional to the number of sampled fragments. Thus, in applications such as RNA-Seq, eXpress can accurately quantify much larger samples than other currently available tools greatly reducing computing infrastructure requirements. eXpress can be used to build lightweight high-throughput sequencing processing pipelines when coupled with a streaming aligner (such as Bowtie), as output can be piped directly into eXpress, effectively eliminating the need to store read alignments in memory or on disk.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands, 'module load express', as in the example below.

biowulf% module load express

Sample Session On Biowulf

Before getting into job submission example sections below, here is a brief instruction for the example used in the eXpress documentation.

In the following two sub-sections, you will run eXpress on a sample RNA-Seq dataset with simulated reads from UGT3A2 and the HOXC cluster using the human genome build hg18. Both the transcript sequences (transcripts.fasta) and raw reads (reads_1.fastq, reads_2.fastq) can be found in the /usr/local/express/sample_data directory. For this example to work, you will need to have both bowtie and samtools in your $PATH (you can use 'module' command), but in general any aligner will work and the conversion to BAM is not necessary unless you have insufficient disk space to store the uncompressed SAM.

Before you begin, you must prepare your Bowtie index. Since you wish to allow many multi-mappings, it is useful to build the index with a small offrate (in this case 1). The smaller the offrate, the larger the index and the faster the mapping. If you have disk space to spare, always use an offrate of 1.

1. Build the index with the following commands.

$ mkdir /data/user/express
$ cd /data/user/express
$ cp -rp /usr/local/express/sample_data . $ cd /usr/local/express/sample_data
$ module load bowtie/0.12.8 # this one use bowtie instead of bowtie2 $ module load samtools $ bowtie-build --offrate 1 transcripts.fasta transcript

This command will populate your directory with several index files that allow Bowtie to more easily align reads to the transcripts.

You can now map the reads to the transcript sequences using the following Bowtie command, which outputs in SAM (-S), allows for unlimited multi-mappings (-a), a maximum insert distance of 800 bp between the paired-ends (-X 800), and 3 mismatches (-v 3). The first three options (a,S,X) are highly recommended for best resscriptults. You should also allow for many mismatches, since eXpress models sequencing errors. Furthermore, you will want to take advantage of multiple processors when mapping large files using the -poption. See the Bowtie Manual for more details on various parameters and options. 

2. The SAM output from Bowtie is piped into SAMtools in order to compress it to BAM format. This conversion is optional, but will greatly reduce the size of the alignment file.

$ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | samtools view -Sb - > hits.bam

3. Once you have aligned your reads to the transcriptome and stored them in a SAM or BAM file, you can run eXpress in default mode with the command:

$ module load express
$ express transcripts.fasta hits.bam

4. If you do not wish to store an intermediate SAM/BAM file, you can pipe the Bowtie output directly into eXpress with the command:

  $ module load bowtie/0.12.8 samtools express
  $ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta 

Submitting a single eXpress batch job

Based the instruction of the example above, create a script file:

#!/bin/bash
# This file is runExpress
#
#PBS -N express
#PBS -m be
#PBS -k oe

module load bowtie/0.12.8  samtools  express
cd /data/user/express
cp -rp /usr/local/express/sample_data . cd /usr/local/express/sample_data bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta

Submit the script using the 'qsub' command on Biowulf. In this example, job was submitted to g8 node. User can also type 'freen' on Biowulf head node to see availabe node types based on your need:

qsub -l nodes=1:g8 /data/username/runExpress

Submitting a swarm of eXpress jobs

1. Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently. Put input files for each analytic job in a separate directories.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

module load bowtie/0.12.8 samtools express ; cd /data/user/express/run1; \
bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | \
express transcripts.fasta
module load bowtie/0.12.8 samtools express ; cd /data/user/express/run2; \
bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | \
express transcripts.fasta
.......
.......
module load bowtie/0.12.8 samtools express ; cd /data/user/express/run20; \
bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | \
express transcripts.fasta

Submit this with

swarm -f cmdfile

By default, each line of the command file above will be executed on 1 processor and use a maximum of 1 GB of memory. Since bowtie can run multi-threaded (i.e use more than 1 core on a node), you may want to run it in multi-threaded mode. e.g. to have bowtie use 8 cores on a node, each bowtie command in the swarm command file above would have the '-p 8' parameter, and the swarm command would be submitted with

swarm -t 8 cmdfile

If each line of the command file above needs more than 1 GB of memory, you should submit the swarm with the -g # flag, where # represents the number of GB of memory needed. e.g if each command requires 3 GB of memory:

swarm -t 8 -g 3 cmdfile

For more information regarding running swarm, see swarm.html

Submit an Interactive eXpress Job
Interactive jobs should not be run on the Biowulf head node (login node). Instead, you should allocate an interactive node and run the commands on that node. Example:

$ qsub -I -l nodes=1:g8  (allocate a node with 8 GB of memory)

$ qsub -I -l nodes=1:g24:c16  (allocate a node with 24 GB of memory and 16 cores)


$ module load bowtie/0.12.8 samtools express  (set up the environment for the programs)
$ cd /data/user/express
$ cp -rp /usr/local/express/sample_data .
$ cd /usr/local/express/sample_data 
$ bowtie -aS -X 800 --offrate 1 -v 3 transcript -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta

Documentation

http://bio.math.berkeley.edu/eXpress/manual.html