Biowulf at the NIH
RSS Feed
HTSeq on Biowulf

HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays. It is developed by Simon Anders at EMBL Heidelberg.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands as in the example below.

$ module avail htseq
---------------------- /usr/local/Modules/3.2.9/modulefiles --------------------------------
fhtseq/0.5.3p9(default)


$ module load htseq

$ module list
Currently Loaded Modulefiles:
1) htseq/0.5.3p9 $ module unload htseq $ module show htseq ------------------------------------------------------------------- /usr/local/Modules/3.2.9/modulefiles/htseq/0.5.3p9: module-whatis Sets up htseq 0.5.3p9 prepend-path PYTHONPATH /usr/local/python-2.7/lib prepend-path PATH /usr/local/python-2.7/bin -------------------------------------------------------------------

Example files can be downloaded from http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz.

For more detailed command example, see http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html#tour

Running a batch job

First create 2 scripts for each sample you are going to process.

The first file is the python script that you plan to use for HTSeq alone the following lines:

#!/usr/local/Python/2.7.2/bin/python
import HTSeq
fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" )
other HTSeq command1
other HTSeq command2
.....
.....

Let's just call this Python script /data/$USER/htseq/run1/htseq1.py

The second script will call this Python script and looks something like this:

#!/bin/bash
# This file is runHTSeq1
#PBS -N HTSeq
#PBS -m be
#PBS -k oe

module load htseq

cd /home/user/htseq/run1
/data/$USER/htseq/run1/htseq1.py

Once the two files are ready, submit the second file from biowulf:

biowulf> $ qsub -l nodes=1:g24:c16 /data/$USER/htseq/run1/runHTSeq1

This job was submitted to a g24 node. Use 'freen' to see other available node types.

Running an interactive job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.


[user@biowulf] $ qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ module load htseq
        
[user@p4]$ cd /data/$USER/mydirectory

[user@p4]$ python
Python 2.7.2 (default, Aug 15 2011, 13:51:43) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" )
>>> [...etc...]
>>> quit()

[user@p4] exit
qsub: job 2236960.biobos completed
user@biowulf]$ 

User may add a node property in the qsub command to request specific interactive node. For example, if you need a node with 24gb of memory to run job interactively, do this:

[user@biowulf]$ qsub -I -l nodes=1:g24:c16

 

Running a swarm job

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/$USER/Dir1; htseq-count [options] sam_file gff_file
cd /data/$USER/Dir2; htseq-count [options] sam_file gff_file [.....] cd /data/$USER/Dir15; htseq-count [options] sam_file gff_file

The '-f' and '--module' options for swarm are required

By default, each line of the command file above will be executed on 1 processor core of a node and use 1gb of memory. If this is not what you want, you will need to specify '-g' flags when you submit the job on biowulf. Say if each line of the commands above need to use 10gb of memory instead of the default 1gb of memory, make sure swarm understands this by including '-g 10' flag:

biowulf> $ swarm -g 10 -f swarmFile --module htseq

For more information regarding running swarm, see swarm.html

 

Documentation

HTSeq documentation at embl.de