Biowulf at the NIH
RSS Feed
Biotoolbox on Biowulf

A collection of various Perl scripts that utilize BioPerl modules for use in bioinformatics analysis. Tools are included for processing microarray data, next generation sequencing data, data file format conversion, querying datasets, and general high level analysis of datasets.

This tool box of programs relies on storing genome annotation, microarray, and next generation sequencing data in local BioPerl databases, allowing for data retrieval relative to any annotated feature in the database. While referencing genomic annotation and features from a database are convenient, they are not required. Simple Bed style input files are also supported for data collection.

The environment variable(s) need to be set properly first. The easiest way to do this is by using the modules commands as in the example below.

biowulf% module avail biotoolbox

---------- /usr/local/Modules/3.2.9/modulefiles --------------------
biotoolbox/1.8.0 biotoolbox/1.8.6 biotoolbox/1.9.4

biowulf% module load biotoolbox

biowulf% module list
Currently Loaded Modulefiles:
  1) biotoolbox/1.9.4

Submitting a Single Batch Job

1. Create a batch script along the lines of the one below:

#!/bin/bash
# This file is FileName
#
#PBS -N RunName
#PBS -m be
#PBS -k oe

module load biotoolbox

cd /data/user/somewhereWithInputfile

get_datasets.pl --db hg19 --feature gene --data /path/to/my/data.bam --method sum --value count --out gene_count
bam2wig.pl --rpm --in data.bam
bam2gff_bed.pl --bed --pe --in data.bam

3. Submit the script using the 'qsub' command on Biowulf.

qsub -l nodes=1 /data/username/theScriptFileAbove

This will submit the job to a node with 1 GB of memory and at least 2 cores. If your Biotoolbox commands require more than 1 GB of memory, you can specify a node with more memory. For example, suppose you need 10 GB of memory. Use 'freen' to see available node types. The smallest-memory node with at least 10 GB of memory are the g24 (24 GB RAM) nodes. Submit with:
qsub -l nodes=1:g24 /data/username/theScriptFileAbove

Submitting a Swarm of Jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

bam2gff_bed.pl --bed --pe --in data1.bam
bam2gff_bed.pl --bed --pe --in data2.bam
bam2gff_bed.pl --bed --pe --in data3.bam
[...]

Submit this swarm with:

swarm -f swarmfile --module biotoolbox

By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If each Biotoolbox command requires more than 1 GB of memory, you can specify the memory required using the '-g' flag. For example, if each command requires 10 GB of memory, submit with:

swarm -g 10 -f swarmfile --module biotoolbox

For more information regarding running swarm, see swarm.html

 

Running an Interactive Job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns
[user@p4]$ module load biotoolbox
[user@p4]$ cd /data/user/somewhereWithInputfile
[user@p4]$ get_datasets.pl --db hg19 --feature gene --data /path/to/my/data.bam --method sum --value count --out gene_count
[user@p4]$ bam2wig.pl --rpm --in data.bam [user@p4]$ bam2gff_bed.pl --bed --pe --in data.bam
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$

Users may add a node property in the qsub command to request a specific kind of interactive node. For example, if you need a node with 8gb of memory to run job interactively, do this:

biowulf% qsub -I -l nodes=1:g8

Documentation

http://code.google.com/p/biotoolbox/