Biowulf at the NIH
RSS Feed
SRA-Toolkit on Biowulf

The NCBI SRA SDK generates loading and dumping tools with their respective libraries for building new and accessing existing runs.

NOTE: Most of the tools in the SRA-Toolkit require internet access to NCBI. Because of this, it is not possible to run these tools on the Biowulf cluster, which is not open to the internet.

There are multiple versions of the SRA-Toolkit available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail sratoolkit

To select a module, type

module load sratoolkit/[ver]

where [ver] is the version of choice. This will set your $PATH variable, as well as $SRATOOLKITHOME, a variable that is required for initializing your local repository.

Initializing a Local Repository

If you are using the SRA Toolkit for the first time, you will need to set up a local repository directory.

For versions 2.3 and above, you will need an X11 connection. Type:

java -jar $SRATOOLKITHOME/preview/sratoolkit.jar

and follow the directions for setting up a local repository.

For versions earlier than 2.3, type:

configuration-assistant.perl

and again follow the directions.

Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.

#!/bin/bash
# This file is YourOwnFileName
#
#PBS -N yourownfilename
#PBS -m be
#PBS -k oe

module load sratoolkit

cd /data/user/somewhereWithInputFile
fastq-dump some.csra
sam-dump some.csra > my_sam.sam
....
....

2. Submit the script using the 'qsub' command on Biowulf

$ qsub -l nodes=1 /data/$USER/theScriptFileAbove
Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /data/$USER/mydir
fastq-dump --aligned --table SEQUENCE -O /data/$USER/mydir2
[....]

This swarm command file can be submitted with:

$ swarm -f cmdfile --module sratoolkit

Submitting with this command will mean that swarm will use the default 1 GB of memory per process (a process would be a single line in the command file above).

For more information regarding running swarm, see swarm.html

Running an interactive job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

$ qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready 
$ cd /data/$USER/myruns
$ module load sratoolkit
$ cd /data/$USER/run1
$ fastq-dump .... 
# illumina-dump....
$ exit
qsub: job 2236960.biobos completed
$

If you want a specific type of node (e.g. one with 8 GB of memory), you can specify that on the qsub command line. e.g.

$ qsub -I -l nodes=1:g8
Documentation

SRA toolkit documentation at NCBI