Biowulf at the NIH
RSS Feed
SRA-Toolkit on Biowulf

The NCBI SRA SDK generates loading and dumping tools with their respective libraries for building new and accessing existing runs.

NOTE: Most of the tools in the SRA-Toolkit require internet access to NCBI. Because of this, it is not possible to use tools to download archives or dbGap data on the Biowulf cluster, which is not open to the internet.

There are multiple versions of the SRA-Toolkit available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail sratoolkit

To select a module, type

module load sratoolkit/[ver]

where [ver] is the version of choice. This will set your $PATH variable, as well as $SRATOOLKITHOME, a variable that is required for initializing your local repository.

Initializing a Local Repository

If you are using the SRA Toolkit for the first time, you will need to set up a local repository directory.

For versions 2.3 and above, you will need an X11 connection. Type:

java -jar $SRATOOLKITHOME/bin/sratoolkit.jar

and follow the directions for setting up a local repository.

For versions earlier than 2.3, type:

configuration-assistant.perl

and again follow the directions.

Dealing with encryption keys

When working with dbGap files, which are encrypted, you will need to either modify your local SRA Toolkit configuration file, or else set the VDB_PWFILE environment variable.

By default, your SRA Toolkit configuration file is located in your home directory:

~/.ncbi/user-settings.mkfg

If you know the path to your encryption file, you can append it to your configuration file:

echo '/krypto/pwfile = /path/to/your/encryption/key/file' >> ~/.ncbi/user-settings.mkfg

OR you can add the following line to your ~/.bashrc file:

export VDB_PWFILE=/path/to/your/encryption/key/file

Once one of these steps has been taken, you can work with dbGap encrypted files:

fastq-dump --split-3 --gzip SRR123456789.sra.ncbi_enc

If you have not configured your encryption file correctly, you may see an error like this:

2014-04-03T19:35:06 fastq-dump.2.3.4 err: file descriptor invalid while
constructing file within file system module - log failure:
Filling Your Local Repository

As noted above, THE BIOWULF CLUSTER DOES NOT HAVE INTERNET CONNECTIVITY. This means that the sra files must be prefetched to your local repository prior to any further work on the Biowulf cluster. To do so, first initialize your repository (see above), and then use prefetch to download the sra file on either Helix or the Biowulf login node.

[helix]$ prefetch SRR390728
Maximum file size download limit is 20,971,520KB

2014-03-24T13:44:54 prefetch.2.3.4: 1) Downloading 'SRR390728'...
2014-03-24T13:44:54 prefetch.2.3.4:  Downloading via http...
2014-03-24T13:44:56 prefetch.2.3.4: 1) 'SRR390728' was downloaded successfully
2014-03-24T13:45:00 prefetch.2.3.4: 'SRR390728' has 0 unresolved dependencies
Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.

#!/bin/bash
# This file is YourOwnFileName
#
#PBS -N yourownfilename
#PBS -m be
#PBS -k oe

module load sratoolkit

cd /data/user/somewhereWithInputFile
fastq-dump some.sra
sam-dump some.sra > my_sam.sam
....
....

2. Submit the script using the 'qsub' command on Biowulf

$ qsub -l nodes=1 /data/$USER/theScriptFileAbove
Submitting a swarm of jobs

Please Note:

The SRA Toolkit executables use random access to read input files. Because of this, users with data located on GPFS filesystems will see significant slowdowns in their jobs. It is best to first copy the input files to a local /scratch directory, work on the data in /scratch, and copy the results back at the end of the job.

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

To ensure enough diskspace, the clearscratch should be run at the beginning of the swarm using the --prologue option. This swarm command file can be submitted with:

$ swarm -f cmdfile --module sratoolkit --prologue 'clearscratch' -g 4 

Submitting with this command will mean that swarm will use 4 GB of memory per process.

For more information regarding running swarm, see swarm.html

Running an interactive job

Sometimes when the results are not predictable, it is necessary to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

$ qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready 
$ cd /data/$USER/myruns
$ module load sratoolkit
$ cd /data/$USER/run1
$ fastq-dump .... 
# illumina-dump....
$ exit
qsub: job 2236960.biobos completed
$

If you want a specific type of node (e.g. one with 8 GB of memory), you can specify that on the qsub command line. e.g.

$ qsub -I -l nodes=1:g8
Documentation

Type the command followed by '-h'