The NCBI SRA SDK generates loading and dumping tools with their respective libraries for building new and accessing existing runs.
NOTE: Most of the tools in the SRA-Toolkit require internet access to NCBI. Because of this, it is not possible to use tools to download archives or dbGap data on the Biowulf cluster, which is not open to the internet.
There are multiple versions of the SRA-Toolkit available. An easy way of selecting the version is to use modules. To see the modules available, type
module avail sratoolkit
To select a module, type
module load sratoolkit/[ver]
where [ver] is the version of choice. This will set your $PATH variable, as well as $SRATOOLKITHOME, a variable that is required for initializing your local repository.
If you are using the SRA Toolkit for the first time, you will need to set up a local repository directory.
For versions 2.3 and above, you will need an X11 connection. Type:
java -jar $SRATOOLKITHOME/bin/sratoolkit.jar
and follow the directions for setting up a local repository.
For versions earlier than 2.3, type:
and again follow the directions.
Dealing with encryption keys
When working with dbGap files, which are encrypted, you will need to either modify your local SRA Toolkit configuration file, or else set the VDB_PWFILE environment variable.
By default, your SRA Toolkit configuration file is located in your home directory:
If you know the path to your encryption file, you can append it to your configuration file:
echo '/krypto/pwfile = /path/to/your/encryption/key/file' >> ~/.ncbi/user-settings.mkfg
OR you can add the following line to your ~/.bashrc file:
Once one of these steps has been taken, you can work with dbGap encrypted files:
fastq-dump --split-3 --gzip SRR123456789.sra.ncbi_enc
If you have not configured your encryption file correctly, you may see an error like this:
2014-04-03T19:35:06 fastq-dump.2.3.4 err: file descriptor invalid while constructing file within file system module - log failure:
As noted above, THE BIOWULF CLUSTER DOES NOT HAVE INTERNET CONNECTIVITY. This means that the sra files must be prefetched to your local repository prior to any further work on the Biowulf cluster. To do so, first initialize your repository (see above), and then use prefetch to download the sra file on either Helix or the Biowulf login node.
[helix]$ prefetch SRR390728 Maximum file size download limit is 20,971,520KB 2014-03-24T13:44:54 prefetch.2.3.4: 1) Downloading 'SRR390728'... 2014-03-24T13:44:54 prefetch.2.3.4: Downloading via http... 2014-03-24T13:44:56 prefetch.2.3.4: 1) 'SRR390728' was downloaded successfully 2014-03-24T13:45:00 prefetch.2.3.4: 'SRR390728' has 0 unresolved dependencies
1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.
#!/bin/bash # This file is YourOwnFileName # #PBS -N yourownfilename #PBS -m be #PBS -k oe module load sratoolkit cd /data/user/somewhereWithInputFile fastq-dump some.sra sam-dump some.sra > my_sam.sam .... ....
2. Submit the script using the 'qsub' command on Biowulf
$ qsub -l nodes=1 /data/$USER/theScriptFileAbove
The SRA Toolkit executables use random access to read input files. Because of this, users with data located on GPFS filesystems will see significant slowdowns in their jobs. It is best to first copy the input files to a local /scratch directory, work on the data in /scratch, and copy the results back at the end of the job.
Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.
Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:
To ensure enough diskspace, the clearscratch should be run at the beginning of the swarm using the --prologue option. This swarm command file can be submitted with:
$ swarm -f cmdfile --module sratoolkit --prologue 'clearscratch' -g 4
Submitting with this command will mean that swarm will use 4 GB of memory per process.
For more information regarding running swarm, see swarm.html
Sometimes when the results are not predictable, it is necessary to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
$ qsub -I -l nodes=1 qsub: waiting for job 2236960.biobos to start qsub: job 2236960.biobos ready $ cd /data/$USER/myruns $ module load sratoolkit $ cd /data/$USER/run1 $ fastq-dump .... # illumina-dump.... $ exit qsub: job 2236960.biobos completed $
If you want a specific type of node (e.g. one with 8 GB of memory), you can specify that on the qsub command line. e.g.
$ qsub -I -l nodes=1:g8
Type the command followed by '-h'