The NCBI SRA SDK generates loading and dumping tools with their respective libraries for building new and accessing existing runs.
NOTE: Most of the tools in the SRA-Toolkit require internet access to NCBI. Because of this, it is not possible to run these tools on the Biowulf cluster, which is not open to the internet.
There are multiple versions of the SRA-Toolkit available. An easy way of selecting the version is to use modules. To see the modules available, type
module avail sratoolkit
To select a module, type
module load sratoolkit/[ver]
where [ver] is the version of choice. This will set your $PATH variable, as well as $SRATOOLKITHOME, a variable that is required for initializing your local repository.
If you are using the SRA Toolkit for the first time, you will need to set up a local repository directory.
For versions 2.3 and above, you will need an X11 connection. Type:
java -jar $SRATOOLKITHOME/preview/sratoolkit.jar
and follow the directions for setting up a local repository.
For versions earlier than 2.3, type:
configuration-assistant.perl
and again follow the directions.
1. Create a script file. The file will contain the lines similar to the lines below. Modify the path of program location before running.
#!/bin/bash # This file is YourOwnFileName # #PBS -N yourownfilename #PBS -m be #PBS -k oe module load sratoolkit cd /data/user/somewhereWithInputFile fastq-dump some.csra sam-dump some.csra > my_sam.sam .... ....
2. Submit the script using the 'qsub' command on Biowulf
$ qsub -l nodes=1 /data/$USER/theScriptFileAbove
Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.
Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:
fastq-dump --aligned --table PRIMARY_ALIGNMENT -O /data/$USER/mydir fastq-dump --aligned --table SEQUENCE -O /data/$USER/mydir2 [....]
This swarm command file can be submitted with:
$ swarm -f cmdfile --module sratoolkit
Submitting with this command will mean that swarm will use the default 1 GB of memory per process (a process would be a single line in the command file above).
For more information regarding running swarm, see swarm.html
User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
$ qsub -I -l nodes=1 qsub: waiting for job 2236960.biobos to start qsub: job 2236960.biobos ready $ cd /data/$USER/myruns $ module load sratoolkit $ cd /data/$USER/run1 $ fastq-dump .... # illumina-dump.... $ exit qsub: job 2236960.biobos completed $
If you want a specific type of node (e.g. one with 8 GB of memory), you can specify that on the qsub command line. e.g.
$ qsub -I -l nodes=1:g8
SRA toolkit documentation at NCBI


