High-throughput BLAST on Biowulf
[Current Blast database update status]
BLAST on Biowulf is intended for running a large number of sequence files,
such as hundreds or thousands of query sequences, against the Blast databases.
If you have just a few query sequences, you should use Blast on the
NCBI website or on Helix.
Please contact the Helix Systems staff staff@helix.nih.gov,
or 4-6248) if you have questions about your Blast jobs.
[Detailed instructions] [Benchmarks]
[Blast & Biowulf node memory] [Blast
documentation]
EasyBlast: An easy interface to Blast on Biowulf
The 'easyblast' program on Biowulf simplifies submission of large Blast jobs. You need to
put all your query sequences into a directory, and then type 'easyblast' at the
Biowulf prompt. You will be prompted for all required parameters. The script will
then decide what kind of nodes you need (based on the database you are BLASTing
against), and submit your job to as many nodes as are available (max 32).
Sample session (user input is in bold):
biobos% easyblast
EasyBlast: Blast for large numbers of sequences
Enter the directory which contains your input sequences: /data/username/blast/seqs
Enter the directory where you want your Blast output to go: /data/username/blast/out
** WARNING: There are already files in /data/username/blast/out which will be
overwritten by this job.
** Continue? (y/n) :y
BLAST programs:
blastn - nucleotide query sequence against nucleotide database
blastp - protein query sequence against protein database
blastx - nucleotide query translated in all 6 reading frames
against a protein database
tblastn - protein query sequence against a nucleotide database
translated in all 6 reading frames
tblastx - 6-frame translations of a nucleotide query sequence
against the 6-frame translations of a nucleotide database
blastpgp - PSI-BLAST protein query against protein database
megablast - EST-type query sequences against nucleotide database
Which program do you want to run: blastn
The following nucleotide databases are available:(
(or enter your own database with full pathname)
nt - NCBI nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
pdbnt - from the 3-dimensional structures
htgs - high throughput genome sequences
ecoli.nt - ecoli genomic sequences
mito.nt - mitochondrial sequences
yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
drosoph.nt - drosophila sequences
hs_genome - human genome assembly (Build 36, Apr 2006)
hs_genome.rna - human genome RNA (Build 36, Apr 2006)
mouse_genome - mouse genome assembly (Build 36, Mar 2006)
mouse_genome.rna - mouse genome RNA (Build 36, Mar 2006)
mouse_masked - mouse genome, masked (Build 36, Mar 2006)
other_genomic - non-human genomic sequences
human.rna - RefSeq human RNA
mouse.rna - RefSeq mouse RNA
Database to run against: yeast
Want a summary file in the output directory? (y/n, default y) :
http://biowulf.nih.gov/apps/blast/#blast_params has a full list of available parameters.
Any additional Blast parameters (e.g. -v 10):
Submitting to 8 nodes with :m2048 memory. Job number is 85633.biobos
Monitor your job at http://biowulf.nih.gov/cgi-bin/usermonS?username
|
easyblast figures out the node memory required, sets up all temporary
files and directories, and submits the job for you.
If a summary has been requested, a file called 'summary' will appear in your output directory along with the
actual Blast outputs. For your convenience, this will contain the hits from
each Blast result so you can scroll through it easily. Sample
summary file.
To run against your own database, enter the db name with full path at the Database: prompt. For example:
Database to run against: /data/username/blast_db/my_db
[Blast database update status] [All
Blast parameters] [NCBI
Blast documentation] [Biowulf system monitors]
Easyblast will select the number and type of nodes for you. If you want more control, you can bypass Easyblast and use the
underlying scripts directly, via the instructions below.
-
Set up a directory structure for input/output and tmp files.
For management purposes it is best (but not necessary) to keep all the input,
output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output.
The tmp and output directories will be created if they do not already exist.
- Edit/create a file which tells Blast where the input/output files are,
and what Blast parameters to use.
This environment file can be in any directory, but it can help, organizationally, to
keep it in your main directory for this run, e.g. /data/username/sample1/env
Sample environment file:
setenv DB /fdb/blastdb/nr
setenv PROG blastn
setenv INDIR /data/username/sample1/input
setenv OUTDIR /data/username/sample1/output
setenv TMPDIR /data/username/sample1/tmp
setenv PARAMS "-a 2 -b 10 -v 10"
where
$DB -- Database to act on (e.g. /fdb/blastdb/nr). See Blast
db status for a list of available databases.
$PROG -- Blast program to run (blastn, blastp, blastx, tblastn, tblastx, blastpgp, megablast ) (see NCBI's
Blast tutorial for details.)
$INDIR -- directory containing all the input sequence files.
Each sequence file may contain single or multiple sequences. Don't put ALL your
sequences into a single file; this would force all the Blast jobs to be run sequentially on a single node and you would not get any advantages from using the cluster.
$OUTDIR -- directory which will receive all the individual
output files
$TMPDIR -- temporary directory used by the program.
$PARAMS-- Parameters for the Blast program, in quotes (e.g.
"-a 2 -b 10 -v 10" )
It is recommended that '-a 2' (use 2 processors per node) is always used.
(List of all available Blast parameters)
Note: Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.
- Submit the job to the batch system
Submit the job to the batch system using the 'qsub' command. Example:
qsub -v np=8,read=/DIRNAME/env2 -l nodes=8:m4096 /usr/local/blast/bin/runblast
This job is using the environment file
/DIRNAME/env2, will use 2 processors per node (because of the -a 2 flag given as a Blast parameter above)
and has asked for 'm4096' nodes (see the section on node memory). The 'shnodes' command will list all the Biowulf nodes and their properties.
Another example:
qsub -v np=8,read=/DIRNAME/env1 -l nodes=8 /usr/local/blast/bin/runblast
This job is using the blast environment
file /DIRNAME/env1, and has asked for 8 nodes. No specific nodes have been
asked for (i.e. no 'm2048' specification on the nodes), so the
batch system will allocate the first nodes that are available. All Biowulf nodes have at least 1 GB memory, so the target database should be no larger than 1GB for this run to be efficient.
(see the section on node memory).
If the Blast
'-a 2' flag is specified (recommended), it will use 2 processors per node.
You can monitor any jobs you submit; see Monitoring
your jobs in the Biowulf User Guide.
- Important Note 2: It is possible for each query sequence file to contain
multiple sequences. However, there needs to be at least as many query sequence files
as nodes! Occasionally Blast may barf on a particular sequence, in which case it will
not continue on to other sequences in that file. See below for splitting multi-sequence files.
When analyzing a large number of sequences with blast it is imperative that the
blast database fit entirely within the memory of a given node... this makes a
vast difference in the performance of blast. The Biowulf cluster
consists of nodes with several different memory configurations. Please see the
hardware section of the User Guide for the current configuration of the cluster.
You need to submit your job to nodes that have
enough memory for your particular job.
If you are running against your own database, check the size of the .nsq or
.psq file and submit to the smallest possible node memory. For example,
biowulf% ls -l my_db.nsq
-rw-rw-r-- 1 username username 1521238769 Aug 31 2001 my_db.nsq
The database file is 1521238769 bytes, i.e. 1.5 GB. The nodes should have at
least 2GB of memory, so you can submit to the m2048 nodes. If you are running against one of the Biowulf-installed databases in /fdb/blastdb, check the database size as described above and submit to the appropriate nodes.
If you are using Easyblast, the Easyblast script will determine the database size and submit the job to the appropriate nodes.
Note: Use the "shnodes" command to see a list of
nodes with their properties and status (free, job-exclusive, offline, down)
More Examples
- DNA sequences against the human genomic database
Note that 'username' in the examples below should be replaced by your own
username!!
biowulf$ mkdir /data1/username/run1 make main directory for this run
biowulf$ cd /data1/username/run1 go this directory
biowulf$ mkdir input output tmp make subdirectories for this run
biowulf$ cd input go to the 'input' subdirectory
biowulf$ cp /home/username/*.seq . copy all the sequence files into this subdir
Create a Blast environment file in /data1/username/run1/env. The file contains
setenv DB /fdb/blastdb/human_genomic
setenv PROG blastn
setenv INDIR /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out
setenv TMPDIR /data/username/run1/tmp
setenv PARAMS "-a 2 -b 10 -v 10"
Check the size of the database:
biowulf % ls -l /fdb/blastdb/human_genomic*.nsq
-rw-rw-r-- 1 helixapp staff 990423059 Apr 24 18:29 /fdb/blastdb/human_genomic.00.nsq
-rw-rw-r-- 1 helixapp staff 529527444 Apr 24 18:30 /fdb/blastdb/human_genomic.01.nsq
The database is ~1.5 GB. Submit the job with the command:
qsub -v np=8,read=/data/username/run1/env -l nodes=8:m2048 /usr/local/blast/bin/runblast
- Same data against the Drosophila protein database
Create a new environment file /data/username/run1/env2 which contains:
setenv DB /fdb/blastdb/drosoph.aa
setenv PROG blastx
setenv PARAMS "-a 2 -b 10 -v 10"
setenv INDIR /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out2
setenv TMPDIR /data/username/run1/tmp2
Submit the job using:
qsub -v np=16,read=/data/username/run1/env2 -l nodes=16 /usr/local/blast/bin/runblast
The Drosophila protein database is not very large and will easily fit into the default
1GB node memory.
Thus you don't have to ask for any special node properties.
Splitting your query sequences
If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:
- seqsplit: will split a multisequence fasta-format file into individual sequences. Usage:
seqsplit -f sequence_file
If the file sequence_file contains 2000 sequences, you will get 2000 individual files. Each file will be named according to the sequence name in the fasta entry.
- split_fasta: will split a multisequence fasta-format file into a desired number of files.
Usage:
Split_fasta: to split any large uncompressed fasta file
Usage: split_fasta [optional parameters] [dir]file.fas
-n # number of split files (default=2)
-o file root name of output file (default split#)
-c # chunks to write out (default 100 entries)
-d outdir output directory (default = input directory)
-z if input file is .Z or .gz compressed
Thus, if a file has 100 sequences, and you want to split it into 5 multisequence files, use
split_fasta -n 5 sequence_file
will produce 5 files, each containing 100/5=20 sequences. The files will be called split0, split1,...split4.
split_fasta -n 5 -o oligo sequence_file
will produce 5 files, each containing 20 sequences. The files will be called oligo0, oligo1 ..oligo5.
Some recent runs on our system to give you an idea of what sort of performance
to expect.
| Query |
Database |
Blast program v 2.2.13 |
Nodes |
Time |
1000 nucleotide
EST sequences |
nt
updated 29/Nov/2006
4,625,962 sequences
4.8 GB |
blastn |
16 nodes
2.6 GHz Opterons
8GB RAM Gb ethernet |
21 mins |
nr
updated 29/Nov/2006
4,196,452 sequences
1.5 GB |
blastx |
16 nodes
2.6 GHz Opterons
8GB RAM Gb ethernet |
21 mins |
est_human
updated 1/Dec/2006
7,895,565 sequences
1.2 GB |
blastn |
16 nodes
2.6 GHz Opterons
8 GB RAM Gb ethernet |
12 mins |
human genome
updated Apr 2006
25 sequences
770 Mb |
blastn |
16 nodes
2.6 GHz Opterons
8 GB RAM Gb ethernet |
6:01 hrs |
|
| 1000 protein sequences |
nr
updated 3/Dec/2006
4,201,456 sequences
1.5 GB |
blastp |
16 nodes
2.6 GHz Opterons
8 GB RAM Gb ethernet |
37 mins |
nt
updated 29/Nov/2006
4,625,962 sequences
4.8 GB |
tblastn |
16 nodes
2.6 GHz Opterons
8 GB RAM Gb ethernet |
3:18 hrs |
Note: EasyBlast will typically allot 32 nodes (i.e. 64 processors) for a job.
Large databases (e.g. nt nucleotide) may be allotted fewer nodes.
BLAST Databases
Local copies of the sequence databases used by blast can be found in the directory
/fdb/blastdb. These data are a (weekly-updated) mirror of the
ftp://ncbi.nlm.nih.gov/blast/db/ directory maintained by NCBI. Some
dbs were built from Fasta-format files, with the command:
formatdb -o T
Reading the Results
Scanning through a large set of Blast results can be time-consuming. The blast_summary
script may help. Go to your blast output directory and type:
/usr/local/blast/bin/blast_summary
and it will create a file in that directory called 'summary' which contains just
the Blast hits for each query sequence. Easyblast does this automatically for you.
Programs/Scripts/Files involved:
These scripts can be copied from /usr/local/blast/bin and modified if desired,
although this should not be necessary.
- runblast -- (aka easyrunblast) Batch(PBS) submission file which sets up the
mpi wrapper (multirun) for 2 perl scripts
- blastdist -- sorts input query sequences by size to balance the load on the nodes.
Makes lists of assigned sequences for each node in $TMPDIR
- mpiblast -- main execution of blast...processes all query sequences
assigned to a given node
- env -- file which contains appropriate variables and parameters
for blast execution
- blastall -- blastall and all its associated executables
and parameters are located in /usr/local/blast/ncbi/bin/.
- multirun -- MPI wrapper program which allows for different
behavior on different nodes
- cleanup - cleans up tmp files for easyblast, and builds summary if requested.
These are the parameters available for the blastall program. See also
(the
NCBI Blast documentation)
blastall 2.2.13 arguments:
-p Program Name [String]
-d Database [String]
default = nr
-i Query File [File In]
default = stdin
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
-o BLAST report Output File [File Out] Optional
default = stdout
-F Filter query sequence (DUST with blastn, SEG with others) [String]
default = T
-G Cost to open a gap (-1 invokes default behavior) [Integer]
default = -1
-E Cost to extend a gap (-1 invokes default behavior) [Integer]
default = -1
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
default = 0
-I Show GI's in deflines [T/F]
default = F
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
default = 1
-v Number of database sequences to show one-line descriptions for (V) [Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-f Threshold for extending hits, default if zero
blastp 11, blastn 0, blastx 12, tblastn 13
tblastx 13, megablast 0 [Integer]
default = 0
-g Perform gapped alignment (not available with tblastx) [T/F]
default = T
-Q Query Genetic code to use [Integer]
default = 1
-D DB Genetic code (for tblast[nx] only) [Integer]
default = 1
-a Number of processors to use [Integer]
default = 1
-O SeqAlign file [File Out] Optional
-J Believe the query defline [T/F]
default = F
-M Matrix [String]
default = BLOSUM62
-W Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer]
default = 0
-z Effective length of the database (use zero for the real size) [Real]
default = 0
-K Number of best hits from a region to keep (off by default, if used
a value of 100 is recommended) [Integer]
default = 0
-P 0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
default = 0
-Y Effective length of the search space (use zero for the real size) [Real]
default = 0
-S Query strands to search against database (for blast[nx], and tblastx)
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output [T/F]
default = F
-l Restrict search of database to list of GI's [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
-y X dropoff value for ungapped extensions in bits (0.0 invokes default behavior)
blastn 20, megablast 10, all others 7 [Real]
default = 0.0
-Z X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
blastn/megablast 50, tblastx 0, all others 25 [Integer]
default = 0
-R PSI-TBLASTN checkpoint file [File In] Optional
-n MegaBlast search [T/F]
default = F
-L Location on query sequence [String] Optional
-A Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer]
default = 0
-w Frame shift penalty (OOF algorithm for blastx) [Integer]
default = 0
-t Length of the largest intron allowed in a translated nucleotide sequence when
linking multiple distinct alignments. (0 invokes default behavior; a negative value
disables linking.) [Integer]
default = 0
-B Number of concatenated queries, for blastn and tblastn [Integer] Optional
default = 0
-V Force use of old engine [T/F] Optional
default = T
-C Use composition-based statistics for tblastn:
D or d: default (equivalent to F)
0 or F or f: no composition-based statistics
blastpgp 2.2.13 arguments:
-d Database [String]
default = nr
-i Query File (not needed if restarting from scoremat) [File In] Optional
default = stdin
-A Multiple Hits window size [Integer]
default = 40
-f Threshold for extending hits [Integer]
default = 11
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = Tabular output,
9 = Tabular output with comments
10 = ASN, text
11 = ASN, binary [Integer]
default = 0
-o Output File for Alignment [File Out] Optional
default = stdout
-y Dropoff (X) for blast extensions in bits (default if zero) [Real]
default = 7.0
-P 0 for multiple hit, 1 for single hit [Integer]
default = 0
-F Filter query sequence with SEG [String]
default = F
-G Cost to open a gap [Integer]
default = 11
-E Cost to extend a gap [Integer]
default = 1
-X X dropoff value for gapped alignment (in bits) [Integer]
default = 15
-N Number of bits to trigger gapping [Real]
default = 22.0
-S Start of required region in query [Integer]
default = 1
-H End of required region in query (-1 indicates end of query) [Integer]
default = -1
-a Number of processors to use [Integer]
default = 1
-I Show GI's in deflines [T/F]
default = F
-h e-value threshold for inclusion in multipass model [Real]
default = 0.002
-c Constant in pseudocounts for multipass version [Integer]
default = 9
-j Maximum number of passes to use in multipass version [Integer]
default = 1
-J Believe the query defline [T/F]
default = F
-Z X dropoff value for final gapped alignment (in bits) [Integer]
default = 25
-O SeqAlign file ('Believe the query defline' must be TRUE) [File Out] Optional
-M Matrix [String]
default = BLOSUM62
-v Number of database sequences to show one-line descriptions for (V) [Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-C Output File for PSI-BLAST Checkpointing [File Out] Optional
-R Input File for PSI-BLAST Restart [File In] Optional
-W Word size [Integer]
default = 3
-z Effective length of the database (use zero for the real size) [Real]
default = 0
-K Number of best hits from a region to keep [Integer]
default = 0
-s Compute locally optimal Smith-Waterman alignments [T/F]
default = F
-Y Effective length of the search space (use zero for the real size) [Real]
default = 0
-p program option for PHI-BLAST [String]
default = blastpgp
-k Hit File for PHI-BLAST [File In]
default = hit_file
-T Produce HTML output [T/F]
default = F
-Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional
-B Input Alignment File for PSI-BLAST Restart [File In] Optional
-l Restrict search of database to list of GI's [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
default = F
-t Use composition based statistics
0 or F or f: no composition-based statistics
1 or T or t: Composition-based statistics as in NAR 29:2994--3005, 2001
2: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005,
conditioned on sequence properties in round 1
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005,
unconditionally in round 1
[String]
default = 1
-q ASN.1 Scoremat input of checkpoint data:
0: no scoremat input
1: Restart is from ASCII scoremat checkpoint file,
2: Restart is from binary scoremat checkpoint file [Integer] Optional
default = 0
-u ASN.1 Scoremat output of checkpoint data:
0: no scoremat output
1: Output is ASCII scoremat checkpoint file (requires -J),
2: Output is binary scoremat checkpoint file (requires -J) [Integer] Optional
default = 0
-L Cost to decline alignment (disabled when 0) [Integer]
default = 0
|