BLAST on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the Blast databases. If you have just a few query sequences, you should use Blast on the NCBI website or on Helix. All NCBI Blast programs - blastn, blastp, blastx, tblastn, tblastx, blastpgp, megablast,rpsblast -- are available on Biowulf. Please contact the Helix Systems staff staff@helix.nih.gov, or 4-6248) if you have questions about your Blast jobs.
Version:
The output from every Blast program includes the version number.
biowulf% easyblast
EasyBlast: Blast for large numbers of sequences
Enter the directory which contains your input sequences: /data/username/blast/seqs
Enter the directory where you want your Blast output to go: /data/username/blast/out
** WARNING: There are already files in /data/username/blast/out which will be
overwritten by this job.
** Continue? (y/n) :y
BLAST programs:
blastn - nucleotide query sequence against nucleotide database
blastp - protein query sequence against protein database
blastx - nucleotide query translated in all 6 reading frames
against a protein database
tblastn - protein query sequence against a nucleotide database
translated in all 6 reading frames
tblastx - 6-frame translations of a nucleotide query sequence
against the 6-frame translations of a nucleotide database
blastpgp - PSI-BLAST protein query against protein database
megablast - EST-type query sequences against nucleotide database
Which program do you want to run: blastn
The following nucleotide databases are available:(
(or enter your own database with full pathname)
nt - NCBI nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
pdbnt - from the 3-dimensional structures
htgs - high throughput genome sequences
ecoli.nt - ecoli genomic sequences
mito.nt - mitochondrial sequences
yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
drosoph.nt - drosophila sequences
hs_genome - human genome assembly (Build 36, Apr 2006)
hs_genome.rna - human genome RNA (Build 36, Apr 2006)
mouse_genome - mouse genome assembly (Build 36, Mar 2006)
mouse_genome.rna - mouse genome RNA (Build 36, Mar 2006)
mouse_masked - mouse genome, masked (Build 36, Mar 2006)
other_genomic - non-human genomic sequences
human.rna - RefSeq human RNA
mouse.rna - RefSeq mouse RNA
Database to run against: yeast
Want a summary file in the output directory? (y/n, default y) :
http://biowulf.nih.gov/apps/blast/#blast_params has a full list of
available parameters.
Any additional Blast parameters (e.g. -v 10):
Submitting to 8 nodes with :g4 memory. Job number is 85633.biobos
Monitor your job at http://biowulf.nih.gov/cgi-bin/usermonS?username
easyblast figures out the node memory required, sets up all temporary files and directories, and submits the job for you.
If a summary has been requested, a file called 'summary' will appear in your output directory along with the actual Blast outputs. For your convenience, this will contain the hits from each Blast result so you can scroll through it easily. Sample summary file.
To run against your own database, enter the db name with full path at the Database: prompt. For example:
Database to run against: /data/username/blast_db/my_db
Database files have suffixes like .nsq, .nin (nucleotide), .psq, .psi (protein) etc.
You should enter the full path and the database name without the suffix.
You can put multiple sequences into each of your input sequence files. However, there needs to be at least as many query sequence files as nodes! Very occasionally Blast may barf on a particular sequence, in which case it will not continue on to other sequences in that file. If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:
- seqsplit: will split a multisequence fasta-format file into
individual sequences. Usage:
seqsplit -f sequence_file
If the file sequence_file contains 2000 sequences, you will get 2000 individual files. Each file will be named according to the sequence name in the fasta entry. - split_fasta: will split a multisequence fasta-format file into a
desired number of files. Usage:
Split_fasta: to split any large uncompressed fasta file Usage: split_fasta [optional parameters] [dir]file.fas -n # number of split files (default=2) -o file root name of output file (default split#) -c # chunks to write out (default 100 entries) -d outdir output directory (default = input directory) -z if input file is .Z or .gz compressedThus, if a file has 100 sequences, and you want to split it into 5 multisequence files, usesplit_fasta -n 5 sequence_file
will produce 5 files, each containing 100/5=20 sequences. The files will be called split0, split1,...split4.
split_fasta -n 5 -o oligo sequence_file
will produce 5 files, each containing 20 sequences. The files will be called oligo0, oligo1 ..oligo5.
| Query | Database | Blast program v 2.2.13 |
Nodes | Time |
| 1000 nucleotide EST sequences |
nt updated 4/Dec/2008 7,808,957 sequences 6.2 GB |
blastn | 16 nodes 2.6 GHz dual-core Opterons 8GB RAM Gb ethernet |
21 mins |
| nr updated 4/Dec/2008 7,463,447 sequences 2.4 GB |
blastx | 16 nodes 2.6 GHz dual-core Opterons 8GB RAM Gb ethernet |
21 mins | |
| est_human updated 4/Dec/2008 8,163,883 sequences 1.1 GB |
blastn | 16 nodes 2.6 GHz dual-core Opterons 8 GB RAM Gb ethernet |
5.5 mins | |
| human genome updated Apr 2006 25 sequences 770 Mb |
blastn | 16 nodes 2.6 GHz dual-core Opterons 8 GB RAM Gb ethernet |
40 mins | |
| 1000 protein sequences | nr updated 15/Dec/2008 7,463,447 sequences 2.4 GB |
blastp | 16 nodes 2.6 GHz dual-core Opterons 8 GB RAM Gb ethernet |
28 mins |
| nt updated 4/Dec/2008 7,808,957 sequences 6.2 GB |
tblastn | 16 nodes 2.6 GHz dual-core Opterons 8 GB RAM Gb ethernet |
3:47 hrs | |
formatdb -o T
Easyblast uses swarm. If you prefer to use swarm directly, set up a swarm command file along the following lines. You have several options for setting up the environment:
- Use the full path for blastall: /usr/local/blast/ncbi/bin/blastall
- OR Use 'module load blast' in each line in the swarm command file, as in the example below.
- OR add /usr/local/blast/ncbi/bin to your PATH in your .bashrc or .cshrc file
- OR add module load blast into your .bashrc or .cshrc file
# this file is called blastcmd # module load blast; blastall -a 4 -p blastn -d /fdb/blastdb/nt -i /data/user/myseqs/seq1.fas -o /data/user/blastout/seq1.out module load blast; blastall -a 4 -p blastn -d /fdb/blastdb/nt -i /data/user/myseqs/seq2.fas -o /data/user/blastout/seq2.out module load blast; blastall -a 4 -p blastn -d /fdb/blastdb/nt -i /data/user/myseqs/seq3.fas -o /data/user/blastout/seq3.out module load blast; blastall -a 4 -p blastn -d /fdb/blastdb/nt -i /data/user/myseqs/seq4.fas -o /data/user/blastout/seq4.out [...]
Determine the size of the database file (see the section on Blast and node memory). Let's assume the database is 8.3 GB. Round upwards to 9 GB. Swarm will be told that each command requires 9 GB with the '-g 9' flag.
The '-a 4' flag to the blastall commands above tell blastall to run with 4 threads. Therefore, swarm has to be told that each command above will require 4 cores, with the '-t 4' flag.
Submit this swarm with
swarm -g 9 -t 4 -f blastcmd
Blast Database Update Status -- status of all Blast databases installed on the system.
When analyzing a large number of sequences with blast it is imperative that the blast database fit entirely within the memory of a given node... this makes a vast difference in the performance of blast. Thus, if you are running Blast via swarm, you need to check the size of the database.biowulf% ls -lh my_db.nsq -rw-rw-r-- 1 username username 1.5G Aug 31 2011 my_db.nsqThe database is 1.5 GB, so you would submit to swarm with '-g 2' (2 GB required for each Blast run).
For multi-part databases such as 'human_genomic', you need to add the size of al the sections. e.g.
[user@biowulf ~]# ls -l /fdb/blastdb/human_genomic*.nsq -rw-rw-r-- 1 helixapp staff 945M Oct 1 22:11 /fdb/blastdb/human_genomic.00.nsq -rw-rw-r-- 1 helixapp staff 957M Oct 1 22:12 /fdb/blastdb/human_genomic.01.nsq -rw-rw-r-- 1 helixapp staff 274M Oct 1 22:12 /fdb/blastdb/human_genomic.02.nsqThe total database size for the human_genomic database is thus ~1GB + ~1GB + .27 GB = 2.3 GB. You would submit to swarm with '-g 3' (3 GB required for each Blast run)
If you are using Easyblast, the Easyblast script will determine the database size and submit the job to the appropriate nodes.
Scanning through a large set of Blast results can be time-consuming. The blast_summary script may help. Go to your blast output directory and type:/usr/local/blast/bin/blast_summary
and it will create a file in that directory called 'summary' which contains just the Blast hits for each query sequence. Easyblast does this automatically for you.
These are the parameters available for the blastall program. See also (the Blast documentation at the NCBI website)
blastall 2.2.17 arguments:
-p Program Name [String]
-d Database [String]
default = nr
-i Query File [File In]
default = stdin
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
range from 0 to 11
-o BLAST report Output File [File Out] Optional
default = stdout
-F Filter query sequence (DUST with blastn, SEG with others) [String]
default = T
-G Cost to open a gap (-1 invokes default behavior) [Integer]
default = -1
-E Cost to extend a gap (-1 invokes default behavior) [Integer]
default = -1
-X X dropoff value for gapped alignment (in bits) (zero invokes default
behavior) blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
default = 0
-I Show GI's in deflines [T/F]
default = F
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
default = 1
-v Number of database sequences to show one-line descriptions for (V)
[Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-f Threshold for extending hits, default if zero
blastp 11, blastn 0, blastx 12, tblastn 13
tblastx 13, megablast 0 [Real]
default = 0
-g Perform gapped alignment (not available with tblastx) [T/F]
default = T
-Q Query Genetic code to use [Integer]
default = 1
-D DB Genetic code (for tblast[nx] only) [Integer]
default = 1
-a Number of processors to use [Integer]
default = 1
-O SeqAlign file [File Out] Optional
-J Believe the query defline [T/F]
default = F
-M Matrix [String]
default = BLOSUM62
-W Word size, default if zero (blastn 11, megablast 28, all others 3)
[Integer]
default = 0
-z Effective length of the database (use zero for the real size) [Real]
default = 0
-K Number of best hits from a region to keep (off by default, if used a value
of 100 is recommended) [Integer]
default = 0
-P 0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
default = 0
-Y Effective length of the search space (use zero for the real size) [Real]
default = 0
-S Query strands to search against database (for blast[nx], and tblastx)
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output [T/F]
default = F
-l Restrict search of database to list of GI's [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
-y X dropoff value for ungapped extensions in bits (0.0 invokes default
behavior) blastn 20, megablast 10, all others 7 [Real]
default = 0.0
-Z X dropoff value for final gapped alignment in bits (0.0 invokes default
behavior) blastn/megablast 50, tblastx 0, all others 25 [Integer]
default = 0
-R PSI-TBLASTN checkpoint file [File In] Optional
-n MegaBlast search [T/F]
default = F
-L Location on query sequence [String] Optional
-A Multiple Hits window size, default if zero (blastn/megablast 0, all others
40 [Integer]
default = 0
-w Frame shift penalty (OOF algorithm for blastx) [Integer]
default = 0
-t Length of the largest intron allowed in a translated nucleotide sequence
when linking multiple distinct alignments. (0 invokes default behavior; a
negative value disables linking.) [Integer]
default = 0
-B Number of concatenated queries, for blastn and tblastn [Integer] Optional
default = 0
-V Force use of the legacy BLAST engine [T/F] Optional
default = F
-C Use composition-based statistics for blastp or tblastn:
As first character:
D or d: default (equivalent to T)
0 or F or f: no composition-based statistics
1 or T or t: Composition-based statistics as in NAR 29:2994-3005, 2001
2: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, unconditionally
For programs other than tblastn, must either be absent or be D, F or 0.
As second character, if first character is equivalent to 1, 2, or 3:
U or u: unified p-value combining alignment p-value and compositional
p-value in round 1 only [String]
default = D
-s Compute locally optimal Smith-Waterman alignments (This option is only
available for gapped tblastn.) [T/F]
default = F


