![]() |
|
||||||||||||||||||||||||||||||||||||||
| |
|||||||||||||||||||||||||||||||||||||||
High-throughput WU-Blast on Biowulf[Current WU-Blast database update status]WU-Blast on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the WU-Blast databases. If you have just a few query sequences, you should use WU-Blast on a public server such as the EBI server or on Helix. Please contact the Helix Systems staff (staff@helix.nih.gov, or 4-6248) if you have questions about your WU-Blast jobs. WU-BLAST was developed by Warren Gish at Washington University in St. Louis. (WU-Blast website) Specific customization via wrapper scripts for the NIH Biowulf cluster by Susan Chacko (Helix Staff, CIT) and Peter FitzGerald (Genome Analysis Unit, NCI). [Detailed instructions] [Benchmarks] [WU-Blast parameters]
EasyWUBlast: An easy interface to WU-Blast on BiowulfThe 'easywublast' program on Biowulf simplifies submission of large WU-Blast jobs. Your query sequences can be in a single large file, or as separate sequence files in a directory. You then type 'easywublast' at the Biowulf prompt. You will be prompted for all required parameters. The script will do some basic sanity checking, set up your run and submit it to the batch queue.Sample session (user input is in bold):
easywublast sets up the temporary files and directories that are required, and submits the job for you. Most required parameters are self-evident. The "NCBI-Blast parameters" will use parameter sets that approximate the NCBI Gapped Blast 2.0 (More info). If you choose to 'save intermediate files', the unmerged outputs, the parameter file, and temporary files will all be saved. In the example above, the query sequences are in individual files in one directory. They can also be set up as multiple sequences per file (e.g. 100 sequences per file, 50 files in the directory), or in one large file. If they are all in a single file, enter the name of that file for the query sequences. e.g. Enter the file or directory which contains your input sequences: /data/user/my_drosoph.seq To run against your own database, enter the db name with full path at the Database: prompt. For example: Database to run against: /data/username/blast_db/my_dbThis database should have been built with the WU-Blast xdformat program. It is available in /usr/local/wublast and described further in the Wu-Blast documentation.
[WU-Blast database update status] [All
WU-Blast parameters]
Detailed InformationWU-Blast on Biowulf works by dividing the database among the nodes, and running WU-Blast with all the query sequences against each piece of the database. At the end of the run, a merge program puts the pieces together. In contrast, NCBI Blast on Biowulf is parallelized by job, where individual query sequences are sent to different nodes. This means that the entire Blast database has to be read by each node, which can cause slowdowns for large databases and/or large numbers of nodes.EasyWUBlast will set up the temporary directories and files, and allocate nodes for your job. If you want more control, you can bypass EasyWUBlast and use the underlying scripts directly, via the instructions below.
WU-Blast environment variablesThe WU-Blast environment variables are set as follows:WUBLASTMAT - /usr/local/wublast/matrix WUBLASTFILTER - /usr/local/wublast/filter WUBLASTDB - /fdb/wublastdbIf you would like to change these values, add them to your parameter file, e.g. setenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/user/sample/query.fasta setenv OUTDIR /data/user/sample/out/ setenv TMPDIR /data/user/sample/tmp setenv PARAMS "-B 10 -V 10 -hspmax 500" setenv WUBLASTDB /data/user/wublast/mydb setenv WUBLASTFILTER /data/user/wublast/filter
DebuggingBy default, the temporary files (/data/user/wublast_par, /data/user/wublast_tmp/*, and the outputs from each node) will be deleted after the run. The only remaining file will be 'output.wublast' which contains the results from all nodes. Your query sequences, of course, will also remain.If you wish to retain all the temporary files, add the environment variable SAVE_ALL to your parameter file. e.g. setenv DB /fdb/wublastdb/nt setenv PROG blastn setenv INDIR /data/user/sample/query.fasta setenv OUTDIR /data/user/sample/out/ setenv TMPDIR /data/user/sample/tmp setenv PARAMS "-B 10 -V 10 -hspmax 500" setenv SAVE_ALL 1
Analyzing the outputYour output from all the query sequences will appear in one large file, for convenience. If you need to pull out individual outputs from that file, use the wublast_extract program. e.g.biowulf% wublast_extract -f output.wublast -q AAU46754 > AAU46754.out
WU-Blast BenchmarksSome recent runs on our system to give you an idea of what sort of timescales to expect.
NCBI Blast parameters: WU-Blast is designed to be sensitive, so that with default parameters it will typically take longer than a NCBI Blast run with default parameters. The wu-blastall command converts NCBI blast parameters into their roughly equivalent WU-Blast parameters. Note that sensitivity is inversely related to speed. More about WU-Blast and NCBI Blast parameters
WU-Blast DatabasesLocal copies of the sequence databases used by WU-Blast can be found in the directory /fdb/wublastdb. These databases are updated weekly. They are built from Fasta-format files downloaded from ftp://ncbi.nlm.nih.gov/blast/db/FASTA directory maintained by NCBI, with the commandxdformat -I "description" -p[-n] file.fasta
WU-Blast Database Update StatusMore Examples
Programs/Scripts/Files involved:These scripts can be copied from /usr/local/wublast/nih and modified if desired, although this should not be necessary.
List of all available WU-Blast parametersTyping the program name will give you a summary of most available parameters and will report the version of WU-Blast being used. See also the Command-line options discussed at the WU-Blast site.
biowulf% /usr/local/wublast/blastn
BLASTN 2.0MP-WashU [26-Oct-2004] [linux24-i686-ILP32F64 2004-10-26T20:25:25]
Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.
Reference: Gish, W. (1996-2004) http://blast.wustl.edu
Notice: this program and its default parameter settings are optimized to find
nearly identical sequences rapidly. To identify weak protein similarities
encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX.
Usage:
BLASTN database queryfile [options]
Valid BLASTN options: E, S, E2, S2, W, T, X, M, N, Y, Z, L, K, H, V, B
(described at Wu-Blast command-line parameters)
-matrix <matrix-name> use the specified scoring matrix (default matrix is
computed from M=+5 N=-4); be sure to consider changing the
default gap penalties when using a non-default scoring
system
-Q <s> penalty score for a gap of length 1
-R <s> penalty score for extending a gap by each letter after the first
-kap use Karlin-Altschul statistics on individual alignment scores
-sump use Karlin-Altschul "Sum" statistics*
-poissonp use Poisson statistics to evaluate multiple HSPs
-top search only the top strand of the query
-bottom search only the bottom strand of the query
-filter <method> hard mask the query using the specified method (e.g.,
"seg", "xnu", "ccp", "dust" or "none")
-wordmask <method> soft mask the query using the specified method (see
-filter)
-maskextra <n> extend soft masking additional distance <n> into flanking
regions
-lcfilter hard mask lower case letters in the query sequence
-lcmask soft mask lower case letters in the query sequence
-echofilter display the query, after any/all masks have been applied
-hitdist <n> max. distance between word hits for 2-hit BLAST (default 0)
-wink <n> generate neighborhood words every <n>-th position (default 1)
-stats collect word-hit statistics (consumes marginally more cpu time)
-ctxfactor <f> base statistics on this number of independent contexts or
reading frames
-nogap turn off gapped alignment method, reporting only ungapped HSPs
-wstrict impose strict requirement for word hits in ungapped alignments
-gapall perform gapped alignment procedure on all ungapped HSPs*
-gapE <e> expectation threshold of sets of ungapped HSPs for subsequent
use in seeding gapped alignments (default gapall)
-gapE2 <e> expectation threshold for saving individual gapped alignments
-gapW <n> full band width for gapped alignment procedure
-gapX <s> drop-off score for gapped alignment procedure
-pingpong perform extra processing to help ensure a locally optimal
alignment (rarely useful)
-nosegs do not segment the query sequence on hyphen (-) characters
-olf <f> max. fractional length of overlap for HSP consistency
-golf <f> max. fractional length overlap for GSP consistency
-olmax <n> max. absolute length of overlap for HSP consistency (default
unlimited)
-golmax <n> max. absolute length of overlap for GSP consistency (default
unlimited)
-gapdecayrate <f> characteristic parameter of geometric weights (default
0.5)
-span2 discard HSPs spanned on both query and subject by a better HSP*
-span1 discard HSPs spanned on query, subject or both by a better HSP
-span do not discard HSPs spanned by other, better HSPs
-prune do not prune insignificant HSPs from the output lists
-consistency turn off HSP consistency rules for statistics
-links display consistent link information for each alignment
-topcomboN <n> report this number of consistent (colinear) groups of HSPs
-topcomboE <e> only show HSP combos within this factor of the best combo
-sumstatsmethod <n> specify an alternate use of Sum statistics
-hspsepqmax <n> max. separation allowed between HSPs along query
-hspsepsmax <n> max. separation allowed between HSPs along subject
-altscore "qc,sc,score" qc and sc may be letters or "all"; score may be
numeric, "min", "max", or "na" (not allowed)
-altscore "none" clears any previous altscore specifications
-hspmax <n> max. number of ungapped HSPs saved per subject sequence
(default 1000; 0 => unlimited)
-gspmax <n> max. number of gapped HSPs (GSPs) saved per subject sequence
(default 0; 0 => unlimited)
-spoutmax <n> max. number of segment pairs reported in the output per
subject sequence (default 0; 0 => unlimited)
-qoffset <i> adjust query sequence coordinate numbers by this amount
-soffset <i> adjust subject sequence coordinate numbers by this amount
-nwstart <n> start generating neighborhood words here in query (default 1)
-nwlen <n> generate neighborhood words over this distance from nwstart
in query
-qrecmin <n> starting multi-query file record number to search
-qrecmax <n> ending multi-query file record number to search
-dbrecmin <n> starting database record number to search
-dbrecmax <n> ending database record number to search
-ucdb search nucleotide sequence database in uncompressed form
-vdbdescmax <n> limit depth of recursion to <n> in describing virtual
databases (default 1)
-dbchunks <n> no. of logical chunks of the database to assign to threads
-dbslice <m>/<n> search slice <m> out of a database sliced <n> ways
-dbslice <a>-<b>/<n> search slices <a> through <b> (inclusive) out of a
database sliced <n> ways
-gi display gi identifiers, when available
-noseqs do not display sequence alignments -- abbreviated output
-qtype exit non-zero if query seems to be of wrong type
-qres exit non-zero if query contains an invalid residue code
Multiple sort options can be specified and are applied in the user-specified
order.
-sort_by_pvalue list subjects in decreasing P-value order*
-sort_by_count list subjects by the number of HSPs
-sort_by_highscore list subjects by highest HSP score
-sort_by_totalscore list subjects by the sum total of HSP scores
-sort_by_subjectlength list subjects with longer sequences first
-cpus <n> no. of processors to utilize on multi-processor systems
-mmio do not use memory-mapped I/O (usually slower)
-nonnegok make all non-negative expected scores a non-FATAL error
-novalidctxok make no valid contexts a non-FATAL error
-shortqueryok make queries shorter than the word length a non-FATAL error
-notes suppress informatory messages
-warnings suppress warning messages
-errors suppress non-fatal error messages (strongly discouraged)
-putenv "NAME=VALUE" set environment variable NAME to the specified VALUE
-endputenv ignore any subsequent putenv options on the command line
-getenv NAME display the value of the environment variable NAME
-endgetenv ignore any subsequent getenv options on the command line
-compat1.4 revert to BLAST version 1.4 behavior (with bug fixes)
-compat1.3 revert to BLAST version 1.3 behavior (with bug fixes)
-haltonfatal halt multi-query execution on occurrence of first FATAL
-globalexit append EXIT CODE 12 to output if any multi-query was fatal
-abortonerror abort (and possibly dump core) on a non-fatal error
-abortonfatal abort (and possibly dump core) on a fatal error
-progress <n> report progress of search at least this often (in seconds)
-o fname write output to file named "fname", instead of stdout
*Default program behavior
|
|||||||||||||||||||||||||||||||||||||||
|
This
document is available as http://biowulf.nih.gov/apps/wublast.html |
|||||||||||||||||||||||||||||||||||||||