FASTA on Biowulf is intended for running a large number of sequence
files, such as hundreds or thousands of query sequences, against the databases.
If you have just a few query sequences, you should use FASTA on Helix.
Please contact the Helix Systems staff staff@helix.nih.gov,
or 4-6248 if you have questions about your FASTA jobs.
The 'easyfasta' program on Biowulf simplifies submission of large Fasta
jobs. You need to put all your query sequences into a directory, and then
type 'easyfasta' at the Biowulf prompt. You will be prompted for all required
parameters. The script will then decide what kind of nodes you need (based
on the database you are searching against), and submit your job to as
many nodes as are available (max 32).
Easyfasta will select the number and type of nodes for you. If you want
more control, you can bypass Easyfasta and use the underlying scripts
directly, via the instructions below.
Submit the job to the batch system using the 'qsub' command.
If your query sequences are all in one file, and you need to split them
into multiple sequence files, there are a couple of utilities available:
Some recent runs on our system to give you an idea of what sort of performance
to expect.
All runs were 2 processors/node.
Fasta programs can accept databases in many different formats.
User can use their own database or ready-to-use databases on Helix. These
databases can be found in the directory /fdb/. User use
one of the two methods below to define the $DB variable in the environmental
file:
1. To use a FASTA-FORMATTED database in ONE single file, the environmental
file should look like this:
2. To use database composed of several files, or databases, or database
in NON-FASTA format:
When analyzing a large number of sequences with fasta, it is imperative
that the database fit entirely within the memory of a given node. This
makes a vast difference in the performance of fasta.
Add up the size of the files whose name is nt.**.nsq
If the result is 2,786,861,701 which is > 2 GB, then 4 Gb nodes are
required ( -l nodes=NodesNumberByUser:m4096 ). If the result is 789,890,000
which is 700 Mb then 1 Gb nodes are required ( -l nodes=NodesNumberByUser:m1024
). If the result is greater than 4 GB, then the dual core node which has
8 GB of memory should be requested ( -l nodes=NodesNumberByUser:dc ).
Please note, if dual core nodes are requested, please add -T 4 to the
PARAMS variable in the environmental file to make use of all 4 processors
in a dual core node.
At the present time the Biowulf cluster consists of nodes with different
memory configurations. Please see the hardware
section of the User Guide for the current configuration of the cluster.
1. DNA sequences against the nucleotide non-redundant database (nt)
2. Same data against the protein non-redundant database (nr)
-a (fasta3, ssearch3 only) show both sequences in
their entirety.
-A force Smith-Waterman alignments for fasta3 DNA
sequences. By default, only fasta3 protein sequence comparisons use
Smith-Waterman alignments.
-B Show normalized score as a z-score, rather than
a bit-score in the list of best scores.
-b # Number of sequence scores to be shown on output.
In the absence of this option, fasta (and tfasta and ssearch) display
all library sequences obtaining similarity scores with expectations
less than 10.0 if optimized score are used, or 2.0 if they are not.
The -b option can limit the display further, but it will not cause additional
sequences to be displayed.
-c # Threshold score for optimization (OPTCUT). Set
"-c 1" to optimize every sequence in a database.
-E # Limit the number of scores and alignments shown
based on the expected number of scores. Used to override the expectation
value of 10.0 used by default. When used with -Q, -E 2.0 will show all
library sequences with scores with an expectation value <= 2.0.
-d # Maximum number of alignments to be displayed.
Ignored if "-Q" is not used.
-f Penalty for the first residue in a gap (-12 by
default for proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
-F # Limit the number of scores and alignments shown
based on the expected number of scores. "-E #" sets the highest
E()-value shown; "-F #" sets the lowest E()-value. Thus, "-F
0.0001" will not show any matches or alignments with E() < 0.0001.
This allows one to skip over close relationships in searches for more
distant relationships.
-g Penalty for additional residues in a gap (-2 by
default for proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
-h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
-H Omit histogram.
-i Invert (reverse complement) the query sequence
if it is DNA. For tfasta3/x3/y3, search the reverse complement of the
library sequence only.
-j # Penalty for frameshift within a codon (fasty3/tfasty3
only).
-l file Location of library menu file (FASTLIBS).
-L Display more information about the library sequence
in the alignment.
-M low-high Range of amino acid sequence lengths to
be included in the search.
-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6,
9, 10
-m 0 -m 1 -m 2 -m 3 -m 4 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- MWKSCGYPYT MWKSCGYPYT
-m 5 provides a combination of -m 4 and -m 0. -m 6 provides -m 5 plus
HTML formatting.
-m 9 provides coordinates and scores with the best score information.
A simple " -m 9 extends the normal best score information:
The best scores are: opt bits E(14548) XURTG4 glutathione transferase
(EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79
to include the additional information (on the same line, separated
by a <tab>):
%_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs 0.771
0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0
-m 9c provides additional information: an encoded alignment string.
Thus:
10 20 30 40 50 60 70 GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
:.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..: XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
20 30 40 50 60
would be encoded:
=23+9=13-2=10-1=3+1=5
The alignment encoding is with repect to the alignment, not the sequences.
The coordinate of the alignment is given earlier in the " -m 9c"
line.
-m 10 -m 10 is a new, parseable format for use with other programs.
See the file "readme.v20u4" for a more complete description.
As of version "fa34t23b2", it has become possible to combine
independent "-m" options. Thus, one can use "-m 1 -m
6 -m 9".
-M low-high Include library sequences (proteins only) with lengths
between low and high.
-n Force the query sequence to be treated as a DNA
sequence. This is particularly useful for query sequences that contain
a large number of ambiguous residues, e.g. transcription factor binding
sites.
-O Send copy of results to "filename." Helpful
for environments without STDOUT (mostly for the Macintosh).
-o Turn off default optimization of all scores greater
than OPTCUT. Sort results by "initn" scores (reduces the accuracy
of statistical estimates).
-p Force query to be treated as protein sequence.
-Q,-q Quiet - does not prompt for any input. Writes
scores and alignments to the terminal or standard output file.
-r Specify match/mismatch scores for DNA comparisons.
The default is "+5/-4". "+3/-2" can perform better
in some cases.
-R file Save a results summary line for every sequence
in the sequence library. The summary line includes the sequence identifier,
superfamily number (if available) position in the library, and the similarity
scores calculated. This option can be used to evaluate the sensitivity
and selectivity of different search strategies (Pearson, 1995, Pearson,
1998).
-s file Specify the scoring matrix file. fasta3 uses
the same scoring matrices as Blast1.4/2.0. Several scoring matrix files
are included in the standard distribution. For protein sequences: codaa.mat
- based on minimum mutation matrix; idnaa.mat - identity matrix; pam250.mat
- the PAM250 matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
pam120.mat - a PAM120 matrix. The default scoring matrix is BLOSUM50
("-s BL50"). Other matrices available from within the program
are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-s
P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
(MDM are modern PAM matrices from Jones et al. (Jones et al., 1992),),
BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
BL80".
-S Treat lower-case characters in the query or library
sequences as "low-complexity" ("seg"-ed) residues.
Traditionally, the "seg" program (Wootton and Federhen, 1993)
is used to remove low complexity regions in DNA sequences by replacing
the residues with an "X". When the "-S" option is
used, the FASTA33 programs provide a potentially more informative approach.
With "-S", lower case characters in the query or database
sequences are treated as "X"'s during the initial scan, but
are treated as normal residues during the final alignment display. Since
statistical significance is calculated from the similarity score calculated
during the library search, when the lower case residues are "X"'s,
low complexity regions will not produce statistically significant matches.
However, if a significant alignment contains low complexity regions,
their alignmen is shown. With "-S", lower case characters
may be included in the alignment to indicate low complexity regions,
and the final alignment score may be higher than the score obtained
during the search.
The pseg program can be used to produce databases (or query sequences)
with lower case residues indicating low complexity regions using the
command:
pseg database.fasta -z 1 -q > database.lc_seg
(seg can also be used with some post processing, see readme.v33tx.)
-U Treat the query sequence an RNA sequence. In addition
to selecting a DNA/RNA alphabet, this option causes changes to the scoring
matrix so that 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'.
-V str It is now possible to specify some annotation
characters that can be included (and will be ignored), in the query
sequence file. Thus, One might have a file with: "ACVS*ITRLFT?",
where "*" and "?" are used to indicate phosphorylation.
By giving the option -V '*?', those characters in the query will be
moved to an "annotation string", and alignments that include
the annotated residues will be highlighted with the appropriate character
above the sequence (on the number line).
-w # Line length (width) = number (<200)
-W # context length (default is 1/2 of line width
-w) for alignment, like fasta and ssearch, that provide additional sequence
context.
-x # Specify the penalty for a match to an 'X', independently
of the PAM matrix. Particularly useful for fastx3/fasty3, where termination
codons are encoded as 'X'.
-X Specifies offsets for the beginning of the query
and library sequence. For example, if you are comparing upstream regions
for two genes, and the first sequence contains 500 nt of upstream sequence
while the second contains 300 nt of upstream sequence, you might try:
fasta -X "-500 -300" seq1.nt seq2.nt
If the -X option is not used, FASTA assumes numbering starts with
1. (You should double check to be certain the negative numbering works
properly.)
-y Set the width of the band used for calculating
"optimized" scores. For proteins and ktup=2, the width is
16. For proteins with ktup=1, the width is 32 by default. For DNA the
width is 16.
-z -1,0,1,2,3,4,5 -z -1 turns off statistical calculations.
z 0 estimates the significance of the match from the mean and standard
deviation of the library scores, without correcting for library sequence
length. -z 1 (the default) uses a weighted regression of average score
vs library sequence length; -z 2 uses maximum likelihood estimates of
Lambda and K; -z 3 uses Altschul-Gish parameters (Altschul and Gish,
1996); -z 4 - 5 uses two variations on the -z 1 strategy. -z 1 and -z
2 are the best methods, in general.
-z 11,12,14,15 estimate the statistical parameters
from shuffled copies of each library sequence. This doubles the time
required for a search, but allows accurate statistics to be estimated
for libraries comprised of a single protein family.
-Z db_size set the apparent size of the database to
be used when calculating expectation E() values. If you searched a database
with 1,000 sequences, but would like to have the E()-values calculated
in the context of a 100,000 sequence database, use '-Z 100000'.
-1 sort output by init1 score (for compatibility with
FASTP - do not use).
-3 translate only three forward frames
For example:
fasta -w 80 -a seq1.aa seq.aa
would compare the sequence in seq1.aa to that in seq2.aa and display
the results with 80 residues on an output line, showing all of the residues
in both sequences. Be sure to enter the options before entering the
file names, or just enter the options on the command line, and the program
will prompt for the file names.
(November, 1997) In addition, it is now possible to provide the fasta
programs with the query sequence (fasta, fasty, ssearch, tfastx), or
two sequences (prss, lalign, plalign) from the unix "stdin"
stream. This makes it much easier to set up FASTA or PRSS WWW pages.
To specify that stdin be used, rather than a file, the file name should
be specified as '-' or '@' (the latter file name makes it possible to
specify a subset of the sequence). Thus:
cat query.aa | fasta -q @:25-75 s
would take residues 25-75 from query.aa and search the 's' library
(see the discussion of FASTLIBS).