biowulf_logo

Status
About
Hardware
Applications
Batch queues
Disk storage

MPI
Performance
New Users
User Guide
Documentation
Research
Photos


High-throughput FASTA on Biowulf

FASTA on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the databases. If you have just a few query sequences, you should use FASTA on Helix. Please contact the Helix Systems staff staff@helix.nih.gov, or 4-6248 if you have questions about your FASTA jobs.

[Sample Session] [Instructions] [Splitting Sequences] [Benchmarks]

[Databases] [Memory] [More Examples] [Programs/Scripts] [Parameters]

EasyFasta: An easy interface to Fasta on Biowulf

The 'easyfasta' program on Biowulf simplifies submission of large Fasta jobs. You need to put all your query sequences into a directory, and then type 'easyfasta' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of nodes you need (based on the database you are searching against), and submit your job to as many nodes as are available (max 32).


Sample Session (user input is in bold):

<biowulf>75% easyfasta

-----------------------------------------------------------------------
EasyFasta: Fasta for large numbers of sequences
-----------------------------------------------------------------------

Enter the directory which contains your input sequences: /data/user/inputdir/1000seq

Enter the directory where you want your Fasta output to go: /data/user/outputdir

/data/maoj/easyfasta/out does not exist; creating it

Are your query sequences nucleotides (y/n)? y

Fasta programs:
fasta - nt seq against nt database; aa seq against aa database
fastx - nt seq against aa database, 6 for/rev reading frames, w/frameshifts
fasty - similar to fastx but slower, more sensitive
ssearch - nt seq against nt database or aa seq against aa database using Smith-Waterman algorithm

Which program do you want to run: fasta

The following nucleotide databases are available:
(Or enter your own database with full pathname)

nt Non-redundant GenBank+EMBL+DDBJ+PDB sequences(no EST,STS,GSS,HTG)
est_human Human sequences from the EST division of Genbank.
est_mouse Mouse sequences from the EST division of Genbank.
htgs High-throughput genome sequences
mito.nt Mitochondrial nucleotide sequences
ecoli.nt Ecoli genomic nucleotide sequences
pdb.nt PDBnucleotide sequences
yeast.nt Saccharomyces cerevisiae genomic nucleotide sequences
drosoph.nt Genbank drosophila sequences
ref.human.rna Refseq human nucleotide sequences
ref.mouse.rna Refseq mouse nucleotide sequences
ref.human.genomic Refseq Human (NC_######) chromosome records with gap adjusted
  concatenated NT contigs
ref.other.genomic RefSeq chromosome records (NC_######) for organisms other than human
hs_genome Build 36, hg18 (April 2006) from the International Human Genome Consortium
hs_genome.rna Build 36, hg18 (April 2006) from the International Human Genome Consortium
mouse_genome Build 34, mm6, May 2005 from the Mouse Genome Consortium
mouse_genome.rna Build 34, mm6, May 2005 from the Mouse Genome Consortium

Database to run against: nt
What is the ktup value (default 6)?

http://helix.nih.gov/apps/bioinfo/fasta3x.txt has a full list of available options.
Any additional Fasta options (e.g. -v 10): -v 10

Creating parameter file /data/maoj/fasta_tmp.15438/fasta_par.15438
Checking node situation....

94 are available for 1000 sequences

qsub -v np=20,read=/data/maoj/fasta_tmp.15438/fasta_par.15438 -l nodes=20:m2048
/usr/local/fasta/bin/easyrunfasta

Submitting to 20 nodes. Job number is 537317.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?537317.biobos

 
  • Easyfasta figures out the node memory required, sets up all temporary files and directories, and submits the job for you.
  • To run against your own database, enter the db name with full path at the Database: prompt. If your database is not in fasta format, please see Database section below. For example:
    Database to run against: /data/username/fasta_db/my_db


Detailed Instructions

Easyfasta will select the number and type of nodes for you. If you want more control, you can bypass Easyfasta and use the underlying scripts directly, via the instructions below.

1. Set up a directory structure for input/output and tmp files.

For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Fasta sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs machines. The tmp and output directories will be created if they do not already exist.

2. Edit/create an Environment file which tells Fasta where the input/output files are, and what Fasta parameters to use.

This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env

Sample environment file:
 setenv DB @/fdb/fastadb/fastanam/nt.nam 
              setenv KTUP 6 setenv PROG fasta_t setenv INDIR /data/username/sample1/indir 
              setenv OUTDIR /data/username/sample1/outdir setenv TMPDIR /data/username/sample1/temp1 
              setenv PARAMS "-b 10 -d 10"

where

$DB -- Database to act on (e.g. /fdb/fastadb/nr or your own database). More info.
$PROG -- Fasta program to run (fasta_t, fastx_t, fastf_t, fasts_t, fasty_t, prss_t, ssearch_t, tfasta_t, tfastx_t, tfastf_t, tfasts_t, tfasty_t)
$INDIR -- directory containing all the input sequence files.
$OUTDIR -- directory which will receive all the individual output files
$TMPDIR -- temporary directory used by the program.
$PARAMS-- Parameters for the Fasta program, in quotes (e.g. "-b 10 -v 10" )
(List of all available Fasta parameters)

  • Each sequence file may contain single or multiple sequences. Each file will go to a single node, so don't put ALL your sequences into a single file.
  • Existing files in $TMPDIR and $OUTDIR may be overwritten by this job. Use a different TMPDIR and OUTDIR if you want to preserve them.
  • The programs ended with _t are multithreaded version of their corresponding programs. They are defaulted to 2 threads. Since each node in Biowulf has 2 processors, _t versions should be used.

3. Submit the job to the batch system

Submit the job to the batch system using the 'qsub' command.

Example 1:

qsub -v np=4,read=/data/username/easyfasta/env1 
-l nodes=4:p2800:m2048 /usr/local/fasta/bin/easyrunfasta

This job is using the environment file /data/username/easyfasta/env1, will use 2 processors per node (because of the -T 2 flag given as a default parameter in Fasta programs) and has asked for 'm2048' nodes (memory = 2048 Mb). The 'shnodes' command will list all the Biowulf nodes and their properties.

Example 2:

qsub -v np=4,read=/data/username/easyfasta/env1 
-l nodes=4 /usr/local/fasta/bin/easyrunfasta

This job is using the fasta environment file /data/username/easyfasta/env1, and has asked for 4 nodes. No specific nodes have been asked for (i.e. no 'm####' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties).

  • You can monitor any jobs you submit; see Monitoring your jobs in the Biowulf User Guide.
  • You need to submit your job to nodes that have enough memory for your particular job. See the section 'Fasta and Biowulf Memory Size' further down this page. If you are running against your own database, check the size of the .nsq or .psq file and submit to the smallest possible node memory.

    For example,

  • biowulf% ls -l 
    my_db.nsq -rw-rw-r-- 1 username username 321238769 Aug 31 2001 my_db.nsq

    The database file is 321238769 bytes, i.e. 321 Mb. The nodes should have at least 321 Mb, so you can submit to any nodes that have at least 512 MB. (currently, all nodes on the Biowulf cluster)

  • It is possible for each query sequence file to contain multiple sequences. However, there needs to be at least as many query sequence files as nodes! Occasionally Fasta may barf on a particular sequence, in which case it will not continue on to other sequences in that file. See below for splitting multi-sequence files.

Splitting your query sequences

If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:

  • seqsplit: will split a multisequence fasta-format file into individual sequences. Usage:
     seqsplit -f sequence_file 

    If the file sequence_file contains 2000 sequences, you will get 2000 individual files. Each file will be named according to the sequence name in the fasta entry.

  • split_fasta: will split a multisequence fasta-format file into a desired number of files. Usage:
    split_fasta [optional parameters] [dir]file.fas 

    -n # number of split files (default=2)
    -o file root name of output file (default split#)
    -c # chunks to write out (default 100 entries)
    -d outdir output directory (default = input directory)
    -z if input file is .Z or .gz compressed.

    Thus, if a file has 100 sequences, and you want to split it into 5 multisequence files, use split_fasta -n 5 sequence_file will produce 5 files, each containing 100/5=20 sequences. The files will be called split0, split1,...split4. split_fasta -n 5 -o oligo sequence_file will produce 5 files, each containing 20 sequences. The files will be called oligo0, oligo1 ..oligo5.


Fasta Benchmarks

Some recent runs on our system to give you an idea of what sort of performance to expect.

Query Database Fasta Program Nodes Time
1000 nucleotide
EST sequence
nt nucleotide
updated 13 Jul 2004
2,283,112 sequences
2.8 Gb
fasta 10
4 Gb memory
18 hrs 51 min
month nucleotide
updated 17 Jul 2004
52,474 sequences
90 Mb
fasta

10
p2800
1 Gb memory

40 min
est
updated 14 Jul 2004
21,925,146 sequences
3.9 Gb
fasta 10
4 Gb memory
50 hrs 47 min
human genome
updated 12 Dec 2003
25 sequences
707 Mb
fasta

10
P2800
1 Gb memory

4 hrs 44 min
1000 protein sequences nt nucleotide
updated 13 Jul 2004
2,283,112 sequences
2.8 Gb
tfastx 10
4 Gb memory
30 hrs 53 min
nr protein
updated 3 Feb 2004
1,934,002sequences
700 Mb
fasta

10
p2800
1 Gb memory

1 hrs 59 min

All runs were 2 processors/node.

  • EasyFasta will typically allot 32 nodes (i.e. 64 processors) for a job. Large databases (e.g. nt nucleotide) may be allotted fewer nodes, since there are fewer nodes with > 2 Gb memory and hence they are less likely to be available.

Fasta Databases

Fasta programs can accept databases in many different formats.

User can use their own database or ready-to-use databases on Helix. These databases can be found in the directory /fdb/. User use one of the two methods below to define the $DB variable in the environmental file:

1. To use a FASTA-FORMATTED database in ONE single file, the environmental file should look like this:

setenv DB /fdb/fastadb/pdb.nt.fas
setenv KTUP 6
setenv PROG fasta_t
setenv INDIR /data/username/sample1/indir
setenv OUTDIR /data/username/sample1/outdir
setenv TMPDIR /data/username/sample1/temp1
setenv PARAMS "-b 10 -d 10"

2. To use database composed of several files, or databases, or database in NON-FASTA format:

  • create a file (/data/user/dbfile) which look like this with the databases full path and format number (see below):
/fdb/blastdb/nt.00 12
/fdb/blastdb/nt.01 12
/fdb/blastdb/nt.02 12
/data/username/mydb.fas 0
/fdb/embossdb/estnew 3
  • The environmental file should look like this:

setenv DB @/data/user/dbfile
setenv KTUP 6
setenv PROG fasta_t
setenv INDIR /data/username/sample1/indir
setenv OUTDIR /data/username/sample1/outdir
setenv TMPDIR /data/username/sample1/temp1
setenv PARAMS "-b 10 -d 10"

  • Make sure the '@' sign is in front of the file which contains the full path of the databases and their format number.
    The format number can be determined using the following list:

    0    Pearson/FASTA (>SEQID - comment/sequence)
    1    Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2    NBRF CODATA (ENTRY/SEQUENCE)
    3    EMBL/SWISS-PROT (ID/DE/SQ)
    4    Intelligenetics (;comment/SEQID/sequence)
    5    NBRF/PIR VMS (>P1;SEQID/comment/sequence)
    6    GCG (version 8.0) Unix Protein and DNA (compressed)
    11  NCBI Blast1.3.2 format (unix only)
    12  NCBI Blast2.0 format (unix only, fasta32t08 or later)

The Helix databases are updated frequently. See Blast Database Update Status. Also see 'Finding the libraries' in http://helix.nih.gov/apps/bioinfo/fasta3x.txt for more details.


Fasta and Biowulf Memory Size

When analyzing a large number of sequences with fasta, it is imperative that the database fit entirely within the memory of a given node. This makes a vast difference in the performance of fasta.

For example, in the environmental file above, the database line reads:

setenv DB @/fdb/fastadb/fastanam/nt.nam

In this nt.nam file, there are full path of 3 database files::

biowulf 75% more /fdb/fastadb/fastanam/nt.nam
/fdb/blastdb/nt.00 12
/fdb/blastdb/nt.01 12
/fdb/blastdb/nt.02 12

The size of the files IN the file /fdb/fastadb/fastanam/nt.nam needs to be sumed up in order to decide the memory of the nodes required. NOT the file /fdb/fastadb/fastanam/nt.nam itself. Do this:

Helix 100% ls -l /fdb/blastdb/nt.*

Add up the size of the files whose name is nt.**.nsq

If the result is 2,786,861,701 which is > 2 GB, then 4 Gb nodes are required ( -l nodes=NodesNumberByUser:m4096 ). If the result is 789,890,000 which is 700 Mb then 1 Gb nodes are required ( -l nodes=NodesNumberByUser:m1024 ). If the result is greater than 4 GB, then the dual core node which has 8 GB of memory should be requested ( -l nodes=NodesNumberByUser:dc ). Please note, if dual core nodes are requested, please add -T 4 to the PARAMS variable in the environmental file to make use of all 4 processors in a dual core node.

At the present time the Biowulf cluster consists of nodes with different memory configurations. Please see the hardware section of the User Guide for the current configuration of the cluster.

  • Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)
  • Fasta jobs should be submitted with the appropriate node designation, depending on the size of the target database. If no size is specified, the batch system will allocate nodes based on availability and system-specified order.

More Examples

1. DNA sequences against the nucleotide non-redundant database (nt)

  • Note that 'username' in the examples below should be replaced by your own username!!
biowulf$ mkdir /data/username/run1             make main directory for this run
biowulf$ cd /data/username/run1                go this directory
biowulf$ mkdir input output tmp                make subdirectories for this run
biowulf$ cd input                              go to the 'input' subdirectory
biowulf$ cp /home/username/*.seq .             copy all the sequence files into this subdir
  • Create a Fasta environment file in /data/username/run1/env. The file contains
setenv DB @/fdb/fastadb/fastanam/nt.nam
setenv PROG fasta_t
setenv INDIR  /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out
setenv TMPDIR  /data/username/run1/tmp
setenv PARAMS "-H -b 10 -v 10"
  • Submit the job with the command:
qsub -v np=8,read=/data/username/run1/env 
-l nodes=8:m1024 /usr/local/fasta/bin/easyrunfasta

2. Same data against the protein non-redundant database (nr)

  • Create a new environment file /data/username/run1/env2 which contains:
setenv DB @/fdb/fastadb/fastanam/nr.nam
setenv PROG fasta_t
setenv PARAMS "-H -b 10 -v 10"
setenv INDIR  /data/username/run1/seqs
setenv OUTDIR /data/username/run1/out2
setenv TMPDIR  /data/username/run1/tmp2
  • Submit the job using:
qsub -v np=16,read=/data/username/run1/env2 
-l nodes=16 /usr/local/fasta/bin/easyrunfasta
  • The protein nr database is not very large and so requires only 512M of memory. Thus you don't have to ask for any special node properties.

Programs/Scripts/Files involved:

Fasta programs:

Program
Description
FASTA Compares a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library.
SSEARCH Performs a rigorous Smith-Waterman alignment between a protein sequence and another protein sequence or a protein database, or with DNA sequence to another DNA sequence or a DNA library (very slow).
FASTX/FASTY Compares a DNA sequence to a protein sequence database, translating the DNA sequence in three forward (or reverse) frames and allowing frameshifts.
TFASTX/TFASTY Compares a protein sequence to a DNA sequence or DNA sequence library. The DNA sequence is translated in three forward and three reverse frames, and the protein query sequence is compared to each of the six derived protein sequences. The DNA sequence is translated from one end to the other; no attempt is made to edit out intervening sequences. Termination codons are translated into unknown ('X') amino acids.
FASTF/TFASTF Compares an ordered peptide mixture, as would be obtained by Edman degredation of a CNBr cleavage of a protein, against a protein (fastf) or DNA (tfastf) database. A different format is required to specify the ordered peptide mixture:
>mgstm1
MGCEN,
MIDYP,
MLLAY,
MLLGY
indicates m in the first position of all three peptides (as from CNBr), g, i, l (twice) in the second position (first cycle), c,d,l (twice) in the third position, etc. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue.
FASTS/TFASTS Compares set of short peptide fragments, as would be obtained from mass-spec. analysis of a protein, against a protein (fasts) or DNA (tfasts) database. A different format is required to specify the ordered peptide mixture:
>mgstm1
MILG,
MLLEYTD,
MGDAP
indicates three peptide fragments were found: MILG, MLLEYTD, and MGDAP. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue.

 

Fasta scripts:

These scripts can be copied from /usr/local/fasta/bin and modified if desired, although this should not be necessary.
  • easyrunfasta-- (aka runfasta) Batch(PBS) submission file which sets up the mpi wrapper (multirun) for 2 perl scripts
  • fastadist -- sorts input query sequences by size to balance the load on the nodes. Makes lists of assigned sequences for each node in $TMPDIR
  • mpifasta -- main execution of fasta...processes all query sequences assigned to a given node
  • env -- file which contains appropriate variables and parameters for fasta execution
  • fasta -- fasta and all its associated executables and parameters are located in /usr/local/fasta.
  • multirun -- MPI wrapper program which allows for different behavior on different nodes
  • cleanup - cleans up tmp files for easyfasta.

List of all available Fasta parameters

These are the parameters available for the fasta programs. See also (the Fasta documentation)

-a (fasta3, ssearch3 only) show both sequences in their entirety.

-A force Smith-Waterman alignments for fasta3 DNA sequences. By default, only fasta3 protein sequence comparisons use Smith-Waterman alignments.

-B Show normalized score as a z-score, rather than a bit-score in the list of best scores.

-b # Number of sequence scores to be shown on output. In the absence of this option, fasta (and tfasta and ssearch) display all library sequences obtaining similarity scores with expectations less than 10.0 if optimized score are used, or 2.0 if they are not. The -b option can limit the display further, but it will not cause additional sequences to be displayed.

-c # Threshold score for optimization (OPTCUT). Set "-c 1" to optimize every sequence in a database.

-E # Limit the number of scores and alignments shown based on the expected number of scores. Used to override the expectation value of 10.0 used by default. When used with -Q, -E 2.0 will show all library sequences with scores with an expectation value <= 2.0.

-d # Maximum number of alignments to be displayed. Ignored if "-Q" is not used.

-f Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

-F # Limit the number of scores and alignments shown based on the expected number of scores. "-E #" sets the highest E()-value shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001" will not show any matches or alignments with E() < 0.0001. This allows one to skip over close relationships in searches for more distant relationships.

-g Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

-h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).

-H Omit histogram.

-i Invert (reverse complement) the query sequence if it is DNA. For tfasta3/x3/y3, search the reverse complement of the library sequence only.

-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).

-l file Location of library menu file (FASTLIBS).

-L Display more information about the library sequence in the alignment.

-M low-high Range of amino acid sequence lengths to be included in the search.

-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10

-m 0 -m 1 -m 2 -m 3 -m 4 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- MWKSCGYPYT MWKSCGYPYT

-m 5 provides a combination of -m 4 and -m 0. -m 6 provides -m 5 plus HTML formatting.

-m 9 provides coordinates and scores with the best score information. A simple " -m 9 extends the normal best score information:

The best scores are: opt bits E(14548) XURTG4 glutathione transferase (EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79

to include the additional information (on the same line, separated by a <tab>):

%_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs 0.771 0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0

-m 9c provides additional information: an encoded alignment string. Thus:

10 20 30 40 50 60 70 GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..: XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ 20 30 40 50 60

would be encoded:

=23+9=13-2=10-1=3+1=5

The alignment encoding is with repect to the alignment, not the sequences. The coordinate of the alignment is given earlier in the " -m 9c" line.

-m 10 -m 10 is a new, parseable format for use with other programs. See the file "readme.v20u4" for a more complete description.

As of version "fa34t23b2", it has become possible to combine independent "-m" options. Thus, one can use "-m 1 -m 6 -m 9".

-M low-high Include library sequences (proteins only) with lengths between low and high.

-n Force the query sequence to be treated as a DNA sequence. This is particularly useful for query sequences that contain a large number of ambiguous residues, e.g. transcription factor binding sites.

-O Send copy of results to "filename." Helpful for environments without STDOUT (mostly for the Macintosh).

-o Turn off default optimization of all scores greater than OPTCUT. Sort results by "initn" scores (reduces the accuracy of statistical estimates).

-p Force query to be treated as protein sequence.

-Q,-q Quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file.

-r Specify match/mismatch scores for DNA comparisons. The default is "+5/-4". "+3/-2" can perform better in some cases.

-R file Save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (Pearson, 1995, Pearson, 1998).

-s file Specify the scoring matrix file. fasta3 uses the same scoring matrices as Blast1.4/2.0. Several scoring matrix files are included in the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al. (Dayhoff et al., 1978); pam120.mat - a PAM120 matrix. The default scoring matrix is BLOSUM50 ("-s BL50"). Other matrices available from within the program are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40" (MDM are modern PAM matrices from Jones et al. (Jones et al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s BL80".

-S Treat lower-case characters in the query or library sequences as "low-complexity" ("seg"-ed) residues. Traditionally, the "seg" program (Wootton and Federhen, 1993) is used to remove low complexity regions in DNA sequences by replacing the residues with an "X". When the "-S" option is used, the FASTA33 programs provide a potentially more informative approach. With "-S", lower case characters in the query or database sequences are treated as "X"'s during the initial scan, but are treated as normal residues during the final alignment display. Since statistical significance is calculated from the similarity score calculated during the library search, when the lower case residues are "X"'s, low complexity regions will not produce statistically significant matches. However, if a significant alignment contains low complexity regions, their alignmen is shown. With "-S", lower case characters may be included in the alignment to indicate low complexity regions, and the final alignment score may be higher than the score obtained during the search.

The pseg program can be used to produce databases (or query sequences) with lower case residues indicating low complexity regions using the command:

pseg database.fasta -z 1 -q > database.lc_seg

(seg can also be used with some post processing, see readme.v33tx.)

-U Treat the query sequence an RNA sequence. In addition to selecting a DNA/RNA alphabet, this option causes changes to the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'.

-V str It is now possible to specify some annotation characters that can be included (and will be ignored), in the query sequence file. Thus, One might have a file with: "ACVS*ITRLFT?", where "*" and "?" are used to indicate phosphorylation. By giving the option -V '*?', those characters in the query will be moved to an "annotation string", and alignments that include the annotated residues will be highlighted with the appropriate character above the sequence (on the number line).

-w # Line length (width) = number (<200)

-W # context length (default is 1/2 of line width -w) for alignment, like fasta and ssearch, that provide additional sequence context.

-x # Specify the penalty for a match to an 'X', independently of the PAM matrix. Particularly useful for fastx3/fasty3, where termination codons are encoded as 'X'.

-X Specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try:

fasta -X "-500 -300" seq1.nt seq2.nt

If the -X option is not used, FASTA assumes numbering starts with 1. (You should double check to be certain the negative numbering works properly.)

-y Set the width of the band used for calculating "optimized" scores. For proteins and ktup=2, the width is 16. For proteins with ktup=1, the width is 32 by default. For DNA the width is 16.

-z -1,0,1,2,3,4,5 -z -1 turns off statistical calculations. z 0 estimates the significance of the match from the mean and standard deviation of the library scores, without correcting for library sequence length. -z 1 (the default) uses a weighted regression of average score vs library sequence length; -z 2 uses maximum likelihood estimates of Lambda and K; -z 3 uses Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5 uses two variations on the -z 1 strategy. -z 1 and -z 2 are the best methods, in general.

-z 11,12,14,15 estimate the statistical parameters from shuffled copies of each library sequence. This doubles the time required for a search, but allows accurate statistics to be estimated for libraries comprised of a single protein family.

-Z db_size set the apparent size of the database to be used when calculating expectation E() values. If you searched a database with 1,000 sequences, but would like to have the E()-values calculated in the context of a 100,000 sequence database, use '-Z 100000'.

-1 sort output by init1 score (for compatibility with FASTP - do not use).

-3 translate only three forward frames

For example:

fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and display the results with 80 residues on an output line, showing all of the residues in both sequences. Be sure to enter the options before entering the file names, or just enter the options on the command line, and the program will prompt for the file names.

(November, 1997) In addition, it is now possible to provide the fasta programs with the query sequence (fasta, fasty, ssearch, tfastx), or two sequences (prss, lalign, plalign) from the unix "stdin" stream. This makes it much easier to set up FASTA or PRSS WWW pages. To specify that stdin be used, rather than a file, the file name should be specified as '-' or '@' (the latter file name makes it possible to specify a subset of the sequence). Thus:

cat query.aa | fasta -q @:25-75 s

would take residues 25-75 from query.aa and search the 's' library (see the discussion of FASTLIBS).

 


This document is available as http://biowulf.nih.gov/apps/fasta.html
Biowulf home page | Helix Systems | NIH