biowulf_logo

Status
About
Hardware
Applications
Batch queues
Disk storage

MPI
Performance
New Users
User Guide
Documentation
Research
Photos


BLAT (not Blast!) on Biowulf

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates. For more information see the BLAT web page or Jim Kent's web page.

** EasyBlat: The easy interface

The 'easyblat' script simplifies running large BLAT jobs. You need to put all your query sequences into a directory, and then type 'easyblat' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of node you need (based on the database you choose) and submit your job to as many nodes as are available (max 24).

Sample session: (user input is in bold):
biobos% easyblat

EasyBLAT: BLAT (not Blast!) for large numbers of sequences

Enter the directory which contains your input sequences: /data/username/blast/ssqs
** ERROR: Input sequence directory /data/username/blast/ssqs does not exist
Enter the directory which contains your input sequences: /data/username/blast/seqs

Enter the directory where you want your BLAT output to go: /data/username/blat/out
** WARNING: There are already files in /data/username/blat/out which will be overwritten by this job.
** Continue? (y/n) : y

The following databases are available:
  H - Human Genome (May 2004) assembly
  M - Mouse Genome (June 2004) assembly
  O - Other databases
  Enter H, M or O for a detailed list: h
      
Human Genome (Build 35, May 2004) assembly:
chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20,
chr21, chr22, chrX, chrY,
chr1-9, chr10-Y
Enter human section to run against: chr10-Y

http://biowulf.nih.gov/blat.html has a full list of available parameters.
Any additional BLAT parameters (e.g. -maxGap=3):

Submitting to 24 nodes with m1028 memory. Job number is 87930.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?87930.biobos

As you see above, easyblat does some simple error checking, such as checking whether your query sequences exist. It will figure out the node memory required, set up all temporary files and directories, and submit the job for you.

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:
The following databases are available:
  H - Human Genome (May 2004) assembly 
  M - Mouse Genome (June 2004) assembly 
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    month - new or revised GenBank+EMBL+DDBJ+PDB in the last 30 days
    pat - from the patent division of Genbank
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas

[All BLAT parameters] [Biowulf monitors]


Detailed procedure

If you want more control over the number of nodes, you can bypass EasyBlat and use the underlying scripts directly, following the instructions below:
  1. Set up a directory structure for input/output and tmp files.
    For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. For management purposes it is best (but not necessary) to keep all the input, output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory is accessible from either Biowulf or any of the Helix SGIs. The tmp and output directories will be created if they do not already exist.

  2. Edit/create a file which tells Blat where the input/output files are, and what Blat parameters to use.
    This environment file can be in any directory, but it can help, organizationally, to keep it in your main directory for this run, e.g. /data/username/sample1/env

    Sample environment file:

    setenv DB /fdb/blastdb/genome/mouse-feb2003/chr11-X.fa
    setenv INDIR /data/user/blast/seqs
    setenv OUTDIR /data/user/blat_out/
    setenv TMPDIR /data/user/blat_tmp/
    setenv PARAMS ""
    
    where:
    $DB -- target database. You can enter your own fasta-format target file here.
    $INDIR -- directory containing all the input sequence files (may contain single or multiple sequences)
    $OUTDIR -- directory which will receive all the individual output files. Note that existing files in this directory may be overwritten by the new run.
    $TMPDIR -- temporary directory used by the program
    $PARAMS -- Optional parameters, in quotes, to pass to BLAT (e.g. "-maxGap=3" ) List of all options

  3. Submit the job to the batch system.
    Submit the job to the batch system using the 'qsub' command. Example:
    
    qsub -v np=16,read=/data/username/sample1/env -l nodes=8:m1024 /usr/local/blat/nih/runblat
    
    This job has asked for 16 processors (np=16), is using the environment file /data/username/sample1/env, is using 2 processors per node (nodes=8, processors=16, has asked for m1024 nodes (1024 Mb memory)).

    Another example:

    qsub -v np=8,read=/data/username/sample2/env -l nodes=8 /usr/local/blat/nih/runblat
    eric This job has asked for 8 processors (np=8), is using the blast environment file /data/username/sample2/env, and has asked for 8 nodes. No specific nodes have been asked for (i.e. no 'm2048' or 'm1024' specification on the nodes), so the batch system will allocate the first nodes that are available. (The command 'shnodes' will list all the nodes on the system and their various properties).

    It is an excellent idea to monitor any jobs you submit, with the Biowulf job or user monitor as shown on the right. More information about the monitors is in the User Guide


More Examples:

  1. DNA sequences against chr 8 from the Aug 2003 human genome build. Chromosome 8 is 149 Mb. (You can see the size of any chromosome file by typing 'ls -l' in the directory'. Therefore, two copies of this file will easily fit in node memory on any node in the system, so no memory requirements need be specified.
    
    Command :
    
    qsub -v np=16,read=/home/user/env -l nodes=8 /usr/local/blat/nih/runblat
    
    File "env" contains -----------------------
    setenv DB /fdb/blastdb/genome/ucsc-aug01/chr8.fa
    setenv INDIR  /data1/user/blat/query100
    setenv OUTDIR /data1/user/blat/out100
    setenv TMPDIR  /data1/user/blat/tmp
    setenv PARAMS " "
    --------------------------------------------
    
  2. Same input data against entire human genome

    ls -l /fdb/blastdb/genome/ucsc-apr01/chr1.fa
    
    -rw-rw-r--    1 susanc   Seqdb    3131547699 Oct 10  2003 chr_all.fa
    
    The database is 3.1 Gb. There are no nodes in the system big enough to hold the entire database in memory. A run against this database will be I/O bound and very inefficient. Therefore, it is better to take two separate runs against chr1-9, and chr10-Y
    -rw-r--r--    1 susanc   Seqdb    1707475762 Mar 26 08:44 chr1-9.fa
    -rw-r--r--    1 susanc   Seqdb    1424055028 Mar 26 08:46 chr10-Y.fa
    
    These files are 1.7 Gb and 1.4 Gb respectively. They will fit in the memory of a 2Gb node (m2048). Command :
    qsub -v np=10,read=/home/user/env1 -l nodes=10:m2048 /usr/local/blat/nih/runblat
    
    where file "env1" contains
    setenv DB /fdb/blastdb/genome/chr1.fa
    setenv PARAMS "-minIdentity=95 -nohead"
    setenv INDIR  /data1/susanc/blat/query100
    setenv OUTDIR /data1/susanc/blat/out2
    setenv TMPDIR  /data1/susanc/blat/tmp2
    

BLAT Databases

The following files are available:
  • Human Genome, May 2004 build in /fdb/genome/human-aug2003
  • Mouse Genome, June 2004 build in /fdb/genome/mouse-oct2003
  • Other nucleotide databases in /fdb/fastadb, updated weekly: -rw-rw-r-- 1 susanc Seqdb 124327488 Feb 3 16:35 drosoph.nt.fas -rw-rw-r-- 1 susanc Seqdb 4763013 Feb 3 16:36 ecoli.nt.fas -rw-rw-r-- 1 susanc Seqdb 73609681 Mar 14 15:22 hs_rna.nt.fas -rw-rw-r-- 1 susanc Seqdb 3214215 Feb 3 16:36 mito.nt.fas -rw-rw-r-- 1 susanc Seqdb 290133972 Apr 11 15:03 month.nt.fas -rw-rw-r-- 1 susanc Seqdb 1345352143 Apr 11 15:36 pat.nt.fas -rw-rw-r-- 1 susanc Seqdb 790661 Apr 11 15:37 pdb.nt.fas -rw-rw-r-- 1 susanc Seqdb 12308631 Feb 3 16:52 yeast.nt.fas

  • Important Notes:

    • The success of this method depends on the entire database fitting into the node memory. If you are running 2 processes per node, 2 copies of the database should fit into the node memory, with an additional 50Mb or so required for overhead. Watch the job the first time it runs (using the Memory Monitor or logging on to the node and running 'top') to make sure you don't have memory issues.

    • The individual chromosome files are on the order of 70 - 300 Mb. It is most efficient to use 2 processors/node. If Biowulf is heavily loaded, your job may be waiting in the queue until enough of the appropriate nodes become available. In that case, you may be better off running with 1 processor/node on a smaller-memory node.
      
      qsub -v np=16,read=/home/user/env -l nodes=16:m512 /usr/local/blat/nih/runblat
      
      You can check the size of each chromosome by doing an 'ls -l' on the directory.

    • Some query sequences may take significantly longer than others. Thus it is not unusual to see one process continue to run while the other nodes have finished their jobs.

      Note: Use the "shnodes" command to see a list of nodes with their properties and status (free, job-exclusive, offline, down)


    Available options for BLAT

       -ooc=11.ooc tells the program to load over-occurring 11-mers from
                   and external file.  This will increase the speed
                   by a factor of 40 in many cases, but is not required
       output.psl is where to put the output.
       Subranges of nib and .2bit files may specified using the syntax:
          /path/file.nib:seqid:start-end
       or
          /path/file.2bit:seqid:start-end
       or
          /path/file.nib:start-end
       With the second form, a sequence id of file:start-end will be used.
    options:
       -t=type     Database type.  Type is one of:
                     dna - DNA sequence
                     prot - protein sequence
                     dnax - DNA sequence translated in six frames to protein
                   The default is dna
       -q=type     Query type.  Type is one of:
                     dna - DNA sequence
                     rna - RNA sequence
                     prot - protein sequence
                     dnax - DNA sequence translated in six frames to protein
                     rnax - DNA sequence translated in three frames to protein
                   The default is dna
       -prot       Synonymous with -d=prot -q=prot
       -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to 
                   the tileSize
       -tileSize=N sets the size of match that triggers an alignment.  
                   Usually between 8 and 12
                   Default is 11 for DNA and 5 for protein.
       -stepSize=N spacing between tiles. Default is tileSize.
       -oneOff=N   If set to 1 this allows one mismatch in tile and still
                   triggers an alignments.  Default is 0.
       -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
                   Default is 2 for nucleotide, 1 for protein.
       -minScore=N sets minimum score.  This is the matches minus the 
                   mismatches minus some sort of gap penalty.  Default is 30
       -minIdentity=N Sets minimum sequence identity (in percent).  Default is
                   90 for nucleotide searches, 25 for protein or translated
                   protein searches.
       -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually
                   set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.
       -noHead     suppress .psl header (so it's just a tab-separated file)
       -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
       -repMatch=N sets the number of repetitions of a tile allowed before
                   it is marked as overused.  Typically this is 256 for tileSize
                   12, 1024 for tile size 11, 4096 for tile size 10.
                   Default is 1024.  Typically only comes into play with makeOoc
       -mask=type  Mask out repeats.  Alignments won't be started in masked region
                   but may extend through it in nucleotide searches.  Masked areas
                   are ignored entirely in protein or translated searches. Types are
                     lower - mask out lower cased sequence
                     upper - mask out upper cased sequence
                     out   - mask according to database.out RepeatMasker .out file
                     file.out - mask database according to RepeatMasker file.out
       -qMask=type Mask out repeats in query sequence.  Similar to -mask above but
                   for query rather than target sequence.
       -minRepDivergence=NN - minimum percent divergence of repeats to allow 
                   them to be unmasked.  Default is 15.  Only relevant for 
                   masking using RepeatMasker .out files.
       -dots=N     Output dot every N sequences to show program's progress
       -trimT      Trim leading poly-T
       -noTrimA    Don't trim trailing poly-A
       -trimHardA  Remove poly-A tail from qSize as well as alignments in 
                   psl output
       -fastMap    Run for fast DNA/DNA remapping - not allowing introns, 
                   requiring high %ID
       -out=type   Controls output file format.  Type is one of:
                       psl - Default.  Tab separated format, no sequence
                       pslx - Tab separated format with sequence
                       axt - blastz-associated axt format
                       maf - multiz-associated maf format
                       sim4 - similar to sim4 format
                       wublast - similar to wublast format
                       blast - similar to NCBI blast format
                       blast8- NCBI blast tabular format
                       blast9 - NCBI blast tabular format with comments
       -fine       For high quality mRNAs look harder for small initial and
                   terminal exons.  Not recommended for ESTs
       -maxIntron=N  Sets maximum intron size. Default is 750000
       -extendThroughN - Allows extension of alignment through large blocks of N's
    

    This document is available as http://biowulf.nih.gov/apps/blat.html
    Biowulf home page | Helix Systems | NIH