Biowulf at the NIH
RSS Feed
Blat (not Blast!) on Biowulf

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates. For more information see the BLAT web page or Jim Kent's web page.

The 'easyblat' script simplifies running large BLAT jobs. You need to put all your query sequences into a directory, and then type 'easyblat' at the Biowulf prompt. You will be prompted for all required parameters. The script will then decide what kind of node you need (based on the database you choose) and submit your job to as many nodes as are available (max 24).

Sample session: (user input is in bold):

biowulf% easyblat

EasyBLAT: BLAT (not Blast!) for large numbers of sequences
Enter the directory which contains your input sequences: /data/user/mydir/seqs

Enter the directory where you want your BLAT output to go: /data/user/mydir/out
** WARNING: There are already files in /data/user/mydir/out which will be overwritten by this job.
** Continue? (y/n): y

The following databases are available:
  H - Human Genome Feb 2009 assembly 
  M - Mouse Genome Jul 2007 assembly 
  O - Other databases
Enter H, M or O for a detailed list: H
Human Genome (Build 37, hg19, Feb 2009) assembly:
    chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
    chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, 
    chr21, chr22, chrX, chrY, chr_all
Enter human section to run against: chr_all

http://biowulf.nih.gov/blat.html has a full list of available parameters.
Any additional BLAT parameters (e.g. -maxGap=3): -minScore=35 -trimT
Creating parameter file /data/user/blat_tmp.12971/blat_par.12971
Submitting: qsub -v np=128,read=/data/user/blat_tmp.12971/blat_par.12971 -l nodes=16:g24 -N EasyBlat /usr/local/blat/nih/easyrunblat
Submitting to 16 nodes. Job number is 2384446.biobos

Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?2384446.biobos

As you see above, easyblat does some simple error checking, such as checking whether your query sequences exist. It will set up all temporary files and directories, and submit the job for you.

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:

The following databases are available:
  H - Human Genome (Apr 2006) assembly 
  M - Mouse Genome (Jul 2007) assembly 
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas

Running via swarm

Easyblat uses swarm. If you prefer to run swarm directly, set up a swarm command file along the following lines:

# this file is called blatcmd
# commands are 'blat  database_file  query_sequence  outputfile'
#
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq1.fas /data/user/blatout/seq1.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq2.fas /data/user/blatout/seq2.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq3.fas /data/user/blatout/seq3.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq4.fas /data/user/blatout/seq4.out
[...]

The memory required for each blat command will be approximately the size of the database file. In this case, the file chr_all.fa is about 2.6 GB

[user@biowulf ]$ ls -lh /fdb/genome/mm9/chr_all.fa
-rw-rw-r-- 1 helixapp staff 2.6G Mar 25  2008 /fdb/genome/mm9/chr_all.fa
Thus, we can estimate that each blat command requires 3 GB of memory. Submit this swarm job with:
swarm -g 3 -f blatcmd

Important Notes

BLAT - The Blast-Like Alignment Tool. W. James Kent, Genome Research 12(4): 656-664, April 2002
BLAT Suite Program Specifications and User Guide. at the UCSC Genome website. All BLAT options are listed on this page.