BLAT (not Blast!) on Biowulf
BLAT is a DNA/Protein Sequence
Analysis program written by Jim Kent at UCSC. It is designed to quickly
find sequences of 95% and greater similarity of length 40 bases or more.
It may miss more divergent or shorter sequence alignments. It will find
perfect sequence matches of 33 bases, and sometimes find them down to
22 bases. BLAT on proteins finds sequences of 80% and greater similarity
of length 20 amino acids or more. In practice DNA BLAT works well on primates,
and protein blat on land vertebrates. For more information see the
BLAT web page or Jim Kent's
web page.
** EasyBlat: The easy interface
The 'easyblat' script
simplifies running large BLAT jobs. You need to put all your query sequences
into a directory, and then type 'easyblat' at the Biowulf prompt. You will
be prompted for all required parameters. The script will then decide what
kind of node you need (based on the database you choose) and submit your
job to as many nodes as are available (max 24).
Sample session: (user input is in bold):
biobos% easyblat
EasyBLAT: BLAT (not Blast!) for large numbers of sequences
Enter the directory which contains your input sequences: /data/username/blast/ssqs
** ERROR: Input sequence directory /data/username/blast/ssqs does not exist
Enter the directory which contains your input sequences: /data/username/blast/seqs
Enter the directory where you want your BLAT output to go: /data/username/blat/out
** WARNING: There are already files in /data/username/blat/out which will be overwritten by this job.
** Continue? (y/n) : y
The following databases are available:
H - Human Genome (May 2004) assembly
M - Mouse Genome (June 2004) assembly
O - Other databases
Enter H, M or O for a detailed list: h
Human Genome (Build 35, May 2004) assembly:
chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20,
chr21, chr22, chrX, chrY,
chr1-9, chr10-Y
Enter human section to run against: chr10-Y
http://biowulf.nih.gov/blat.html has a full list of available parameters.
Any additional BLAT parameters (e.g. -maxGap=3):
Submitting to 24 nodes with m1028 memory. Job number is 87930.biobos
Monitor your job at http://biowulf.nih.gov/cgi-bin/queuemon?87930.biobos
|
As you see
above, easyblat does some simple error checking, such as checking
whether your query sequences exist. It will figure out the node memory
required, set up all temporary files and directories, and submit the job
for you.
You can run against
your own database (any fasta format file) by selecting 'other databases',
and then entering the full pathname of the database you want to search. For example:
The following databases are available:
H - Human Genome (May 2004) assembly
M - Mouse Genome (June 2004) assembly
O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
month - new or revised GenBank+EMBL+DDBJ+PDB in the last 30 days
pat - from the patent division of Genbank
pdb - from the PDB 3-dimensional structures
drosoph - Drosophila sequences
ecoli - E. Coli sequences
mito - mitochondrial sequences
yeast - Yeast sequences
If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas
|
[All
BLAT parameters] [Biowulf monitors]
Detailed procedure
If you want more control over the number of nodes, you can bypass EasyBlat and use the underlying scripts directly, following the instructions below:
- Set up a directory
structure for input/output and tmp files.
For management purposes it is best (but not necessary) to keep all the
input, output and temporary directories within a given subdirectory.
For management purposes it is best (but not necessary) to keep all the input,
output and temporary directories within a given subdirectory. Note that the /home/username directory is small, and intended for std.error/output files. Use the /data/username directory for your Blast sequences, tmp files, and output. Every Biowulf user has a directory in /data, and this directory
is accessible from either Biowulf or any of the Helix SGIs.
The tmp and output directories will be created if they do not already exist.
- Edit/create
a file which tells Blat where the input/output files are, and what Blat
parameters to use.
This environment file can be in any directory, but it can help, organizationally, to
keep it in your main directory for this run, e.g. /data/username/sample1/env
Sample environment file:
setenv DB /fdb/blastdb/genome/mouse-feb2003/chr11-X.fa
setenv INDIR /data/user/blast/seqs
setenv OUTDIR /data/user/blat_out/
setenv TMPDIR /data/user/blat_tmp/
setenv PARAMS ""
where:
$DB -- target database. You can enter your own fasta-format target file here.
$INDIR -- directory containing all the input sequence files (may contain single or multiple sequences)
$OUTDIR -- directory which will receive all the individual output files. Note that existing files in this directory may be overwritten by the new run.
$TMPDIR -- temporary directory used by the program
$PARAMS -- Optional parameters, in quotes, to pass to BLAT
(e.g. "-maxGap=3" ) List of all options
- Submit the job
to the batch system.
Submit the job to the batch system using the 'qsub' command. Example:
qsub -v np=16,read=/data/username/sample1/env -l nodes=8:m1024 /usr/local/blat/nih/runblat
This job has asked for 16 processors (np=16), is using
the environment file /data/username/sample1/env, is using 2 processors
per node (nodes=8, processors=16, has asked for m1024 nodes (1024 Mb memory)).
Another example:
qsub -v np=8,read=/data/username/sample2/env -l nodes=8 /usr/local/blat/nih/runblat
This job has asked for 8 processors (np=8), is using the blast environment
file /data/username/sample2/env, and has asked for 8 nodes. No specific
nodes have been asked for (i.e. no 'm2048' or 'm1024' specification
on the nodes), so the batch system will allocate the first nodes that
are available. (The command 'shnodes' will list all the nodes on the
system and their various properties).
It is an excellent idea to monitor any jobs you submit, with the Biowulf job or user monitor as shown on the right. More information about the monitors is in the User Guide
More Examples:
- DNA sequences
against chr 8 from the Aug 2003 human genome build. Chromosome 8 is 149 Mb.
(You can see the size of any chromosome file by typing 'ls -l' in
the directory'. Therefore, two copies of this file will easily fit in node memory on any node
in the system, so no memory requirements need be specified.
Command :
qsub -v np=16,read=/home/user/env -l nodes=8 /usr/local/blat/nih/runblat
File "env" contains -----------------------
setenv DB /fdb/blastdb/genome/ucsc-aug01/chr8.fa
setenv INDIR /data1/user/blat/query100
setenv OUTDIR /data1/user/blat/out100
setenv TMPDIR /data1/user/blat/tmp
setenv PARAMS " "
--------------------------------------------
- Same input data
against entire human genome
ls -l /fdb/blastdb/genome/ucsc-apr01/chr1.fa
-rw-rw-r-- 1 susanc Seqdb 3131547699 Oct 10 2003 chr_all.fa
The database is 3.1 Gb. There are no nodes in the system big enough to hold the entire database
in memory. A run against this database will be I/O bound and very inefficient. Therefore, it is
better to take two separate runs against chr1-9, and chr10-Y
-rw-r--r-- 1 susanc Seqdb 1707475762 Mar 26 08:44 chr1-9.fa
-rw-r--r-- 1 susanc Seqdb 1424055028 Mar 26 08:46 chr10-Y.fa
These files are 1.7 Gb and 1.4 Gb respectively. They will fit in the memory of a 2Gb node (m2048).
Command :
qsub -v np=10,read=/home/user/env1 -l nodes=10:m2048 /usr/local/blat/nih/runblat
where file "env1" contains
setenv DB /fdb/blastdb/genome/chr1.fa
setenv PARAMS "-minIdentity=95 -nohead"
setenv INDIR /data1/susanc/blat/query100
setenv OUTDIR /data1/susanc/blat/out2
setenv TMPDIR /data1/susanc/blat/tmp2
BLAT Databases
The following files are available:
Human Genome, May 2004 build in /fdb/genome/human-aug2003
Mouse Genome, June 2004 build in /fdb/genome/mouse-oct2003
Other nucleotide databases in /fdb/fastadb, updated weekly:
-rw-rw-r-- 1 susanc Seqdb 124327488 Feb 3 16:35 drosoph.nt.fas
-rw-rw-r-- 1 susanc Seqdb 4763013 Feb 3 16:36 ecoli.nt.fas
-rw-rw-r-- 1 susanc Seqdb 73609681 Mar 14 15:22 hs_rna.nt.fas
-rw-rw-r-- 1 susanc Seqdb 3214215 Feb 3 16:36 mito.nt.fas
-rw-rw-r-- 1 susanc Seqdb 290133972 Apr 11 15:03 month.nt.fas
-rw-rw-r-- 1 susanc Seqdb 1345352143 Apr 11 15:36 pat.nt.fas
-rw-rw-r-- 1 susanc Seqdb 790661 Apr 11 15:37 pdb.nt.fas
-rw-rw-r-- 1 susanc Seqdb 12308631 Feb 3 16:52 yeast.nt.fas
Important Notes:
- The success of this method
depends on the entire database fitting into the node memory. If you
are running 2 processes per node, 2 copies of the database should fit
into the node memory, with an additional 50Mb or so required for overhead.
Watch the job the first time it runs (using the Memory
Monitor or logging on to the node and running 'top') to make sure
you don't have memory issues.
- The individual chromosome
files are on the order of 70 - 300 Mb. It is most efficient to use 2
processors/node. If Biowulf is heavily loaded, your job may be waiting
in the queue until enough of the appropriate nodes become available.
In that case, you may be better off running with 1 processor/node on
a smaller-memory node.
qsub -v np=16,read=/home/user/env -l nodes=16:m512 /usr/local/blat/nih/runblat
You can check the size of each chromosome by doing an 'ls -l' on the
directory.
- Some query sequences may
take significantly longer than others. Thus it is not unusual to see
one process continue to run while the other nodes have finished their
jobs.
Note:
Use the "shnodes" command to see a list of nodes with their
properties and status (free, job-exclusive, offline, down)
-ooc=11.ooc tells the program to load over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many cases, but is not required
output.psl is where to put the output.
Subranges of nib and .2bit files may specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
The default is dna
-prot Synonymous with -d=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to
the tileSize
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for protein.
-stepSize=N spacing between tiles. Default is tileSize.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-minScore=N sets minimum score. This is the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
-noHead suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
-repMatch=N sets the number of repetitions of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc
-mask=type Mask out repeats. Alignments won't be started in masked region
but may extend through it in nucleotide searches. Masked areas
are ignored entirely in protein or translated searches. Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above but
for query rather than target sequence.
-minRepDivergence=NN - minimum percent divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out files.
-dots=N Output dot every N sequences to show program's progress
-trimT Trim leading poly-T
-noTrimA Don't trim trailing poly-A
-trimHardA Remove poly-A tail from qSize as well as alignments in
psl output
-fastMap Run for fast DNA/DNA remapping - not allowing introns,
requiring high %ID
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
-maxIntron=N Sets maximum intron size. Default is 750000
-extendThroughN - Allows extension of alignment through large blocks of N's
|