OMSSA on Biowulf
The Open Mass Spectrometry Search Algorithm [OMSSA] is an efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.
OMSSA was developed by researchers at the NCBI, National Institutes of Health. [OMSSA website]
Small numbers of OMSSA jobs should be run on the NCBI OMSSA server. OMSSA on Biowulf is intended for running a large number of OMSSA searches, or running OMSSA against a personal database.
How to run OMSSA on Biowulf
Use the swarm utility. Set up a swarm command file containing one line for each of your OMSSA runs. Here is a sample swarm command file:
------------------file sample.com--------------------
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file1.dta -ox file1.xml
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file2.dta -ox file2.xml
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file3.dta -ox file3.xml
/usr/local/omssa/omssacl -d /fdb/blastdb/nr -f file4.dta -ox file4.xml
----------------end of file -------------------------
Submit this file with
swarm -f sample.com -n 1
Note about multithreading: As of v 2.1.0, OMSSA is multithreaded and will attempt
to use all available processors on a node. Thus, it is critical to use the '-n 1' parameter
on the swarm command above (sending only one OMSSA command to each node), otherwise the
nodes will get overloaded and performance will suffer.
These OMSSA commands will produce XML output. You can write your own script to process the
XML data. The OMSSA package includes a sample parser: the command to use it is
perl /usr/local/omssa/readOMSSA.pl file1.xml
Thus, it is possible to set up an OMSSA search and parse the results in a single
swarm command. See this
Sample Swarm command
Available databases
OMSSA searches Blast-format sequence databases. A large collection of Blast protein databases is available and updated on the Biowulf cluster, in /fdb/blastdb/.
Names, location and status of Blast databases. (OMSSA will search only protein databases)
Bundling jobs
If you have over 1000 OMSSA searches to run, they should be bundled with the '-b' flag to swarm.
'-b 25' will send 25 of the commands to a single processor, and then submit two such bundles as a
single swarm job. This significantly decreases the number of individual jobs and therefore decreases the
overhead for such large numbers of small jobs. (More information about
swarm options)
Thus, to run OMSSA on 5000 dta files, you would set up a swarm command file with one line
per file as described above. This file would be submitted to the swarm program using:
swarm -b 50 -f sample.com
swarm will send 50 commands to a single processor, and 50x2 = 100 commands as a single batch job
to a node. The total number of jobs will be 5000 / 100 = 50 swarm jobs.
Monitoring your jobs
As always, jobs can be monitored using the Biowulf
cluster monitors. Click on 'List status of running jobs only',
and then your username or job number on the resultant page to view
your own jobs only, as in the image on the right.
OMSSA Options
v2.1 (Jul 2007)
USAGE
NCBI_PROGRAM [-h] [-help] [-pm param] [-d blastdb] [-umm] [-f infile]
[-fx xmlinfile] [-fb dtainfile] [-fp pklinfile] [-fm pklinfile]
[-foms omsinfile] [-fomx omxinfile] [-fxml omxinfile] [-o textasnoutfile]
[-ob binaryasnoutfile] [-ox xmloutfile] [-oc csvfile] [-w] [-to pretol]
[-te protol] [-tom promass] [-tem premass] [-tez prozdep] [-ta autotol]
[-tex exact] [-i ions] [-cl cutlo] [-ch cuthi] [-ci cutinc]
[-cp precursorcull] [-v cleave] [-x taxid] [-w1 window1] [-w2 window2]
[-h1 hit1] [-h2 hit2] [-hl hitlist] [-ht tophitnum] [-hm minhit]
[-hs minspectra] [-he evalcut] [-mf fixedmod] [-mv variablemod] [-mnm]
[-mm maxmod] [-e enzyme] [-zh maxcharge] [-zl mincharge]
[-zoh maxprodcharge] [-zt chargethresh] [-z1 plusone] [-zc calcplusone]
[-zcc calccharge] [-pc pseudocount] [-sb1 searchb1] [-sct searchcterm]
[-sp productnum] [-scorr corrscore] [-scorp corrprob] [-no minno]
[-nox maxno] [-is subsetthresh] [-ir replacethresh] [-ii iterativethresh]
[-p prolineruleions] [-il] [-el] [-ml] [-mx modinputfile]
[-mux usermodinputfile] [-nt numthreads] [-ni] [-ns] [-os]
[-logfile File_Name] [-conffile File_Name] [-version] [-dryrun]
DESCRIPTION -- none
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore other arguments
-help
Print USAGE, DESCRIPTION and ARGUMENTS description; ignore other arguments
-pm
search parameter input in xml format (overrides command line)
Default = `'
-d
Blast sequence library to search. Do not include .p* filename suffixes.
Default = `nr'
-umm
use memory mapped sequence libraries
-f
single dta file to search
Default = `'
-fx
multiple xml-encapsulated dta files to search
Default = `'
-fb
multiple dta files separated by blank lines to search
Default = `'
-fp
pkl formatted file
Default = `'
-fm
mgf formatted file
Default = `'
-foms
omssa oms file
Default = `'
-fomx
omssa omx file
Default = `'
-fxml
omssa xml search request file
Default = `'
-o
filename for text asn.1 formatted search results
Default = `'
-ob
filename for binary asn.1 formatted search results
Default = `'
-ox
filename for xml formatted search results
Default = `'
-oc
filename for csv formatted search summary
Default = `'
-w
include spectra and search params in search results
-to
product ion m/z tolerance in Da
Default = `0.8'
-te
precursor ion m/z tolerance in Da
Default = `2.0'
-tom
product ion search type (0 = mono, 1 = avg, 2 = N15, 3 = exact)
Default = `0'
-tem
precursor ion search type (0 = mono, 1 = avg, 2 = N15, 3 = exact)
Default = `0'
-tez
charge dependency of precursor mass tolerance (0 = none, 1 = linear)
Default = `1'
-ta
automatic mass tolerance adjustment fraction
Default = `1.0'
-tex
threshold in Da above which the mass of neutron should be added in exact
mass search
Default = `1446.94'
-i
id numbers of ions to search (comma delimited, no spaces)
Default = `1,4'
-cl
low intensity cutoff as a fraction of max peak
Default = `0.0'
-ch
high intensity cutoff as a fraction of max peak
Default = `0.2'
-ci
intensity cutoff increment as a fraction of max peak
Default = `0.0005'
-cp
eliminate charge reduced precursors in spectra (0=no, 1=yes)
Default = `0'
-v
number of missed cleavages allowed
Default = `1'
-x
comma delimited list of taxids to search (0 = all)
Default = `0'
-w1
single charge window in Da
Default = `20'
-w2
double charge window in Da
Default = `14'
-h1
number of peaks allowed in single charge window
Default = `2'
-h2
number of peaks allowed in double charge window
Default = `2'
-hl
maximum number of hits retained per precursor charge state per spectrum
Default = `30'
-ht
number of m/z values corresponding to the most intense peaks that must
include one match to the theoretical peptide
Default = `6'
-hm
the minimum number of m/z matches a sequence library peptide must have for
the hit to the peptide to be recorded
Default = `2'
-hs
the minimum number of m/z values a spectrum must have to be searched
Default = `4'
-he
the maximum evalue allowed in the hit list
Default = `1'
-mf
comma delimited (no spaces) list of id numbers for fixed modifications
Default = `'
-mv
comma delimited (no spaces) list of id numbers for variable modifications
Default = `'
-mnm
n-term methionine should not be cleaved
-mm
the maximum number of mass ladders to generate per database peptide
Default = `128'
-e
id number of enzyme to use
Default = `0'
-zh
maximum precursor charge to search when not 1+
Default = `3'
-zl
minimum precursor charge to search when not 1+
Default = `1'
-zoh
maximum product charge to search
Default = `2'
-zt
minimum precursor charge to start considering multiply charged products
Default = `3'
-z1
fraction of peaks below precursor used to determine if spectrum is charge 1
Default = `0.95'
-zc
should charge plus one be determined algorithmically? (1=yes)
Default = `1'
-zcc
how should precursor charges be determined? (1=believe the input file,
2=use a range)
Default = `2'
-pc
minimum number of precursors that match a spectrum
Default = `1'
-sb1
should first forward (b1) product ions be in search (1=no)
Default = `1'
-sct
should c terminus ions be searched (1=no)
Default = `0'
-sp
max number of ions in each series being searched (0=all)
Default = `100'
-scorr
turn off correlation correction to score (1=off, 0=use correlation)
Default = `0'
-scorp
probability of consecutive ion (used in correlation correction)
Default = `0.5'
-no
minimum size of peptides for no-enzyme and semi-tryptic searches
Default = `4'
-nox
maximum size of peptides for no-enzyme and semi-tryptic searches (0=none)
Default = `40'
-is
evalue threshold to include a sequence in the iterative search, 0 = all
Default = `0.0'
-ir
evalue threshold to replace a hit, 0 = only if better
Default = `0.0'
-ii
evalue threshold to iteratively search a spectrum again, 0 = always
Default = `0.01'
-p
id numbers of ion series to apply no product ions at proline rule at (comma
delimited, no spaces)
Default = `'
-il
print a list of ions and their corresponding id number
-el
print a list of enzymes and their corresponding id number
-ml
print a list of modifications and their corresponding id number
-mx
file containing modification data
Default = `mods.xml'
-mux
file containing user modification data
Default = `usermods.xml'
-nt
number of search threads to use, 0=autodetect
Default = `0'
-ni
don't print informational messages
-ns
depreciated flag
-os
use omssa 1.0 scoring
-logfile
File to which the program log should be redirected
-conffile
Program's configuration (registry) data file
-version
Print version number; ignore other arguments
-dryrun
Dry run the application: do nothing, only test all preconditions
|