Biowulf at the NIH
RSS Feed
Inspect on Biowulf

InsPecT is a MS/MS database search tool specifically designed to address two crucial needs of the proteomics community: post-translational modification identification and search speed.

InsPecT was developed by the Computational Mass Spectrometry group at UCSD.

InsPecT uses peptide sequence tags (PSTs) to filter the database. InsPecT has an internal tag generator, but can accept tags generated by other tools (e.g. Pepnovo, GutenTAG). Because de novo is imperfect, multiple tags are produced for each spectrum, to ensure that (at least) one tag is corrrect. These PSTs are extremely efficient filters, even in the context of up to a dozen possible modifications. Tag-based filtering can also be combined with the "two-pass" filtering pioneered by X!Tandem, where from one search provides a list of proteins (a mini-database) for a more detailed search.

Unanticipated modifications are common in proteomics. InsPecT implements the MS-Alignment algorithm for "blind" spectral search, with no bias toward anticipated modification types. This search has been applied to annotate heavily-modified proteins such as crystallins.

The InsPecT distribution includes a script (PTMAnalysis.py) implementing the PTMFinder procedure for analysis of unrestrictive modification results. This procedure allows for the accurate scoring of PTMs, and for the calculation of a false discovery rate.

InsPecT is a single-threaded program. The advantage of running InsPecT on Biowulf would be to run large numbers of single-threaded batch jobs. This would be done most efficiently via swarm.

InsPecT databases

InsPecT requires its own database format. The major databases (NCBInr, Swissprot/Uniprot etc.) are available and updated weekly on the Biowulf cluster in /fdb/fastadb. The desired database should be copied into your /data area and reformatted into InsPecT format with InsPecT's PrepDB.py. Of course, users can also create their own databases out of any set of Fasta-formatted sequences.

Sample session (user input in bold):

[user@biowulf ]$ cp /fdb/fastadb/pdb.aa.fas .
[user@biowulf ]$ python /usr/local/Inspect/PrepDB.py FASTA ./pdb.aa.fas
(psyco not found - running in non-optimized mode)
Converted 38892 protein sequences (169114 lines) to .trie format.
Created database file './pdb.aa.trie'
[user@biowulf ]$ ls -l
total 12188
-rw-r--r-- 1 user user 23611550 Dec 22 15:20 pdb.aa.fas
-rw-r--r-- 1 user user 3577972 Dec 22 15:11 pdb.aa.index
-rw-r--r-- 1 user user 8880908 Dec 22 15:11 pdb.aa.trie

Running a swarm of InsPecT jobs

Each InsPecT run requires an input file. For large numbers of runs with the same parameter set, it is most convenient to generate the input files via a script. For example, if all the spectrum files *.mzXML are in the directory /data/user/mydir, and the database files are in /data/user/mydbs, the script might look like this:

#!/bin/csh
# this script is make_inp.csh. 
# It will need to be made executable with 'chmod +x make_inp.csh'
# Then type 'make_inp.csh' to run it.

cd /data/user/mydir
foreach file (*.mzXML)
cat > $file.inp <<EOF
spectra,$file
instrument,ESI-ION-TRAP
protease,Trypsin
DB,/data/user/mydbs/pdb.aa.trie
mod,57,X,fix
EOF

end
For each .mzxml file in the directory, this script will produce the corresponding .inp file.

Create a swarm input file with one line for each input file.

# -- this file is swarm.cmd ---
cd /data/user/mydir; /usr/local/Inspect/inspect -r /usr/local/Inspect -i file1.inp -o file1.out
cd /data/user/mydir; /usr/local/Inspect/inspect -r /usr/local/Inspect -i file2.inp -o file2.out
cd /data/user/mydir; /usr/local/Inspect/inspect -r /usr/local/Inspect -i file3.inp -o file3.out
[....]

If each Inspect process requires less than 1 GB of memory, submit this to the batch system with the command:

swarm -f cmdfile

If each Inspect process requires more than 1 GB of memory, use

swarm -g # -f cmdfile
where '#' is the number of Gigabytes of memory required by each Inspect process.

More about swarm.

Documentation

InsPecT documentation at the UCSD site.