Status
About
Hardware
Applications
Batch queues
Disk storage

MPI
Performance
New Users
User Guide
Documentation
Research
Photos


InterProScan on Biowulf

InterProScan (IPRScan) is a tool that combines different protein signature recognition methods into one resource. IPRScan not only wraps the sequence analysis applications, it also performs a considerable amount of program outputs and data look-up from various databases. InterPro (IPR) integrates PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D and PANTHER databases. Each member database in IPR uses different scanning applications.

InterPro and Interproscan are developed and maintained at the EMBL-EBI.

IPRScan on Biowulf is designed to be used for bulk sequence analysis such as hundreds or thousands of sequences. For small scale of sequence analysis, please use the EBI IPRScan website which can handle 1 sequence at a time.

[Input sequences] [Options] [PBS script] [Submit job] [Output structure] [Cleanup] [Benchmarks] [Database]

Input sequences

  • Prepare the input sequence file by putting all the sequences of the same format in one file.
  • Current settings for IPRScan on Biowulf.
    • max amino acids for input sequence=3000
    • max nucleic acids for input sequene=1000
    • max length for the nucleotide input sequence=10000
    • min length for the protein input sequence=5
    • default minimum orf size for translation=50
    If you have a special need to modify these parameters, please contact staff@helix.nih.gov.
  • Most major sequence formats are acceptable, including Fasta, Genbank, EMBL, GCG and Swissprot. IPRScan reformats input sequences using the 'seqret' program in EMBOSS. For detailed list of accetable sequence format, see 'Input sequence formats' section in http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/seqret.html.
  • Do not mix nucleotide and amino acid sequences in one file.
  • Do not mix different formats of sequences in one file.
  • Sample input file:
    >RS16_ECOLI
    MVTIRLARHGAKKRPFYQVVVADSRNARNGRFIERVGFFNPIASEKEEGTRLDLDRIAHW
    VGQGATISDRVAALIKEVNKAA
    >Q9RHD9
    XPKLEEGVEGLVHVSEMDWTNKNIHPSKVVQVGDEVEVQVLDIDEERRRISLGIKQCKSN
    PWEDFSSQFNKGDRISGSIKSITDFGIFIGLDGGIDGLVHLSDISWNEVGEEAVRRFKKG
    DELETVILSVDPERERISLGIKQLEDDPFSNYASLHEKGSIVRGTVKEVDAKGAVISLGD
    DIEGILKASEISRDRVEDARNVLKEGEEVEAKIISIDRKSRVISLSVKSKDVDDEKDAMK
    ELRKQEVESAGPTTIGDLIRAQMENQG
    >Y902_MYCTU Q10560 PROBABLE SENSOR-LIKE HISTIDINE KINASE RV0902C (EC 2.7.3.-).
    MNILSRIFARTPSLRTRVVVATAIGAAIPVLIVGTVVWVGITNDRKERLDRRLDEAAGFA
    IPFVPRGLDEIPRSPNDQDALITVRRGNVIKSNSDITLPKLQDDYADTYVRGVRYRVRTV
    EIPGPEPTSVAVGATYDATVAETNNLHRRVLLICTFAIGAAAVFAWLLAAFAVRPFKQLA
    EQTRSIDAGDEAPRVEVHGASEAIEIAEAMRGMLQRIWNEQNRTKEALASARDFAAVSSH
    ELRTPLTAMRTNLEVLSTLDLPDDQRKEVLNDVIRTQSRIEATLSALERLAQGELSTSDD
    HVPVDITDLLDRAAHDAARIYPDLDVSLVPSPTCIIVGLPAGLRLAVDNAIANAVKHGGA

Iprscan Options

  • Required options for iprscan:
    • -cli
    • -i
  • Remember to include '-seqtype n' if query sequences are nucleotide acids.
  • A summary of options for iprscan can be seen by typing at the biowulf prompt:
    '/usr/local/iprscan/bin/iprscan -cli -h'
  • All options for iprscan:
    • -i <seqfile> Your sequence file (mandatory).
    • -o <output file> The output file where to write results (optional), default is STDOUT which is /home/UserID (see table below).
    • -email <addr> Submitter email address (Not required for Biowulf batch system).
    • -appl <name> Application(s) to run (optional), default is all.
      Possible values (dependent on set-up):
      blastprodom
      fprintscan
      hmmpfam
      hmmpir
      hmmpanther
      hmmtigr
      hmmsmart
      superfamily
      gene3d
      scanregexp
      profilescan
      seg
      coils
    • -nocrc Don't perform CRC64 check and rerun all searches even already exist in database which is unnecessary. The default is run without this flag.
    • -altjobs Launch jobs alternatively, chunk after chunk. Default is off.
    • -seqtype <type> Sequence type: n for DNA/RNA, p for protein (default).
    • -trlen <n> Transcript length threshold (20-150).
    • -trtable <table> Codon table number.
    • -goterms Show GO terms if iprlookup option is also given.
    • -iprlookup Switch on the InterPro lookup for results.
    • -format <format> Output results format (raw, txt, html, xml(default), ebixml(EBI header on top of xml), gff)
    • -verbose Print messages during run

Prepare batch script

  • Sample script file. See the Biowulf user guide for more information about batch scripts.
    #!/bin/bash
    #
    #PBS -N YourScriptNameHere
    #PBS -m be
    #PBS -k oe

    /usr/local/iprscan/bin/iprscan -cli -i /data/maoj/iprscan/test248aa.seq -o /dev/null -format raw -goterms -iprlookup


Submit PBS job

  • Run the following command on Biowulf:
    qsub -l nodes=1 YourScriptNameWithFullPath
  • To check status of your job:
    qstat -u YourUserID
  • More info about monitoring your jobs.
  • Sequences in the input file will be split into chunks of 100. For example, 250 sequences will be split to 3 chunks.

Structure of output files after job started:

Directory/File Example Notes
/home/YourUserID/YourJobName.oJobNo
/home/YourUserID/YourJobName.eJobNo
/home/YourUserID/iprscan-xxx.oxxxxx
/home/YourUserID/iprscan-xxx.exxxxx
/home/userID/growth.e1151902
/home/userID/growth.o1151902
/home/userID/iprscan-2008050.e1269981
/home/userID/iprscan-2008050.o1269981

YourJobName is the name specified in the batch script file, beside "#PBS -N".

The JobNo is the number appears right after user submit the job.

If -o option is not given, the summary result not only will appear in 'merged.raw' file (see below) but also in this 'xxx.oJobNo' file. To stop the result from duplicating, include '-o /dev/null' in the command as appeared in sample script above.

/data/YourUserID/iprscan-yyyymmdd/ /data/userID/iprscan-20080301 Automatically created

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
Subdirectories created based on time stamp the job is submitted if multiple jobs are submitted in the same day.
/data/UserID/pxxxxxxxx.res or .log or .fa /data/UserID/p1063121154651720535.fa
/data/UserID/p1063121154651720535.log
/data/UserID/p1063121154651720535.res
These files are temperatory and will be cleaned up automatically before job finishes. Do not touch or remove these files when the job is running.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/iprscan-
yyyymmdd-hhmmssxx.exitcode
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.exitcode
File content should be '0' if job runs successfully. However, .exitcode files under all chunks should be double checked to confirm. User can go to each chunk_x directory and type 'more iprscan*.exitcode' to view all the exitcode files content at once.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.input
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.input
Input file with sequence format converted to fasta format
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.input.inx
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.input.inx
Binary format of input sequences
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.params
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.params
checksum summary of all the input sequences
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/
iprscan-yyyymmdd-hhmmssxx.seqs
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/
iprscan-20080301-11414247.seqs
Input file with sequences of original format
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/merged.raw

In addition to merged.raw, it can also be xxx.html or xxx.xml or xxx.txt depend on the format user specified.

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/merged.raw

html output sample :

Output summary file of all chunks. The format can be merged.raw or html or xml or txt.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_3
Directories created for each chunk of sequences which contains output files for each of the 13 applications.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx-APP-cnkX.
OUTPUTFILE
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403-profilescan-cnk2.
exitcode
Each application generates 4 output files: .output; .output.inx; .errors; .exitcode. Check all the exitcode output file for each application in each chunk. The content should be '0' in all exitcode files for a successful run.
/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx.nocrc
/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403.nocrc
.nocrc file under each chunk_x directory contains the query sequences that do not have a known crc64 according to the match.xml file. Applications will only be launched against these sequences if -nocrc flag is not issued.

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/
iprscan-yyyymmdd-hhmmssxx.xml

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/
iprscan-20080310-13280403.xml

'xxx.xml' file under each chunk_x directory contains the default output result from the search for each chunk. Additional output format can be obtained by changing the command option from -format raw to -format html for example.

To view .html output file, type 'firefox YourFileName.html'

/data/YourUserID/iprscan-yyyymmdd/
iprscan-yyyymmdd-hhmmssxx/chunk_x/merged.raw

/data/YourUserID/iprscan-20080301/
iprscan-20080301-11414247/chunk_2/merged.raw
merged.raw file under each chunk_x directory contains the output in raw format converted from .xml file.

Cleanup

Output files from iprscan runs are grouped under two directories: /home/userID & /data/userID/iprscan-xxxxx. Since the number of files can accumulate and fill up user's space fast, frequent cleanup by users themselves is highly recommended.

The main output file(s) such as merged.raw or xxx.html or xxx.xml or xxx.txt contains the summarized interesting output from all chunks in each run. The other files can be deleted after checking exitcode files in the chunk_x directories as described above. Sample cleanup commands:

% cd /data/user
% mv iprscan-yymmdd/iprscan..../merged.raw . # or *.html or *.txt or *.xml
% rm -r iprscan-yymmdd

Benchmarks

Input file containing query sequences were submitted to batch system using qsub -l nodes=1,mem=2048 scriptName like example above. (m2048 is there to make sure only nodes with memory > 2048 will be assigned since currently all data files are less than 2Gb)

  • 3000 amino acids, 1 hour 32 minutes
  • 1000 amino acids, 57 minutes
  • 1000 nucleotide acids, 2 hours 50 minutes

Database

  • InterPro database is updated within a week whenever a newer version is available.
  • Current InterPro version is 16.1
  • Database files can be accessed /spin/db/iprdb/data.new

Please contact the Helix Systems staff if you have questions.

 


Biowulf home page | Helix Systems | NIH