Biowulf at the NIH
RSS Feed
PolyPhen-2 on Biowulf

PolyPhen-2 (Polymorphism Phenotyping v2) is a software tool which predicts possible impact of amino acid substitutions on the structure and function of human proteins using straightforward physical and evolutionary comparative considerations.

Running PolyPhen-2 jobs in parallel can be problematic, because if a single dumpdir is shared and the same sequences are used in multiple jobs, a single job will lock out access to the dumpdir while the other jobs will stall.

To easily access the polyphen-2 executables, set your PATH using environment modules:

module load polyphen-2
pph_swarm.pl

PolyPhen-2 can be distributed on the Biowulf cluster as a swarm. To do so, use the command pph_swarm.pl:

[biowulf]$ module load polyphen-2
[biowulf]$ pph_swarm test.inp -d /data/user/polyphen2/scratch -o test.out
dumpdir = /data/user/polyphen2/scratch
1234568.biobos

pph_swarm will run three batch jobs. The first job will sort and split the input into individual sets of mutations for a single protein accession. The second job will run the pph2 commands in parallel via a swarm jobarray. The third batch job will round up all the output from the pph2 jobs and write them to a single file.

By default, the final output will be the name of the input file plus .pph_out. This can be changed using the -o option. Additionally, output and error files from the batch system runs will also be present. Thus, from the example above, the output files would be:

pph_swarm is a wrapper for the pph command, and accepts all the same options. However, the dumpdir (-d option to pph) is set to ~/polyphen2/scratch by default, and must be in a shared location (not /scratch!).

pph_swarm automatically bundles the PolyPhen-2 runs, so there is no need to break the input into separate files. This is useful for exhaustive mutational analysis comprising >100,000 of mutations.

Documentation
$ pph_swarm.pl
Usage:
    pph_swarm [options] infile

    where options are:

      -s seqfile  read query protein sequences from FASTA-formatted seqfile

      -b dbase    BLAST sequence database; default is 'nrdb/uniref100'

      -d dumpdir  directory for auxiliary output files; default is 'scratch'

      -r n-m[:s]  submit a range of queries from the input file; n=first,
                  m=last, s=stepsize (default s=1)

      -v level    debugging output, verbosity level=[1..3]

      -o output   set the output file path (default = infile.pph_out)

      --debug     don't actually run, just show what would have happened
  
      --noclean   don't remove intermediate files

      --gb-per-process int
                  request more than 1 GB per process (swarm option)

      --block
                  run pph_swarm in foreground as a single process (uses the
                  -W block=true attribute from PBS)

      --queue     use the specified queue for swarm

pph_swarm passes the -s, -b, -d, -r, and -v options to pph within the swarm.
The other options are only for pph_swarm.