PolyPhen-2 (Polymorphism Phenotyping v2) is a software tool which predicts possible impact of amino acid substitutions on the structure and function of human proteins using straightforward physical and evolutionary comparative considerations.
Running PolyPhen-2 jobs in parallel can be problematic, because if a single dumpdir is shared and the same sequences are used in multiple jobs, a single job will lock out access to the dumpdir while the other jobs will stall.
To easily access the polyphen-2 executables, set your PATH using environment modules:
module load polyphen-2
PolyPhen-2 can be distributed on the Biowulf cluster as a swarm. To do so, use the command pph_swarm.pl:
[biowulf]$ module load polyphen-2 [biowulf]$ pph_swarm test.inp -d /data/user/polyphen2/scratch -o test.out dumpdir = /data/user/polyphen2/scratch 1234568.biobos
pph_swarm will run three batch jobs. The first job will sort and split the input into individual sets of mutations for a single protein accession. The second job will run the pph2 commands in parallel via a swarm jobarray. The third batch job will round up all the output from the pph2 jobs and write them to a single file.
By default, the final output will be the name of the input file plus .pph_out. This can be changed using the -o option. Additionally, output and error files from the batch system runs will also be present. Thus, from the example above, the output files would be:
- test.out consolidated output from pph2
- pph_swarm.o####### stdout from input setup
- pph_swarm.e####### stderr from input setup
- sw0#####.o stdout from pph2 jobs
- sw0#####.e stderr from pph2 jobs
- pph_roundup.o####### stdout from final roundup
- pph_roundup.e####### stderr from final roundup, should be empty
pph_swarm is a wrapper for the pph command, and accepts all the same options. However, the dumpdir (-d option to pph) is set to ~/polyphen2/scratch by default, and must be in a shared location (not /scratch!).
pph_swarm automatically bundles the PolyPhen-2 runs, so there is no need to break the input into separate files. This is useful for exhaustive mutational analysis comprising >100,000 of mutations.
$ pph_swarm.pl
Usage:
pph_swarm [options] infile
where options are:
-s seqfile read query protein sequences from FASTA-formatted seqfile
-b dbase BLAST sequence database; default is 'nrdb/uniref100'
-d dumpdir directory for auxiliary output files; default is 'scratch'
-r n-m[:s] submit a range of queries from the input file; n=first,
m=last, s=stepsize (default s=1)
-v level debugging output, verbosity level=[1..3]
-o output set the output file path (default = infile.pph_out)
--debug don't actually run, just show what would have happened
--noclean don't remove intermediate files
pph_swarm passes the -s, -b, -d, -r, and -v options to pph within the swarm.
The other options are only for pph_swarm.


