Biowulf at the NIH
RSS Feed
Simwalk on Biowulf

SimWalk2 is a statistical genetics computer application for haplotype, parametric linkage, non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree. SimWalk2 uses Markov chain Monte Carlo (MCMC) and simulated annealing algorithms to perform these multipoint analyses.

Simwalk2 was developed by Eric Sobel, Kenneth Lange, Daniel Weeks, Jeff O'Connell, and Goncalo Abecasis at UCLA. SimWalk2 documentation at UCLA

Simwalk is also available on Helix . Users who need relatively few Simwalk runs should use it on Helix. It is advantageous to run Simwalk on Biowulf only if you need large numbers of Simwalk runs.

Running a swarm of Simwalk jobs

For each Simwalk job, you can set up a batch command file and submit via qsub to the Biowulf batch system. Each job will then be allotted a node. Since Simwalk is not parallelized, the job will use only 1 processor of the node, which is inefficient use of the system. A preferable way to submit large numbers of Simwalk jobs is via the swarm command.

  1. Set up the Simwalk jobs, each in a directory with the appropriate input files.
  2. Create a swarm command file as below, with one line for each simwalk job.
    ------------------ file cmdfile -----------------------
    cd /data/user/simwalk/run1; simwalk2
    cd /data/user/simwalk/run2; simwalk2
    cd /data/user/simwalk/run3; simwalk2
    [...etc...]
    -------------------------------------------------------
    
  3. There are one flag of swarm that's required '-f' and two other flags of swarm user most possibly needs to specify when submit a swarm job: '-t' and '-g'.

    -f: the swarm command file name above (required)
    -t: number of processors per node to use for each line of the commands in the swarm file above.(optional)
    -g: GB of memory needed for each line of the commands in the swarm file above.(optional)

    By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If this is not what you want, you will need to specify '-t' and '-g' flags when you submit the job on biowulf.

    Say if each line of the commands above also will need to use 10gb of memory instead of the default 1gb of memory, make sure swarm understands this by including '-g 10' flag:

    biowulf> $ swarm -g 10 -f cmdfile

    For more information regarding running swarm, see swarm.html

Running an interactive Simwalk job

Typically, Simwalk runs should be done via non-interactive batch or the swarm command. It may sometimes be useful to run interactively for debugging purposes.

This test run uses the files MAP.DAT, LOCUS.DAT, PEDIGREE.DAT and PEN.DAT from the Simwalk Example set, and the example sampling analysis (file BATCH-01.DAT) is being performed. These files can be copied from /usr/local/src/simwalk/SimWalk289/Examples.

[user@biowulf ~]$ qsub -I -l nodes=1
qsub: waiting for job 521768.biobos to start
qsub: job 521768.biobos ready

[user@p554 ~]$ cd mydir
[user@p554 ~/mydir]$ simwalk2
                                                                               
       SimWalk2 version 2.89                                                   
                                                                               
       Type of data analysis:         Pedigree Sampling                        
                                                                               
       Locus data INPUT file:         LOCUS.DAT                                
       Pedigree data INPUT file:      PEDIGREE.DAT                             
       Map data INPUT file:           MAP.DAT                                  
                                                                               
                                                                               
       Individual OUTPUT files:       MODEL-01.mmm                             
       Copy of all screen output:     VIDEO-01.TXT                             
                                                                               
       Here 'mmm' is from the order within the input pedigree file,            
       e.g., '001' for the first pedigree, etc.                                
                                                                               
  Working on data initialization ...                                           
   WARNING. In the locus file: LOCUS.DAT                                       
            for the following loci, minor adjustments had to be made           
            to the allele frequencies to force them to sum to 1.0:             
            CACNL1A1  pY2/1     KCNA5     S93                                  
   Map data file 'MAP.DAT' completed initialization;                           
   Locus data file 'LOCUS.DAT' completed initialization;                       
   Pedigree #001 completed initialization;                                     
  All data completed initialization.                                           
                                                                               
                                                                               
  Working on pedigree analysis ...                                             
                                                                               
   Pedigree #001 ('20') working on simulated annealing ...                     
      (Found an initial consistent state.)                                     
          25% done ...                                                         
          50% done ...                                                         
          75% done ...                                                         
   Pedigree #001 ('20') completed simulated annealing.                         
   Pedigree #001 ('20') working on Markov chain Monte Carlo process ...        
          25% done ...                                                         
          50% done ...                                                         
          75% done ...                                                         
   Pedigree #001 ('20') completed Markov chain Monte Carlo process.            
   Pedigree #001 ('20') completed all analyses.                                
                                                                               
  All individual pedigrees completed analysis.                                 
                                                                               
  Please see the following output files.                                       
                                                                               
       Individual OUTPUT files:       MODEL-01.mmm                             
       Copy of all screen output:     VIDEO-01.TXT                             
                                                                               
       Here 'mmm' is from the order within the input pedigree file,            
       e.g., '001' for the first pedigree, etc.                                
                                                                               
  Program run completed!                                                       
[user@p554 mydir]$exit

[user@biowulf ~]

Documentation

Simwalk documentation at UCLA.