Biowulf at the NIH
RSS Feed
VarScan on Biowulf

VarScan is a platform-independent, technology-independent software tool for identifying SNPs and indels in massively parallel sequencing of individual and pooled samples. Given data for a single sample, VarScan identifies and filters germline variants based on read counts, base quality, and allele frequency. Given data for a tumor-normal pair, VarScan also determines the somatic status of each variant (Germline, Somatic, or LOH) by comparing read counts between samples.

 

Programs Location

/usr/local/VarScan

Jar files for all available versions of VarScan are located in this directory. /usr/local/VarScan/VarScan.jar is a link to the latest version.

Submitting a single batch job

1. Create a script file. Here is a sample batch script:

#!/bin/bash
# This file is YourOwnFileName
#
#PBS -N VarScanjob
#PBS -m be
#PBS -k oe

alias VarScan="java -Xmx2000m -jar /usr/local/VarScan/VarScan.jar"

cd /data/user/mydir
VarScan pileup2snp mypileup.file --min-coverage
VarScan pileup2indel mypileup.file --min-coverage

Note that VarScan is set here to use 2000 MB = 2 GB of memory. This value can be adjusted to suit your own job.

2. Submit the script using the 'qsub' command on Biowulf, e.g. Note, user is recommend to run benchmarks to determine what kind of node is suitable for his/her jobs.

To submit to a node with 4 GB of memory (a little more than the 2 GB required by the VarScan job for safety).
[user@biowulf]$ qsub -l nodes=1:g4 /data/username/theScriptFileAbove
Use 'freen' to see available node types.

 

Submitting a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/username/cmdfile). Here is a sample file:

java -Xmx2000m -jar /usr/local/VarScan/VarScan.jar pileup2snp mypileup1.file --min-coverage
java -Xmx2000m -jar /usr/local/VarScan/VarScan.jar pileup2snp mypileup2.file --min-coverage
java -Xmx2000m -jar /usr/local/VarScan/VarScan.jar pileup2snp mypileup3.file --min-coverage
[...]     

Submit this job with

swarm -f cmdfile

By default, each line of the commands above will be executed on '1' processor core of a node and uses 1GB of memory. If each of your VarScan command lines requires more than 1 GB of memory, you should specify the memory required using the '-g #' flag to swarm, where # represents the number of Gigabytes of memory required by a single command. For example, if each of the VarScan commands in the swarm file above require 10 GB of memory, you will need to submit the job with:

[user@biowulf]$ swarm -g 10 -f cmdfile

For more information regarding running swarm, see swarm.html

 

Running an interactive job

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf] $ qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ cd /data/user/myruns
[user@p4]$ cd /data/userID/VarScan/run1
[user@p4]$ java -Xmx2000m -jar /usr/local/VarScan/VarScan.jar pileup2snp mypileup1.file --min-coverage
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf]$

Users may add a node property in the qsub command to request specific interactive node. For example, if you need a node with 24gb of memory to run job interactively, do this:

[user@biowulf]$ qsub -I -l nodes=1:g24

 

Documentation

http://varscan.sourceforge.net/using-varscan.html