Biowulf at the NIH
RSS Feed
GATK on Biowulf
GATK (Genome Analysis Tool Kit) is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

GATK is developed and maintained at the Broad Institute. GATK website

The GATK Resource Bundle (a collection of standard files for working with human resequencing data with the GATK) is available on Helix/Biowulf in /fdb/GATK_resource_bundle

To add the GATK executables to your PATH and set up the environment, you need to load the GATK module using the command 'module load GATK'. This will load the latest version of GATK, set up the environment variable GATK_HOME, and also set up an alias for the full java command. (Use 'module display GATK' to see exactly what the module does)

If you need a particular version of GATK, use the module commands to find and load that version. e.g.

[user@biowulf]$ module avail GATK

------------- /usr/local/Modules/3.2.9/modulefiles ------------------
 GATK/1.5-11   GATK/1.6.13   GATK/2.0.36    GATK/2.1-11

[user@biowulf]$ module load  GATK/1.6.13

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) GATK/1.6.13

The alias 'GATK' is set up to use 1 GB of memory. If you need more than that, set up your own GATK modulefile or use the full command

java -Xmx####g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar

GATK writes some temporary files during its run. By default, these files are written to /tmp on the node, but this is not advisable because /tmp is used by the operating system, and because it is a small area. The tmp files should be redirected to /scratch, by adding

-Djava.io.tmpdir=/scratch
to the java command. The alias 'GATK' that is set up by the module command includes this parameter. If you are using the full java command, you need to add this parameter to your command, as in the examples below.

Note that GATK has a phone-home feature which will send some generic information about each GATK run to the developers at the Broad Institute. This feature will not work on the Biowulf computational nodes, which are on a private network. This will not affect your runs in most cases. If you're getting an error related to 'phone-home', add
-et NO_ET -K /usr/local/gatk/helix.key
to your GATK command line to disable the phone-home feature.

Running a single GATK job on Biowulf

Set up a batch script along the following lines:

#!/bin/bash
#PBS -m be
# this file is gatk.bat

module load GATK

cd /data/user/test-gatk

# use the default 1 GB memory
GATK -R exampleFASTA.fasta -I  exampleBAM.bam -T CountReads -nt 4

# use 8 gb of memory
java -Xmx8g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar 
     -R exampleFASTA.fasta -I  exampleBAM.bam  -T CountReads -nt 4

Submit this job to the batch system with the command:

biowulf% qsub -l nodes=1:c4 gatk.bat
The '-nt 4' parameter tells GATK to use 4 threads. Thus, this job should be submitted to a node with 4 cores (c4).

Submitting a swarm of GATK jobs

The following example uses GATK's multiple sequence alignment tool which reads in one or more BAM files and locally realign reads such that the number of mismatching bases is minimized across all the reads. (Details at the GATK website). In this case, you should add 'module load GATK' to your .bashrc or .cshrc file, or add it to the beginning of each line in your swarm command file.

Set up a swarm command file along the following lines:

# this file is gatk-swarm

java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
[...]

These command lines specify that the process will use 7 GB of memory (-Xmx7g). Thus, this swarm job should be submitted with the '-g 7' flag, to tell swarm that each process will require 7 GB of memory.

biowulf% swarm -g 7  -f gatk-swarm --module GATK/2.3-9

Running an interactive GATK job

It may be useful for debugging purposes to run GATK jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ module load GATK/2.2-16
[user@p4]$ cd /data/user/myruns
[user@p4]$ java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
[...]
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$ 

Make sure to exit the job once you have finished your run.

LiftOverVCF -- converts a VCF file from one reference build to another

Information in this section is courtesy of Elisa Majounie (NIA)

This procedure is to use liftOverVCF to switch from, say, hg18 to hg19.

  1. Get databases from
    Chainfiles: ftp://ftp.broadinstitute.org/Liftover_Chain_Files/
    Fasta files: ftp://ftp.broadinstitute.org/bundle/1.2/

  2. Sort sample vcf file: It needs to be sorted in the same order as the Broad files.
    vcfsorter.pl genome.dict myvcf > mynewvcf.file 2>STDERR
    
    (vcfsorter.pl was written by Dr. German Gaston Leparc and is available here)

  3. liftOverVCF.pl and sortByRef.pl are available in /usr/local/GATK, and the executables will be in your path if you use 'module load GATK'. The paths in liftOverVCF.pl have been modified to be correct for the Biowulf cluster.

Genome STRiP

Genome STRiP (Genome STRucture In Populations) from the Broad Institute is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP has been installed in /usr/local/gatk/svtoolkit. You can set up the environment for genomestrip by typing 'module load genomestrip'.

You can add this command into your batch script. Note that this sets the path so that the version of BWA packaged with GenomeSTRip will be run by default. If you want to use the latest version of BWA, use /usr/local/bin/bwa.

Documentation

Extensive documentation is available at the GATK website