Biowulf at the NIH
RSS Feed
GATK on Biowulf
GATK (Genome Analysis Tool Kit) is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

GATK is developed and maintained at the Broad Institute. GATK website

The GATK Resource Bundle (a collection of standard files for working with human resequencing data with the GATK) is available on Helix/Biowulf in /fdb/GATK_resource_bundle

To add the GATK executables to your PATH and set up the environment, you need to load the GATK module using the command 'module load GATK'. This will load the latest version of GATK, set up the environment variable GATK_HOME, and also set up an alias for the full java command. (Use 'module display GATK' to see exactly what the module does)

If you need a particular version of GATK, use the module commands to find and load that version. e.g.

[user@biowulf]$ module avail GATK

------------- /usr/local/Modules/3.2.9/modulefiles ------------------
 GATK/1.5-11   GATK/1.6.13   GATK/2.0.36    GATK/2.1-11

[user@biowulf]$ module load  GATK/1.6.13

[user@biowulf]$ module list
Currently Loaded Modulefiles:
  1) GATK/1.6.13

The alias 'GATK' is set up to use 1 GB of memory. If you need more than that, set up your own GATK modulefile or use the full command

java -Xmx####g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar
v

GATK writes some temporary files during its run. By default, these files are written to /tmp on the node, but this is not advisable because /tmp is used by the operating system, and because it is a small area. The tmp files should be redirected to /scratch, by adding

-Djava.io.tmpdir=/scratch
to the java command. The alias 'GATK' that is set up by the module command includes this parameter. If you are using the full java command, you need to add this parameter to your command, as in the examples below.

Note that GATK has a phone-home feature which will send some generic information about each GATK run to the developers at the Broad Institute. This feature will not work on the Biowulf computational nodes, which are on a private network. This will not affect your runs in most cases. If you're getting an error related to 'phone-home', add
-et NO_ET -K /usr/local/gatk/helix.key
to your GATK command line to disable the phone-home feature.

GATK and Open Files

GATK opens many files simultaneously. It can happen that you hit the open-file limit on Biowulf, and see errors like:

##### ERROR MESSAGE: Couldn't read file .... (Too many open files)
in the GATK output.

The most important change you can make is to reduce the multithreading. If your GATK command has any multithreading flags turned on (e.g. '-nt' and 'nct'), reduce the number of threads to 1.

The open-file limit is set to 4096 on the Biowulf nodes, but the soft limit is still 1024. To open more than 1024 files simultaneously, you will need to add the following into your batch script:

ulimit -n 4096 

Here are some other possibilities (thanks to Ryan Neff, NHGRI):

* Reduce the number of open files. When doing variant calling, first create gVCF files for each sample and then combine them together in batches of 100-200, and then do the final variant calls on the remaining 5-20 files (GATK's HaplotypeCaller and GenotypeGVCFs). If you are using the UG or a different variant caller, try first discovering variant calls on a per-sample basis (-gt mode DISCOVERY), combining all of those files together, and then genotyping each sample on it's own from the list of all variants (-gt mode GENOTYPE_GIVEN_ALLELES).

* Reduce the number of open file pointers. This can be done by increasing RAM, reducing the number of temporary files open at one time. Sometimes, in the GATK, setting --disableAutoIndexCreationAndLockingWhenReadingRods helps with performance and other I/O issues.

Relevant discussions from the GATK mailing list:

GATK Parallelization

The Haplotype Caller analysis does not support parallel execution with 'nt' or 'nct'. A better option is to parallelize across regions by running each chromosome in parallel. You can also parallelize across samples and then combine. See here http://www.appistry.com/wp-content/uploads/2014/08/AppNote_HC-Performance_final.pdf for a discussion of performance in Haplotype Caller.

Running a single GATK job on Biowulf

Set up a batch script along the following lines:

#!/bin/bash
#PBS -m be
# this file is gatk.bat

module load GATK

cd /data/user/test-gatk

# use the default 1 GB memory
GATK -R exampleFASTA.fasta -I  exampleBAM.bam -T CountReads -nt 4

# use 8 gb of memory
java -Xmx8g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar 
     -R exampleFASTA.fasta -I  exampleBAM.bam  -T CountReads -nt 4

Submit this job to the batch system with the command:

biowulf% qsub -l nodes=1:c4 gatk.bat
The '-nt 4' parameter tells GATK to use 4 threads. Thus, this job should be submitted to a node with 4 cores (c4).

Submitting a swarm of GATK jobs

The following example uses GATK's multiple sequence alignment tool which reads in one or more BAM files and locally realign reads such that the number of mismatching bases is minimized across all the reads. (Details at the GATK website). In this case, you should add 'module load GATK' to your .bashrc or .cshrc file, or add it to the beginning of each line in your swarm command file.

Set up a swarm command file along the following lines:

# this file is gatk-swarm

java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
[...]

These command lines specify that the process will use 7 GB of memory (-Xmx7g). Thus, this swarm job should be submitted with the '-g 7' flag, to tell swarm that each process will require 7 GB of memory.

biowulf% swarm -g 7  -f gatk-swarm --module GATK/2.3-9

Running an interactive GATK job

It may be useful for debugging purposes to run GATK jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf% qsub -I -l nodes=1
qsub: waiting for job 2236960.biobos to start
qsub: job 2236960.biobos ready

[user@p4]$ module load GATK/2.2-16
[user@p4]$ cd /data/user/myruns
[user@p4]$ java -Xmx7g -Djava.io.tmpdir=/scratch -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod
[...]
[user@p4] exit
qsub: job 2236960.biobos completed
[user@biowulf ~]$ 

Make sure to exit the job once you have finished your run.

LiftOverVCF -- converts a VCF file from one reference build to another

Information in this section is courtesy of Elisa Majounie (NIA)

This procedure is to use liftOverVCF to switch from, say, hg18 to hg19.

  1. Get databases from
    Chainfiles: ftp://ftp.broadinstitute.org/Liftover_Chain_Files/
    Fasta files: ftp://ftp.broadinstitute.org/bundle/1.2/

  2. Sort sample vcf file: It needs to be sorted in the same order as the Broad files.
    vcfsorter.pl genome.dict myvcf > mynewvcf.file 2>STDERR
    
    (vcfsorter.pl was written by Dr. German Gaston Leparc and is available here)

  3. liftOverVCF.pl and sortByRef.pl are available in /usr/local/GATK, and the executables will be in your path if you use 'module load GATK'. The paths in liftOverVCF.pl have been modified to be correct for the Biowulf cluster.

Genome STRiP

Genome STRiP (Genome STRucture In Populations) from the Broad Institute is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP has been installed in /usr/local/gatk/svtoolkit. You can set up the environment for genomestrip by typing 'module load genomestrip'.

You can add this command into your batch script. Note that this sets the path so that the version of BWA packaged with GenomeSTRip will be run by default. If you want to use the latest version of BWA, use /usr/local/bin/bwa.

Documentation

Extensive documentation is available at the GATK website