GATK is developed and maintained at the Broad Institute. GATK website
The GATK Resource Bundle (a collection of standard files for working with human resequencing data with the GATK) is available on Helix/Biowulf in /fdb/GATK_resource_bundle
To add the GATK executables to your PATH and set up the environment, you need to load the GATK module using the command 'module load GATK'. This will load the latest version of GATK, set up the environment module GATK_HOME, and also set up an alias for 'java -jar /usr/local/GATK/GenomeAnalysisTK.jar'.
If you need a particular version of GATK, use the module commands to find and load that version. e.g.
[user@biowulf]$ module avail GATK ------------- /usr/local/Modules/3.2.9/modulefiles ------------------ GATK/1.5-11 GATK/1.6.13 GATK/2.0.36 GATK/2.1-11 [user@biowulf]$ module load GATK/1.6.13 [user@biowulf]$ module list Currently Loaded Modulefiles: 1) GATK/1.6.13
The alias 'GATK' is set up to use 1 GB of memory. If you need more than that, set up your own GATK modulefile or use the full command
java -Xmx####m -jar $GATK_HOME/GenomeAnalysisTK.jar
Note that GATK has a phone-home feature which will send some generic information about each GATK run to the developers at the Broad Institute. This feature will not work on the Biowulf computational nodes, which are on a private network. This will not affect your runs in most cases. If you're getting an error related to 'phone-home', add
-et NO_ET -K /usr/local/gatk/helix.key
to your GATK command line to disable the phone-home feature.
Set up a batch script along the following lines:
#!/bin/bash #PBS -m be # this file is gatk.bat module load GATK cd /data/user/test-gatk # use the default 1 GB memory GATK -R exampleFASTA.fasta -I exampleBAM.bam -T CountReads -nt 4 # use 8 gb of memory java -Xmx8g -jar $GATK_HOME/GenomeAnalysisTK.jar -R exampleFASTA.fasta -I exampleBAM.bam \ -T CountReads -nt 4
Submit this job to the batch system with the command:
biowulf% qsub -l nodes=1:c4 gatk.bat
The following example uses GATK's multiple sequence alignment tool which reads in one or more BAM files and locally realign reads such that the number of mismatching bases is minimized across all the reads. (Details at the GATK website). In this case, you should add 'module load GATK' to your .bashrc or .cshrc file, or add it to the beginning of each line in your swarm command file.
Set up a swarm command file along the following lines:
# this file is gatk-swarm java -Xmx7g -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod java -Xmx7g -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod java -Xmx7g -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod [...]
These command lines specify that the process will use 7 GB of memory (-Xmx7g). Thus, this swarm job should be submitted with the '-g 7' flag, to tell swarm that each process will require 7 GB of memory.
biowulf% swarm -g 7 -f gatk-swarm --module GATK/2.3-9
It may be useful for debugging purposes to run GATK jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.
biowulf% qsub -I -l nodes=1 qsub: waiting for job 2236960.biobos to start qsub: job 2236960.biobos ready [user@p4]$ module load GATK/2.2-16 [user@p4]$ cd /data/user/myruns [user@p4]$ java -Xmx7g -jar $GATK_HOME/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -o output.intervals -I input.bam -B:snps,VCF SNP_calls.vcf -B:indels,VCF indel_calls.vcf -D dbsnp.rod [...] [user@p4] exit qsub: job 2236960.biobos completed [user@biowulf ~]$
Make sure to exit the job once you have finished your run.
Information in this section is courtesy of Elisa Majounie (NIA)
This procedure is to use liftOverVCF to switch from, say, hg18 to hg19.
- Get databases from
Fasta files: ftp://ftp.broadinstitute.org/bundle/1.2/
Sort sample vcf file: It needs to be sorted in the same order as the Broad files.
(vcfsorter.pl was written by Dr. German Gaston Leparc and is available here)
vcfsorter.pl genome.dict myvcf > mynewvcf.file 2>STDERR
- liftOverVCF.pl and sortByRef.pl are available in /usr/local/GATK, and the executables will be in your path if you use 'module load GATK'. The paths in liftOverVCF.pl have been modified to be correct for the Biowulf cluster.
Genome STRiP (Genome STRucture In Populations) from the Broad Institute is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.
Genome STRiP has been installed in /usr/local/gatk/svtoolkit. You can set up the environment for genomestrip by typing 'module load genomestrip'.
You can add this command into your batch script. Note that this sets the path so that the version of BWA packaged with GenomeSTRip will be run by default. If you want to use the latest version of BWA, use /usr/local/bin/bwa.
Extensive documentation is available at the GATK website