IMPUTE is a program for estimating ("imputing") unobserved genotypes in SNP association studies. The program is designed to work seamlessly with the output of the genotype calling program CHIAMO and the population genetic simulator HAPGEN, and it produces output that can be analyzed using the program SNPTEST. IMPUTE website at Oxford.
The associated programs snptest, gtool and qctool are also available in the /usr/local/impute directory. All these executables will become available in your path if you set up the environment with 'module load impute' (once per session).
Reference data from 1000 genomes for Impute2 is available in /fdb/impute2.
module load impute
Set up a swarm commmand file with one line for each Impute run. If running a swarm, it's best to add 'module load impute' into your .bashrc or .cshrc file. Example:
# this file is impute_swarm cd /data/user/dir1; impute2 -ref_samp_out -m chr16.map -h chr16.haps -l chr16.legend -g gtypes -s refstrand1 -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3 -o out1 -i info1 -r summary1 cd /data/user/dir2; impute2 -ref_samp_out -m chr26.map -h chr26.haps -l chr26.legend -g gtypes -s refstrand2 -Ne 22428 -int 5000000 5500000 -buffer 250 -k 20 -iter 20 -burnin 3 -o out2 -i info2 -r summary2 cd /data/user/dir3; impute2 -ref_samp_out -m chr36.map -h chr36.haps -l chr36.legend -g gtypes -s refstrand3 -Ne 33438 -int 5000000 5500000 -buffer 250 -k 30 -iter 30 -burnin 3 -o out3 -i info3 -r summary3 [...]
If each Impute process requires less than 1 GB of memory, submit this to the batch system with the command:
swarm -f cmdfile
If each Impute process requires more than 1 GB of memory, use
swarm -g # -f cmdfilewhere '#' is the number of Gigabytes of memory required by each Impute process.
The example below uses the sample data from the Impute website. Users can copy the example files from /usr/local/impute/Examples, or download their own copies from the Impute website. The text files in the example directory contain command lines for sample Impute runs. User input in bold below:
[user@helix mydir]$ cp -R /usr/local/impute/Example . [user@helix mydir]$ cd Example [user@helix mydir]$ impute2 -ref_samp_out -m ./chr16.map -h ./chr16.haps \ -l ./chr16.legend -g ./chr16.reference.gtypes -s ./chr16.reference.strand \ -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3 \ -o ./Results/chr16.multi_panel.ref_gtypes.impute2 \ -i ./Results/chr16.multi_panel.ref_gtypes.impute2.info \ -r ./Results/chr16.multi_panel.ref_gtypes.impute2.summary The seed for the random number generator is 1115038504. Command-line input: impute2 -ref_samp_out -m ./chr16.map -h ./chr16.haps -l ./chr16.legend -g ./chr16.reference.gtypes -s ./chr16.reference.strand -Ne 11418 -int 5000000 5500000 -buffer 250 -k 10 -iter 10 -burnin 3 -o ./Results/chr16.multi_panel.ref_gtypes.impute2 -i ./Results/chr16.multi_panel.ref_gtypes.impute2.info -r ./Results/chr16.multi_panel.ref_gtypes.impute2.summary ====================== IMPUTE version 2.0.3 ====================== Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini Please see the LICENCE file included with this program for conditions of use. haplotypes file : ./chr16.haps legend file : ./chr16.legend ref genotypes file : NULL ref gen strand file : NULL genotypes file : ./chr16.reference.gtypes strand file : ./chr16.reference.strand map file : ./chr16.map excluded SNPs file : NULL included SNPs file : NULL ref samp infile : NULL output file : ./Results/chr16.multi_panel.ref_gtypes.impute2 info file : ./Results/chr16.multi_panel.ref_gtypes.impute2.info summary file : ./Results/chr16.multi_panel.ref_gtypes.impute2.summary imputation interval : [5000000,5500000] reading genetic map...done reading inference panel genotypes # inference panel individuals = 30 # SNPs with genotypes read in = 430 reading haplotypes # haplotypes = 120 # SNPs read in = 1202 No initial haplotype guesses file was provided for the inference panel genotypes; now phasing hets at random and imputing missing genotypes from allele freqs. Summary : 139 SNPs in left-hand buffer region 356 SNPs in right-hand buffer region 277 type 0 SNPs will be in output file (type 0 = SNP in reference haps file only) 0 type 1 SNPs will be in output file (type 1 = SNP in reference gens file) 0 type 2 SNPs will be in output file (type 2 = SNP in inference gens and all reference files) 0 type 3 SNPs will be in output file (type 3 = SNP in inference gens file only) 277 SNPs will be in output file in total 1202 SNPs in total -using strand file to orientate strand in inference genotype panel --flipped strand at 237 genotyped SNPs in inference panel out of a total of 430 -aligning allele labels of haplotypes, reference genotypes, and inference genotypes -removing non-aligned genotyped SNPs --removing 0 genotyped SNPs out of a total of 430 setting weights...done setting storage space...done setting mutation matrices...done setting switch rates...done haps in -h file : 120 indiv in -g_ref file : 0 indiv in -g file : 30 interval : [5000000, 5500000] buffer : 250 Ne : 11418 call thresh : 0.900 MCMC iterations : 10 burn-in iterations : 3 states for phasing (k) : 10 MCMC iteration [1/10] updating inf indiv [30/30] [dip] MCMC iteration [2/10] --- RESETTING MODEL PARAMETERS FOR INFORMED CONDITIONING --- setting mutation matrices...done setting switch rates...done updating inf indiv [30/30] [dip] MCMC iteration [3/10] updating inf indiv [30/30] [dip] MCMC iteration [4/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [5/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [6/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [7/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [8/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [9/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] MCMC iteration [10/10] updating inf indiv [30/30] [dip] [hap 0] [hap 0] dip sampling success rate: 0.949 hap sampling success rate: (no haploid sampling performed) -------------------------------- Imputation accuracy assessment -------------------------------- This breakdown is based on an internal leave-one-out validation at SNPs with genotypes in the -g input file, using only the input genotypes with maximum call probabilities exceeding a threshold of 0.90. There are 5106 such genotypes in the current input file. Accuracy assessment for imputation of type 0 SNPs (those with data in the haploid reference panel only) .The maximum imputed genotype calls are distributed as follows: Interval #Genotypes %Concordance Interval %Called %Concordance [0.0-0.1] 0 0.0 [ >= 0.0] 100.0 96.7 [0.1-0.2] 0 0.0 [ >= 0.1] 100.0 96.7 [0.2-0.3] 0 0.0 [ >= 0.2] 100.0 96.7 [0.3-0.4] 0 0.0 [ >= 0.3] 100.0 96.7 [0.4-0.5] 10 40.0 [ >= 0.4] 100.0 96.7 [user@helix mydir]$ ls Results chr16.multi_panel.ref_gtypes.impute2 chr16.multi_panel.ref_gtypes.impute2.info chr16.multi_panel.ref_gtypes.impute2.summary chr16.multi_panel.ref_gtypes.impute2_refsamp1.gz chr16.multi_panel.ref_gtypes.impute2_refsamp10.gz chr16.multi_panel.ref_gtypes.impute2_refsamp2.gz chr16.multi_panel.ref_gtypes.impute2_refsamp3.gz chr16.multi_panel.ref_gtypes.impute2_refsamp4.gz chr16.multi_panel.ref_gtypes.impute2_refsamp5.gz chr16.multi_panel.ref_gtypes.impute2_refsamp6.gz chr16.multi_panel.ref_gtypes.impute2_refsamp7.gz chr16.multi_panel.ref_gtypes.impute2_refsamp8.gz chr16.multi_panel.ref_gtypes.impute2_refsamp9.gz [user@helix mydir]$
Note about parallelization
In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.
We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.
See this informative snippet from the Impute website for more details about dealing with whole chromosomes.
Two additional executables, GTOOL and SNPTEST, are also available on Helix and Biowulf.
GTOOL is used to transform genotype data for SNPTEST and IMPUTE. Example files can be found in /usr/local/impute/gtool/example:
[helix]$ cp /usr/local/impute/gtool/example/* . [helix]$ gtool -S --g example.gen --s example.sample --og out.gen --os out.sample --sample_id sample_id.txt Number of input samples: 5 Samples... Number of output samples: 3 Gen... Number of input SNPs: 11 Number of output SNPs: 11 [helix]$
SNPTEST is used in concert with IMPUTE for the analysis of single SNP association in GWAS. Example files can be found in /usr/local/impute/snptest/example
[helix]$ cp /usr/local/impute/snptest/example/* . [helix]$ snptest -cases cases.gen cases.sample -controls controls.gen controls.sample -o ex.out SNPTEST v1.1.5 ============== Please refer to the LICENCE file included with this package for details of conditions of use. Data Files : -case files : cases.gen -controls files : controls.gen -case sample files : cases.sample -controls sample files : controls.sample Tests : Data Summaries : -number of SNPs = 100 -number of controls = 500 -number of cases = 500 Reading sample files : Summary of covariates and phenotypes # discrete variables : 2 cov_1 : type = 1 cov_2 : type = 2 # continuous variables : 2 cov_3 : type = 3 cov_4 : type = 3 # phenotypes : 2 pheno1 : type = P pheno2 : type = P Covariate ranges : cov_1 : 0 1 cov_2 : 0 1 2 3 4 5 cov_3 : -3.27025 3.83104 cov_4 : -3.16989 3.17686 Phenotype ranges : pheno1 : -1.07664 5.44337 pheno2 : -2.84885 3.69998 Exclusion list file : NONE Exclusion list size = 0 Number of individuals removed based on the exclusion list = 0 Data with missing genotype data threshold and exclusion list applied : -number of controls = 500 -number of cases = 500 Setting storage...done Analyzing Data : chunk [1/1] reading data...controls...cases...done run tests... output results to file...done finito [helix]$