Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy.

paper Primary Public

Authors: Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P
Year: 2013
Journal: Human genetics
PMID: 23334152
DOI: 10.1007/s00439-013-1266-7
PMCID: PMC3628082

Fig. 1

Genomic inflation factors (grey lines) (λgc) and percentages of SNPs having spurious association (black lines) (P < 1 × 10−6), by minor allele frequency (MAF), when combining studies genotyped on different Illumina BeadChip arrays (Human1M or HumanHap550 version 3). a–c European American subjects from SAGE were compared to PanScan subjects, and d-f African American subjects from SAGE were compared to iControl subjects. Three different SNP sets were assessed: a, d genotyped SNPs available on both arrays; b, e imputed SNPs based on the union of genotyped SNPs available on either array; and c, f imputed SNPs based on the intersection of genotyped SNPs available on both arrays. The number of SNPs with MAF >1 % and the overall λgc are shown in each plot

Fig. 2

Genomic inflation factors (grey lines) (λgc) and percentages of SNPs having spurious association (black lines) (P < 1 × 10−6), by minor allele frequency (MAF), when combining studies genotyped on either the Illumina Human1M or Affymetrix 6.0 array. a–c European American and d–f African American subjects from SAGE (genotyped on Illumina 1M) were compared to subjects from the GAIN GWAS of Schizophrenia (genotyped on Affymetrix 6.0). Three different SNP sets were assessed: a, d genotyped SNPs available on both arrays; b, e imputed SNPs based on the union of genotyped SNPs available on either array; and c, f imputed SNPs based on the intersection of genotyped SNPs available on both arrays. The number of SNPs with MAF >1 % and the overall λgc are shown in each plot

Fig. 3

Average R2 values in SAGE control subjects (genotyped on Illumina’s Human1M) to indicate overall quality across all imputed SNPs, when imputation was based on all genotyped SNPs or the intersection of genotyped SNPs with Affymetrix 6.0 or varying Illumina arrays (Human1M, HumanOmni1-Quad, Human660W, HumanHap550 version 1, and HumanHap300-Duo version 2 BeadChip). Results are shown across minor allele frequency (MAF) intervals of 1 % for all imputed SNPs with MAF >1 % on chromosome 22: a ~34,000 SNPs in European Americans and b ~43,000 SNPs in African Americans

Fig. 4

Expected statistical power by level of imputation accuracy (average R2) for differing numbers of public controls added to the baseline design of 2,000 cases and 2,000 controls (blue diamond and blue dashed line). Power was estimated for detection of a SNP effect size of 1 % explained variance in the phenotype. The baseline model provided 81 % power to detect this effect size at a genome-wide significance of P = 5 × 10−8

Fig. 5

Expected statistical power by imputation accuracy (average R2) for the baseline study design (2,000 cases and 2,000 controls: blue diamond and blue dashed line) and several alternatives focusing study recruitment and genotyping on increasing numbers of cases and relying on public controls under the constraint of maximal recruitment and genotyping of 4,000 individuals. The baseline model provided 81 % power to detect this effect size at a genome-wide significance of P = 5 × 10−8

#	Section	Preview
0	Introduction	Centralized repositories for genome-wide association study (GWAS) data, such as the database of…
1	Introduction	cost-effective strategy to obtain the large number of control subjects needed for GWAS analyses,…
2	Introduction	Statistical imputation of untyped SNP genotypes based on reference haplotype panels can be used to…
3	Introduction	Sinnott and Kraft (2012) and Uh et al. (2012) recently have demonstrated that substantial false…
4	Introduction	In this study, we used data from GWAS repositories to estimate the magnitude of imputation-induced…
5	Introduction	all arrays for the samples to be combined and then imputed up to a common set of HapMap SNPs for…
6	Subjects and methods — Study subjects and genotyping arrays	Table 1 lists the sources of European American and African American study subjects, who were…
7	Subjects and methods — Quality control	Quality control (QC) procedures, mimicking standard procedures used for GWAS, were conducted in each…
8	Subjects and methods — Quality control	single subject having the highest call rate from each cluster. Since IBD estimates may be inflated…
9	Subjects and methods — Quality control	Subjects were further evaluated for population structure to identify ancestral outliers using HapMap…
10	Subjects and methods — Quality control	Additional subject exclusions were made in dbGaP studies to remove the original study cases [e.g.,…
11	Subjects and methods — Quality control	Combining subjects genotyped on Illumina versus Affymetrix arrays required an additional QC step to…
12	Subjects and methods — Reference haplotype panels	For genotype imputation in European Americans, we used the CEU reference haplotype panel from merged…
13	Subjects and methods — Imputation procedure	SNP imputation procedures use haplotype information on genotyped SNPs in the study population and…
14	Subjects and methods — Imputation procedure	Genotype imputations reported here were conducted using MaCH, unless otherwise stated (Li et al.…
15	Subjects and methods — Imputation procedure	The first imputation step in MaCH used a subset of 200 randomly selected haplotypes from study…
16	Subjects and methods — Statistical analyses	Imputation results were compared across subjects genotyped on different arrays by arbitrarily…
17	Subjects and methods — Statistical analyses	Three data sets were compared for each pair of studies: (1) genotyped SNPs shared on both arrays;…
18	Subjects and methods — Calculating statistical power for using public controls under cross array imputation scenarios	Adding publically available controls to augment existing study controls or using such public…
19	Subjects and methods — Calculating statistical power for using public controls under cross array imputation scenarios	controls, and public controls. Under both scenarios, we began with a baseline model in which a study…

Citation	PMID	DOI	Status
Almeida, MA et al., BMC Genet, 2011, An empirical evaluation of imputation accuracy for association statistics reveals increased type-I error rates in genome-wide associations	21251252	10.1186/1471-2156-12-10	Cited
Altshuler, DM et al., Nature, 2010, Integrating common and rare genetic variation in diverse human populations	20811451	10.1038/nature09298	Cited
Amundadottir, L et al., Nat Genet, 2009, Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer	19648918	10.1038/ng.429	Cited
Beecham, GW et al., Ann Hum Genet, 2010, APOE is not associated with Alzheimer disease: a cautionary tale of genotype imputation	20529013	10.1111/j.1469-1809.2010.00573.x	Cited
Bierut, LJ et al., Proc Natl Acad Sci USA, 2010, A genome-wide association study of alcohol dependence	20202923	10.1073/pnas.0911109107	Primary
Durbin, RM et al., Nature, 2010, A map of human genome variation from population-scale sequencing	20981092	10.1038/nature09534	Cited
Fellay, J et al., Science, 2007, A whole-genome association study of major determinants for host control of HIV-1	17641165	10.1126/science.1143767	Cited
Hancock, DB et al., PLoS ONE, 2012, Assessment of genotype imputation performance using 1000 Genomes in African American studies	23226329	10.1371/journal.pone.0050610	Cited
Hartz, SM et al., Am J Epidemiol, 2011, Inclusion of African Americans in genetic studies: what is the barrier?	21633120	10.1093/aje/kwr084	Primary
Ho, LA et al., Hum Genet, 2010, Using public control genotype data to increase power and decrease cost of case-control genetic association studies	20821337	10.1007/s00439-010-0880-x	Cited
Howie, B et al., G3 (Bethesda), 2011, Genotype imputation with thousands of genomes	22384356	10.1534/g3.111.001198	Cited
Hunter, DJ et al., Nat Genet, 2007, A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer	17529973	10.1038/ng2075	Cited
Li, Y et al., Genet Epidemiol, 2010, MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes	21058334	10.1002/gepi.20533	Cited
Manichaikul, A et al., Bioinformatics, 2010, Robust relationship inference in genome-wide association studies	20926424	10.1093/bioinformatics/btq559	Cited
Manolio, TA et al., Nat Genet, 2007, New models of collaboration in genome-wide association studies: the Genetic Association Information Network	17728769	10.1038/ng2127	Cited
Marchini, J et al., Nat Rev Genet, 2010, Genotype imputation for genome-wide association studies	20517342	10.1038/nrg2796	Cited
Mukherjee, S et al., Hum Hered, 2011, Including additional controls from public databases improves the power of a genome-wide association study	21849791	10.1159/000330149	Cited
Pasaniuc, B et al., Nat Genet, 2012, Extremely low-coverage sequencing and imputation increases power for genome-wide association studies	22610117	10.1038/ng.2283	Cited
Price, AL et al., Nat Genet, 2006, Principal components analysis corrects for stratification in genome-wide association studies	16862161	10.1038/ng1847	Cited
Pritchard, JK et al., Am J Hum Genet, 2001, Linkage disequilibrium in humans: models and data	11410837	10.1086/321275	Cited
Pritchard, JK et al., Genetics, 2000, Inference of population structure using multilocus genotype data	10835412	10.1093/genetics/155.2.945	Cited
Purcell, S et al., Am J Hum Genet, 2007, PLINK: a tool set for whole-genome association and population-based linkage analyses	17701901	10.1086/519795	Cited
Shriner, D et al., Genet Epidemiol, 2010, Practical considerations for imputation of untyped markers in admixed populations	19918757	10.1002/gepi.20457	Cited
Sinnott, JA et al., Hum Genet, 2012, Artifact due to differential error when cases and controls are imputed from different platforms	21735171	10.1007/s00439-011-1054-1	Cited
Southam, L et al., Eur J Hum Genet, 2011, The effect of genome-wide association scan quality control on imputation outcome for common variants	21267008	10.1038/ejhg.2010.242	Cited
Spencer, CC et al., PLoS Genet, 2009, Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip	19492015	10.1371/journal.pgen.1000477	Cited
Tiwari, HK et al., Stat Interface, 2011, Accurate and flexible power calculations on the spot: applications to genomic research	22022634	10.4310/sii.2011.v4.n3.a9	Cited
Uh, HW et al., Eur J Hum Genet, 2012, How to deal with the early GWAS data when imputing and combining different arrays is necessary	22189269	10.1038/ejhg.2011.231	Cited
Zheng, J et al., Genet Epidemiol, 2011, A comparison of approaches to account for uncertainty in analysis of imputed genotypes	21254217	10.1002/gepi.20552	Cited
Zhuang, JJ et al., Genet Epidemiol, 2010, Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group	20088020	10.1002/gepi.20482	Cited

In this knowledge base

Title	Year	PMID
GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing.	2022	35953715
Association Between Substance Use Disorder and Polygenic Liability to Schizophrenia.	2017	28739213
KAT2B polymorphism identified for drug abuse in African Americans with regulatory links to drug abuse pathways in human prefrontal cortex.	2016	26202629
Cis-Expression Quantitative Trait Loci Mapping Reveals Replicable Associations with Heroin Addiction in OPRM1.	2015	25744370

External

Title	Authors	Journal	Year	Link
Variants in the β-globin locus are associated with pneumonia in African American children.	Halligan NLN et al.	—	2025	→
Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks.	Appadurai V et al.	—	2023	→
Contribution of common and rare variants to Asian neovascular age-related macular degeneration subtypes.	Fan Q et al.	—	2023	→
Natural variation of respiration-related traits in plants.	Bulut M et al.	—	2023	→
Aquaculture Molecular Breeding Platform (AMBP): a comprehensive web server for genotype imputation and genetic analysis in aquaculture.	Zeng Q et al.	—	2022	→
GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing.	Mathur R et al.	—	2022	→
False positive findings during genome-wide association studies with imputation: influence of allele frequency and imputation accuracy.	Zhang Z et al.	—	2021	→
Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption.	Sarkar E et al.	—	2021	→
Genome-wide association studies: assessing trait characteristics in model and crop plants.	Alseekh S et al.	—	2021	→
Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors.	Chen W et al.	—	2021	→
Inclusion of genetic variants in an ensemble of gradient boosting decision trees does not improve the prediction of citalopram treatment response.	Shumake J et al.	—	2021	→
Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation.	Kim M et al.	—	2021	→
Genome-wide association and Mendelian randomisation analysis provide insights into the pathogenesis of heart failure.	Shah S et al.	—	2020	→
Ultra-Fast Homomorphic Encryption Models enable Secure Outsourcing of Genotype Imputation	Kim M et al.	—	2020	—
Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores.	Homburger JR et al.	—	2019	→
Genotype imputation performance of three reference panels using African ancestry individuals.	Vergara C et al.	—	2018	→
Investigating the genetic architecture of dementia with Lewy bodies: a two-stage genome-wide association study.	Guerreiro R et al.	—	2018	→
Use of polygenic risk scores of nicotine metabolism in predicting smoking behaviors.	Chen LS et al.	—	2018	→
A comprehensive survey of genetic variation in 20,691 subjects from four large cohorts.	Lindström S et al.	—	2017	→
Association Between Substance Use Disorder and Polygenic Liability to Schizophrenia.	Hartz SM et al.	—	2017	→
Failure to replicate thrombomodulin genetic variant predictors of venous thromboembolism in African Americans.	Folsom AR et al.	—	2017	→
Genome-wide association study identifies the SERPINB gene cluster as a susceptibility locus for food allergy.	Marenholz I et al.	—	2017	→
A genome-wide association study identifies variants in KCNIP4 associated with ACE inhibitor-induced cough.	Mosley JD et al.	—	2016	→
KAT2B polymorphism identified for drug abuse in African Americans with regulatory links to drug abuse pathways in human prefrontal cortex.	Johnson EO et al.	—	2016	→
Preservation Analysis of Macrophage Gene Coexpression Between Human and Mouse Identifies PARK2 as a Genetically Controlled Master Regulator of Oxidative Phosphorylation in Humans.	Codoni V et al.	—	2016	→
A multiancestry study identifies novel genetic associations with CHRNA5 methylation in human brain and risk of nicotine dependence.	Hancock DB et al.	—	2015	→
Cis-Expression Quantitative Trait Loci Mapping Reveals Replicable Associations with Heroin Addiction in OPRM1.	Hancock DB et al.	—	2015	→
Genome-wide meta-analysis reveals common splice site acceptor variant in CHRNA4 associated with nicotine dependence.	Hancock DB et al.	—	2015	→
Meta-analysis of 65,734 individuals identifies TSPAN15 and SLC44A2 as two susceptibility loci for venous thromboembolism.	Germain M et al.	—	2015	→
When Does Choice of Accuracy Measure Alter Imputation Accuracy Assessments?	Ramnarine S et al.	—	2015	→
A meta-analysis of genome-wide association studies identifies ORM1 as a novel gene controlling thrombin generation potential.	Rocanin-Arjo A et al.	—	2014	→
Genome-wide investigation of DNA methylation marks associated with FV Leiden mutation.	Aïssi D et al.	—	2014	→
Practical aspects of genome-wide association interaction analysis.	Gusareva ES et al.	—	2014	→
The relevance of checking population allele frequencies and Hardy-Weinberg Equilibrium in genetic association studies: the case of SLC6A4 5-HTTLPR polymorphism in a Chinese Han Irritable Bowel Syndrome association study.	Napolioni V	—	2014	→