Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy.
- Authors
- Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P
- Year
- 2013
- Journal
- Human genetics
- PMID
- 23334152
- DOI
- 10.1007/s00439-013-1266-7
- PMCID
- PMC3628082
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.
Genomic inflation factors (grey lines) (λgc) and percentages of SNPs having spurious association (black lines) (P < 1 × 10−6), by minor allele frequency (MAF), when combining studies genotyped on different Illumina BeadChip arrays (Human1M or HumanHap550 version 3). a–c European American subjects from SAGE were compared to PanScan subjects, and d-f African American subjects from SAGE were compared to iControl subjects. Three different SNP sets were assessed: a, d genotyped SNPs available on both arrays; b, e imputed SNPs based on the union of genotyped SNPs available on either array; and c, f imputed SNPs based on the intersection of genotyped SNPs available on both arrays. The number of SNPs with MAF >1 % and the overall λgc are shown in each plot
Genomic inflation factors (grey lines) (λgc) and percentages of SNPs having spurious association (black lines) (P < 1 × 10−6), by minor allele frequency (MAF), when combining studies genotyped on either the Illumina Human1M or Affymetrix 6.0 array. a–c European American and d–f African American subjects from SAGE (genotyped on Illumina 1M) were compared to subjects from the GAIN GWAS of Schizophrenia (genotyped on Affymetrix 6.0). Three different SNP sets were assessed: a, d genotyped SNPs available on both arrays; b, e imputed SNPs based on the union of genotyped SNPs available on either array; and c, f imputed SNPs based on the intersection of genotyped SNPs available on both arrays. The number of SNPs with MAF >1 % and the overall λgc are shown in each plot
Average R2 values in SAGE control subjects (genotyped on Illumina’s Human1M) to indicate overall quality across all imputed SNPs, when imputation was based on all genotyped SNPs or the intersection of genotyped SNPs with Affymetrix 6.0 or varying Illumina arrays (Human1M, HumanOmni1-Quad, Human660W, HumanHap550 version 1, and HumanHap300-Duo version 2 BeadChip). Results are shown across minor allele frequency (MAF) intervals of 1 % for all imputed SNPs with MAF >1 % on chromosome 22: a ~34,000 SNPs in European Americans and b ~43,000 SNPs in African Americans
Expected statistical power by level of imputation accuracy (average R2) for differing numbers of public controls added to the baseline design of 2,000 cases and 2,000 controls (blue diamond and blue dashed line). Power was estimated for detection of a SNP effect size of 1 % explained variance in the phenotype. The baseline model provided 81 % power to detect this effect size at a genome-wide significance of P = 5 × 10−8
Expected statistical power by imputation accuracy (average R2) for the baseline study design (2,000 cases and 2,000 controls: blue diamond and blue dashed line) and several alternatives focusing study recruitment and genotyping on increasing numbers of cases and relying on public controls under the constraint of maximal recruitment and genotyping of 4,000 individuals. The baseline model provided 81 % power to detect this effect size at a genome-wide significance of P = 5 × 10−8
No entities extracted from this document yet.
No uploaded files.
In this knowledge base
External
| Title | Authors | Journal | Year | Link |
|---|---|---|---|---|
| Variants in the β-globin locus are associated with pneumonia in African American children. | Halligan NLN et al. | — | 2025 | → |
| Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks. | Appadurai V et al. | — | 2023 | → |
| Contribution of common and rare variants to Asian neovascular age-related macular degeneration subtypes. | Fan Q et al. | — | 2023 | → |
| Natural variation of respiration-related traits in plants. | Bulut M et al. | — | 2023 | → |
| Aquaculture Molecular Breeding Platform (AMBP): a comprehensive web server for genotype imputation and genetic analysis in aquaculture. | Zeng Q et al. | — | 2022 | → |
| GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing. | Mathur R et al. | — | 2022 | → |
| False positive findings during genome-wide association studies with imputation: influence of allele frequency and imputation accuracy. | Zhang Z et al. | — | 2021 | → |
| Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption. | Sarkar E et al. | — | 2021 | → |
| Genome-wide association studies: assessing trait characteristics in model and crop plants. | Alseekh S et al. | — | 2021 | → |
| Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors. | Chen W et al. | — | 2021 | → |
| Inclusion of genetic variants in an ensemble of gradient boosting decision trees does not improve the prediction of citalopram treatment response. | Shumake J et al. | — | 2021 | → |
| Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation. | Kim M et al. | — | 2021 | → |
| Genome-wide association and Mendelian randomisation analysis provide insights into the pathogenesis of heart failure. | Shah S et al. | — | 2020 | → |
| Ultra-Fast Homomorphic Encryption Models enable Secure Outsourcing of Genotype Imputation | Kim M et al. | — | 2020 | — |
| Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores. | Homburger JR et al. | — | 2019 | → |
| Genotype imputation performance of three reference panels using African ancestry individuals. | Vergara C et al. | — | 2018 | → |
| Investigating the genetic architecture of dementia with Lewy bodies: a two-stage genome-wide association study. | Guerreiro R et al. | — | 2018 | → |
| Use of polygenic risk scores of nicotine metabolism in predicting smoking behaviors. | Chen LS et al. | — | 2018 | → |
| A comprehensive survey of genetic variation in 20,691 subjects from four large cohorts. | Lindström S et al. | — | 2017 | → |
| Association Between Substance Use Disorder and Polygenic Liability to Schizophrenia. | Hartz SM et al. | — | 2017 | → |
| Failure to replicate thrombomodulin genetic variant predictors of venous thromboembolism in African Americans. | Folsom AR et al. | — | 2017 | → |
| Genome-wide association study identifies the SERPINB gene cluster as a susceptibility locus for food allergy. | Marenholz I et al. | — | 2017 | → |
| A genome-wide association study identifies variants in KCNIP4 associated with ACE inhibitor-induced cough. | Mosley JD et al. | — | 2016 | → |
| KAT2B polymorphism identified for drug abuse in African Americans with regulatory links to drug abuse pathways in human prefrontal cortex. | Johnson EO et al. | — | 2016 | → |
| Preservation Analysis of Macrophage Gene Coexpression Between Human and Mouse Identifies PARK2 as a Genetically Controlled Master Regulator of Oxidative Phosphorylation in Humans. | Codoni V et al. | — | 2016 | → |
| A multiancestry study identifies novel genetic associations with CHRNA5 methylation in human brain and risk of nicotine dependence. | Hancock DB et al. | — | 2015 | → |
| Cis-Expression Quantitative Trait Loci Mapping Reveals Replicable Associations with Heroin Addiction in OPRM1. | Hancock DB et al. | — | 2015 | → |
| Genome-wide meta-analysis reveals common splice site acceptor variant in CHRNA4 associated with nicotine dependence. | Hancock DB et al. | — | 2015 | → |
| Meta-analysis of 65,734 individuals identifies TSPAN15 and SLC44A2 as two susceptibility loci for venous thromboembolism. | Germain M et al. | — | 2015 | → |
| When Does Choice of Accuracy Measure Alter Imputation Accuracy Assessments? | Ramnarine S et al. | — | 2015 | → |
| A meta-analysis of genome-wide association studies identifies ORM1 as a novel gene controlling thrombin generation potential. | Rocanin-Arjo A et al. | — | 2014 | → |
| Genome-wide investigation of DNA methylation marks associated with FV Leiden mutation. | Aïssi D et al. | — | 2014 | → |
| Practical aspects of genome-wide association interaction analysis. | Gusareva ES et al. | — | 2014 | → |
| The relevance of checking population allele frequencies and Hardy-Weinberg Equilibrium in genetic association studies: the case of SLC6A4 5-HTTLPR polymorphism in a Chinese Han Irritable Bowel Syndrome association study. | Napolioni V | — | 2014 | → |