A new statistic to evaluate imputation reliability.

paper Primary Public

Authors: Lin, Peng; Hartz, Sarah M; Zhang, Zhehao; Saccone, Scott F; Wang, Jia; Tischfield, Jay A; Edenberg, Howard J; Kramer, John R; M Goate, Alison; Bierut, Laura J; Rice, John P; COGA Collaborators COGEND Collaborators, GENEVA
Year: 2010
Journal: PloS one
PMID: 20300623
DOI: 10.1371/journal.pone.0009697
PMCID: PMC2837741

BACKGROUND: As the amount of data from genome wide association studies grows dramatically, many interesting scientific questions require imputation to combine or expand datasets. However, there are two situations for which imputation has been problematic: (1) polymorphisms with low minor allele frequency (MAF), and (2) datasets where subjects are genotyped on different platforms. Traditional measures of imputation cannot effectively address these problems. METHODOLOGY/PRINCIPAL FINDINGS: We introduce a new statistic, the imputation quality score (IQS). In order to differentiate between well-imputed and poorly-imputed single nucleotide polymorphisms (SNPs), IQS adjusts the concordance between imputed and genotyped SNPs for chance. We first evaluated IQS in relation to minor allele frequency. Using a sample of subjects genotyped on the Illumina 1 M array, we extracted those SNPs that were also on the Illumina 550 K array and imputed them to the full set of the 1 M SNPs. As expected, the average IQS value drops dramatically with a decrease in minor allele frequency, indicating that IQS appropriately adjusts for minor allele frequency. We then evaluated whether IQS can filter poorly-imputed SNPs in situations where cases and controls are genotyped on different platforms. Randomly dividing the data into "cases" and "controls", we extracted the Illumina 550 K SNPs from the cases and imputed the remaining Illumina 1 M SNPs. The initial Q-Q plot for the test of association between cases and controls was grossly distorted (lambda = 1.15) and had 4016 false positives, reflecting imputation error. After filtering out SNPs with IQS<0.9, the Q-Q plot was acceptable and there were no longer false positives. We then evaluated the robustness of IQS computed independently on the two halves of the data. In both European Americans and African Americans the correlation was >0.99 demonstrating that a database of IQS values from common imputations could be used as an effective filter to combine data genotyped on different platforms. CONCLUSIONS/SIGNIFICANCE: IQS effectively differentiates well-imputed and poorly-imputed SNPs. It is particularly useful for SNPs with low minor allele frequency and when datasets are genotyped on different platforms.

Figure 1

The means of IQS and imputation accuracy within each minor allele frequency interval.IQS adjusts for chance agreement. As the minor allele frequency approaches 0, the difference between IQS and imputation accuracy increases. The standard deviation is shown for every other point.

Figure 2

The Q-Q plots based on randomly dividing data into cases and controls.Samples were divided randomly into cases and controls. (A) All Illumina 1 M SNPs are directly genotyped indicating there is no population stratification or other non-random factors in cases and controls. (B) Cases were genotyped on the Illumina 550 K array and the remaining Illumina 1 M SNPs were imputed. (C) An IQS filter (IQS>0.9) was applied, retaining 92% of the SNPs. (D) An imputation accuracy filter (>0.99) was applied, retaining 91% of the SNPs.

Figure 3

Evaluation of the robustness of IQS score.European Americans (A) and African Americans(B) datasets were split in half and Illumina 550 K SNPs were imputed to Illumina 1 M SNPs. IQS score for the two halves of the data were plotted against each other. SNPs with minor allele frequency less than 0.01 were excluded to avoid zero in the denominator.

Figure 4

A database of IQS can be used to filter poorly-imputed SNPs.The set of hard-to-impute SNPs compiled from one dataset can be used to filter the imputed data in another dataset. (A) Cases were European Americans genotyped on the Illumina 550 K array and the remaining Illumina 1 M SNPs were imputed. Controls were European Americans genotyped on the Illumina 1 M array. The QQ plot was shown for the 790,965 available SNPs. (B) An IQS filter (IQS>0.9) was applied, retaining 92% of the SNPs. IQS was calculated from an independent dataset. (C) A similar QQ plot for African Americans. Cases were genotyped on the Illumina 550 K array and the remaining Illumina 1 M SNPs were imputed. Controls were genotyped on the Illumina 1 M array. The QQ plot was shown for the 836,993 available SNPs. (D) An IQS filter (IQS>0.9) was applied, retaining 78% of the SNPs. IQS was calculated from an independent dataset.

#	Section	Preview
20	Materials and Methods — Statistical estimates of imputation quality	score is a measure of genotype information content, which is related to the effective sample size…
21	Results	The Illumina 1 M array covers all of the SNPs on the Illumina 550 K array. We started with all SAGE…
22	Results	The imputation results are given in Table 2. The mean IQS is lower than the mean accuracy in both EA…
23	Results	A second notable result is that the quality of imputation in AA is markedly lower than in EA. This…
24	Results	The relationship between IQS and imputation accuracy with respect to minor allele frequency is seen…
25	Results	We then evaluated the effectiveness of IQS in the situation where cases and controls are genotyped…
26	Results	We tested genetic association of all the 1 M SNPs with the cases and controls. A Quantile-Quantile…
27	Results	SNPs with genotyped SNPs without other quality control is problematic (Fig. 2B). Therefore, the…
28	Results	A more practical way of evaluating this approach is to look at the false positive rate.…
29	Results	Although IQS can serve as an effective filter to minimize the use of poorly-imputed SNPs, the…
30	Results	The two common methods for filtering imputed data are to combine a minor allele frequency threshold…
31	Results	Filtering on MAF differences between the Hapmap and the study genotypes is another possible approach…
32	Results	A second method for using IQS without directly genotyping would be to develop a database of common…
33	Results	We further tested whether the set of hard-to-impute SNPs compiled from the first group can be used…
34	Results	In order to confirm these results in a different dataset, we replicated the study in European…
35	Discussion	There are two situations in which imputation is avoided[18]: (1) SNPs with low minor allele…
36	Discussion	It is important to note that the traditional genome inflation factor λ is not an ideal indicator of…
37	Discussion	We also would like to emphasize that we are dealing with the extreme situation when cases and…
38	Discussion	The reasons for the false positives are very complicated. Among the 4016 genome wide significant…
39	Discussion	Filtering by the difference between the reference and the estimated minor allele frequency can…

Name	Type
AA	cohort
AA sample	cohort
Affymetrix 5.0	drug
Affymetrix 5.0 array local	drug
Affymetrix 6.0	drug
Affymetrix array local	cohort
Affymetrix GeneChip Mapping 500 K Array Set local	drug
African	cohort
African American	cohort
cases	cohort
Center for Inherited Disease Research	cohort
CEPH	cohort
CEU	cohort
COGEND	cohort
Collaborative Study on the Genetics of Alcoholism (COGA)	cohort
common human diseases local	phenotype
controls	cohort
Database of IQS scores local	drug
EA	cohort
European ancestry	cohort
false positive rate	phenotype
false positive SNPs local	variant
Family study of cocaine dependence	cohort
First group local	cohort
genetic variants	cohort
GENEVA consortium	cohort
GENEVA project local	cohort
genome wide significant SNPs local	variant
HapMap	cohort
HapMap controls local	drug
HapMap Phase II CEU population local	cohort
HapMap Phase II release 22	cohort
Hard-to-impute SNPs local	variant
Illumina 1 M local	drug
Illumina 1 M array local	drug
Illumina 550 K local	drug
Illumina 550 K array local	drug
Illumina array local	cohort
Illumina Human 1 M array local	drug
Illumina HumanHap 550 K Array set local	drug
imputation accuracy	drug
imputation accuracy filter local	drug
Imputation efficiency local	phenotype
imputation reliability local	phenotype
Impute2	drug
Imputed SNP local	variant
imputed SNPs	variant
International Hapmap Project	cohort
IQS local	drug
IQS local	phenotype
IQS filter local	drug
Johns Hopkins University	cohort
minor allele frequency local	phenotype
National Institute of Mental Health Center for Collaborative Genetic Studies on Mental Disorders local	cohort
NCBI Build 36 dbSNP b126 local	cohort
NIMH GAIN samples local	cohort
other available SNPs local	variant
population stratification	phenotype
rare variant	cohort
SAGE	cohort
Second group local	cohort
SNP	cohort
SNP microarrays	drug
Study of Addiction: Genetics and Environment	cohort
Type I error local	phenotype
uncommon SNP local	variant
uncommon SNPs local	variant
Wellcome Trust local	cohort
Yoruba	cohort
YRI reference panel local	cohort

Citation	PMID	DOI	Status
Altshuler, D et al., Nat Genet, 2007, Guilt beyond a reasonable doubt.	17597768	10.1038/ng0707-813	Cited
Barrett, JC et al., Nat Genet, 2008, Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease.	18587394	10.1038/NG.175	Cited
Browning, BL et al., Hum Genet, 2008, Haplotypic analysis of Wellcome Trust Case Control Consortium data.	18224336	10.1007/s00439-008-0472-1	Cited
Browning, SR et al., Am J Hum Genet, 2007, Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.	17924348	10.1086/521987	Cited
Clayton, DG et al., Nat Genet, 2005, Population structure, differential bias and genomic control in a large-scale, case-control association study.	16228001	10.1038/ng1653	Cited
Cohen, J, Educ psychol Meas, 1960, A coefficient of agreement for nominal scales.	—	—	—
de Bakker, PI et al., Hum Mol Genet, 2008, Practical aspects of imputation-driven meta-analysis of genome-wide association studies.	18852200	10.1093/hmg/ddn288	Cited
Dupuis, J et al., Nat Genet, 2010, New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.	20081858	10.1038/ng.520	Cited
Frazer, KA et al., Nature, 2007, A second generation human haplotype map of over 3.1 million SNPs.	17943122	10.1038/nature06258	Cited
Hancock, DB et al., Nat Genet, 2009, Meta-analyses of genome-wide association studies identify multiple loci associated with pulmonary function.	20010835	10.1038/ng.500	Cited
Howie, BN et al., PLoS Genet, 2009, A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.	19543373	10.1371/journal.pgen.1000529	Cited
Laurie, C et al., In preparation, Genotype data cleaning for whole-genome association studies.	—	—	—
Lettre, G et al., Nat Genet, 2008, Identification of ten loci associated with height highlights new biological pathways in human growth.	18391950	10.1038/ng.125	Cited
Manolio, TA et al., Nat Genet, 2007, New models of collaboration in genome-wide association studies: the Genetic Association Information Network.	17728769	10.1038/ng2127	Cited
Marchini, J et al., Nat Genet, 2007, A new multipoint method for genome-wide association studies by imputation of genotypes.	17572673	10.1038/ng2088	Cited
McMahon, FJ et al., Nat Genet, 2010, Meta-analysis of genome-wide association data identifies a risk locus for major mood disorders on 3p21.1.	20081856	10.1038/ng.523	Cited
Nature, 2005, A haplotype map of the human genome.	16255080	10.1038/nature04226	Cited
Nature, 2007, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.	17554300	10.1038/nature05911	Cited
Nothnagel, M et al., Hum Genet, 2009, A comprehensive evaluation of SNP genotype imputation.	19089453	10.1007/s00439-008-0606-5	Cited
Pfeufer, A et al., Nat Genet, 2010, Genome-wide association study of PR interval.	20062060	10.1038/ng.517	Cited
Price, AL et al., Nat Genet, 2006, Principal components analysis corrects for stratification in genome-wide association studies.	16862161	10.1038/ng1847	Cited
Repapi, E et al., Nat Genet, 2009, Genome-wide association study identifies five loci associated with lung function.	20010834	10.1038/ng.501	Cited
Saxena, R et al., Nat Genet, 2010, Genetic variation in GIPR influences the glucose and insulin responses to an oral glucose challenge.	20081857	10.1038/ng.521	Cited
Willer, CJ et al., Nat Genet, 2008, Newly identified loci that influence lipid concentrations and risk of coronary artery disease.	18193043	10.1038/ng.76	Cited
Zeggini, E et al., Nat Genet, 2008, Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes.	18372903	10.1038/ng.120	Cited
Zeggini, E et al., Science, 2007, Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes.	17463249	10.1126/science.1142364	Cited

No papers in this knowledge base cite this source.

External

Title	Authors	Journal	Year	Link
Association of genetic variation with age at diagnosis in type 1 diabetes.	Vollenbrock CE et al.	—	2026	→
Adjustment for Genotype Imputation Uncertainty Corrects for Inflated Type I Error in Family-Based Association Testing.	Day TRC et al.	—	2025	→
A low-coverage skim-sequencing and imputation pipeline for genomic selection.	Sthapit SR et al.	—	2025	→
A primer on sequencing and genotype imputation in cattle.	Rowan TN	—	2025	→
Assessing Genotype Imputation Methods for Low-Coverage Sequencing Data in Populations With Differing Relatedness and Inbreeding Levels.	Vi T et al.	—	2025	→
Benchmarking Imputed Low Coverage Genomes in a Human Population Genetics Context.	Purnomo GA et al.	—	2025	→
Evaluation of Low-Coverage Sequencing Strategies for Whole-Genome Imputation in Pacific Abalone <i>Haliotis discus hannai</i>.	Fei C et al.	—	2025	→
Genetic regulation of TERT splicing affects cancer risk by altering cellular longevity and replicative potential.	Florez-Vargas O et al.	—	2025	→
Genotype imputation from low-coverage WGS using haplotype reference panels in cultivated strawberry.	Koorevaar T et al.	—	2025	→
Imputation disparities driven by recent selection and their impact on disease risk estimation in East and Southeast Asian populations.	Li D et al.	—	2025	→
STICI: Split-Transformer with integrated convolutions for genotype imputation.	Mowlaei ME et al.	—	2025	→
A deep learning approach to prediction of blood group antigens from genomic data.	Moslemi C et al.	—	2024	→
Genotype imputation in human genomic studies.	Berdnikova AA et al.	—	2024	→
How local reference panels improve imputation in French populations.	Herzig AF et al.	—	2024	→
Imputation accuracy across global human populations.	Cahoon JL et al.	—	2024	→
Deep Learning Methods for Omics Data Imputation.	Huang L et al.	—	2023	→
Genetic prediction of 33 blood group phenotypes using an existing genotype dataset.	Moslemi C et al.	—	2023	→
A comparative analysis of current phasing and imputation software.	De Marino A et al.	—	2022	→
A data harmonization pipeline to leverage external controls and boost power in GWAS.	Chen D et al.	—	2022	→
An autoencoder-based deep learning method for genotype imputation.	Song M et al.	—	2022	→
A Pipeline for Phasing and Genotype Imputation on Mixed Human Data (Parents-Offspring Trios and Unrelated Subjects) by Reviewing Current Methods and Software.	Baldrighi GN et al.	—	2022	→
Best practices for analyzing imputed genotypes from low-pass sequencing in dogs.	Buckley RM et al.	—	2022	→
Genotype imputation and polygenic score estimation in northwestern Russian population.	Kolosov N et al.	—	2022	→
MagicalRsq: Machine-learning-based genotype imputation quality calibration.	Sun Q et al.	—	2022	→
Assessment of Imputation Quality: Comparison of Phasing and Imputation Algorithms in Real Data.	Stahl K et al.	—	2021	→
Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome.	Hermisdorff IDC et al.	—	2020	→
Quality Control Measures and Validation in Gene Association Studies: Lessons for Acute Illness.	Cohen M et al.	—	2020	→
A multi-breed reference panel and additional rare variants maximize imputation accuracy in cattle.	Rowan TN et al.	—	2019	→
Comparison and assessment of family- and population-based genotype imputation methods in large pedigrees.	Ullah E et al.	—	2019	→
Evaluation of vitamin D biosynthesis and pathway target genes reveals UGT2A1/2 and EGFR polymorphisms associated with epithelial ovarian cancer in African American Women.	Grant DJ et al.	—	2019	→
Linkage disequilibrium and effective population size in Gir cattle selected for yearling weight.	Toro Ospina AM et al.	—	2019	→
Meta-Analysis of Genome-Wide Association Studies Identifies Three Loci Associated With Stiffness Index of the Calcaneus.	Lu HF et al.	—	2019	→
Revisit Population-based and Family-based Genotype Imputation.	Liu CT et al.	—	2019	→
The African Descent and Glaucoma Evaluation Study (ADAGES) III: Contribution of Genotype to Glaucoma Phenotype in African Americans: Study Design and Baseline Data.	Zangwill LM et al.	—	2019	→
Genome-Wide Association Study of Heavy Smoking and Daily/Nondaily Smoking in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).	Saccone NL et al.	—	2018	→
Genotype imputation performance of three reference panels using African ancestry individuals.	Vergara C et al.	—	2018	→
Imputation from SNP chip to sequence: a case study in a Chinese indigenous chicken population.	Ye S et al.	—	2018	→
Failure to replicate thrombomodulin genetic variant predictors of venous thromboembolism in African Americans.	Folsom AR et al.	—	2017	→
Imputation of missing genotypes within LD-blocks relying on the basic coalescent and beyond: consideration of population growth and structure.	Kabisch M et al.	—	2017	→
Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy.	Ahmad M et al.	—	2017	→
Siccuracy: An R-package for executing genotype imputation strategy simulations with AlphaImpute	Edwards SM	—	2017	—
Empirical determination of breed-of-origin of alleles in three-breed cross pigs.	Sevillano CA et al.	—	2016	→
Family-based approaches: design, imputation, analysis, and beyond.	Wijsman EM	—	2016	→
Genome-wide association study of antidepressant response: involvement of the inorganic cation transmembrane transporter activity pathway.	Cocchi E et al.	—	2016	→
Imputing rare variants in families using a two-stage approach.	Lent S et al.	—	2016	→
Accuracy of imputation using the most common sires as reference population in layer chickens.	Heidaritabar M et al.	—	2015	→
Evaluating the ovarian cancer gonadotropin hypothesis: a candidate gene study.	Lee AW et al.	—	2015	→
First genome-wide association study in an Australian aboriginal population provides insights into genetic risk factors for body mass index and type 2 diabetes.	Anderson D et al.	—	2015	→
Tailored selection of study individuals to be sequenced in order to improve the accuracy of genotype imputation.	Peil B et al.	—	2015	→
When Does Choice of Accuracy Measure Alter Imputation Accuracy Assessments?	Ramnarine S et al.	—	2015	→
Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: a review of livestock applications.	Calus MP et al.	—	2014	→
Genotypic discrepancies arising from imputation.	Hinrichs AL et al.	—	2014	→
Impact of pre-imputation SNP-filtering on genotype imputation results.	Roshyara NR et al.	—	2014	→
Imputation and quality control steps for combining multiple genome-wide datasets.	Verma SS et al.	—	2014	→
Imputation in families using a heuristic phasing approach.	Blackburn AN et al.	—	2014	→
Predicting HLA genotypes using unphased and flanking single-nucleotide polymorphisms in Han Chinese population.	Hsieh AR et al.	—	2014	→
Value of Mendelian laws of segregation in families: data quality control, imputation, and beyond.	Blue EM et al.	—	2014	→
A K(ATP) channel gene effect on sleep duration: from genome-wide association studies to function in Drosophila.	Allebrandt KV et al.	—	2013	→
Dosage transmission disequilibrium test (dTDT) for linkage and association detection.	Zhang Z et al.	—	2013	→
Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels.	Gualdrón Duarte JL et al.	—	2013	→
Imputation-based genomic coverage assessments of current human genotyping arrays.	Nelson SC et al.	—	2013	→
MaCH-admix: genotype imputation for admixed populations.	Liu EY et al.	—	2013	→
Meta-analysis methods for genome-wide association studies and beyond.	Evangelou E et al.	—	2013	→
Assessment of genotype imputation performance using 1000 Genomes in African American studies.	Hancock DB et al.	—	2012	→
A ν-support vector regression based approach for predicting imputation quality.	Huang YH et al.	—	2012	→
Genotype imputation of Metabochip SNPs using a study-specific reference panel of ~4,000 haplotypes in African Americans from the Women's Health Initiative.	Liu EY et al.	—	2012	→
Imputation of genotypes with low-density chips and its effect on reliability of direct genomic values in Dutch Holstein cattle.	Mulder HA et al.	—	2012	→
Copy number variation accuracy in genome-wide association studies.	Lin P et al.	—	2011	→
Rare variant association analysis methods for complex traits.	Asimit J et al.	—	2010	→