Using ancestry matching to combine family-based and unrelated samples for genome-wide association studies.

paper Cited Public

Authors: Crossett, Andrew; Kent, Brian P; Klei, Lambertus; Ringquist, Steven; Trucco, Massimo; Roeder, Kathryn; Devlin, Bernie
Year: 2010
Journal: Statistics in medicine
PMID: 20862653
DOI: 10.1002/sim.4057
PMCID: PMC4629477

Figure 1

HapMap trios matched by ancestry to POPRES controls. The 30 offsprings from the HapMap, CEU sample, trios serve as cases and the 2184 individuals of European ancestry from the POPRES data serve as controls. (a) The plot displays the top two principal components of ancestry for cases (red) and controls (black) obtained using SGA. Based on the distribution of points in the eigenmap, many available controls would not be good matches to the HapMap trios. Only those delineated in blue are considered further. Each case is matched to one or more controls that are genetically similar based on the eigenvectors. (b) Distance between controls and closest case when matching in a random subset drawn from the full sample of controls versus (c) the distances when the controls consist of the restricted sample delineated in blue.

Figure 2

(a) African, (b) East Asian, and (c) European clusters identified by SGA. The 51 population samples within HGDP were analyzed to identify homogeneous clusters using SGA applied to continental samples. Analysis was performed separately for each continent using SpectralGEM. Population labels were ignored in the analysis. The display is organized to emphasize when a population or group of populations falls into a common cluster. Groups of populations that fall into a common cluster are often from a common region; see Supplementary Figures 1 and 2.

Figure 3

HGDP and POPRES eigenmap representations plotted for various ancestry bases. In each panel, the eigenvectors (labeled PC) are calculated using a portion of the data, called the base. The remaining samples are projected using the Nystrom approximation. For each eigenmap we show only the top two principal components, POPRES (turquoise) and HGDP (black). (a) Base = HGDP, projected = POPRES; (b) Base = POPRES, projected = HGDP; (c) Base = HGDP + half of POPRES, projected = half of POPRES; and (d) Base = half of the balanced subset of countries including HGDP, projected = remaining half of the balanced subset.

Figure 4

Comparing ancestry of selected groups in HGDP versus POPRES for the top two principal components. SGA was performed using the balanced sample (Figure 3(d)). Individuals selected for comparison from POPRES and HGDP are highlighted using colors other than turquoise. (a) HGDP-French (black) versus POPRES-French (fuchsia); (b) HGDP-Orcadian (black) versus POPRES-British & Irish (fuchsia); (c) HGDP-Tuscan (black) and HGDP-N. Italian (blue) versus POPRES-Italian (fuchsia); and (d) HGDP-French Basque (black) versus POPRES-French (fuchsia), POPRES-Spanish & Portuguese (blue).

Figure 5

Type I error analysis at α = 0.05. Solid line represents Type I error for mCLR method and dashed line represents Type I error for combined association analysis with Fst =0.05 (a), Fst =0.01 (b), and Fst = 0.001 (c). Results are based on 5000 replications of 500 unrelated controls and 500 trios.

Figure 6

Power analysis at α = 0.05. (a) mCLR method (solid line) versus combined association analysis (dashed line). Results are based on 5000 replications of 500 unrelated controls and 500 trios. (b) Power of mCLR method plotted against the theoretical ratio of controls to case (R). Results are based on 10 000 replications under the assumption that ψ = 1.3, 1.4, 1.5.

Figure 7

Association between HLA markers and Type 1 diabetes. –log10(p-values) are plotted versus individual SNPs in the HLA region of chromosome 6. (a) All controls matched; (b) 1:10 matching; (c) 1:5 matching; and (d) Trios only. The strongest association occurs for rs241427 (diamond) and next strongest for rs9273363 (triangle).

#	Section	Preview
0	Introduction	Collections of large samples, including case and control individuals as well as families containing…
1	Introduction	In addition to sample collections for specific diseases, genotype data from large samples of…
2	Introduction	The two most common sampling techniques for studies of association are the case–control design and…
3	Introduction	For the case–control design, a large panel of genetic markers can be used to estimate genetic…
4	Introduction	Family-based designs are robust to population stratification. For simplicity, we will only consider…
5	Introduction	The research problem we address here is how to use both case–control and family-based data in a…
6	Introduction	We propose a hybrid analytical approach that is robust to differences in sampling distribution…
7	Introduction	The success of our approach depends upon the quality of the eigenmap. In practice, the map can be…
8	Introduction	This sample emphasizes distinct populations, including isolated and geographically well-separated…
9	Methods — Data	The HGDP panel includes 1063 individuals from seven continental groups classified into 51…
10	Methods — Data	(108), north-west European (173), east-central European (75), south-eastern European (45), and other…
11	Methods — Matched analysis	Let G denote the minor allele count for a subject (0, 1, or 2) and D denote the disease outcome (1…
12	Methods — Matched analysis	The Euclidean distance between individuals in the eigenmap are representative of their ancestral…
13	Methods — Matched analysis	A traditional approach to family-based analysis of parents and a single affected offspring (trios)…
14	Methods — Eigenmaps	As a first step we estimate the genetic background of unrelated individuals (unrelated cases,…
15	Methods — Eigenmaps	Rather than using traditional PCA, we utilize a variant of this approach that arises from spectral…
16	Methods — Eigenmaps	To perform spectral graph analysis (SGA), we start with the PCA kernel, X Xt and create a weight…
17	Methods — Eigenmaps	The base sample, consisting of subject i =1, ...,n corresponding to the centered and scaled allele…
18	Methods — Combining trios, cases, and controls	As a first step we estimate the genetic background of unrelated individuals (cases, controls, and…
19	Methods — Combining trios, cases, and controls	For trios, pseudo-controls are automatically matched by ancestry with the corresponding proband, and…

Citation	PMID	DOI	Status
Balding, D et al., Genetica, 1995, A method for quantifying differentiation between populations at multi-allelic locus and its implications for investigating identify and paternity.	7607457	10.1007/BF01441146	Cited
Barrett, JC et al., Nature Genetics, 2008, Genome-wide association defines more than 30 distinct susceptibility loci for crohn's disease.	18587394	10.1038/NG.175	Cited
Belkin, M et al., Advances in Neural Information Processing Systems, 2002, Laplacian eigenmaps and spectral techniques for embedding and clustering.	—	—	—
Bengio, Y et al., Neural Computation, 2004, Learning eigenfunctions links spectral embedding and kernel pca.	15333211	10.1162/0899766041732396	Cited
Breslow, N, Annual Review of Public Health, 1982, Design and analysis of case-control studies.	6756431	10.1146/annurev.pu.03.050182.000333	Cited
Clayton, D, American Journal of Human Genetics, 1999, Tdt for uncertain haplotypes.	10486336	10.1086/302577	Cited
Cordell, HJ et al., American Journal of Human Genetics, 2002, A unified stepwise regression procedure for evaluating the relative effects of polymorphisms with a gene using case/control or family data: application to hla in type I diabetes.	11719900	10.1086/338007	Cited
Cordell, HJ, Genetic Epidemiology, 2004, Properties of case/pseudocontrol analysis for genetic association studies: effects of recombination, ascertainment, and multiple affected offspring.	15022206	10.1002/gepi.10306	Cited
Davies, JL et al., Nature, 1994, A genome-wide search for human type 1 diabetes susceptibility genes.	8072542	10.1038/371130a0	Cited
Devlin, B et al., Biometrics, 1999, Genomic control for association studies.	11315092	10.1111/j.0006-341x.1999.00997.x	Cited
Devlin, B et al., Nature Genetics, 2004, Genomic control to the extreme.	15514657	10.1038/ng1104-1129	Cited
Devlin, B et al., Theoretical Population Biology, 2001, Genomic control, a new approach to genetic-based association studies.	11855950	10.1006/tpbi.2001.1542	Cited
Epstein, MP et al., American Journal of Human Genetics, 2005, Genetic association analysis using data from triads and unrelated subjects.	15712104	10.1086/429225	Cited
Epstein, MP et al., American Journal of Human Genetics, 2007, A simple and improved correction for population stratification in case-control studies.	17436246	10.1086/516842	Cited
Falk, CT et al., Annals of Human Genetics, 1987, Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations.	3500674	10.1111/j.1469-1809.1987.tb00875.x	Cited
Guan, W et al., Genetic Epidemiology, 2009, Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies.	19170134	10.1002/gepi.20403	Cited
Hansen, BB, Journal of the American Statistical Association, 2004, Full matching in an observational study of coaching for the (sat).	—	—	—
Heath, SC et al., European Journal of Human Genetics, 2008, Investigation of the fine structure of european populations with applications to disease association studies.	19020537	10.1038/ejhg.2008.210	Cited
Hindorff, LA et al., Proceedings of the National Academy of Sciences, 2009, Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.	19474294	10.1073/pnas.0903103106	Cited
Huber, P, Proceedings of the Fifth Berkeley Symposium in Mathematical Statistics and Probability, 1967, The behaviour of maximum likelihood estimates under non-standard conditions.	—	—	—
Knapp, M, American Journal of Human Genetics, 1999, The transmission/disequilibrium test and parental-genotype reconstruction: the reconstruction-combined transmission/ disequilibrium test.	10053021	10.1086/302285	Cited
Koike, A et al., Journal of Human Genetics, 2009, Genome-wide association database developed in the Japanese integrated database project.	19629137	10.1038/jhg.2009.68	Cited
Lander, ES et al., Science, 1994, Genetic dissection of complex traits.	8091226	10.1126/science.8091226	Cited
Lange, C et al., Genetic Epidemiology, 2002, On a general class of conditional tests for family-based association studies in genetics: the asymptotic distribution, the conditional power, and optimality considerations.	12214309	10.1002/gepi.209	Cited
Lee, AB et al., Genetic Epidemiology, 2010, Discovering genetic ancestry using spectral graph theory.	19455578	10.1002/gepi.20434	Cited
Lee, WC, Genetic Epidemiology, 2004, Case-control association studies with matching and genomic controlling.	15185398	10.1002/gepi.20011	Cited
Li, JZ et al., Science, 2008, Worldwide human relationships inferred from genome-wide patterns of variation.	18292342	10.1126/science.1153717	Cited
Luca, D et al., American Journal of Human Genetics, 2008, On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants.	18252225	10.1016/j.ajhg.2007.11.003	Cited
Manolio, TA et al., Journal of Clinical Investigation, 2008, A hapmap harvest of insights into the genetics of common diseases.	18451988	10.1172/JCI34772	Cited
Manolio, TA et al., Nature, 2009, Finding missing heritability of complex diseases.	19812666	10.1038/nature08494	Cited
McVean, G, PLoS Genetics, 2009, A genealogical interpretation of principal components analysis.	19834557	10.1371/journal.pgen.1000686	Cited
Nagelkerke, NJ et al., European Journal of Human Genetics, 2004, Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression.	15340361	10.1038/sj.ejhg.5201255	Cited
Nelson, MR et al., American Journal of Human Genetics, 2008, The population reference sample, popres: a resource for population, disease, and pharmacological genetics research.	18760391	10.1016/j.ajhg.2008.08.005	Cited
Novembre, J et al., Nature, 2008, Genes mirror geography within Europe.	18758442	10.1038/nature07331	Cited
Patterson, NJ et al., PLoS Genetics, 2006, Population structure and eigenanalysis.	17194218	10.1371/journal.pgen.0020190	Cited
Price, AL et al., Nature Genetics, 2006, Principal components analysis corrects for stratification in genome-wide association studies.	16862161	10.1038/ng1847	Cited
Purcell, S et al., American Journal of Human Genetics, 2007, Plink: a tool set for whole-genome association and population-based linkage analyses.	17701901	10.1086/519795	Cited
Rinaldo, A et al., Genetic Epidemiology, 2005, Characterization of multilocus linkage disequilibrium.	15637716	10.1002/gepi.20056	Cited
Schaid, DJ et al., American Journal of Human Genetics, 1993, Genotype relative risks: methods for design and analysis of candidate-gene association studies.	8213835	—	Cited
Schaid, DJ et al., American Journal of Human Genetics, 1994, Comparison of statistics for candidate-gene association studies using cases and parents.	8037216	—	Cited
Schaid, DJ, Genetic Epidemiology, 1999, Likelihoods and tdt for the case-parent design.	10096688	10.1002/(SICI)1098-2272(1999)16:3<250::AID-GEPI2>3.0.CO;2-T	Cited
Self, SG et al., Biometrics, 1991, On estimating hla/disease association with application to a study of aplastic anemia.	2049513	—	Cited
Spielman, RS et al., American Journal of Human Genetics, 1993, Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (iddm).	8447318	—	Cited
Thornton, T et al., American Journal of Human Genetics, 2010, Roadtrips: case–control association testing with partially or completely unknown population and pedigree structure.	20137780	10.1016/j.ajhg.2010.01.001	Primary
White, H, Econometrica, 1982, Maximum likelihood estimation of misspecified models.	—	—	—
Zhu, X et al., American Journal of Human Genetics, 2008, A unified association analysis approach for family and unrelated samples correcting for stratification.	18252216	10.1016/j.ajhg.2007.10.009	Cited

In this knowledge base

Title	Year	PMID
Rare copy number variants in tourette syndrome disrupt genes in histaminergic pathways and overlap with autism.	2012	22169095
Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism.	2011	21658581
A genome-wide scan for common alleles affecting risk for autism.	2010	20663923

External

Title	Authors	Journal	Year	Link
Hereditary variants of unknown significance in African American women with breast cancer.	McDonald JT et al.	—	2022	→
The Genetic Architecture of Obsessive-Compulsive Disorder: Contribution of Liability to OCD From Alleles Across the Frequency Spectrum.	Mahjani B et al.	—	2022	→
Functional rare and low frequency variants in BLK and BANK1 contribute to human lupus.	Jiang SH et al.	—	2019	→
A method to exploit the structure of genetic ancestry space to enhance case-control studies	Bodea CA et al.	—	2016	—
A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies.	Bodea CA et al.	—	2016	→
A genome-wide association study of autism using the Simons Simplex Collection: Does reducing phenotypic heterogeneity in autism increase genetic homogeneity?	Chaste P et al.	—	2015	→
Extreme-phenotype genome-wide association study (XP-GWAS): a method for identifying trait-associated variants by sequencing pools of individuals selected from a diversity panel.	Yang J et al.	—	2015	→
Individual common variants exert weak effects on the risk for autism spectrum disorders.	Anney R et al.	—	2012	→
Rare copy number variants in tourette syndrome disrupt genes in histaminergic pathways and overlap with autism.	Fernandez TV et al.	—	2012	→
Do common variants play a role in risk for autism? Evidence and theoretical musings.	Devlin B et al.	—	2011	→
Identification of common variants influencing risk of the tauopathy progressive supranuclear palsy.	Höglinger GU et al.	—	2011	→
Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism.	Sanders SJ et al.	—	2011	→
A genome-wide scan for common alleles affecting risk for autism.	Anney R et al.	—	2010	→