Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics.
- Authors
- Williams, Camille M; Poore, Holly; Tanksley, Peter T; Kweon, Hyeokmoon; Courchesne-Krak, Natasia S; Londono-Correa, Diego; Mallard, Travis T; Barr, Peter; Koellinger, Philipp D; Waldman, Irwin D; Sanchez-Roige, Sandra; Harden, K Paige; Palmer, Abraham A; Dick, Danielle M; Karlsson Linnér, Richard
- Year
- 2023
- Journal
- Behavior genetics
- PMID
- 37713023
- DOI
- 10.1007/s10519-023-10152-z
- PMCID
- PMC10584908
Proprietary genetic datasets are valuable for boosting the statistical power of genome-wide association studies (GWASs), but their use can restrict investigators from publicly sharing the resulting summary statistics. Although researchers can resort to sharing down-sampled versions that exclude restricted data, down-sampling reduces power and might change the genetic etiology of the phenotype being studied. These problems are further complicated when using multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM), that model genetic correlations across multiple traits. Here, we propose a systematic approach to assess the comparability of GWAS summary statistics that include versus exclude restricted data. Illustrating this approach with a multivariate GWAS of an externalizing factor, we assessed the impact of down-sampling on (1) the strength of the genetic signal in univariate GWASs, (2) the factor loadings and model fit in multivariate Genomic SEM, (3) the strength of the genetic signal at the factor level, (4) insights from gene-property analyses, (5) the pattern of genetic correlations with other traits, and (6) polygenic score analyses in independent samples. For the externalizing GWAS, although down-sampling resulted in a loss of genetic signal and fewer genome-wide significant loci; the factor loadings and model fit, gene-property analyses, genetic correlations, and polygenic score analyses were found robust. Given the importance of data sharing for the advancement of open science, we recommend that investigators who generate and share down-sampled summary statistics report these analyses as accompanying documentation to support other researchers' use of the summary statistics.
LD Score genetic correlations and heritability estimates for the seven indicator phenotypes of the single-factor models of EXT and EXT-minus-23andMe (see Step 1). The left panel displays the analysis of the original study with 23andMe data, the middle panel displays the down-sampled analysis excluding 23andMe data, and the right panel displays the difference in estimates computed by subtracting the values in the middle panel from those in the left panel. The lower and upper triangles display pairwise genetic correlation (rg) estimates and standard errors, respectively. The diagonals display the observed-scale heritability (h2; see Table 1 for standard errors). These results are also reported in Table S1. ADHD attention-deficit/hyperactivity disorder; ALCP problematic alcohol use; CANN lifetime cannabis use; FSEX age at first sexual intercourse (reverse coded); NSEX number of sexual partners; RISK risk tolerance; SMOK lifetime tobacco initiation
Path diagram of a single-factor model with seven indicator phenotypes, of which SMOK and CANN are down-sampled, as estimated with Genomic SEM. These results are also reported in Table S2. Neither the factor loadings nor residual variances were statistically different from the original estimates (a path diagram of the original estimates was therefore omitted). The same figure displaying the results of the original study is available here: https://www.nature.com/articles/s41593-021-00908-3/figures/1. EXT-minus-23andMe genetic externalizing factor; ADHD attention-deficit/hyperactivity disorder; ALCP problematic alcohol use; CANN lifetime cannabis use; FSEX age at first sexual intercourse (reverse coded); NSEX number of sexual partners; RISK risk tolerance; SMOK lifetime tobacco initiation; AIC Akaike Information Criterion; CFI comparative fit index; SRMR standardized root mean square residual
Scatterplot of genetic correlations (rg) and marginal density plots between EXT (y-axis) or EXT-minus-23andMe (x-axis) with 77 other phenotypes. Each point corresponds to the genetic correlation coefficient with its 95% confidence intervals (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{g}\pm 1.96\times SE$$\end{document}rg±1.96×SE) estimated with bivariate LD Score regression. Table S5 reports the estimates, their standard errors, and confidence intervals. The Spearman rank correlation reported in the figure is rounded from r = 0.9995. No particular shape, such as a normal distribution, is expected for the marginal density because the figure displays an arbitrary selection of traits
Comparison of the down-sampled polygenic score (PGS) analyses in Add Health (29 phenotypes) and the Collaborative Study on the Genetics of Alcoholism (COGA; 26 phenotypes). Panel A displays the standardized difference between the coefficient estimates (i.e., a Z-statistic) of the down-sampled PGS for EXT-minus-23andMe versus the PGS for EXT from the original study. Absolute values were evaluated so that a negative standardized difference refers to an attenuation towards zero in the down-sampled analysis. Panel B displays the same measure but as a histogram. Four coefficient estimates were significantly (at the 5% level) attenuated in the down-sampled analysis: lifetime smoking initiation (Add Health and COGA; P = 3.18 × 10–5 and 4.17 × 10–5, respectively), the phenotypic externalizing factor (Add Health; P = 0.046), and lifetime cannabis use (Add Health, P = 0.03). None of the coefficients were significantly larger in the down-sampled analysis. Panel C displays a scatter plot of the absolute value of the coefficient estimates divided by their respective standard errors (i.e., a Z-statistic). These results are also reported in Table S6
| Name | Type |
|---|---|
| 1000 Genomes Project | cohort |
| 23andMe | cohort |
| 23andMe Inc. local | cohort |
| 79 traits local | phenotype |
| 91 other traits local | phenotype |
| Add Health | cohort |
| ADHD | phenotype |
| age at first sexual intercourse | phenotype |
| ALC | drug |
| Allen Institute for Brain Science local | cohort |
| attention deficit hyperactivity disorder | phenotype |
| behavioral phenotypes | phenotype |
| binary trait | phenotype |
| BrainSpan local | cohort |
| brain tissue | anatomy |
| CANN | phenotype |
| childhood developmental disorders local | phenotype |
| Coleman et al. 2020 study local | cohort |
| Collaborative Study on the Genetics of Alcoholism (COGA) | cohort |
| comparative fit index local | drug |
| continuous trait local | phenotype |
| correlated SNPs local | variant |
| depression | phenotype |
| developmental stages local | anatomy |
| Down-sampled EXT local | phenotype |
| down-sampled EXT-minus-23andMe local | phenotype |
| down-sampled GWAS local | cohort |
| Down-sampled GWAS local | cohort |
| Down-sampled latent factor local | phenotype |
| down-sampled multivariate GWAS local | cohort |
| down-sampled PGS local | drug |
| down-sampled summary statistics local | drug |
| empirical genetic covariance matrix local | drug |
| European subsample of the 1000 Genomes Phase 3 reference panel local | cohort |
| EXT | phenotype |
| externalizing behavior | phenotype |
| Externalizing Consortium local | cohort |
| externalizing disorders | phenotype |
| EXT factor local | cohort |
| EXT-minus-23andMe local | drug |
| EXT-minus-23andMe factor local | cohort |
| EXT-minus-23andMe factor local | phenotype |
| EXT-minus-23andMe PGS local | cohort |
| EXT PGS local | cohort |
| focal phenotype local | phenotype |
| FSEX | phenotype |
| full-data GWAS local | cohort |
| Full GWAS local | cohort |
| Full latent factor local | phenotype |
| full-sample EXT local | phenotype |
| FUMA | drug |
| gene expression profiles local | gene |
| gene-property analyses local | drug |
| genome-wide significant hits local | variant |
| genome-wide significant SNPs | cohort |
| Genomic SEM | drug |
| GWAS | cohort |
| height | phenotype |
| heritability | phenotype |
| independent samples | cohort |
| indicator phenotype local | phenotype |
| Karlsson Linnér et al. 2021 local | cohort |
| latent factors local | phenotype |
| lead SNP | cohort |
| Lee et al. 2018 study local | cohort |
| lifetime cannabis use | phenotype |
| lifetime smoking | phenotype |
| mood disorders | phenotype |
| near-independent SNPs local | variant |
| NSEX | phenotype |
| number of sexual partners | phenotype |
| original study local | cohort |
| other traits local | phenotype |
| phenotype | phenotype |
| phenotypic externalizing factor | phenotype |
| polygenic score analyses local | drug |
| problematic alcohol use | phenotype |
| psychiatric traits | phenotype |
| risk | phenotype |
| risk tolerance | phenotype |
| root mean square residual local | drug |
| SMOK | phenotype |
| smoking initiation | phenotype |
| SNP | cohort |
| study cohort | cohort |
| substance use | phenotype |
| trait | phenotype |
| Vlaming et al. 2017 study local | cohort |
| Yengo et al. 2022 study local | cohort |
No uploaded files.
No papers in this knowledge base cite this source.
External
| Title | Authors | Journal | Year | Link |
|---|---|---|---|---|
| Associations of polygenic scores and developmental trajectories of externalizing behaviors. | Sasia AB et al. | — | 2025 | → |
| Binge drinking trajectories across adolescence and early adulthood: Associations with genetic influences for dual-systems impulsive personality traits, alcohol consumption, and alcohol use disorder. | Miller AP et al. | — | 2025 | → |
| Child maltreatment as a transdiagnostic risk factor for the externalizing dimension: a Mendelian randomization study. | Konzok J et al. | — | 2025 | → |
| Hillclimb-Causal Inference: a data-driven approach to identify causal pathways among parental behaviors, genetic risk, and externalizing behaviors in children. | Wei M et al. | — | 2025 | → |
| Mapping the genetic landscape of immune-mediated disorders: potential implications for classification and therapeutic strategies. | Fominykh V et al. | — | 2025 | → |
| Merging High-Throughput, Amplicon-Based Second and Third Generation Sequencing Data: An Integrative and Modular Data Analysis Framework for Haplotype Prediction and Output Evaluation. | Mink S et al. | — | 2025 | → |
| Polygenic Scores and Mood Disorder Onsets in the Context of Family History and Early Psychopathology. | Freeman K et al. | — | 2025 | → |
| The causal role of male pubertal timing for the development of externalizing and internalizing traits: results from Mendelian randomization studies. | Dinkelbach L et al. | — | 2025 | → |
| Examining intergenerational risk factors for conduct problems using polygenic scores in the Norwegian Mother, Father and Child Cohort Study. | Frach L et al. | — | 2024 | → |
| The causal role of male pubertal timing for the development of externalizing and internalizing traits: results from Mendelian randomization studies | Dinkelbach L et al. | — | 2024 | — |
| The goldmine of GWAS summary statistics: a systematic review of methods and tools. | Kontou PI et al. | — | 2024 | → |