Chunk #17 — Pitfalls of the analysis — Pitfall 1: Validation and discovery sample overlap

Source: Pitfalls of predicting complex traits from SNPs.
Embedded: yes

Text

A less obvious mistake is to select the most significantly associated SNPs in the entire sample and to use these to estimate SNP effects and test their prediction accuracy in the discovery and validation sets55. In this case the variance explained by the SNPs when applied in the validation sample is inflated. It creates bias and misleading results because the initial selection step of the SNPs is based upon there being a chance correlation between these SNPs and the entire sample, so also between the SNPs and any sub-sample. A prediction equation based on these SNPs will appear to work in the validation sample but not in a genuinely independent sample. Cross-validation analysis after the initial set of SNPs has been selected from the entire sample does not mitigate this bias. The pitfall of SNP selection from discovery and validation samples occurred in a recent study reporting a genetic predictor of autism56. SNPs putatively associated with autism in multiple biological pathways were selected based upon p-values from GWAS in the entire data set. Model selection was subsequently applied using cross-validation