paperKB
coga / coga-kb
Help
Sign in

Chunk #15 — Pitfalls of the analysis — Pitfall 1: Validation and discovery sample overlap

Source
Pitfalls of predicting complex traits from SNPs.
Embedded
yes

Text

When the number of SNPs in the predictor is large and the sample size is small, the discovery R2 can be very high by chance and can be a gross over-estimation of the true variance explained by the predictor when applied in an independent sample. Also, the expected R2 in the validation sample for a set of SNPs selected from a discovery sample but with the effect sizes of the SNPs re-estimated in the validation sample is ~1/Nv, with Nv the validation sample size. Therefore, to estimate the R2 of a prediction in a new sample, a prediction equation is estimated in the discovery sample and is tested, without re-estimating the regression coefficients, in the validation sample (Box 2). Applying the incorrect validation procedure results in over-estimation of the accuracy of the prediction (or over-fitting). An example of where over-fitting occurs is when testing the prediction in the discovery sample, i.e., the same data are used to estimate the effect of SNPs on phenotype and to make predictions53, 54 . We illustrate the overlap pitfall with examples in dairy cattle,