Chunk #15 — Pitfalls of the analysis — Pitfall 1: Validation and discovery sample overlap

Source: Pitfalls of predicting complex traits from SNPs.
Embedded: yes

Text

When the number of SNPs in the predictor is large and the sample size is small, the discovery R2 can be very high by chance and can be a gross over-estimation of the true variance explained by the predictor when applied in an independent sample. Also, the expected R2 in the validation sample for a set of SNPs selected from a discovery sample but with the effect sizes of the SNPs re-estimated in the validation sample is ~1/Nv, with Nv the validation sample size. Therefore, to estimate the R2 of a prediction in a new sample, a prediction equation is estimated in the discovery sample and is tested, without re-estimating the regression coefficients, in the validation sample (Box 2). Applying the incorrect validation procedure results in over-estimation of the accuracy of the prediction (or over-fitting). An example of where over-fitting occurs is when testing the prediction in the discovery sample, i.e., the same data are used to estimate the effect of SNPs on phenotype and to make predictions53, 54 . We illustrate the overlap pitfall with examples in dairy cattle,