Chunk #14 — Pitfalls of the analysis — Pitfall 1: Validation and discovery sample overlap

Source: Pitfalls of predicting complex traits from SNPs.
Embedded: yes

Text

If the correlation (R) between a phenotype and a single SNP in the population is zero (that is, the SNP is not associated with the trait), the expected value of the squared correlation (R2) estimated from a sample of size N is 1/(N-1), or approximately 1/N if N is large. Hence, a randomly chosen ‘candidate’ (but not truly associated) SNP explains 1/N of variation in any sample. Usually 1/N is small enough not to worry about. However, a set of m uncorrelated SNPs that have nothing to do with a phenotype of interest would, when fitted together, explain m/N of variation (due to the summing of their effects). For example, a set of 100 independent SNPs when fitted together in a regression analysis in a discovery sample of Nd = 1000 would, on average, explain R2 =10% of phenotypic variance in the discovery sample under the null hypothesis of no true association.