Chunk #18 — Polygenic Risk Scores: A Bridge Between Population Variation and Individual Differences — PRS Practicalities — STEP 3: Eliminate SNPs in high LD.
Correlated variants can represent non-independent association signals that, if ignored, could overweight PRS in favor of loci in high LD, by essentially counting a single signal multiple times. Indeed, not thinning a PRS according to LD can reduce their precision (Wu et al., 2013). As a result, it is recommended that variants be clumped so that the LD statistic, r2, is no greater than 0.10. When there are correlated SNPs, it is recommended to select the SNP in your target dataset based on the strength of association in the discovery GWAS (i.e., p-value-informed clumping). For example, if there are 80 SNPs forming an LD block, the SNP selected to represent this cluster should be the one with the largest effect size in the discovery GWAS. Not clumping data may limit the polygenic interpretability of PRS, especially at more significant p-value thresholds where an entire PRS could be driven by a series of correlated variants; but see Improvements in PRS estimation for alternatives. Studies also commonly exclude areas of complex linkage structure (e.g., MHC region) or retain one representative SNP across such regions.