Chunk #47 — Methods — Summary Statistics — Using the UK Biobank split-sample PGI

Source: Resource profile and user guide of the Polygenic Index Repository.
Embedded: yes

Text

Splitting the UKB into thirds as described above increases the predictive power of the PGI within each third (relative to omitting the UKB from the GWAS sample). Researchers may desire to conduct analyses that simultaneously include individuals from different partitions of the data or to meta-analyse results across different partitions. Such analyses will produce estimates that are unbiased, but the standard errors will be incorrectly calibrated. To see why, consider a linear model Yi=Xiβ+εi, where Xi is a vector of covariates that includes a PGI. Imagine that the data (Y,X) include individuals from different partitions of the data. As a result of the sample-splitting procedure above, Cov(Xi, εi) = 0, which implies that the OLS estimator for β will be unbiased. However, because some of the individuals in the data were used to generate the PGI for other individuals in the data, Cov(Xi, εi) ≠ 0 whenever individuals i and j are in different partitions. As a result, Varβ^=VarX′X-1X′Y (5)=Var[(X′X)−1X′ε] (6)≠(X′X)−1X′Var(ε)X(X′X)−1. The expression (6) is the standard general formula for the sampling variance of OLS estimates. It is not equal