Chunk #71 — Online Methods — Evaluation of in-sample imputation accuracy

Source: Fast and accurate long-range phasing in a UK Biobank cohort.
Embedded: yes

Text

In our in-sample imputation benchmarks, we used the same SNP and sample subsets described above, but we modified the genotype data by randomly masking 2% of all genotypes (increasing the missingness of each SNP by ≈0.02). We then phased the masked data, obtaining imputed genotypes at all masked SNPs in the phased output. For each SNP, we computed adjusted R2 between actual and imputed masked genotype values according to the formula (1)adjustedR2≔R2−1−R2n−2, where R2 on the right is the usual coefficient of determination and n is the number of data points. (This adjustment corrects for upward bias due to finite sample size; for simplicity, we always use “R2” to refer to adjusted R2 elsewhere in this manuscript.) We computed means and standard errors of R2 over MAF strata, treating R2 from different SNPs as approximately independent given that the ≈2% subset of masked individuals varied from SNP to SNP. To assess in-sample imputation accuracy on a subset of samples (e.g., the 120,000 British samples curated by UK Biobank for GWAS), we computed R2 using only masked genotypes from samples in the subset.