Chunk #72 — Online Methods — Evaluation of GWAS imputation accuracy

Source: Fast and accurate long-range phasing in a UK Biobank cohort.
Embedded: yes

Text

For computational efficiency, we performed all benchmarks of downstream imputation starting from a single data set, created as follows. First, we merged the 379 European-ancestry individuals from the 1000 Genomes Phase 1 integrated v3 release (see URLs) into the UK Biobank data set. Second, we entirely masked 700 random SNPs per chromosome, 100 in each of seven MAF bins (with MAF computed in the curated British samples). We phased all samples together using Eagle, and we phased a subset of N≈15,000 samples (all 1000 Genomes samples plus 10% of the UK Biobank samples) using SHAPEIT2. Finally, we used the Sanger Imputation Service to impute the N≈15,000 SHAPEIT2-phased samples and the same subset of Eagle-phased samples using both the UK10K panel (3,781 samples) and the Haplotype Reference Consortium (r1) panel (32,488 samples) with the PBWT imputation algorithm37 (see URLs). We assessed imputation R2 in N≈12,000 curated British samples at the masked and imputed SNPs, computing means and standard errors across MAF strata as before (treating R2 from different SNPs as approximately independent given that each MAF bin contained <1 SNP per