Chunk #49 — Methods — Imputation

Source: The UK Biobank resource with deep phenotyping and genomic data.
Embedded: yes

Text

To facilitate fast imputation of all 500,000 samples, we re-coded IMPUTE223 to focus exclusively on the haploid imputation needed when samples have been pre-phased. This new version of the program is referred to as IMPUTE4 (see https://jmarchini.org/software/), but uses exactly the same hidden Markov model within IMPUTE2, and produces identical results to IMPUTE2 when run using all reference haplotypes as hidden states (data not shown). To reduce RAM usage and increase speed we use compact data structures that store the indices of haplotypes carrying the non-reference allele at variant sites in the reference panel. Not only is this data structure compact, but at each stage of the forward-backward algorithm it also allows the calculations involving the emission part of the hidden Markov model to sum only over just the subset of haplotypes that carrying the non-reference allele in an efficient way. A further increase in speed is obtained by only calculating the marginal copying probabilities at those sites common to the target and reference datasets, and then linearly interpolating these for SNPs in-between those sites that need to be imputed.