Chunk #24 — Online Methods — Eagle2 core algorithm for phasing a single target sample using a set of reference haplotypes — Step 3: Exploration of the diplotype space
Having prepared a HapHedge of conditioning haplotypes, Eagle2 performs phasing using a statistical model similar to the Li-Stephens haplotype copying model5,21 used by previous HMM-based methods. However, in contrast to previous methods, Eagle2 applies two new ideas to perform fast and accurate phase inference under this model. The first idea is a new way to efficiently compute haplotype probabilities under a copying model. Naïvely, such computations require exponential time because of the combinatorial explosion of possible recombination points. The standard approach to overcoming this barrier is to observe that within a KHMM-state HMM, recursion allows computation of all marginal probabilities (for all KHMM states at each of M positions) in O(MKHMM2) time. With Eagle2, we take a completely different recursive approach that computes the probability of a single haplotype in O(M) time—independent of the number of reference haplotypes K—after creation of the HapHedge in O(MK) time. The HapHedge essentially consolidates all reference haplotypes sharing a common prefix (starting at any given position) into a single atom of data, thus eliminating future computation that scales with K.