Chunk #21 — Methods — The phasing model for low coverage sequence data

Source: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.
Embedded: yes

Text

Given the segment representation described above, sampling a diplotype (pair of haplotypes) given a set of known haplotypes H and a set of sequencing reads R involves sampling from the posterior distribution Pr(X1, X2|H, R). By assuming first that the reads for the individual we are updating, R, are conditionally independent of the haplotypes in other individuals, H, given the pair of haplotypes (X1, X2) we can write (1)P(X1,X2∣H,R)∝P(X1,X2,R,H) (2)∝P(R∣X1,X2)P(X1,X2∣H) This factorisation involves a model of the diplotype given the observed haplotypes, P (X1, X2|H) and for this we use the previously described SHAPEIT2 model [8]. The term P (R|X1, X2) is constructed from the genotype likelihoods.