Chunk #22 — Methods — The phasing model for low coverage sequence data

Source: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.
Embedded: yes

Text

Based on the segmentation of the chromosome into C segments, we employ a similar Markov model as the one introduced in the SHAPEIT2 method [8]. It can be written as: (3)P(X1,X2∣H,R)=P(X{1}1,X{1}2∣H,R)∏s=2CP(X{s}1,X{s}2∣X{s−1}1.X{s−1}2,H,R) The idea here is to sample first a diplotype for the first segment s = 1 from P(X{1}1,X{1}2∣H,R) and then for each successive segment from P(X{s}1,X{s}2∣X{s−1}1,X{s−1}2,H,R). The scheme we use is described by the following steps: A pair of haplotypes in the first segment with labels (i, j) is sampled with probability proportional to P(X11=i,X12=j∣H,R).While s ≤ C a pair of haplotypes (d, f) for the sth segment is sampled given the previously sampled pair (i, j) for the {s–1}th segment with probability proportional to P(X{s}1=d,X{s}2=f∣X{s−1}1=i,X{s−1}2=j,H,R).Set s = s + 1.If s = C + 1 then stop, else go to Step 2.