Chunk #18 — Methods — The phasing model for low coverage sequence data

Source: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.
Embedded: yes

Text

We wish to estimate the haplotypes of N unrelated individuals with sequence data at L bi-allelic variants, which could be either SNPs, Indels or structural variants. Our new algorithm extends the SHAPEIT2 model and the MCMC method used to carryout inference from this model. We use a Gibbs sampling scheme in which each individuals haplotypes are sampled conditional upon the sequence reads of the individual and the current estimates of all other individuals. Thus it is sufficient for us to consider the details of a single iteration in which we update the haplotypes of the ith individual. We use R to denote the sequence data available for this individual and H to denote the current haplotype estimates of other individuals being used in the iteration. We define the genotype likelihood as the probability of observing the sequence data R at a particular site l given the unobserved genotype Gl: P (R|Gl), where Gl = 0, 1, 2 counts the number of non-reference alleles in the genotype. These genotype likelihoods can be obtained using specialised software like SAMtools [14], SNPtools [15] or GATK [16] that derive these likelihoods directly from the BAM files containing the sequence reads.