Chunk #16 — Online Methods — Simulation of sequencing data based on 1000 Genomes Project dataset — Imputing genotypes from sequencing data

Source: Extremely low-coverage sequencing and imputation increases power for genome-wide association studies.
Embedded: yes

Text

Genotypes can be inferred from sequencing data by either (1) inferring genotypes independently at each SNP in each individual, (2) making use of allele frequencies inferred from all sequenced individuals, (3) making use of linkage disequilibrium (LD) patterns inferred from sequenced individuals, or (4) making use of LD patterns inferred from sequenced individuals as well as reference panels of haplotypes7,22,24,26. Here we focus on (3) and (4), using a two-step imputation approach (see Supplementary Note for details and results of other approaches). In the first step, we computed genotype likelihoods at all polymorphic loci identified in the 1000 Genomes Project dataset independently for each individual. We disregarded all observed alleles that did not match either the reference or alternate allele identified in the 1000 Genomes Project dataset and computed likelihoods of 0,1,2 copies of the 1000 Genomes Project dataset “reference” allele at all SNPs identified in the phase 1 release of the 1000 Genomes Project. Reads that did not overlap any polymorphic sites were discarded. In the second step, the genotype likelihoods for all loci in all samples (with or