Chunk #7 — Results

Source: Extremely low-coverage sequencing and imputation increases power for genome-wide association studies.
Embedded: yes

Text

We also evaluated empirical results at lower coverage (0.005x to 0.5x) by sub-sampling reads with corresponding probability. Due to the large number of experiments and the higher non-exome coverage of the IHCS data as compared to all the 909 samples, we restricted this analysis to the 10 distinct 5Mb regions (total of 50Mb) described above in the IHCS data set (84 samples). As coverage decreases, we observe a reduction in accuracy, analogous to our simulations based on the 1000 Genomes Project dataset, restricted to the same set of 6,070 SNPs from the array (Figure 3). At 0.5x coverage we observe a mean r2 of 0.82, standard deviation of 0.03 and standard error of 0.01 across the 10 regions. However, the accuracy of imputation in the IHCS sequencing data is lower than in simulations for any level of coverage (Figure 3). The discrepancy between simulations and real data could be an effect of increased similarity across haplotypes inferred from the 1000 Genomes Project phase 1 data due to the genotype calling and phasing procedure from 4x sequencing data that aggregated information