Chunk #53 — Results and discussion — Discovering uncategorized bias

Source: Characterizing and measuring bias in sequence data.
Embedded: yes

Text

Performance on this fraction is hidden from our monitoring methods by its dissimilarity with the current set of motifs. On the Illumina HiSeq 'Kapa' data set, these bases had mean relative coverage of 0.037. They also suffered from high error rates - a mismatch rate of 0.020 (6.7 times the whole-genome average), a deletion rate of 0.11 (470 times the whole-genome average), and an insertion rate of 0.0021 (12 times the whole-genome average). The high deletion rate suggests that some of the undercoverage may have been due to short biological deletions in NA12878 relative to the reference sequence, but even if all the deletions originated in the sample, these regions would still be more than ten-fold undercovered. Their GC-content and homopolymer distributions did not differ appreciably from the overall genome (Figure 6). Clearly, these regions were either exceptionally resistant to the Illumina HiSeq technology or are places where the reference is inaccurate for NA12878 or for human samples generally. A list of the intervals' coordinates, GC content, and homopolymer N50 statistics are included in Additional file 6.