Chunk #11 — Results — High frequency polymorphisms between sequenced reads and reference genomes

Source: Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data.
Embedded: yes

Text

even when the ‘putative’ indel was present across a large number of reads [3], [5]. We would expect that if the indels in our datasets were bona fide polymorphisms they would be observed across all datasets for the same species. Analysis of the 200 bp kits revealed that 87% of high-frequency indels across the B. amyloliquefaciens datasets, and 82% across S. tokodaii were unique to a single run (Figure S2a and S2b). As the data were derived from the same DNA template, this strongly indicates that these indels are due to PGM-based error as opposed to genuine polymorphisms. While few high frequency indel (HFI) sites were shared amongst all 200 bp runs for the same species, the size of the intersection between pairs of runs suggests that the HFI (or a subset of them) are not random (i.e. they may be more prevalent in or around a particular sequence motif) (Figure S2a and S2b). HFIs have been observed previously in PGM data, with some evidence to suggest the HFI were asymmetrically distributed across reads in the forward versus reverse orientation (or vice versa) [3]. We investigated whether any of the HFIs in our data were asymmetrically distributed across the forward