Chunk #11 — Results and discussion — Motif bias

Source: Characterizing and measuring bias in sequence data.
Embedded: yes
Text

Typically, only a small fraction of a genome has 'low' relative coverage. For example, 198-fold mean coverage of the human genome by Illumina HiSeq 2000 version 2 chemistry only left 0.23% of bases undercovered by a factor of 10 or more (data set A2). At first glance, this portion of the genome appears minuscule, but if the data were unbiased, we would expect no bases to have such a low level of coverage (more than 12 standard deviations less than the mean). Additionally, this small undercovered fraction included important loci. For example, this deep-coverage HiSeq data set contained no reads overlapping the transcription start sites of several genes associated with early development, transcriptional regulation, cell-cell adhesion, actin binding, neural development, and intracellular signaling (for an example, see Figure 1). Thus, understanding the specific nature of undercovered sequences is important. We approached this problem in two ways: by evaluating specific biologically important regions of the genome that are significantly undercovered, and by identifying specific sequence motifs that are systematically undercovered. Anecdotal results suggested that many transcription start sites or first exons