Chunk #0 — Background

Source: Characterizing and measuring bias in sequence data.
Embedded: yes
Text

Ideal whole-genome shotgun DNA sequencing would distribute reads uniformly across the genome and without sequence-dependent variations in quality. All existing sequencing technologies fall short of this ideal and exhibit various types and degrees of bias. Sequencing bias degrades genomic data applications, including genome assembly and variation discovery, which rely on genome-wide coverage. Undercovered regions might lead to a missed SNP in an important region or cause an assembler to produce shorter contigs. For example, Figure 1 plots the coverage of the transcription start site and first exon of human gene NCS1, which encodes a neurotransmitter regulator [1], in whole-genome shotgun sequencing (data set A2). Despite 198-fold mean coverage of the genome, the first 72 bases of this exon are completely uncovered. This type of bias can reduce the effectiveness of biological and medical research. Recently published work on drug-resistant tuberculosis identified thousands of zero-coverage sites in an entire class of the bacterium's genes, despite sequencing to an average depth of 134× [2]. Alleviating gaps or dips in coverage through additional reads inflates sequencing costs, and may have limited effectiveness. For these reasons, improving our knowledge of sequencing bias is essential to improving the utility of DNA sequencing data.