Chunk #2 — Background

Source: Statistical modeling for sensitive detection of low-frequency single nucleotide variants.
Embedded: yes

Text

The difficulty for low-frequency SNV identification using NGS technologies is due to the relatively high sequencing artifacts or error rates, which is around 0.1 ~ 1 % for most platforms. Further, such error rates differ significantly under various genome contexts. For example, Illumina sequencing data are prone to have mismatches while Ion Torren and Ion Proton data contain more homopolymer related indels and consequently, mismatches near homopolymer loci [11–13]. For somatic SNV identification paired tumor-normal design, some existing methods derive the sequencing error probability from base qualities followed by error likelihood ratio test of tumor and normal sample at the same location, for example in Mutect [7], Strelka [14]. While VarScan2 applies a Fisher’s exact test on the paired samples, treating non-reference read counts from the normal sample as background error rate. The former failed to consider differential error rates for substitution types while the latter only utilized information in one location thus the background error rate estimation is off. For one sample low-frequency SNV calling, UDT-Seq [15] tabulated the error rate based on substitution types, strand and location on