Statistical modeling for sensitive detection of low-frequency single nucleotide variants.
- Authors
- Hao, Yangyang; Zhang, Pengyue; Xuei, Xiaoling; Nakshatri, Harikrishna; Edenberg, Howard J; Li, Lang; Liu, Yunlong
- Year
- 2016
- Journal
- BMC genomics
- PMID
- 27556804
- DOI
- 10.1186/s12864-016-2905-x
- PMCID
- PMC5001245
BACKGROUND: Sensitive detection of low-frequency single nucleotide variants carries great significance in many applications. In cancer genetics research, tumor biopsies are a mixture of normal and tumor cells from various subpopulations due to tumor heterogeneity. Thus the frequencies of somatic variants from a subpopulation tend to be low. Liquid biopsies, which monitor circulating tumor DNA in blood to detect metastatic potential, also face the challenge of detecting low-frequency variants due to the small percentage of the circulating tumor DNA in blood. Moreover, in population genetics research, although pooled sequencing of a large number of individuals is cost-effective, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 2Β % to 5Β %; most fail to consider differential sequencing artifacts. RESULTS: We aimed to push down the frequency detection limit close to the position specific sequencing error rates by modeling the observed erroneous read counts with respect to genomic sequence contexts. 4 distributions suitable for count data modeling (using generalized linear models) were extensively characterized in terms of their goodness-of-fit as well as the performances on real sequencing data benchmarks, which were specifically designed for testing detection of low-frequency variants; two sequencing technologies with significantly different chemistry mechanisms were used to explore systematic errors. We found the zero-inflated negative binomial distribution generalized linear mode is superior to the other models tested, and the advantage is most evident at 0.5Β % to 1Β % range. This method is also generalizable to different sequencing technologies. Under standard sequencing protocols and depth given in the testing benchmarks, 95.3Β % recall and 79.9Β % precision for Ion Proton data, 95.6Β % recall and 97.0Β % precision for Illumina MiSeq data were achieved for SNVs with frequencyβ>β= 1Β %, while the detection limit is around 0.5Β %. CONCLUSIONS: Our method enables sensitive detection of low-frequency single nucleotide variants across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.
SNV loci depth distribution by allele frequency for Ion Proton and Illumina MiSeq. The dashed lines show the 3000x depth
Distplot on binomial, Poisson and negative binomial distributions. The y-axis is the distribution metameter calculated by the method distplot used. The open points show the observed count metameters; the filled points show the confidence interval centers and the dashed lines show the confidence intervals for each point. 95 % confidence interval is used
| Name | Type |
|---|---|
| 0.5% frequency SNV local | variant |
| 1000 Genomes Project | cohort |
| 1% frequency SNV local | variant |
| acquired resistance local | phenotype |
| benchmark datasets local | cohort |
| Binomial distribution local | drug |
| blood DNA local | drug |
| CAL_A local | cohort |
| CAL_A dataset local | cohort |
| CAL_B local | cohort |
| CAL_C local | cohort |
| CAL_D local | cohort |
| cancer | phenotype |
| candidate SNV local | variant |
| candidate SNVs local | variant |
| circulating tumor DNA local | drug |
| DNA | drug |
| DNA-Seq local | drug |
| duplex sequencing | drug |
| early detection local | phenotype |
| EdgeR local | drug |
| Fisherβs exact test local | drug |
| homopolymer related errors local | phenotype |
| Illumina MiSeq local | cohort |
| Illumina MiSeq local | drug |
| Illumina MiSeq benchmark local | cohort |
| Illumina MiSeq dataset local | cohort |
| invariant loci local | variant |
| Ion Proton local | cohort |
| Ion Proton local | drug |
| Ion Proton dataset local | cohort |
| Ion Proton sequencing dataset local | cohort |
| Ion Proton test benchmark local | cohort |
| Ion Proton test benchmark dataset local | cohort |
| Ion Proton training benchmark local | cohort |
| Ion Proton training dataset local | cohort |
| low-frequency SNVs local | variant |
| low frequency tumor somatic SNV local | variant |
| metastasis | phenotype |
| metastatic monitoring local | phenotype |
| MiSeq local | cohort |
| MiSeq training data local | cohort |
| NA11993 local | cohort |
| NA12878 | cohort |
| NB local | drug |
| NB distribution local | drug |
| NB GLM local | drug |
| Negative binomial distribution local | drug |
| Poisson local | drug |
| Poisson distribution | drug |
| Poisson GLM local | drug |
| prognostic assessment local | phenotype |
| PSEM local | drug |
| PSEM approach local | drug |
| PSEM framework local | drug |
| PSEM model local | drug |
| relapse | phenotype |
| RNA-seq | drug |
| sequencing depth local | drug |
| sequencing depth evenness local | phenotype |
| single nucleotide variant | variant |
| SNV | variant |
| SNVs | variant |
| somatic SNV local | variant |
| SRP009487.1 local | cohort |
| testing benchmark local | cohort |
| training benchmark local | cohort |
| UDT-Seq local | drug |
| ultra-deep target enrichment assay local | drug |
| VarScan2 local | drug |
| Vuongβs non-nested test local | drug |
| Vuongβs test local | drug |
| Zero-inflated negative binomial local | drug |
| Zero-inflated Poisson local | drug |
| ZINB local | drug |
| ZINB GLM local | drug |
| ZIP local | drug |
No uploaded files.
No papers in this knowledge base cite this source.
External
| Title | Authors | Journal | Year | Link |
|---|---|---|---|---|
| GeneBits: ultra-sensitive tumour-informed ctDNA monitoring of treatment response and relapse in cancer patients. | Broche J et al. | β | 2025 | β |
| A system for detecting high impact-low frequency mutations in primary tumors and metastases. | Anjanappa M et al. | β | 2018 | β |
| RareVar: A Framework for Detecting Low-Frequency Single-Nucleotide Variants. | Hao Y et al. | β | 2017 | β |
| Intelligent biology and medicine in 2015: advancing interdisciplinary education, collaboration, and data science. | Huang K et al. | β | 2016 | β |