Chunk #9 — Results — Candidate distributions selection

Source: Statistical modeling for sensitive detection of low-frequency single nucleotide variants.
Embedded: yes

Text

To model error rate based on count data, 3 most common distribution choices are binomial, Poisson and negative binomial (NB) distributions. We applied a graphical exploratory plot – distplot [16–18] on the model response – number of reads containing non-reference bases – to get visual intuition about the overall fit of response data on different distributions. Intuitively, if an assumed distribution fits the data well, the data points should follow a straight line determined by the distribution metameters. As shown in Fig. 3, the obvious curve for binomial distribution plot suggests binomial distribution is not appropriate. The Poisson and NB plots show better agreement with the straight line although both curves deviate more from the straight line when the x-axis approaches 0. Tabulating the percentages of zero in the model responses show for Ion Proton training dataset, 85 % is 0 while 80 % for MiSeq training data. Thus zero-inflated models should be considered. In the modeling step, we included Poisson, NB and their zero-inflated counterparts (zero-inflated Poisson [19] or ZIP and zero-inflated negative binomial [20] or ZINB) as the candidate distributions under generalized linear model (GLM) framework.