Chunk #66 — Methods — Statistical filtering and analysis — Modeling flow values

Source: Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data.
Embedded: yes

Text

Given RAM restrictions, a random subset of 18 million observations (flows) were sampled from all datasets as input to model fitting. Note that true zero calls and over-calls of a zero were not included in the model, as zero-flows were unlikely to be well-approximated by a Gaussian. The flow-values were then modeled as normally distributed, using a variety of read attributes (including chip, kit, machine, flow position, well x-coordinate, well y-coordinate, nucleotide, position in cycle (PIC), nucleotide, pyrimidine versus purine). As the flow-values for each homopolymer length did not share a constant variance, these needed to be modeled using a double generalised linear model (DGLM), which simultaneously models the mean and dispersion. In the DGLM used here, the mean was a Gaussian linear model and the dispersion was linear on a log-scale. Only terms with an effect size greater than 0.001 were retained in the model. While the PIC showed the strongest relationship with the flow error-rate, we considered the replacement of PIC with simpler terms, such as the nucleotide flowed or pyrimidine versus purine, however this was detrimental to