Chunk #72 — ONLINE METHODS — Normalization of Gene Expression and Adjustment for Covariates — Normalize observations and estimate confidence of sampling abundance by sequencing
The voom68 normalization scales each sample’s read count for each gene by their total counts across all genes to account for variable sequencing depths across the samples. It then transforms each gene to be more approximately Gaussian by taking the logarithm (base 2) of the counts. Still, as a result of the experimental steps involved in obtaining read counts for genes (PCR, library preparation, sequencing, etc.), the read count for a particular gene will only on average be proportional to the underlying expression level of that gene. Thus, it is critical to model the statistical sampling of gene expression level, since larger log(CPM) typically exhibit lower variance (an example of heteroscedasticity). To this end, voom estimates confidence weights for each normalized observed read count. It does this by residualizing on the covariates (known and surrogate, as applicable), fitting a mean-variance relationship function across all genes, using the fitted function to estimate the variance of a particular read count observation, and then setting the observation weight to be the inverse of the corresponding estimated variance. The normalized observed read counts, along with the corresponding weights, move forward into the next step.