paperKB
coga / coga-kb
Help
Sign in

Chunk #20 — Review — Technical challenges to employing reference sets

Source
Context and the human microbiome.
Embedded
yes

Text

Studies employing a reference set typically rely on the closed-reference approach to minimize compute since only the input study need be evaluated and can be done so in an embarrassingly parallel fashion. Another benefit is that the closed-reference strategy is unlikely to result in OTUs composed of non-16S sequence, as the reference is expected to only contain 16S exemplars; furthermore, comprehensive references like Greengenes typically contain only near-full-length reads, thus allowing researchers to combine data represented by multiple variable regions. Of course, any annotation information about the reference, such as the phylogenetic relationship between the data contained or annotations such as taxonomy, can be attached to the input study data “for free.” Unfortunately, this strategy can only classify sequences that are reasonably similar to those in the reference database. Combining studies with differential representation in the reference (e.g., samples from different environments) can lead to statistically significant patterns in the data that are not driven by the underlying biology. As an example, imagine three samples A, B, and C where A is composed of Escherichia coli and both B and