Chunk #50 — Materials and Methods — Note on Genomic DNA Contamination in RNA-seq Datasets

Source: Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.
Embedded: yes

Text

Genomic DNA contamination is a potential source of false positive expression signal in RNA-seq data that may contribute to de novo assembly of erroneous transcripts. In principle, only exon-exon junction spanning reads can be unequivocally determined as derived from RNA. Proper de novo assembly of both nonspliced and spliced (aside from the exon-exon junction spanning reads) transcripts may therefore suffer if significant genomic DNA contamination is present. Because our analysis utilized a wide range of novel and previously existing RNA-seq datasets of unknown genomic DNA contamination content, we took multiple steps to mitigate this possibility. First, for all RNA-seq datasets, we analyzed the distribution of reads between protein coding exons compared to other regions with the expectation that read distributions should be similar between RNA-seq datasets generated from libraries of the same type (e.g. polyA+ selected). A dataset with an unusually high percentage of intronic and intergenic reads could contain significant genomic DNA contamination. Our analysis of the datasets used in this study revealed that, as expected, polyA+ specific RNA-seq datasets have a higher fraction of reads mapping to protein