Chunk #51 — Materials and Methods — Note on Genomic DNA Contamination in RNA-seq Datasets

Source: Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.
Embedded: yes

Text

unusually high percentage of intronic and intergenic reads could contain significant genomic DNA contamination. Our analysis of the datasets used in this study revealed that, as expected, polyA+ specific RNA-seq datasets have a higher fraction of reads mapping to protein coding gene exons than rRNA-depleted or polyA− specific datasets. Furthermore, no obvious outlier datasets were found for any of the library types. The results of this analysis ensured that no datasets with high genomic DNA contamination were used in this study (Figure S2). Next, as described in Figure 2A and in the Methods, we applied both size and expression cutoffs for all lincRNAs. The size cutoff prevents miscalling errant single reads, either from genomic DNA contamination or from read mapping artifacts, as lincRNAs while the expression cutoff removes lincRNAs that are assembled from rare genomic DNA-derived reads. The combination of these approaches served to minimize the contribution of genomic DNA to the lincRNA catalog.