Chunk #16 — Results — Distinguishing promoters of protein-coding and lncRNA genes through an ensemble of decision trees model

Source: Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes.
Embedded: yes

Text

Several lines of evidence indicate that the transcriptional regulation of lncRNAs may differ substantially from that of protein-coding genes. To computationally test for any evidence of this phenomenon, we leveraged recent advances in machine learning to fit an integrative model based on the information from all analyzed data types to distinguish the promoters of protein-coding genes from those of lncRNAs. Our fitted ensemble model correctly classified the promoters (lncRNA or protein-coding) with more than 80% accuracy. Hence, across the majority of the genome sequence space, genetic and epigenetic information is sufficient to confidently separate these two classes of promoters (Table 1, Table S4). Interrogation of our fitted models revealed that the strongest effects accounting for this predictive power are DNA k-mers and CSs. These were more discriminative than TFBSs, although most feature types, including TFBSs, had significant discrimination power (Figure 3).