Chunk #48 — Materials and Methods — LincRNA Discovery — Identifying lincRNAs expressed significantly above other intergenic regions

Source: Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.
Embedded: yes

Text

For each RNA-seq dataset (Table S1), an empirical background distribution of expression values was generated using one million size-matched annotations shuffled randomly across intergenic sequence. The intergenic sequence used includes all portions of the uniquely mappable genome excluding RefSeq NM, NR and XR genes, Ensembl v61 genes including “lincRNAs” and “processed transcripts”, GENCODEv6 genes including “lincRNAs” and “processed transcripts”, H-Invitational “noncoding” transcripts, alternative and extended 5′ and 3′ UTRs of known human genes from UTRdb, extended protein coding gene structures derived from RNA-seq data (extended gene filter, described above), and published lincRNAs from Khalil et al. [18] and Cabili et al. [16]. To determine which putative lincRNAs (in Dataset S2, FPKM>1) were expressed significantly above background in at least one dataset the probability of observing a transcript at any given expression level was estimated using the dataset-specific background distribution and adjusted for multiple testing according to the Bonferroni correction assuming one test per RNA-seq dataset. Those lincRNA annotations with a corrected P value < = 0.1 in at least one dataset are cataloged in Datasets S6, S7.