Chunk #43 — Materials and Methods — LincRNA Discovery — Transcripts overlapping extended protein coding gene structures were removed

Source: Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.
Embedded: yes

Text

RNA-seq reads may extend beyond annotated 5′ and 3′ ends of annotated protein coding gene structures representing possible extended UTRs as well as, in the case of spliced reads mapping to the gene from distal sites, unannotated exons. In order to avoid cataloging transcripts in these regions as lincRNAs, we created a filter based on extended boundaries of protein coding genes using RNA-seq data. To do this, de novo transcriptome assembly with Cufflinks v1.1.0 using RefSeq NM genes as a reference annotation (-g), upper quartile normalization (-N), and fragment bias correction (-b) was performed on all polyA+ RNA-seq libraries in Table S2. RefSeq NM gene annotations were used as the reference annotation for this transcript assembly because these represent a limited, high confidence set of protein coding gene annotations. This set of extended protein coding gene boundaries (Dataset S1) was used as a filter to remove transcripts that overlap any extended protein coding gene by at least one base regardless of strandedness.