Chunk #23 — VERTEBRATES — Long non-coding RNAs (lncRNAs)

Source: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.
Embedded: yes

Text

The RefSeq group continues to significantly expand on the representation of non-coding structural- and micro-RNAs, transcribed pseudogenes, and the largely uncharacterized lncRNAs. This class of genes is generally defined as being transcripts >200 nt in length that lack strong protein-coding potential (23). lncRNA RefSeq records are generated by curation and through the eukaryotic genome annotation pipeline for lncRNA genes. NCBI currently maintains over 540 000 eukaryotic lncRNA RefSeq records, of which over 6700 have been curated and only a few hundred have been functionally characterized. Of these, many have been implicated in human disease, such as BACE1-AS which may play a role in the pathophysiology of Alzheimer's disease, and HOTAIR which has been associated with multiple cancers (24,25). The vast majority of lncRNAs have unknown functions and the absence of long open reading frames presents a challenge in terms of confirming the completeness of the transcript. Furthermore, lncRNA submissions to INSDC are largely based on TSAs from short read datasets that may include artifactual exon combinations. RefSeq curators take a conservative approach to representing lncRNA genes, only manually creating RefSeqs