Chunk #21 — VERTEBRATES — Incorporation of RNA-Seq and other data types in transcript-based curation

Source: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.
Embedded: yes

Text

A major goal of the RefSeq curation project is to represent high quality and full-length transcript and protein reference sequences. As such, our curation criteria are primarily based on conventional transcript (mRNA and ESTs) and protein alignments and published evidence. However, vertebrate transcriptome projects have become ever more complex with the majority of new transcript data currently generated by short read sequencing technology. Genome-wide studies looking at global patterns of promoter-associated epigenetic marks also provide evidence of active promoters and/or active transcription. The RefSeq group has adjusted curation practices to incorporate these new data types to enhance our manual annotation, particularly in cases where a gene or variant lacks abundant conventional transcript support. These RNA-Seq and epigenomic studies have generated enormous datasets that present a challenge for gene annotation groups for example through potential false positives and the lack of support for long range exon combinations (15). RefSeq curators mitigate against false positives by selectively incorporating only high quality datasets for consideration into our genome annotation pipeline and into the manual annotation process. RefSeq curators visualize transcript alignments, variation data,