Chunk #6 — GENERATING THE REFSEQ DATASET

Source: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.
Embedded: yes

Text

RefSeq sequence records are generated by different methods depending on the sequence class and organism. Archaeal and bacterial genomes (see Prokaryotes section) are annotated using NCBI's prokaryotic genome annotation pipeline (http://www.ncbi.nlm.nih.gov/books/NBK174280/), while a small number of reference bacterial genomes are supported by collaboration and manual curation. RefSeq eukaryotic genomes are provided using two process flows. The majority of plant, animal, insect and arthropod genomes are annotated by the eukaryotic genome annotation pipeline. This pipeline generates annotation results based on available transcript data (including RNA-Seq and transcriptome shotgun assembly (TSA) data), as well as protein homology, ab initio prediction (largely when transcriptome data are unavailable), and available known (curated) RefSeq transcripts and proteins (see Table 1). Pipeline-generated annotation (model RefSeqs) may or may not have support for the complete exon combination from a single evidence alignment but may have RNA-Seq support for exon pairs. The eukaryotic genomes which have been annotated by this pipeline are reported publicly with links to download the data by FTP, to view or perform a BLAST query against the annotated genome, or to access a detailed