Chunk #14 — GROWTH AND STATISTICS

Source: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.
Embedded: yes

Text

RefSeq FTP release 71 (July 2015) includes more than 77 million sequence records for more than 55 000 organisms. Table 2 summarizes the growth of the RefSeq dataset in the last year in terms of the organisms and number of sequence records represented per each RefSeq release FTP directory area. Bacterial genomes and proteins comprise the bulk of the RefSeq dataset (56% of the total accessions and 76% of the >52 million protein accessions). Significant increases in the number of organisms, proteins, and total records are seen for invertebrate, plant, and eukaryotic organisms which is consistent with the increased number and throughput of genome sequencing projects. A significant factor for the continued high rate of growth of RefSeq data are improvements in genome pipelines that generate annotated RefSeq genomes. Most notably, this includes increased capacity in NCBI's prokaryotic genome annotation pipeline, re-development of the process flow that propagates annotation from eukaryotic GenBank genomes onto RefSeq genomes, and the incorporation of RNA-Seq evidence in NCBI's eukaryotic genome annotation pipeline and its impact on generating model RefSeqs (XM_, XR_ and XP_ accessions, Table 1).