Chunk #7 — DATA ACQUISITION AND METHODS — UCSC Genes—the next generation of Known Genes

Source: The UCSC Genome Browser Database: 2008 update.
Embedded: yes

Text

In April 2007 UCSC released UCSC Genes (W.J. Kent, manuscript in preparation), an improved version of the existing Known Genes annotation (11), on the March 2006 (Build 36, hg18) human assembly. This annotation, which includes putative non-coding genes as well as protein-coding genes and 99.9% of RefSeq genes, is a moderately conservative prediction set based on data from RefSeq, GenBank and UniProt (12). Each entry requires the support of one GenBank RNA sequence and at least one additional line of evidence, with the exception of RefSeq RNAs, which require no additional evidence. Although some of the transcripts labeled as ‘non-coding’ in the set may actually code for protein, typically the evidence for the associated protein is weak. Compared to RefSeq, this gene set generally has about 10% more protein-coding genes, approximately five times as many putative non-coding genes, and about twice as many splice variants. As part of the migration to the UCSC Genes annotation, we now use our own UCSC Genes accession numbers as the primary key into the underlying knownGene table, rather than the GenBank mRNA accessions used