Chunk #28 — Comparing different publicly available data sets against the GENCODE 7 reference set

Source: GENCODE: the reference human genome annotation for The ENCODE Project.
Embedded: yes

Text

We compared the composition of annotation across the five major gene sets publicly available in UCSC, GENCODE, CCDS, RefSeq, and AceView. Both the number of protein-coding loci and transcripts at those loci were investigated. The CCDS set has the lowest number of protein-coding loci and alternatively spliced transcripts since it is a high-quality conservative gene set derived from RefSeq and Ensembl/HAVANA gene merge (Pruitt et al. 2009). In CCDS, every splice site of every transcript must agree in both the RefSeq and Ensembl/Havana gene set and all transcripts must be full-length. While the number of protein-coding loci in RefSeq, GENCODE, and UCSC is comparable, AceView has ∼20,000 more coding loci. One likely source of inflation is the predisposition for AceView to add a CDS to transcript model and hence create novel loci from lncRNAs and pseudogenes (e.g., PTENP1). AceView predicts 31,057 single exon loci compared with 1724 in GENCODE, 3234 in RefSeq, and 4731 in UCSC genes. Excluding single exon loci predicted by AceView from this analysis, the number of AceView gene loci is much closer to the number in other gene sets (Fig. 7A).