Chunk #2 — ENCODE data production and initial analyses — Transcribed and protein-coding regions

Source: An integrated encyclopedia of DNA elements in the human genome.
Embedded: yes

Text

We used manual and automated annotation to produce a comprehensive catalogue of human protein-coding and non-coding RNAs as well as pseudogenes, referred to as the GENCODE reference gene set14,15 (Supplementary Table U1). This includes 20,687 protein-coding genes (GENCODE annotation, V7), with on average 6.3 alternatively spliced transcripts (3.9 different protein-coding transcripts) per locus. In total GENCODE annotated exons of protein coding genes cover 2.94% of the genome or 1.22% for protein-coding exons. Protein-coding genes span 33.45% from the outermost start to stop codons, or 39.54% from promoter to poly A site. Analysis of mass spectrometry (MS) data from K562 and GM12878 cell lines yielded 57 confidently-identified unique peptide sequences intergenic relative to GENCODE annotation. Taken together with evidence of pervasive genome transcription16, these data indicate that additional protein–coding genes remain to be found.