Chunk #3 — GENCODE gene merge process

Source: GENCODE: the reference human genome annotation for The ENCODE Project.
Embedded: yes

Text

This process of combining the HAVANA and Ensembl annotation is complex. During the merge process, all HAVANA and Ensembl transcript models are compared, first by clustering together transcripts on the same strand which have any overlapping coding exons, and then by pairwise comparisons of each exon in a cluster of transcripts. The merge process is summarized in the Supplemental Figures and Tables, including the rules involved in each step. Ensembl have developed a new module, HavanaAdder, to produce this GENCODE merged gene set. Prior to running the HavanaAdder code, the HAVANA gene models are passed through the Ensembl health-checking system, which aims to identify any inconsistencies within the manually annotated gene set. Annotation highlighted by this system is passed back to HAVANA for further inspection. In addition, the HAVANA transcript models are queried against external data sets such as the consensus coding sequence (CCDS) (Pruitt et al. 2009) gene set and Ensembl's cDNA alignments of all human cDNAs. If annotation described in these external data sets is missing from the manual set, then this is stored in the AnnoTrack system (see below) (Kokocinski et al. 2010) so that a record is kept for the annotators to inspect these loci.