Chunk #2 — Automatic annotation process

Source: GENCODE: the reference human genome annotation for The ENCODE Project.
Embedded: yes

Text

Protein-coding genes were annotated automatically using the Ensembl gene annotation pipeline (Flicek et al. 2012). Protein sequences from UniProt (Apweiler et al. 2012) (only “protein existence” levels 1 and 2) were included as input, along with RefSeq sequences. Untranslated regions (UTRs) were added using cDNA sequences from the EMBL Nucleotide Archive (ENA) (Cochrane et al. 2011). Long intergenic noncoding RNA (lincRNA) genes were annotated using a combination of cDNA sequences and regulatory data from the Ensembl project. Short noncoding RNAs were annotated using the Ensembl ncRNA pipelines, using data from mirBase (Griffiths-Jones 2010) and Rfam (Gardner et al. 2011) as input.