Chunk #58 — Methods — PhyloCSF analysis

Source: GENCODE: the reference human genome annotation for The ENCODE Project.
Embedded: yes

Text

We used PhyloCSF (Lin et al. 2011) to identify potential novel coding genes in RNA-seq transcript models based on evolutionary signatures. For each transcript model generated from the Illumina HBM data using either Exonerate or Scripture, we generated a mammalian alignment by extracting the alignment of each exon from UCSC's vertebrate alignments (which includes 33 placental mammals) and “stitching” the exon alignments together. We then ran PhyloCSF on each transcript alignment using the settings “-f 6–orf StopStop3–bls,” which cause the program to evaluate all ORFs in six frames and report the best-scoring. The “–bls” setting causes the program to additionally report a branch length score (BLS), which measures the alignment coverage of the best-scoring region as the percentage of the neutral branch length of the 33 mammals actually present in the alignment (averaged across the individual nucleotide columns). We selected transcripts containing a region with a PhyloCSF score of at least 60 (corresponding to a 1,000,000:1 likelihood ratio in favor of PhyloCSF's coding model) and a BLS of at least 25% for manual examination by an annotator.