We used RefSeq transcripts from the UCSC Genome Database (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz, download date: 14 January, 2013) for the human genome (version hg19). Out of the 44,140 transcripts, we considered only the 34,475 that were clearly protein-coding (i.e. having an NM RefSeqID) and that were located on chromosomes 1–22, X, and Y. To construct a non-redundant (a single reference transcript per gene) set, we considered at least 1 bp overlap in the entire genomic span (including exons and introns along hg19 coordinates) among all transcripts located on the same strand in the same locus, and we randomly selected one transcript per locus. Through these filtering steps, we ultimately arrived at 18,789 protein-coding non-redundant representative transcripts conforming to our one-transcript-per-gene data structure (Dataset S1a).