Chunk #3 — THE UNIPROT DATABASES — The UniProt Reference Clusters (UniRef)

Source: The Universal Protein Resource (UniProt) in 2010.
Embedded: yes

Text

UniRef provides clustered sets of all sequences from UniProtKB (including splice forms as separate entries) and selected records from UniParc to achieve complete coverage of sequence space at identity levels of 100, 90 and 50% while hiding redundant sequences (11). The UniRef clusters are generated in a hierarchical manner; the UniRef100 database combines identical sequences and sub-fragments into a single UniRef entry, UniRef90 is built from UniRef100 clusters and UniRef50 is built from UniRef90 clusters. Each individual member sequence can exist in only one UniRef cluster at each identity level and have only one parent or child cluster at another identity level. UniRef100, UniRef90 and UniRef50 yield database size reductions of ∼11, 40 and 72%, respectively. Each cluster record contains source database, protein name and taxonomy information on each member sequence but is represented by a single selected representative protein sequence and name; the number of members and the lowest common taxonomy node are also included. UniRef100 is one of the most comprehensive non-redundant protein sequence datasets available. The reduced size of the UniRef90 and UniRef50 datasets provide faster sequence