ToppGene works by generating a representative profile of the training genes using as many as 14 features and identifies over-representative terms from the training genes. This forms the first step and is done by using ToppFun (see previous section). The test set genes are compared to this representative profile of the training set or the overrepresented terms from the training genes for all categorical annotations and the average vector for the expression values (Figure 1). For a test gene, a similarity score to the training profile for each of the 14 features is derived and summarized by the 14 similarity scores. In the case of a missing value (for instance, lack of one or more annotations for a test gene), the score is set to −1. Otherwise, it is a real value in [0, 1]. Different methods are used for similarity measures of categorical (e.g. GO annotations) and numeric (i.e. gene expression) annotations. While a fuzzy-based similarity measure is applied for categorical terms [see Popescu et al. (30) for additional details], for numeric annotation, i.e. the microarray expression values, the