Chunk #33 — Experimental Procedures — Hallmark generation methodology — Step 1: Identify groups of similar gene sets using consensus clustering

Source: The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Embedded: yes

Text

We first clustered all the gene sets according to their member genes’ overlaps and regardless of their annotations. We used consensus clustering (Monti et al., 2003) with bootstrap resampling to allow a more robust determination of cluster stability for multiple values of k, the ultimate number of clusters. In order to find the optimal number of clusters, we inspected the cophenetic coefficient as a function of k and searched for a peak value indicating the most stable partition (Brunet et al., 2004). We avoid choosing solutions with high values of k that produce higher values of the cophenetic coefficient but potentially overfit and represent small numbers of gene sets in each cluster. The extreme of this behavior is, for example, when the number of clusters equals the number of items, and the fit becomes perfect.