Chunk #34 — Experimental Procedures — Hallmark generation methodology — Step 1: Identify groups of similar gene sets using consensus clustering

Source: The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Embedded: yes

Text

The input dataset to this procedure consisted of 8,380 gene sets from MSigDB v4.0 (collections C1 through C6) each containing between 5 and 1,994 genes (features). We decided to include the C1 collection containing genes in cytogenetic bands because these often indicate regions of similar chromatin structure, or regions affected by oncogenic copy number alterations, which could result in co-regulation and may be important in development and cancer related datasets. We used agglomerative hierarchical clustering with average linkage as implemented in the fastclust R package (Müllner, 2013). For the clustering distance metric we used the Jaccard’s distance (Jaccard, 1902; Levandowsky and Winter, 1971). For two gene sets S1, S2 the Jaccard distance is: (1)D12=1∣S1∩S2∣∣S1∪S2∣ where ∣S1∩S2∣is the number of elements in the intersection of S1andS2,and∣S1∪S2∣ is the number of elements in the union of the sets.