paperKB
coga / coga-kb
Processing
Help
Sign in

Chunk #96 — Online Methods — 10. Clustering of DNaseI-accessible regulatory regions to identify modules of coordinated activity

Source
Integrative analysis of 111 reference human epigenomes.
Embedded
yes

Text

The thus obtained binary matrices are subsequently clustered using a variation of a k-centroid clustering algorithm110. Instead of Euclidean distance we use a Jaccard-index based distance. This is done to be able to correctly cluster highly cell type restricted regions. From a computational point of view, we optimized the method to both deal with the size of the used data matrices and leverage their sparsity, in order to efficiently compute and update distances for matrices with sizes on the order of 106×103. The algorithm has been further modified to converge when less than 0.01% of cluster assignments change between iterations.