Chunk #4 — INTRODUCTION

Source: The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Embedded: yes
Text

Redundancy can take different forms, e.g. gene sets may simply share a large proportion of their comprising genes. Another more subtle form of redundancy can occur when gene sets have only a partial overlap but their annotations refer to similar or the same biological process. In the latter case, the gene sets may actually represent partial transcriptional readouts of the same processes, and in both cases the sets may attain similar GSEA. As a consequence of this redundancy, gene set enrichment analysis could produce long lists of statistically significant results with multiple occurrences of essentially the same biological process. Moreover, many high scoring, but overlapping or redundant, gene sets can dominate the top of a result set and effectively hide other potentially relevant hits further down the list. In this scenario one can easily fail to notice important and relevant findings and thus not realize the full potential of GSEA. In addition, the overrepresentation of a biological process at the top of a gene set list can skew the tail of the observed distribution of enrichment scores, thereby increasing the significance of top scoring gene sets that represent the same signal.