Chunk #6 — ENHANCEMENTS AND UPDATES — New gene set libraries — Differentially expressed genes after drug, gene, disease, ligand and pathogen perturbations extracted from GEO by the crowd
To ensure the quality of these crowd-generated gene set libraries, we performed both automatic and manual sanitizations. We first programmatically re-processed all the entries submitted by the participants to calculate differentially expressed gene sets using the metadata submitted by the participants using the Characteristic Direction method (31). Incorrect entries where samples did not belong to the particular study were automatically filtered. We also automatically filtered out entries with invalid gene symbols and mismatched organisms. Entries from curators who submitted more than 10% invalid entries were removed entirely. Entries that passed these filters were randomly sampled for manual inspection to ensure that the metadata, such as the perturbed genes, were in fact perturbed in the study, and control samples and perturbation samples were correctly selected. As a result, approximately 20% of the submitted entries were removed for each microtask.