Chunk #12 — SUMMARY STATISTICS IN THE GWAS CATALOG

Source: The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.
Embedded: yes

Text

The size of SS files and the types of queries users need to perform present significant challenges for query performance requiring adjustments to the Catalog informatics infrastructure. We therefore identified a representative set of user queries; for example, retrieving P-values and associated fields, P-value plus the effect allele frequency, or beta-coefficient and standard error for combinations of variant, trait and study. The existing GWAS Catalog infrastructure uses a relational database for storage and when tested, did not scale to support the necessary range of queries over billions of data points. We evaluated the performance of several alternatives, including a relational database with a simplified GWAS schema to optimize performance, Cassandra and MongoDB. We found that the optimum performance and query times could be achieved using a HDF5 data library, and that queries over this data library scale to support anticipated data volumes over at least the next five years.