Chunk #16 — INTRODUCTION — A technical look into the Genetics Portal

Source: Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics.
Embedded: yes

Text

Retrieving, processing, analysing and presenting the large amount of biological data in the Genetics Portal introduces some challenges. The majority of the data currently in the Portal corresponds to public information that we downloaded from the respective resources and analysed in Google Cloud Platform (GCP). Other datasets such as the UK Biobank LD reference panel have more restrictive access conditions and so were analysed locally. Storing and processing the hundreds of TB of raw data has also required some technical solutions for large scale data manipulation. Other challenges relate to the algorithmic part of the analysis. For example, the full cross-trait and QTL colocalisation currently takes about 4 weeks, conducting 2,035,470 successful comparisons on a compute cluster using 60 CPU cores. Since computing pairwise similarities is an O(N2) problem, we are looking into new methods that might alleviate the current constraints. Considering the speed at which population genomics data is currently generated, maintaining a set of reference datasets (e.g. LD panels, variant indexes) also introduces the need for keeping infrastructure up-to-date.