Chunk #7 — IMPROVED DATA REPRESENTATION AND ANNOTATION

Source: The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog).
Embedded: yes

Text

To improve the quality and accuracy of the SNP and associated genomic data in the Catalog, we have redeveloped the variant mapping pipeline. The new pipeline accesses Ensembl's REST API (https://rest.ensembl.org/) (18) enabling live validation within the curation system for: SNP ID validation; reported gene ID validation; checking that the SNP and reported gene are on the same chromosome. This delivers more accurate data in the Catalog as errors are reported and corrected immediately, decreasing the need for post-hoc curation. The new pipeline has also increased the proportion of variants that map to the genome, from 92% to 96%, improving the completeness of genetic location, mapped gene and cytogenetic data. In future, the flexibility of this pipeline will allow integration of additional information from Ensembl to improve functional annotation, for example with all genes within a specified genomic region from both the RefSeq (19) and GENCODE (20) gene sets. These future enhancements are supported in the redesigned database with the model now capturing mapping of multiple genes to a single variant, and the distance to each gene. In addition the new pipeline is used to update the current dataset to the most recent genome build.