Chunk #17 — Polygenic Risk Scores: A Bridge Between Population Variation and Individual Differences — PRS Practicalities — STEP 2: Establish commonality between your target dataset and the discovery GWAS.
PRS are amenable to differences in SNP content across the discovery and target samples; it is preferable to begin by working with imputed data in both the discovery and target dataset to maximize convergence. First, SNPs in the target dataset that overlap with SNPs in the discovery GWAS are extracted. Second, the target data are aligned to the discovery dataset (i.e., individual genotypes are oriented to the same strand of DNA and strand-ambiguous SNPs - A/T or G/C - are either excluded or closely evaluated). These steps are critical to ensure that effect sizes from the discovery GWAS are being accurately applied to the target sample. Typically, sex chromosomes are also removed. Traditional quality control indices should be applied to the target dataset including minor allele frequency cutoffs, Hardy-Weinberg-equilibrium testing, missingness by individual and marker exclusion, cryptic relatedness exclusion, sex check, ancestral outliers, and imputation quality.