Chunk #9 — Methods — Genotype data processing and imputation

Source: Gene-based polygenic risk scores analysis of alcohol use disorder in African Americans.
Embedded: yes

Text

Detailed information about COGA, SAGE, and YalePenn data processing has been reported previously [34–36]. Briefly, all data were combined and a common set of high quality (minor allele frequency (MAF) > 10%, missing rate <2%, Hardy-Weinberg Equilibrium (HWE) P-values > 0.001) and independent (defined as R2 < 0.5) variants (N = 24,135) was used to identify duplicate samples among different target datasets and confirm the reported family structures using PLINK [37, 38]; family structures were updated as needed. The same set of common variants was also used to estimate the principal components (PCs) of population stratification using Eigenstrat [39] with 1000 Genomes data (Phase 3, version 5, NCBI GRCh37) as the reference panel. These PCs were also used to determine AA samples (first PC between -0.0043 and 0.0115 and second PC between -0.0035 and 0.0059). Due to the different arrays used, each target dataset was imputed separately to 1000 Genomes by using SHAPEIT2 [40] followed by Minimac3 [41]. Before imputation, variants with A/T or C/G alleles, missing rates >5%, MAF < 3%, and HWE P-values < 0.0001 were excluded. Imputed