Chunk #42 — Methods — MVP dataset

Source: Multi-ancestry study of the genetics of problematic alcohol use in over 1 million individuals.
Embedded: yes

Text

MVP genotype data were processed by the MVP release 4 (R4) data team. A total of 729,324 samples were genotyped using an Affymetrix Axiom biobank array. Rigorous sample-level quality control (QC) served to remove samples with duplicates, call rates <98.5%, sex mismatches, >7 relatives or excess heterozygosity. After QC, MVP R4 data contained 658,582 participants and 667,995 variants (pre-imputation). Pre-imputation QC removed variants with high missingness (>1.5%), that were monomorphic, or with Hardy–Weinberg equilibrium (HWE) P value of ≤1 × 10−6, leaving 590,511 variants for imputation. As in our previous work, we ran a principal component analysis (PCA)61 for the R4 data and 1000 Genome phase 3 reference panels62. The Euclidean distances between each MVP participant and the centers of the five reference ancestral groups were calculated using the first ten principal components (PCs), with each participant assigned to the nearest reference ancestry. A second round of PCA within each assigned ancestral group was performed and outliers with PC scores >6 standard deviations from the mean of any of the 10 PCs were removed. This two-stage approach resulted in the assignment of 468,869 EUR ancestry, 122,024 AFR, 41,662 LA, 7,364 EAS and 536 SAS individuals for analysis.