Chunk #3 — Online Methods — Genotype and Phenotype Data

Source: Efficient multivariate linear mixed model algorithms for genome-wide association studies.
Embedded: yes

Text

The NFBC1966 data contains 5402 individuals with multiple metabolic traits measured and 364,590 SNPs typed. We selected four phenotypes (high-density lipoprotein, HDL; low-density lipoprotein, LDL; triglycerides, TG; C-reactive protein, CRP) among them, following previous studies3. We selected individuals and SNPs following previous studies11,32 with the software PLINK33. Specifically, we excluded individuals with missing phenotypes for any of these four phenotypes or having discrepancies between reported sex and sex determined from the X chromosome. We excluded SNPs with a minor allele frequency less than 1%, having missing values in more than 1% of the individuals, or with a Hardy-Weinberg equilibrium p value below 0.0001. This left us with 5,255 individuals and 319,111 SNPs. For each phenotype, we quantile transformed the phenotypic values to a standard normal distribution, regressed out sex, oral contraceptives and pregnancy status effects32, and quantile transformed the residuals to a standard normal distribution again. We replaced the missing genotypes for a given SNP with its mean genotype value. We used the product of centered and scaled genotype matrix as an estimate of relatedness11,17,34,35.