Heart disease and its most common manifestation, CAD, were a focus of our analysis. However, no single phecode was diagnostic of CAD. Therefore, we defined CAD in BioVU patients by a random forest (machine learning) classifier [29] that integrated data from across the EHR (Supplementary Methods; Supplementary Table 1; Supplementary Figs. 1–3). Age in cases was defined by the age of first CAD-defining feature in the EHR, while in controls, age was defined by the age at last encounter. Data on CAD risk factors were also extracted from the EHR either from structured data or via text-mining algorithms (Supplementary Methods). CAD risk factors included body mass index (BMI), hypertension, type 2 diabetes diagnosis, pre-medication blood levels of high- and low-density lipoprotein cholesterol (HDL and LDL) and of triglycerides [30] (lipid-altering medications are listed in Supplementary Table 2), smoking history, and socio-economic status.