Even when supplemented with external information, the informativeness of epidemiological studies of chronic disease endpoints for the purpose of pathway analysis is limited by the dichotomous nature of the phenotype. The information content may be improved by obtaining biomarker data on some of the intermediate steps in the process. Ideally, biomarker specimens would be sampled longitudinally and before disease onset. This may be prohibitively expensive, so the two-phase case-control design samples individuals from a cohort or case-control study based on disease, exposure, and genotype information83. Nested case-control studies within biobanks overcome the problem of reverse causation by using stored specimens and exposure information obtained at enrollment. Mendelian randomization84,85 provides another way to avoid reverse causation by using genes (which are not subject to this problem) as instrumental variables86 for the biomarker–disease relationship. In a randomized trial of estrogen plus progestin, Dai et al.87 used a two-phase design to assess interactions of treatment with thrombosis biomarkers and found that estimates of the interaction effect were considerably more precise than those from the case-control study alone or standard two-phase estimators not assuming G-E independence.