Defining a significance threshold posed a challenge due to strong correlation between the 12 models and among the assessed phenotypes. To avoid false claims, we defined two null distributions: an empirical null distribution using the synonymous collapsing model and an n-of-1 permutation-based null distribution. These approaches independently converged on a study-wide significance threshold of P ≤ 2 × 10−9 (Methods).