Table S1 shows the empirical significance thresholds for all methods for every simulation scenario. Thresholds were around 5% for MV-PLINK, MultiPhen, TATES and UV-PCA. Significance thresholds for PCHAT were slightly increased to approximately 6%, indicating slight deflation of type I error rate under the null. MV-SNPTEST and MV-BIMBAM showed log10 BF significance thresholds between -0.05 and 0.44. Significance thresholds for UV-MA were highly dependent on the residual correlation between the traits: around 5% for scenarios with uncorrelated traits and 0.2-0.3% for scenarios with high residual correlation, thus indicating high inflation of type I error rate under the null for the latter scenarios. Thresholds for UV analysis were around 5%/3 = 1.7% for scenarios with no residual correlation and slightly increased with increasing residual correlation.