As in Figure 1, our proposed framework always produced near-maximal accuracy. Also as before, the solid black curves typically reach their highest accuracy values at small values of khap. Figure 2 shows that IMPUTE2 achieved higher accuracy than Beagle in every panel, except at the lowest khap settings. In some target panels, the difference between methods was small; for example, IMPUTE2 was only slightly more accurate than Beagle in the CEU and TSI panels, which is consistent with previous results comparing these methods on a European dataset (Howie et al. 2009). [We note that Jostins et al. (2011) reached the contradictory conclusion that Beagle is more accurate than IMPUTE2 when imputing Europeans from diverse reference panels. We believe that their conclusion was driven by spurious IMPUTE2 results, as we explain in File S4.] On the other hand, IMPUTE2 was more accurate by a large margin in the African panels (YRI, LWK, and MKK). These trends cannot be attributed to the fact that we are running Beagle with a stratified reference panel when the method is not designed for that situation: