We evaluated 10 fine-mapping methods (Methods, Table 1). We assessed calibration via the proportion of false positives among SNPs with posterior causal probability (posterior inclusion probability; PIP) above a given threshold (e.g. PIP>0.95), aggregating the results across all simulations; we refer to this quantity as the false discovery rate (FDR). For each PIP threshold, we estimated the FDR as one minus the PIP threshold, which is more conservative than an exact estimate (Figure 1a–b, Supplementary Note, Supplementary Table 4). No method except CAVIARBF2- and CAVIARBF2 had significantly inflated false discovery rates, although fastPAINTOR and CAVIARBF1 had suggestive evidence of inflated false discovery rates. We assessed power via the proportion of true causal SNPs with PIP above a given threshold, aggregating the results across all simulations. PolyFun + FINEMAP was the most powerful method, identifying >5% more PIP>0.95 causal SNPs than PolyFun + SuSiE and >20% more PIP>0.95 causal SNPs than FINEMAP; PolyFun + SuSiE was the second most powerful method, identifying >25% more PIP>0.95 causal SNPs than SuSiE (Figure 1c–d, Supplementary Table 4). These results demonstrate the benefits of prioritizing SNPs using functional annotations.