Overall performances are presented in a synthesized form in Table 2 for all three measures in form of average performances along with standard deviations and confidence intervals computed using the adjusted bootstrap percentile (BCa) method [38]. The boxplots of performances of Random Forest (RF) and Logistic Regression (LR) for the three considered performance measures are depicted in Fig. 3, which also includes the boxplot of the difference in performances (bottom row). It can be seen from Fig. 3 that RF performs better for the majority of datasets (69.0% of the datasets for acc, 72.3% for auc and 71.5% for brier). Furthermore, when LR outperforms RF the difference is small. It can also be noted that the differences in performance tend to be larger for auc than for acc and brier. Table 2Performances of LR and RF (top: accuracy, middle: AUC, bottom: Brier score): (top: accuracy, middle: AUC, bottom: Brier score): mean performance μ, standard deviation σ and confidence interval for the mean (estimated via the bootstrap BCa method [38]) on the 243 datasetsAcc μ σ BCa confidence intervalLogistic regression0.8260.135[0.808, 0.842]Random forest0.8540.134[0.837, 0.870]Difference0.0290.067[0.021, 0.038]AucLogistic regression0.8260.149[0.807, 0.844]Random forest0.8670.147[0.847, 0.884]Difference0.0410.088[0.031, 0.054]BrierLogistic regression0.1290.091[0.117, 0.140]Random forest0.1020.080[0.092, 0.112]Difference-0.02690.054[-0.034, -0.021]