Chunk #45 — Results — Explaining differences: datasets’ meta-features — Subgroup analyses: meta-features

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes

Text

To further explore this issue over all 243 investigated datasets, we compute Spearman’s correlation coefficient between the difference in accuracy between random forest and logistic regression (Δacc) and various datasets’ meta-features. The results of Spearman’s correlation test are shown in Table 3. These analyses again point to the importance of the number p of features (and related meta-features), while the dataset size n is not significantly correlated with Δacc. The percentage Cmax of observations in the majority class, which was identified as influencing the relative performance of RF and LR in a previous study [39] conducted on a dataset from the field of political science is also not significantly correlated with Δacc in our study. Note that our results are averaged over a large number of different datasets: they are not incompatible with the existence of an effect in some cases. Table 3Correlation between Δacc and dataset’s featuresSpearman’s ρSpearman’s ρp-value n -0.03386.00·10−1 p 0.3311.32·10−7 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {p}{n}$\end{document}pn 0.2546.39·10−5 d 0.2584.55·10−5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {d}{n}$\end{document}dn 0.2461.04·10−4 p numeric 0.2546.09·10−5 p categorical -0.0762.37·10−1 p numeric,rate 0.2401.54·10−4 C max 0.007359.10·10−1