Chunk #28 — Methods — Inclusion criteria and subgroup analyses

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes
Text

Independent of the problem of fishing for significance, it is important that the criteria for inclusion in the benchmarking experiment are clearly stated as recently discussed [11]. In our study, we consider simple datasets’ characteristics, also termed “meta-features”. They are presented in Table 1. Based on these datasets’ characteristics, we define subgroups and repeat the benchmark study within these subgroups, following the principle of subgroup analyses in clinical research. For example, one could analyse the results for “large” datasets (n>1000) and “small datasets” (n≤1000) separately. Moreover, we also examine the subgroup of datasets related to biosciences/medicine. Table 1Considered meta-featuresMeta-featureDescription n Number of observations p Number of features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {p}{n}$\end{document}pn Dimensionality d Number of features of the associated design matrix for LR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {d}{n}$\end{document}dn Dimensionality of the design matrix p numeric Number of numeric features p categorical Number of categorical features p numeric,rate Proportion of numeric features C max Percentage of observation of the majority class time Duration for the run a 5-fold CV with a default Random Forest