Chunk #43 — Results — Explaining differences: datasets’ meta-features — Preliminary analysis

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes

Text

As a preliminary, let us illustrate this idea using only one (large) biomedical dataset, the OpenML dataset with ID=310 including n0=11183 observations and p0=7 features. A total of N=50 sub-datasets are extracted from this dataset by randomly picking a number n′<n0 of observations or a number p′<p0 of features. Thereby we successively set n′ to n′=5.102,103,5.103,104 and p′ to p′=1,2,3,4,5,6. Figure 4 displays the boxplots of the accuracy of RF (white) and LR (dark) for varying n′ (top-left) and varying p′ (top-right). Each boxplot represents N=50 data points. It can be seen from Fig. 4 that the accuracy increases with p′ for both LR and RF. This reflects the fact that relevant features may be missing from the considered random subsets of p′ features. Interestingly, it can also be seen that the increase of accuracy with p′ is more pronounced for RF than for LR. This supports the commonly formulated assumption that RF copes better with large numbers of features. As a consequence, the difference between RF and LR (bottom-right) increases with p′ from negative values (LR better than RF)