Chunk #65 — Discussion — Limitations

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes

Text

Fourthly, our main study was intentionally restricted to RF with default values. The superiority of RF may be more pronounced if used together with an appropriate tuning strategy, as suggested by our additional analyses with TRF. Moreover, the version of RF considered in our study has been shown to be (sometimes strongly) biased in variable selection [14]. More precisely, variables of certain types (e.g., categorical variables with a large number of categories) are systematically preferred by the algorithm for inclusion in the trees irrespectively of their relevance for prediction. Variants of RF addressing this issue [13] may perform better, at least in some cases.