Chunk #39 — Results — Missing values due to errors

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes

Text

Both LR and RF fail in the presence of categorical features with too many categories. More precisely, RF fails when more than 53 categories are detected in at least one of the features, while LR fails when levels undetected during the training phase occur in the test data. We could admittedly have prevented these errors through basic preprocessing of the data such as the removal or recoding of the features that induce errors. However, we decide to just remove the datasets resulting in NAs because we do not want to address preprocessing steps, which would be a topic on their own and cannot be adequately treated along the way for such a high number of datasets. Since 22 datasets yield NAs, our study finally includes 265-22 =243 datasets.