Chunk #126 — Features and Pitfalls — Bias in Variable Selection and Variable Importance

Source: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.
Embedded: yes

Text

In the classical classification and regression tree algorithms CART and C4.5, variable selection is biased in favor of variables with certain characteristics, even if these variables are no more informative than their competitors. For example, variables with many categories and numeric variables or, even more unintuitively, variables with many missing values are artificially preferred (see, e.g., White and Liu 1994; Kim and Loh 2001; Strobl, Boulesteix, and Augustin 2007).