Decision Tree, Random Forest, Gradient Boosted Trees
Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set.
The measure based on which the (locally) optimal condition is chosen is called impurity.
- classification: Gini impurity or information gain/entropy
- regression: variance.
- How much each feature decreases the weighted impurity in a tree.
- For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
- feature selection based on impurity reduction is biased towards preferring variables with more categories
- With correlated features, strong features can end up with low scores: dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature.