Random Forest Algorithm
Master the ensemble learning technique that combines multiple decision trees
Random Forest Algorithm
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. Developed by Leo Breiman in 2001, it combines the concepts of bootstrap aggregating (bagging) and random feature selection to create a robust model with reduced variance.
Mathematical Foundation
For a Random Forest with B trees, the prediction is:
Classification: ŷ = majority vote(ŷ1(x), ŷ2(x), ..., ŷB(x))
Regression: ŷ = (1/B) Σb=1B ŷb(x)
Where ŷb(x) is the prediction of the b-th tree for input x.
Algorithm Steps
- For b = 1 to B:
- Draw bootstrap sample Z* of size N from training data
- Grow random-forest tree Tb to Z* by:
- At each node, select m features at random from p features
- Pick the best split among m features
- Split node into daughter nodes
- Output ensemble of trees: {Tb}b=1B
- Make predictions by aggregating (voting or averaging) predictions from all trees
Key Features & Parameters
- n_estimators: Number of trees (typically 100-500)
- max_features: Features considered at each split (√p for classification, p/3 for regression)
- OOB Error: Out-of-bag estimate (≈ cross-validation error)
- Gini Impurity: Split criterion IG(p) = 1 - ∑pi2
Advantages & Limitations
Advantages (+)
- Reduces overfitting compared to single decision trees
- Handles high dimensional spaces well
- Parallelizable - trees built independently
Limitations (-)
- Less interpretable than single decision trees
- Can be slow for real-time predictions
- May overfit noisy datasets
Performance vs Number of Trees
The graph below shows how Random Forest performance improves with more trees, while variance decreases:
- Initial trees provide rapid accuracy gains
- Diminishing returns after ~100 trees
- Variance reduction continues with more trees
Note: Optimal tree count depends on dataset size and complexity. Use out-of-bag error or cross-validation to determine the best n_estimators.
Feature Importance Calculation
Random Forest computes feature importance through mean decrease in impurity (Gini importance):
Importancej = (1/NT) ∑T ∑node∈Sj ΔI(node)
where Sj = nodes splitting on feature j,
ΔI = impurity reduction, NT = total trees
Alternative methods include permutation importance and mean decrease in accuracy.