Random Forest Algorithm

Master the ensemble learning technique that combines multiple decision trees

Random Forest Algorithm

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. Developed by Leo Breiman in 2001, it combines the concepts of bootstrap aggregating (bagging) and random feature selection to create a robust model with reduced variance.

Mathematical Foundation

For a Random Forest with B trees, the prediction is:

Classification: ŷ = majority vote(ŷ1(x), ŷ2(x), ..., ŷB(x))
Regression: ŷ = (1/B) Σb=1B ŷb(x)

Where ŷb(x) is the prediction of the b-th tree for input x.

Algorithm Steps

  1. For b = 1 to B:
    1. Draw bootstrap sample Z* of size N from training data
    2. Grow random-forest tree Tb to Z* by:
      1. At each node, select m features at random from p features
      2. Pick the best split among m features
      3. Split node into daughter nodes
  2. Output ensemble of trees: {Tb}b=1B
  3. Make predictions by aggregating (voting or averaging) predictions from all trees

Key Features & Parameters

  • n_estimators: Number of trees (typically 100-500)
  • max_features: Features considered at each split (√p for classification, p/3 for regression)
  • OOB Error: Out-of-bag estimate (≈ cross-validation error)
  • Gini Impurity: Split criterion IG(p) = 1 - ∑pi2

Advantages & Limitations

Advantages (+)

  • Reduces overfitting compared to single decision trees
  • Handles high dimensional spaces well
  • Parallelizable - trees built independently

Limitations (-)

  • Less interpretable than single decision trees
  • Can be slow for real-time predictions
  • May overfit noisy datasets

Performance vs Number of Trees

The graph below shows how Random Forest performance improves with more trees, while variance decreases:

  • Initial trees provide rapid accuracy gains
  • Diminishing returns after ~100 trees
  • Variance reduction continues with more trees

Note: Optimal tree count depends on dataset size and complexity. Use out-of-bag error or cross-validation to determine the best n_estimators.

Feature Importance Calculation

Random Forest computes feature importance through mean decrease in impurity (Gini importance):

Importancej = (1/NT) ∑Tnode∈Sj ΔI(node)
where Sj = nodes splitting on feature j,
ΔI = impurity reduction, NT = total trees

Alternative methods include permutation importance and mean decrease in accuracy.