ML Algorithm Book

Random Forest Algorithm

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. Developed by Leo Breiman in 2001, it combines the concepts of bootstrap aggregating (bagging) and random feature selection to create a robust model with reduced variance.

Mathematical Foundation

For a Random Forest with B trees, the prediction is:

Classification: ŷ = majority vote(ŷ₁(x), ŷ₂(x), ..., ŷ_B(x))
Regression: ŷ = (1/B) Σ_b=1^B ŷ_b(x)

Where ŷ_b(x) is the prediction of the b-th tree for input x.

Algorithm Steps

For b = 1 to B:
1. Draw bootstrap sample Z* of size N from training data
2. Grow random-forest tree T_b to Z* by:
  1. At each node, select m features at random from p features
  2. Pick the best split among m features
  3. Split node into daughter nodes
Output ensemble of trees: {T_b}_b=1^B
Make predictions by aggregating (voting or averaging) predictions from all trees

Key Features & Parameters

n_estimators: Number of trees (typically 100-500)
max_features: Features considered at each split (√p for classification, p/3 for regression)
OOB Error: Out-of-bag estimate (≈ cross-validation error)
Gini Impurity: Split criterion I_G(p) = 1 - ∑p_i²

Advantages & Limitations

Advantages (+)

Reduces overfitting compared to single decision trees
Handles high dimensional spaces well
Parallelizable - trees built independently

Limitations (-)

Less interpretable than single decision trees
Can be slow for real-time predictions
May overfit noisy datasets

Performance vs Number of Trees

The graph below shows how Random Forest performance improves with more trees, while variance decreases:

Initial trees provide rapid accuracy gains
Diminishing returns after ~100 trees
Variance reduction continues with more trees

Note: Optimal tree count depends on dataset size and complexity. Use out-of-bag error or cross-validation to determine the best n_estimators.

Feature Importance Calculation

Random Forest computes feature importance through mean decrease in impurity (Gini importance):

Importance_j = (1/N_T) ∑_T ∑_{node∈S_j} ΔI(node)
where S_j = nodes splitting on feature j,
ΔI = impurity reduction, N_T = total trees

Alternative methods include permutation importance and mean decrease in accuracy.