Random Forest Algorithm

Updated May 2026
A random forest is an ensemble machine learning algorithm that builds hundreds or thousands of decision trees, each trained on a random subset of the data and features, then combines their predictions through majority voting (classification) or averaging (regression). The randomness prevents individual trees from overfitting to the same patterns, producing a model that is dramatically more accurate and stable than any single decision tree. Random forests consistently rank among the best algorithms for structured data and require minimal tuning to work well.

How Random Forests Work

A random forest introduces randomness at two levels. First, each tree is trained on a different random sample of the training data, drawn with replacement (a technique called bootstrap sampling or bagging). A dataset of 10,000 rows produces 10,000-row training sets for each tree, but roughly 37% of the original rows will be missing from each sample, replaced by duplicates of other rows. This means each tree sees a different version of the data.

Second, at each split in each tree, only a random subset of features is considered as candidates. If the dataset has 50 features, each split might evaluate only 7 of them (the square root of 50 is a common default for classification). This forces the trees to use different features at different points, creating diversity among trees. Without this feature randomization, every tree would choose the same dominant feature at the root and the trees would be too similar to benefit from aggregation.

For classification, the forest predicts the class that receives the most votes across all trees. If 340 of 500 trees predict "fraud" and 160 predict "legitimate," the forest predicts "fraud" with 68% confidence. For regression, the forest averages the predictions of all trees. If tree predictions range from $340,000 to $420,000 for a house, the forest might output $378,500.

The mathematical insight behind this approach is the law of large numbers applied to model errors. Individual trees make errors, but those errors are partially random due to the bootstrap sampling and feature subsetting. When you average across hundreds of trees, the random errors cancel out, leaving the systematic signal. This is the same principle that makes polling averages more accurate than individual polls.

Why Random Forests Resist Overfitting

A single decision tree overfits easily because it keeps splitting until it perfectly classifies the training data, memorizing noise and outliers. Random forests solve this through the wisdom of crowds. Each tree overfits to its particular random sample, but they overfit in different directions. One tree might have an outlier in its sample that creates a spurious split, but most other trees will not have that outlier, so the spurious pattern gets outvoted.

Formally, the ensemble reduces variance (sensitivity to specific training data) without substantially increasing bias (systematic error). A single deep tree has low bias but high variance: it fits the training data closely but changes dramatically with a different training sample. The forest maintains the low bias (each tree fits the data well) while averaging away the high variance (the random errors cancel out).

This does not mean random forests cannot overfit at all. If you use thousands of very deep trees on a small dataset, the forest can still memorize the data. But the threshold is far higher than for a single tree, and practical defaults (100-500 trees, no max depth, at least a few samples per leaf) work well on most problems without any tuning.

Key Hyperparameters

n_estimators is the number of trees. More trees generally improve performance but with diminishing returns. Going from 10 to 100 trees is a large improvement. Going from 100 to 500 is modest. Going from 500 to 5000 is barely measurable but costs 10x more compute. 100-500 trees is the practical range for most problems.

max_features controls how many features each split considers. For classification, the default is the square root of the total features. For regression, it is one-third of total features. Lower values increase diversity between trees (reducing variance) but make each individual tree weaker (increasing bias). The default works well in practice.

max_depth limits tree depth. Unlimited depth (the default) lets each tree fully fit its bootstrap sample. Limiting depth reduces individual tree accuracy but can reduce overall forest variance. For most datasets, the default works fine.

min_samples_leaf sets the minimum number of samples at a leaf node. Higher values prevent leaves from being too specific. Setting this to 5 or 10 provides mild regularization and speeds up training.

Practical Advantages

Works out of the box. Random forests with default hyperparameters produce competitive results on most structured data problems. This is rare among ML algorithms. SVMs need careful kernel and regularization tuning. Neural networks need architecture design, learning rate schedules, and extensive hyperparameter search. A random forest with 100 trees and default settings is often good enough to deploy.

Feature importance. Random forests provide a natural measure of feature importance based on how much each feature improves purity across all splits in all trees. This is valuable for understanding the data and for feature selection. If a dataset has 200 features and the forest reveals that 15 of them account for 90% of the predictive power, you can simplify subsequent models and improve interpretability.

Handles missing data gracefully. Several implementations can work with missing values directly by splitting on surrogate features. Even without built-in support, random forests are robust to simple imputation strategies because the ensemble averages over many trees with different subsets.

Parallelizable. Each tree is trained independently, so the algorithm scales linearly with available CPU cores. Training 500 trees on 16 cores takes about the same wall-clock time as training 31 trees on one core.

Limitations

Less interpretable than single trees. A single decision tree can be printed on a page and followed by a human. A forest of 500 trees cannot. Feature importance scores provide partial interpretability, but you lose the ability to explain individual predictions through a simple flowchart. For applications requiring explainability, SHAP values or partial dependence plots can restore some interpretability.

Large model size. A forest of 500 deep trees on a large dataset can consume gigabytes of memory. For deployment on resource-constrained devices, model compression or switching to a simpler algorithm may be necessary.

Outperformed by gradient boosting on many problems. Gradient-boosted trees (XGBoost, LightGBM, CatBoost) often achieve slightly higher accuracy than random forests because they build trees sequentially, with each tree correcting the errors of the previous ones. The improvement is typically 1-3% on accuracy metrics, but in competitive settings, that margin matters.

Key Takeaway

Random forests combine hundreds of randomized decision trees to produce highly accurate, overfitting-resistant predictions. The algorithm works well with default settings, handles messy real-world data, and provides useful feature importance rankings. It is the default "first try" algorithm for structured data problems and remains one of the most reliable tools in the ML practitioner's toolkit.