Ensemble Methods Explained: Bagging, Boosting, and Stacking
Why Ensembles Work
The mathematical foundation is surprisingly simple. If you have a collection of models that are individually better than random chance and whose errors are at least partially independent, averaging their predictions will always be at least as good as the average individual model and usually much better. This is the same principle that makes polling averages more reliable than individual polls and jury verdicts more reliable than individual judgments.
Consider three binary classifiers, each with 70% accuracy, making independent errors. The ensemble prediction is the majority vote. All three must agree on the wrong answer for the ensemble to be wrong. The probability of all three being wrong is 0.3 * 0.3 * 0.3 = 2.7%. But two of three agreeing on the wrong answer also fools the ensemble. The exact probability that the majority is wrong works out to about 21.6%, giving the ensemble 78.4% accuracy, a significant improvement over any individual model's 70%.
The key assumption is error independence. If all three models make the same mistakes (they are correlated), combining them adds no value. This is why ensemble methods deliberately introduce diversity: different training data subsets, different feature subsets, different algorithms, or different hyperparameters. The more diverse the component models, the more their errors cancel out.
Bagging: Parallel Ensemble on Random Subsets
Bagging (bootstrap aggregating) trains multiple instances of the same algorithm on different random subsets of the training data, then combines their predictions. Each subset is created by sampling with replacement (bootstrap sampling), producing datasets the same size as the original but with some rows duplicated and roughly 37% of rows missing from each sample.
The most famous bagging implementation is the random forest, which bags decision trees with the additional twist of random feature subsetting at each split. But bagging works with any high-variance algorithm: neural networks, KNN, even SVMs.
Bagging primarily reduces variance. Each individual model overfits to its particular bootstrap sample, but they overfit in different directions because they see different data. Averaging across hundreds of models smooths out these individual overfitting patterns, leaving the consistent signal. Bias is largely unchanged because each individual model is still fit to a representative sample of the data.
Bagging is embarrassingly parallel: every model is independent, so training scales linearly with available CPU or GPU cores. A random forest with 500 trees on 16 cores takes about the same wall-clock time as 31 trees on a single core.
Boosting: Sequential Error Correction
Boosting builds models sequentially, with each new model focusing specifically on the examples that previous models got wrong. Instead of training on random subsets, boosting reweights the training data so that misclassified examples get higher weights, forcing the next model to pay more attention to them.
AdaBoost (Adaptive Boosting) was the first practical boosting algorithm. After each round, it increases the weights of misclassified examples and decreases the weights of correctly classified ones. Each model also receives a weight proportional to its accuracy, so better models have more influence in the final vote. AdaBoost typically uses simple decision stumps (depth-1 trees) as base models, combining many weak learners into a strong learner.
Gradient Boosting generalizes the idea by fitting each new model to the residual errors (the differences between predictions and actual values) of the combined model so far. Instead of reweighting examples, each new tree directly predicts the correction needed. This approach is more flexible because it works with any differentiable loss function.
XGBoost, LightGBM, and CatBoost are optimized implementations of gradient boosting that dominate machine learning competitions on structured data. XGBoost introduced regularization and efficient tree construction. LightGBM uses histogram-based splitting for speed on large datasets. CatBoost handles categorical features natively without one-hot encoding. In Kaggle competitions on tabular data, gradient boosting wins more often than any other algorithm family.
Boosting primarily reduces bias, because each new model is specifically designed to correct the remaining errors. But it can also reduce variance through regularization parameters like learning rate (how much each new model contributes), max tree depth, and subsampling rate. The learning rate is critical: a small learning rate (0.01-0.1) with many trees generally outperforms a large learning rate with few trees, because the gradual approach avoids overshooting the optimal solution.
Stacking: Meta-Learning from Diverse Models
Stacking (stacked generalization) trains a meta-model to learn how to best combine the predictions of several diverse base models. Unlike bagging (same algorithm, different data) or boosting (same algorithm family, sequential), stacking uses different algorithms as base models and a separate algorithm to combine them.
The process has two levels. Level 0: Train several diverse base models (logistic regression, random forest, SVM, neural network) on the training data. Generate out-of-fold predictions for each model using cross-validation to avoid data leakage. Level 1: Use the out-of-fold predictions as features for a meta-model (often logistic regression or a simple linear model) that learns the optimal weighting of each base model's predictions.
Stacking is powerful because different algorithms capture different aspects of the data. A linear model captures linear trends. A tree-based model captures feature interactions. A KNN model captures local structure. The meta-model learns that the linear model is more reliable in one region of the feature space while the tree model is better in another.
Stacking is the most complex ensemble method to implement correctly, because the out-of-fold prediction generation must use cross-validation to prevent leakage. If you train base models on all training data and then train the meta-model on those same predictions, the meta-model will overfit to the base models' memorized noise.
Choosing an Ensemble Strategy
Start with gradient boosting (XGBoost or LightGBM) for structured/tabular data. It is the single best algorithm family for most structured data problems, combining the benefits of boosting with efficient implementation and built-in regularization.
Use random forests when you want a model that works well with minimal tuning, need feature importance scores, or want fast parallel training. Random forests are harder to overfit than gradient boosting and serve as an excellent baseline.
Use stacking when you are in a competition or production system where the last 0.5-1% of accuracy matters and you have the engineering resources to maintain multiple models. In practice, most production systems use a single well-tuned gradient boosting model because the maintenance cost of stacking rarely justifies the marginal improvement.
Ensembles improve predictions by combining multiple models whose errors are partially independent. Bagging (random forests) reduces variance through parallel averaging. Boosting (XGBoost, LightGBM) reduces bias through sequential error correction. Stacking combines diverse algorithms through a meta-learner. Gradient boosting is the default choice for structured data, winning more competitions and production deployments than any other approach.