Decision Trees Explained
How a Decision Tree Works
A decision tree is a tree-shaped structure where each internal node represents a test on one feature, each branch represents the outcome of that test, and each leaf node represents a prediction. The tree is built top-down by choosing the best feature to split on at each level.
Consider a dataset of loan applicants with features like income, credit score, employment length, and debt-to-income ratio, where the label is "approved" or "denied." The algorithm examines all features and all possible split points to find the one that best separates approved and denied applicants. Suppose the best first split is credit score > 680. All applicants with credit scores above 680 go down the right branch, the rest go left.
At each child node, the process repeats. The right branch (high credit score) might next split on debt-to-income ratio. The left branch (low credit score) might split on income. This continues recursively until one of several stopping conditions is met: the node contains only one class (all approved or all denied), the tree reaches a maximum depth, or the node has fewer samples than a minimum threshold.
For classification, each leaf node predicts the majority class among the training samples that reached it. If 47 of the 50 training samples at a leaf were "approved," the leaf predicts "approved" with 94% confidence. For regression, each leaf predicts the average value of the training samples, such as predicting an average house price of $385,000 based on the 120 houses that fell into that branch.
How the Algorithm Chooses Splits
The split selection is the heart of the algorithm. At each node, the tree evaluates every feature and every possible threshold to find the split that produces the most "pure" child nodes, meaning child nodes where the samples are as homogeneous as possible.
For classification trees, two metrics measure purity. Gini impurity measures the probability of a randomly chosen sample being misclassified if it were labeled according to the class distribution at that node. A node where 100% of samples are one class has Gini = 0 (perfect purity). A node with a 50/50 split has Gini = 0.5 (maximum impurity for binary classification). Information gain (based on entropy) measures how much the split reduces the uncertainty about the class label. Both metrics produce similar trees in practice.
For regression trees, the split criterion is typically variance reduction. The algorithm chooses the split that minimizes the weighted average variance of the target variable in the two child nodes. A good split sends high-value targets down one branch and low-value targets down the other, reducing the spread within each group.
The algorithm is greedy: it picks the best split at each step without considering whether a different choice might lead to a better overall tree. This makes the algorithm fast (polynomial time) but means it does not guarantee the globally optimal tree (which would require exponential time to find).
Strengths of Decision Trees
Interpretability is the standout advantage. You can visualize the entire decision process as a flowchart and explain every prediction to a non-technical audience. In regulated industries like banking, healthcare, and insurance, this transparency is legally required. A bank must be able to explain why a loan was denied, and a decision tree provides that explanation directly.
No feature scaling needed. Unlike algorithms such as SVM or KNN that depend on feature distances, decision trees only compare values within a single feature. Whether income is measured in dollars or thousands of dollars, the tree produces the same splits. This eliminates the preprocessing step of normalizing or standardizing features.
Handles both numerical and categorical features. Decision trees naturally handle mixed data types. For numerical features, splits are threshold comparisons (income > 50000). For categorical features, splits group categories (state is California, Oregon, or Washington vs all others). Many other algorithms require categorical features to be encoded as numbers first.
Captures nonlinear relationships. A decision tree can model complex, nonlinear interactions between features without explicit feature engineering. If high income combined with high debt is risky but high income with low debt is safe, the tree captures this through branching. Linear models would require manually creating an interaction term.
The Overfitting Problem
The major weakness of decision trees is overfitting. An unrestricted tree will keep splitting until every leaf is perfectly pure, which means it has memorized the training data, including its noise and outliers. A tree with 10,000 leaves on 10,000 training samples has not learned general patterns; it has built a lookup table.
Signs of overfitting include: very high training accuracy with much lower test accuracy, a tree with hundreds of leaf nodes on a small dataset, and leaf nodes with only one or two samples. The tree has learned that sample #4,721 belongs to class A because of its unique combination of features, not because that combination generally indicates class A.
Several techniques control overfitting. Max depth limits how many levels the tree can have. Min samples per leaf prevents the tree from creating tiny leaf nodes. Min samples per split requires a minimum number of samples before a node can be split. Pruning grows a full tree and then removes branches that do not improve performance on a validation set.
Even with these controls, a single decision tree is often outperformed by ensemble methods that combine many trees, specifically random forests and gradient-boosted trees. These approaches use the interpretable tree structure as a building block while eliminating the overfitting weakness through aggregation.
When to Use Decision Trees
Use a single decision tree when interpretability is the top priority and model complexity must be minimal. Medical screening tools, insurance risk assessments, and customer service decision guides often use simple decision trees because they can be printed on a single page and followed by a human.
For maximum prediction accuracy on tabular data, use random forests or gradient-boosted trees instead. They build on the same splitting logic but combine hundreds of trees to reduce variance and improve generalization. In Kaggle competitions on structured data, gradient-boosted trees (XGBoost, LightGBM, CatBoost) win far more often than any other algorithm family.
Decision trees split data through a series of feature-based questions, creating an interpretable flowchart of predictions. They handle mixed data types, require no feature scaling, and capture nonlinear relationships. Their main weakness is overfitting, which is why they serve as the building block for random forests and gradient boosting rather than being used alone in high-accuracy applications.