How to Evaluate Machine Learning Models
Model evaluation is where most beginners make their most consequential mistakes. A model that looks excellent on the training data may be completely useless in production. Understanding why this happens and how to evaluate honestly is the difference between building models that work and building models that only appear to work.
Step 1: Split Your Data Properly
The cardinal rule of evaluation: never test a model on data it trained on. A model that memorizes its training data will score perfectly on that data and fail on everything else. Split your data before any model building or feature engineering.
The standard three-way split is: 60-80% for training, 10-20% for validation (used during model development to compare approaches and tune hyperparameters), and 10-20% for testing (used only once, at the end, for the final performance estimate). The test set must remain untouched until you are ready to report final results.
For small datasets (under 10,000 samples), use cross-validation instead of a fixed split. K-fold cross-validation trains K models, each on a different 80% of the data, and evaluates on the remaining 20%. The average score across folds is more reliable than a single split because it uses all data for both training and evaluation.
For time series data, use time-based splitting: train on past data, test on future data. Random splitting would leak future information into the training set.
Step 2: Choose Metrics That Match Your Problem
For classification:
Accuracy is the percentage of correct predictions. It is intuitive but misleading on imbalanced data. A model that always predicts "not fraud" achieves 99.9% accuracy on a dataset where 0.1% of transactions are fraud, while catching zero actual fraud.
Precision answers: of all the items the model labeled positive, how many were actually positive? High precision means few false positives. A spam filter with 99% precision puts very few legitimate emails in the spam folder.
Recall (sensitivity) answers: of all the actually positive items, how many did the model find? High recall means few false negatives. A cancer screening test with 99% recall misses very few actual cancers.
F1 score is the harmonic mean of precision and recall: 2 * (precision * recall) / (precision + recall). It balances both concerns and is the standard metric for imbalanced binary classification.
AUC-ROC measures the model's ability to rank positive examples above negative examples, independent of any threshold choice. An AUC of 0.5 means random guessing, 1.0 means perfect ranking. It is the best metric when you care about ranking quality rather than a specific threshold decision.
For regression:
RMSE (root mean squared error) penalizes large errors more than small ones and is in the same units as the target. RMSE of $25,000 for house prices means typical predictions are off by about $25K.
MAE (mean absolute error) is more robust to outliers than RMSE. It treats a $100K error as exactly 10x worse than a $10K error, while RMSE treats it as 100x worse.
R-squared measures the proportion of variance explained by the model. R2 = 0.85 means the model explains 85% of the target's variation. It provides a scale-independent measure of fit quality.
Step 3: Compute Metrics on Held-Out Data
Train the model on the training set and compute metrics on the validation set. Compare the training and validation metrics. If training accuracy is 98% but validation accuracy is 72%, the model is overfitting. If both are around 65%, the model is underfitting.
Use cross-validation for reliable estimates. Report the mean and standard deviation across folds. A model with F1 = 0.82 +/- 0.01 across 5 folds is consistent and trustworthy. A model with F1 = 0.82 +/- 0.15 is unstable and the high average may be driven by one lucky fold.
Compute the final metric on the test set only once, after all model selection and tuning is complete. If you repeatedly evaluate on the test set and adjust your approach based on the results, you are implicitly fitting to the test set and your final numbers will be optimistic.
Step 4: Analyze Error Patterns
For classification, the confusion matrix breaks down predictions into true positives, false positives, true negatives, and false negatives. This reveals specific failure modes. Perhaps the model never confuses cats with dogs but frequently confuses cats with raccoons. That specific confusion tells you what to improve: more cat-raccoon training examples, or features that distinguish them.
For regression, residual plots reveal systematic problems. Plot predicted values against residuals. A random cloud means the model's errors are random (good). A curve means the model is missing a nonlinear pattern. A funnel shape means errors grow with the predicted value (heteroscedasticity).
Examine the worst predictions individually. What do the highest-error examples have in common? Maybe the model fails on a specific subgroup (young patients, rural areas, weekend transactions). These insights drive targeted improvements.
Step 5: Compare Against Baselines
A model's metrics are meaningless in isolation. You need a baseline to judge whether the model adds value. Common baselines include: always predicting the majority class (for classification), always predicting the mean (for regression), using the most recent value as the prediction (for time series), and using a simple rule-based heuristic.
If your sophisticated neural network achieves 91% accuracy but the majority-class baseline achieves 89% accuracy, the model adds only 2 percentage points of value. That may not justify the complexity, compute cost, and maintenance burden. If logistic regression achieves 90% and the neural network achieves 91%, the simpler model is almost certainly the better choice for production.
Evaluate on held-out data, choose metrics that reflect the real-world cost of errors, analyze error patterns to find specific failure modes, and always compare against a simple baseline. The metric you choose determines what the model optimizes for, so choosing the wrong metric is worse than choosing the wrong algorithm.