How to Use Cross-Validation
A single train-test split introduces randomness. Maybe the test set happened to contain the easy examples, making the model look better than it is. Maybe a critical subgroup ended up entirely in the training set, so the model was never tested on it. Cross-validation eliminates this randomness by averaging across multiple splits.
Step 1: Choose the Right CV Strategy
K-fold cross-validation is the default. The data is divided into K equal parts (folds). The model trains on K-1 folds and evaluates on the remaining fold. This rotates K times so every fold serves as the test set exactly once. K=5 or K=10 are the standard choices. K=5 balances bias and variance well for most problems. K=10 provides a slightly less biased estimate but takes twice as long.
Stratified K-fold ensures each fold has approximately the same class distribution as the full dataset. For a dataset with 90% negative and 10% positive examples, every fold will also be roughly 90/10. This is essential for imbalanced classification because a random split might create a fold with zero positive examples, producing meaningless evaluation results.
Leave-one-out (LOO) sets K equal to the number of data points: each fold contains exactly one test example. LOO gives nearly unbiased performance estimates but is computationally expensive (N separate training runs) and has high variance because each test set is a single point. It is practical only for very small datasets (under 500 samples).
Time series split respects temporal ordering. The first fold trains on months 1-3 and tests on month 4. The second fold trains on months 1-4 and tests on month 5. The training set always precedes the test set in time, preventing future data from leaking into training. This is mandatory for any time-dependent data.
Group K-fold ensures all data from the same group stays in the same fold. If your data has multiple observations per patient, per store, or per user, you want all observations from a given patient in either training or test, not split across both. Splitting within a group would allow the model to "recognize" the group rather than learn general patterns.
Step 2: Set Up the Splits
Use your ML framework's built-in cross-validation tools. In scikit-learn, KFold, StratifiedKFold, TimeSeriesSplit, and GroupKFold are all available as splitter objects that generate train-test index arrays for each fold.
The critical rule: all preprocessing must happen inside each fold, not before splitting. If you standardize features using the mean and standard deviation of the entire dataset, and then split into folds, the test fold's statistics have leaked into the training fold through the global mean and standard deviation. Instead, compute the mean and standard deviation from the training fold only, and apply that transform to both training and test folds. Scikit-learn's Pipeline object handles this automatically.
Feature engineering must also happen inside each fold. If you create a feature based on the target variable's distribution (like target encoding), computing it on the full dataset before splitting leaks the target into the features. Compute target-encoded values from the training fold only.
Step 3: Train and Evaluate Across Folds
For each fold: fit the model on the training portion, make predictions on the test portion, and compute the chosen metric. Store the score for each fold.
In scikit-learn, cross_val_score handles this in one function call: it takes the model, data, labels, number of folds, and scoring metric, and returns an array of K scores. For more control, cross_validate returns additional information like training scores and fit times.
If you are using cross-validation for hyperparameter tuning (finding the best settings), use nested cross-validation. The outer loop evaluates the final model. The inner loop, inside each outer fold, tunes hyperparameters. Without nesting, the hyperparameters are effectively tuned on the test data, producing optimistic estimates. GridSearchCV with cv parameter handles the inner loop, and you wrap the whole GridSearchCV in an outer cross_val_score for honest evaluation.
Step 4: Aggregate and Interpret Results
Compute the mean and standard deviation of scores across folds. The mean is your best estimate of the model's generalization performance. The standard deviation indicates stability: how much the model's performance varies with different training data.
A mean F1 of 0.84 with standard deviation 0.02 indicates a stable, reliable model. A mean F1 of 0.84 with standard deviation 0.12 indicates an unstable model whose performance depends heavily on which data it happens to see. The latter model needs investigation: maybe it performs poorly on a specific subgroup that appears in some folds but not others.
When comparing models, look at both the mean performance and the spread. A model with mean 0.82 and std 0.01 may be preferable to a model with mean 0.84 and std 0.08, because the reliable model is more predictable in production.
Examine per-fold scores individually. If one fold has dramatically lower performance, investigate what is different about that fold's test data. It may reveal a subgroup where the model needs improvement or a data quality issue in a specific portion of the dataset.
Cross-validation gives you a reliable performance estimate by training and evaluating on multiple data subsets. Use stratified K-fold for classification, time series split for temporal data, and group K-fold when observations are grouped. Always preprocess inside each fold to prevent data leakage. Report both the mean score and standard deviation across folds.