Machine Learning with Python
The machine learning workflow in Python is consistent regardless of the specific algorithm or problem. Load data (pandas), explore it (describe, plot), prepare it (clean, encode, scale, split), train a model (fit), evaluate it (score, cross-validate), tune it (grid search), and package it (pipeline). scikit-learn enforces this consistency through its estimator API: every algorithm, from logistic regression to gradient boosting, uses the same .fit(X, y), .predict(X), and .score(X, y) methods. Once you learn the pattern with one algorithm, switching to another is a one-line change.
Step 1: Prepare Your Data
Split data into training and test sets before any preprocessing. from sklearn.model_selection import train_test_split, then X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y). test_size=0.2 holds out 20% for testing. random_state ensures reproducibility. stratify=y ensures both sets have the same class distribution (essential for imbalanced datasets). Never use test data for any training decisions including preprocessing parameter fitting, feature selection, or hyperparameter tuning. This separation prevents data leakage that inflates performance estimates.
Handle missing values before modeling. SimpleImputer from sklearn.impute replaces missing values with the mean, median, mode, or a constant. from sklearn.impute import SimpleImputer, then imputer = SimpleImputer(strategy='median').fit(X_train) computes medians from training data only. X_train = imputer.transform(X_train) and X_test = imputer.transform(X_test) apply the same transformation to both sets. For more sophisticated imputation, KNNImputer uses nearest neighbors to estimate missing values based on similar complete observations.
Encode categorical variables because most ML algorithms require numeric input. from sklearn.preprocessing import OneHotEncoder for nominal categories (no ordering): OneHotEncoder(sparse_output=False) converts each category to a binary column. OrdinalEncoder for ordered categories: OrdinalEncoder(categories=[['low', 'medium', 'high']]) preserves the order. pandas.get_dummies(df, columns=['category']) provides a quick one-hot encoding but does not handle unseen categories in test data, making the scikit-learn encoders preferred for production workflows.
Scale features when algorithms are sensitive to magnitude differences. StandardScaler standardizes to zero mean and unit variance, appropriate for algorithms that assume normally distributed features (SVM, logistic regression, neural networks). MinMaxScaler scales to [0, 1], appropriate when you need bounded features. RobustScaler uses median and interquartile range, resistant to outliers. Always fit the scaler on training data only and transform both training and test data with the same parameters: scaler.fit(X_train), then X_train = scaler.transform(X_train), X_test = scaler.transform(X_test).
Step 2: Choose and Train a Model
For classification (predicting categories), start with LogisticRegression for a linear baseline. from sklearn.linear_model import LogisticRegression, then model = LogisticRegression(max_iter=1000).fit(X_train, y_train). If the data has nonlinear decision boundaries, try RandomForestClassifier (robust, handles mixed features, few parameters to tune) or GradientBoostingClassifier (often highest accuracy, more parameters to tune). For very large datasets, HistGradientBoostingClassifier is significantly faster because it bins continuous features into discrete values.
For regression (predicting continuous values), start with LinearRegression. from sklearn.linear_model import LinearRegression, then model = LinearRegression().fit(X_train, y_train). For nonlinear relationships, RandomForestRegressor and GradientBoostingRegressor capture interactions and nonlinearities automatically. Ridge and Lasso add regularization to linear regression: Ridge penalizes large coefficients (L2 regularization), preventing overfitting when features are correlated. Lasso penalizes the absolute value of coefficients (L1 regularization), driving some to exactly zero for automatic feature selection.
For clustering (finding groups without labels), KMeans is the starting point. from sklearn.cluster import KMeans, then model = KMeans(n_clusters=3, random_state=42).fit(X). The silhouette score (from sklearn.metrics import silhouette_score) measures clustering quality: higher is better, with values above 0.5 indicating well-separated clusters. DBSCAN finds arbitrarily shaped clusters and does not require specifying the number of clusters in advance. For dimensionality reduction, PCA (principal component analysis) reduces features while preserving maximum variance: PCA(n_components=2).fit_transform(X) projects high-dimensional data to 2D for visualization.
Prediction is the same for every algorithm. model.predict(X_test) returns predicted classes (classification) or values (regression). model.predict_proba(X_test) returns class probabilities for classifiers (useful for adjusting decision thresholds). model.decision_function(X_test) returns raw confidence scores. For clustering, model.labels_ gives the cluster assignments and model.predict(X_new) assigns new points to existing clusters.
Step 3: Evaluate Model Performance
Classification metrics quantify different aspects of prediction quality. from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report. accuracy_score(y_test, y_pred) is the fraction of correct predictions, suitable only for balanced datasets. For imbalanced datasets (rare events, medical diagnosis), precision (what fraction of positive predictions are correct) and recall (what fraction of actual positives are detected) are more informative. f1_score balances precision and recall. classification_report(y_test, y_pred) prints all metrics per class in a formatted table.
The confusion matrix reveals error patterns. from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay. cm = confusion_matrix(y_test, y_pred) returns a matrix where rows are actual classes and columns are predicted classes. Diagonal elements are correct predictions; off-diagonal elements show misclassification patterns. ConfusionMatrixDisplay(cm, display_labels=class_names).plot() creates a visual heatmap. For medical or safety-critical applications, examine false negatives (missed positives) and false positives separately because their costs are usually very different.
Regression metrics measure prediction error magnitude. from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score. mean_squared_error(y_test, y_pred) penalizes large errors more heavily. np.sqrt(mean_squared_error(y_test, y_pred)) gives RMSE in the same units as the target variable. mean_absolute_error(y_test, y_pred) is more robust to outliers. r2_score(y_test, y_pred) measures the fraction of variance explained (1.0 is perfect, 0.0 is no better than predicting the mean, negative means worse than the mean).
ROC curves and AUC measure classification performance across all decision thresholds. from sklearn.metrics import roc_curve, roc_auc_score. fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1]) computes the false positive rate and true positive rate at each threshold. roc_auc_score returns the area under the ROC curve: 0.5 is random guessing, 1.0 is perfect. AUC is threshold-independent and handles class imbalance better than accuracy, making it the standard comparison metric for binary classifiers.
Step 4: Tune Hyperparameters
Cross-validation estimates generalization performance without wasting data on a separate validation set. from sklearn.model_selection import cross_val_score, then scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') performs 5-fold cross-validation: it splits training data into 5 parts, trains on 4, evaluates on 1, and rotates 5 times. The mean and standard deviation of the 5 scores estimate how the model will perform on unseen data. Use cv=StratifiedKFold(n_splits=5) for classification to maintain class balance in each fold.
GridSearchCV exhaustively searches a parameter grid. from sklearn.model_selection import GridSearchCV. Define a grid: param_grid = {'n_estimators': [100, 200, 500], 'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10]}. grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1', n_jobs=-1).fit(X_train, y_train). grid.best_params_ shows the best parameters, grid.best_score_ shows the best cross-validated score, and grid.best_estimator_ is the model refitted on all training data with the best parameters. n_jobs=-1 uses all CPU cores for parallel evaluation.
RandomizedSearchCV is faster for large parameter spaces. Instead of trying every combination, it samples a specified number of random combinations. from sklearn.model_selection import RandomizedSearchCV. param_distributions = {'n_estimators': [100, 200, 500, 1000], 'max_depth': [3, 5, 10, 20, None], 'min_samples_split': stats.randint(2, 20), 'min_samples_leaf': stats.randint(1, 10)}. Use n_iter=50 to try 50 random combinations. For most problems, RandomizedSearchCV with 50 to 100 iterations finds parameters within 1% of the grid search optimum while evaluating a fraction of the combinations.
Learning curves diagnose underfitting and overfitting. from sklearn.model_selection import learning_curve. train_sizes, train_scores, val_scores = learning_curve(model, X_train, y_train, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)). Plot training and validation scores against training set size. If both scores are low, the model underfits (too simple, try more features or a more complex model). If training score is high but validation score is much lower, the model overfits (too complex, try regularization, simpler model, or more data). If both scores converge at a high value, the model is well-fitted.
Step 5: Build Production Pipelines
Pipelines chain preprocessing and modeling into a single object that ensures consistent transformations. from sklearn.pipeline import Pipeline. pipe = Pipeline([('scaler', StandardScaler()), ('model', RandomForestClassifier())]). pipe.fit(X_train, y_train) scales the training data and fits the model. pipe.predict(X_test) scales the test data using the same parameters and predicts. This eliminates the risk of applying different transformations to training and test data, the most common source of data leakage bugs.
ColumnTransformer applies different transformations to different columns. from sklearn.compose import ColumnTransformer. preprocessor = ColumnTransformer([('num', StandardScaler(), numeric_columns), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)]). Combine with a model: pipe = Pipeline([('prep', preprocessor), ('model', GradientBoostingClassifier())]). This handles mixed-type DataFrames (numeric and categorical columns) in a single pipeline that can be cross-validated, grid-searched, and serialized as one object.
Model persistence saves trained models for later use. import joblib, then joblib.dump(pipe, 'model.joblib') saves the entire pipeline (preprocessing + model) to a file. pipe = joblib.load('model.joblib') loads it back. The loaded pipeline can immediately make predictions on new data with the exact same preprocessing that was used during training. For sharing models across Python versions, use the ONNX format (via skl2onnx) or export to PMML.
Feature importance reveals which variables drive predictions. For tree-based models (random forest, gradient boosting), model.feature_importances_ returns the importance of each feature. For linear models, model.coef_ shows the weight of each feature. from sklearn.inspection import permutation_importance computes model-agnostic importances by measuring how much performance drops when each feature is randomly shuffled. Plot importances as a horizontal bar chart sorted by magnitude to identify the most influential predictors, which aids both model interpretation and feature selection for simpler models.
The most important rule in machine learning is to never let information from the test set influence any training decision. Use pipelines and cross-validation to enforce this separation automatically.