How to Do Linear Regression

Updated May 2026
Linear regression predicts a continuous numerical value by fitting a straight line (or flat hyperplane) through the data that minimizes the total prediction error. It is the most fundamental machine learning algorithm, the starting point for nearly every regression problem, and the foundation for understanding more complex models. This guide walks through the complete process from data preparation to result interpretation.

Linear regression is both a statistical method and a machine learning algorithm. Statisticians use it to understand relationships between variables. ML practitioners use it to make predictions. The math is identical; the emphasis differs. This guide covers both perspectives because understanding the statistical foundations makes you a better ML practitioner.

Step 1: Prepare and Explore Your Data

Start by understanding what you are working with. Load the dataset, examine the shape (rows and columns), check data types, and look for missing values. For a house price prediction problem, your features might include square footage, number of bedrooms, lot size, year built, and neighborhood, with sale price as the target.

Create scatter plots of each feature against the target. Linear regression assumes a linear relationship, so if a scatter plot shows a clear curve (like an exponential or logarithmic shape), you will need to transform that feature before fitting. Plotting square footage against price should show a roughly linear upward trend. If price accelerates at high square footages, a log transform on price or a polynomial feature might be needed.

Check the correlation matrix. Features highly correlated with the target are good predictors. Features highly correlated with each other (multicollinearity) can cause problems: the model cannot distinguish their individual effects, producing unstable and uninterpretable coefficients.

Handle missing values before fitting. Options include dropping rows (if few are missing), imputation with the median or mean, or using a more sophisticated method like KNN imputation. Never fill missing values with zero unless zero is a meaningful value for that feature.

Step 2: Check the Assumptions

Linear regression makes four key assumptions. Violating them does not necessarily ruin the model, but it affects the reliability of coefficient estimates and confidence intervals.

Linearity: The relationship between each feature and the target is approximately linear. Check this with scatter plots. If the relationship is curved, apply transformations (log, square root, polynomial features) to linearize it.

Independence: Each observation is independent of the others. This is violated in time series data (today's stock price depends on yesterday's) and spatial data (nearby houses have correlated prices). If observations are dependent, use specialized models like autoregressive models or spatial regression.

Normality of residuals: The errors (differences between predictions and actual values) should be approximately normally distributed. Check this with a histogram or Q-Q plot of the residuals after fitting. Non-normal residuals suggest missing nonlinear patterns or outliers.

Homoscedasticity: The variance of the residuals should be roughly constant across all levels of the predicted value. If residuals fan out (larger errors at higher predictions), the model is less reliable for high-value predictions. Log-transforming the target variable often fixes this.

Step 3: Fit the Model

The model equation for simple linear regression (one feature) is: y = w*x + b, where y is the prediction, x is the feature, w is the weight (slope), and b is the bias (intercept). For multiple features: y = w1*x1 + w2*x2 + ... + wn*xn + b.

The fitting algorithm finds values of w and b that minimize the sum of squared residuals: the total squared distance between each prediction and the actual value. This is called ordinary least squares (OLS), and it has a closed-form solution: w = (X^T X)^(-1) X^T y, where X is the feature matrix and y is the target vector.

In Python with scikit-learn, the code is three lines: create the model object, call fit(X_train, y_train), then call predict(X_test). Under the hood, scikit-learn uses the closed-form solution for small datasets and gradient descent for larger ones.

For very large datasets (millions of rows) or when regularization is needed, gradient descent is more practical than the closed-form solution. Gradient descent iteratively adjusts the weights in the direction that reduces the loss function, converging on the optimal values over many iterations.

Step 4: Evaluate the Results

R-squared (R2) measures the proportion of variance in the target explained by the model. R2 = 0.85 means the model explains 85% of the variation in prices. R2 = 1.0 would be a perfect fit (suspicious, likely overfitting). R2 = 0.0 means the model does no better than simply predicting the average. R2 can be negative, meaning the model is worse than the average.

RMSE (root mean squared error) gives the average prediction error in the same units as the target. RMSE = $25,000 for a house price model means predictions are off by about $25,000 on average. RMSE penalizes large errors more heavily than small ones because of the squaring.

MAE (mean absolute error) is the average absolute prediction error. It is less sensitive to outliers than RMSE. If a few predictions are wildly off but most are close, MAE will be lower than RMSE.

Residual plots are the most informative diagnostic. Plot predicted values on the x-axis and residuals on the y-axis. A good model produces a random cloud of points centered at zero. Patterns in the residuals (curves, funnels, clusters) indicate model problems: nonlinearity, heteroscedasticity, or missing features.

Step 5: Interpret and Validate

Each coefficient has a direct interpretation: it is the expected change in the target for a one-unit increase in that feature, holding all other features constant. If the square footage coefficient is 150, each additional square foot is associated with a $150 increase in price, all else being equal.

Check the statistical significance of each coefficient. The p-value tests whether the coefficient is significantly different from zero. Coefficients with p-values above 0.05 may not be reliably different from zero and might not belong in the model.

Validate on held-out data. The metrics on your training set are optimistic because the model was fit to that data. Calculate R2, RMSE, and MAE on a test set the model has never seen. If test performance is much worse than training performance, the model is overfitting. If both are poor, the model is underfitting.

Consider regularization if the model overfits. Ridge regression (L2 regularization) adds a penalty proportional to the squared magnitude of the weights, shrinking them toward zero. Lasso regression (L1 regularization) can shrink some weights all the way to zero, effectively performing feature selection. Elastic net combines both.

Key Takeaway

Linear regression finds the line of best fit by minimizing squared errors. The process involves data preparation, assumption checking, model fitting via OLS or gradient descent, evaluation with R2 and RMSE, and validation on held-out data. Start every regression problem with linear regression as a baseline before trying more complex approaches.