Logistic Regression Explained
Why It Is Called Regression
The name is misleading but historical. Logistic regression uses the same linear equation as linear regression (z = w1*x1 + w2*x2 + ... + wn*xn + b) but feeds the result through a sigmoid function that maps any real number to the range 0 to 1. The linear part is a regression, and the sigmoid converts it into a probability for classification. The name stuck even though the algorithm is used almost exclusively for classification.
The sigmoid function is: sigma(z) = 1 / (1 + e^(-z)). When z is very positive, the output approaches 1. When z is very negative, it approaches 0. When z = 0, the output is exactly 0.5. This S-shaped curve smoothly maps the entire real line into a probability.
The output 0.73 means the model estimates a 73% probability that the input belongs to the positive class. To make a binary prediction, you apply a threshold, typically 0.5: anything above 0.5 is predicted positive, anything below is predicted negative. But the threshold can be adjusted based on the relative cost of false positives and false negatives.
How Logistic Regression Learns
Unlike linear regression, which minimizes squared errors, logistic regression minimizes cross-entropy loss (also called log loss). For each training example, the loss is: -[y * log(p) + (1-y) * log(1-p)], where y is the true label (0 or 1) and p is the predicted probability.
This loss function has an intuitive interpretation. If the true label is 1 and the model predicts p = 0.99, the loss is tiny: -log(0.99) = 0.01. If the true label is 1 and the model predicts p = 0.01, the loss is huge: -log(0.01) = 4.6. The loss punishes confident wrong predictions far more severely than tentative wrong predictions, which encourages the model to reserve high-confidence predictions for cases it is genuinely sure about.
There is no closed-form solution for the weights that minimize cross-entropy loss, unlike linear regression's OLS solution. Instead, the model uses iterative optimization, typically gradient descent or a more efficient variant called L-BFGS. The weights are initialized to small values and adjusted in the direction that reduces the total loss across all training examples, converging after a number of iterations.
Interpreting the Coefficients
Each weight has a direct, meaningful interpretation. The coefficient for a feature represents the change in the log-odds of the positive class for a one-unit increase in that feature, holding all other features constant.
Log-odds is the natural logarithm of the odds ratio: log(p / (1-p)). If the coefficient for "years of experience" is 0.3, then each additional year of experience increases the log-odds of being hired by 0.3. Converting to an odds ratio: e^0.3 = 1.35, meaning each year of experience multiplies the odds of being hired by 1.35x.
Positive coefficients increase the probability of the positive class. Negative coefficients decrease it. The magnitude indicates the strength of the relationship. A coefficient of 2.1 has a much stronger effect than a coefficient of 0.2. Coefficients near zero mean the feature has little influence on the prediction.
This interpretability is a major advantage in practice. When a bank uses logistic regression to approve loans, regulators can examine the model and verify that it uses legitimate factors (income, credit history) rather than prohibited ones (race, gender). The model is a transparent formula, not a black box.
Regularization
Like linear regression, logistic regression can overfit when there are many features relative to the number of training samples. Regularization adds a penalty to the loss function that discourages large weights.
L2 regularization (Ridge) adds a penalty proportional to the sum of squared weights. This shrinks all weights toward zero but never exactly to zero. The effect is that the model distributes importance more evenly across correlated features rather than assigning all the weight to one of them.
L1 regularization (Lasso) adds a penalty proportional to the sum of absolute weights. This can shrink some weights all the way to zero, effectively removing features from the model. L1 performs automatic feature selection, which is useful when you suspect many features are irrelevant.
The regularization strength is controlled by a parameter C (in scikit-learn's convention, where lower C means stronger regularization). The optimal C is found through cross-validation. For most problems, scikit-learn's default C=1.0 with L2 regularization is a reasonable starting point.
Multi-Class Logistic Regression
Binary logistic regression naturally extends to multiple classes through two approaches. One-vs-rest (OVR) trains one binary classifier per class, each predicting "this class vs everything else." For 10 classes, you train 10 separate logistic regression models. At prediction time, each model outputs a probability, and the class with the highest probability wins.
Multinomial logistic regression (also called softmax regression) generalizes the sigmoid function to multiple classes using the softmax function. Instead of one sigmoid output, the model produces a probability distribution across all classes, where probabilities sum to 1. This is mathematically cleaner and more principled than OVR, and it is the default in scikit-learn for multi-class problems.
The softmax function is: P(class_k) = e^(z_k) / sum(e^(z_j) for all j). Each class has its own set of weights, and the softmax normalizes the raw scores into probabilities. This is exactly the output layer of most neural network classifiers, making logistic regression a one-layer neural network.
When to Use Logistic Regression
Use it when you need interpretable coefficients, calibrated probability estimates, fast training and prediction, or a baseline to compare against more complex models. Logistic regression trains in seconds on datasets with millions of rows and hundreds of features. It is the standard starting point for binary classification.
Avoid it when the decision boundary is highly nonlinear. Logistic regression draws a linear boundary in feature space (or a hyperplane in higher dimensions). If the true boundary is curved, circular, or fragmented, logistic regression will underperform. Tree-based methods and neural networks handle nonlinear boundaries naturally.
In practice, always try logistic regression first. If its performance is acceptable for the application, ship it, because no other algorithm offers the same combination of speed, interpretability, and calibrated probabilities. Only reach for complex models when logistic regression clearly cannot capture the patterns in the data.
Logistic regression predicts class probabilities by passing a linear combination of features through a sigmoid function. It is fast, interpretable, and produces well-calibrated probabilities. Use it as the starting point for every binary classification problem, and only move to more complex models when the linear decision boundary is insufficient.