What Is Gradient Descent?

Updated May 2026
Gradient descent is the optimization algorithm that trains virtually every neural network and most machine learning models. It works by computing the gradient (the direction and steepness of the loss function's slope) with respect to each model parameter, then adjusting those parameters in the direction that reduces the loss. Repeated over millions of steps, gradient descent navigates a high-dimensional landscape to find parameter values that produce good predictions.

The Mountain Analogy

The classic intuition for gradient descent is walking downhill in fog. You are on a mountain, you cannot see the bottom, and your goal is to reach the lowest point. At each step, you feel the slope beneath your feet and take a step in the steepest downhill direction. If you repeat this process, you will eventually reach a valley.

The mountain is the loss landscape, a surface where each point corresponds to a specific combination of model parameters, and the height at that point is the loss (the model's error). The steepest downhill direction is the negative gradient. Each step adjusts the parameters to reduce the loss. The "fog" represents the fact that you cannot see the entire landscape; you can only measure the slope at your current position.

This analogy breaks down in one important way: real loss landscapes have billions of dimensions, not two or three. A model with 70 billion parameters navigates a 70-billion-dimensional surface. Human spatial intuition fails in these spaces, but the mathematics of gradient descent works identically regardless of dimensionality. The gradient is always a vector pointing uphill, and subtracting it always moves parameters downhill.

The Mathematics

Gradient descent updates each parameter according to a simple rule: new value equals old value minus the learning rate times the gradient. In mathematical notation:

w = w - lr * (dL/dw)

Here, w is a parameter (weight), lr is the learning rate (a small positive number, typically 0.001 to 0.0001), and dL/dw is the partial derivative of the loss L with respect to that parameter. The partial derivative tells you how much the loss would change if you increased the parameter by a tiny amount. If the derivative is positive, increasing the parameter increases the loss, so you decrease the parameter. If the derivative is negative, increasing the parameter decreases the loss, so you increase the parameter.

The learning rate controls step size. A large learning rate means aggressive updates that can reach the bottom quickly but risk overshooting and bouncing around the minimum. A small learning rate means careful, stable updates that converge reliably but slowly. Finding the right learning rate is one of the most important hyperparameter decisions in training, and it often makes the difference between a training run that converges in hours and one that never converges at all.

Batch, Stochastic, and Mini-Batch Variants

Batch gradient descent computes the gradient using the entire training dataset. This gives the most accurate gradient estimate, but it is impractically slow for large datasets. Computing the gradient over millions of examples before taking a single step means each step is very expensive, even though each step is very accurate.

Stochastic gradient descent (SGD) computes the gradient using a single training example. This makes each step extremely fast but very noisy, because one example is a poor estimate of the true gradient. The noise means the model zigzags toward the minimum rather than walking straight toward it, but it also helps the model escape shallow local minima and explore the loss landscape more broadly.

Mini-batch gradient descent is the practical compromise used by virtually all modern training. It computes the gradient over a small batch of examples, typically 16 to 256. This is much faster than full batch (because you process fewer examples per step) and much less noisy than single-example SGD (because averaging over a batch smooths out individual example noise). The batch size is a hyperparameter that balances computation cost, gradient quality, and GPU memory usage.

When people say "gradient descent" in the context of deep learning, they almost always mean mini-batch gradient descent. The term "SGD" is used loosely to refer to mini-batch as well, because the gradient from a mini-batch is still a stochastic (random) estimate of the true gradient.

Momentum: Accelerating Convergence

Plain gradient descent has a well-known problem in valleys where the loss surface curves steeply in one direction but gently in another. The gradient oscillates back and forth across the steep direction while making slow progress along the gentle direction. This is inefficient and common in practice.

Momentum addresses this by adding a "velocity" term to the update. Instead of moving in the direction of the current gradient only, the update is a weighted combination of the current gradient and the previous update direction. If the gradient points consistently in one direction across multiple steps, momentum accumulates and the updates accelerate. If the gradient oscillates, the oscillations cancel out and the net movement follows the consistent direction.

The analogy is a ball rolling downhill. Without momentum, the ball reacts only to the slope at its current position. With momentum, the ball has inertia, it continues in its current direction and is gradually deflected by the slope. This inertia carries it through flat regions (where gradients are small and plain SGD stalls) and prevents it from oscillating in narrow valleys.

Momentum is controlled by a coefficient, typically set to 0.9, meaning 90% of the previous velocity is retained at each step. This seemingly simple modification produces dramatically faster convergence on most loss landscapes.

Adam and Adaptive Learning Rates

Adam (Adaptive Moment Estimation) is the most widely used optimizer in deep learning. It combines two ideas: momentum (tracking the running average of gradients) and adaptive learning rates (tracking the running average of squared gradients).

The running average of gradients (first moment) provides momentum, smoothing out noise and accelerating convergence. The running average of squared gradients (second moment) provides per-parameter learning rate adaptation. Parameters with consistently large gradients get smaller effective learning rates (to avoid overshooting), while parameters with small, sporadic gradients get larger effective learning rates (to make meaningful progress despite the noise).

Adam also includes bias correction for the early steps of training, when the running averages have not yet accumulated enough history to be reliable. Without bias correction, the first few updates would be heavily biased toward zero, producing overly timid initial steps.

The default hyperparameters for Adam (learning rate 0.001, beta1 0.9, beta2 0.999) work well across a surprisingly wide range of tasks. This robustness is why Adam is the default recommendation for most deep learning projects. You can often get good results without tuning the optimizer at all, which frees your time for tuning other aspects of the model and data.

The Loss Landscape

Understanding why gradient descent works requires understanding the structure of the loss landscape it navigates.

In low dimensions, optimization is hard because local minima trap the optimizer. A two-dimensional loss surface with many valleys means gradient descent might settle in a shallow valley and never find the deepest one. This was a major concern in early neural network research.

In high dimensions, the situation is fundamentally different. Research by Choromanska et al. (2015) and Dauphin et al. (2014) showed that in high-dimensional loss landscapes, local minima tend to have loss values very close to the global minimum. The more concerning obstacles are saddle points, positions where the loss surface curves upward in some directions and downward in others. At a saddle point, the gradient is zero (so plain gradient descent stalls), but the point is not a minimum because moving in certain directions would decrease the loss. Momentum and adaptive methods are effective at escaping saddle points because accumulated velocity carries the optimizer past the flat region.

Another important finding is that flatter minima tend to generalize better than sharp minima. A sharp minimum is a narrow valley where small changes in parameters cause large changes in loss. A flat minimum is a broad valley where the loss is similarly low across a range of parameter values. The noise in mini-batch gradient descent naturally biases the optimizer toward flat minima, because the noise prevents it from settling into sharp, narrow valleys. This is one reason why SGD with mini-batches often generalizes better than full-batch gradient descent despite its noisier gradient estimates.

Learning Rate Schedules

Using a fixed learning rate throughout training is suboptimal. In early training, a larger learning rate helps the model make rapid progress through the easy parts of the loss landscape. In later training, a smaller learning rate helps the model fine-tune its parameters near a good minimum without overshooting.

Step decay reduces the learning rate by a fixed factor (typically dividing by 10) at predetermined epoch milestones. This is simple but requires knowing in advance when to reduce the rate.

Cosine annealing gradually decreases the learning rate following a cosine curve from the initial value to near zero over the training run. This is smooth, requires no manual milestones, and has become the default schedule for many training recipes.

Warm-up starts training with a very small learning rate and gradually increases it to the target value over the first few hundred or thousand steps. This prevents large, unstable updates when the model's parameters are still randomly initialized and the gradients are unreliable. Warm-up is nearly universal in transformer training.

Cyclical learning rates oscillate the learning rate between a minimum and maximum value. The theory is that periodically increasing the learning rate helps the optimizer escape local minima and explore new regions of the loss landscape. In practice, cyclical schedules work well for some problems but are less commonly used than cosine annealing.

Key Takeaway

Gradient descent trains AI models by repeatedly computing the slope of the loss function and adjusting parameters downhill. Mini-batch gradient descent with momentum and adaptive learning rates (Adam) is the standard approach. The learning rate is the most critical hyperparameter, and learning rate schedules that start large and decay over time consistently improve results. Despite navigating loss landscapes with billions of dimensions, gradient descent works because high-dimensional landscapes have favorable geometric properties that make good solutions accessible.