What Is the Vanishing Gradient Problem?

Updated May 2026
The vanishing gradient problem occurs when gradients become exponentially smaller as they propagate backward through the layers of a deep neural network. When gradients vanish, the early layers of the network receive error signals too small to produce meaningful weight updates, effectively preventing those layers from learning. This was the primary obstacle to training deep networks for decades and was solved by a combination of ReLU activations, skip connections, careful initialization, and normalization layers.

The Detailed Answer

During backpropagation, gradients are computed by multiplying local gradients at each layer using the chain rule. If each local gradient is a number less than 1 (which happens naturally with sigmoid and tanh activations), the product of many such numbers shrinks exponentially. In a 20-layer network where each layer multiplies the gradient by 0.25, the gradient reaching the first layer is 0.25^19, which is approximately 0.000000003. This number is so small that the weight updates it produces are essentially zero, meaning the first layer stops learning entirely.

The problem gets worse with depth. In a 50-layer network, the gradient at the first layer would be 0.25^49, a number so small that it underflows to exactly zero in standard floating-point arithmetic. This is why neural networks with more than a few layers were considered impractical before the mid-2010s. Adding layers was supposed to increase capacity, but beyond a certain depth, additional layers actually degraded performance because the early layers could not learn.

Why do sigmoid and tanh cause vanishing gradients?
Sigmoid and tanh both saturate, meaning their outputs flatten for large positive and large negative inputs. The derivative of sigmoid approaches zero for inputs far from zero, with a maximum derivative of only 0.25 (at the input value zero). At every layer, the gradient is multiplied by this derivative. Since the maximum possible value is 0.25, and most activations are in the saturated region where the derivative is even smaller, the gradient shrinks at every layer. Tanh has a slightly better maximum derivative of 1.0 at zero, but it still saturates for large inputs, producing the same exponential decay.
What is the exploding gradient problem?
Exploding gradients are the opposite failure: gradients grow exponentially through the layers instead of shrinking. This happens when the weight matrices have eigenvalues greater than 1, causing each layer to amplify the gradient. Exploding gradients produce enormous weight updates that destabilize training, often causing the loss to increase suddenly or the weights to become NaN (not a number). Gradient clipping, which caps the gradient magnitude at a threshold, is the standard fix. Exploding gradients are easier to detect and fix than vanishing gradients because they cause visible training failures rather than silent stagnation.
Is the vanishing gradient problem fully solved?
For practical purposes, yes. The combination of ReLU activations, residual connections, proper initialization, and normalization layers has made training networks with hundreds of layers routine. However, vanishing gradients can still appear in specific architectures (very deep RNNs without gating, networks with saturating activations, poorly initialized models) and in some advanced training scenarios (very long sequence backpropagation through time, certain types of physics-informed neural networks). Awareness of the problem remains important for diagnosing training failures.

Why This Matters

The vanishing gradient problem held back neural network research for over two decades. From the late 1980s through the early 2010s, it was the reason neural networks could not go deep. Shallow networks (2 to 3 layers) worked but could not learn the hierarchical representations needed for complex tasks. The solutions to this problem, ReLU, residual connections, normalization, are what enabled the deep learning revolution.

Solution 1: ReLU Activation

ReLU (Rectified Linear Unit) has a derivative of exactly 1 for positive inputs and 0 for negative inputs. Unlike sigmoid's maximum derivative of 0.25, ReLU passes gradients through unchanged when the neuron is active. In a 20-layer network with ReLU, the gradient at the first layer (for active neurons) is 1^19 = 1, not 0.25^19. This simple change, using ReLU instead of sigmoid, was one of the key innovations that enabled AlexNet to train 8 layers effectively in 2012.

ReLU does not completely solve the problem because neurons with negative inputs have a gradient of zero (the "dying ReLU" problem). If a large gradient update pushes a neuron into the negative region for all inputs, it becomes permanently inactive. Leaky ReLU (which has a small positive slope for negative inputs) addresses this by ensuring every neuron always has a non-zero gradient.

Solution 2: Residual Connections (Skip Connections)

Residual connections, introduced in ResNet (2015), are the most important architectural solution to vanishing gradients. A residual connection adds the input of a layer directly to its output: y = x + f(x), where f is the layer's transformation. During backpropagation, the gradient for this operation is: dy/dx = 1 + df/dx. The +1 term means the gradient can always flow directly through the skip connection, regardless of what happens in the layer's transformation.

In a 100-layer network with residual connections, the gradient from the loss can reach the first layer through a path that involves only addition operations, no multiplication by potentially small values. The transformation layers learn residuals (small corrections to the identity) rather than complete transformations, which is both easier to optimize and better for gradient flow.

Residual connections made it possible to train networks with 152 layers (ResNet-152), then 1,000+ layers in research settings. They are now standard in virtually every deep architecture, including transformers, where every attention block and feedforward block has a residual connection.

Solution 3: Proper Initialization

Xavier initialization (2010) and He initialization (2015) set initial weight values so that the variance of activations and gradients remains approximately constant across layers. If activations neither grow nor shrink as they pass through layers, gradients also neither grow nor shrink during backpropagation.

He initialization, designed for ReLU networks, samples weights from a distribution with variance 2/n, where n is the number of inputs to the layer. This specific variance compensates for the fact that ReLU zeros out half of its inputs, which would otherwise cause activations to shrink by a factor of 2 at each layer.

Initialization alone does not solve the vanishing gradient problem in very deep networks, but it provides a stable starting point that gives the other solutions (ReLU, residual connections, normalization) a foundation to work from. Poor initialization can prevent any of the other solutions from being effective.

Solution 4: Normalization Layers

Batch normalization (2015) normalizes the activations at each layer to have zero mean and unit variance (computed across the batch dimension). This prevents the internal distribution of activations from shifting during training (a phenomenon called internal covariate shift), which was believed to slow training and contribute to gradient instability. Batch normalization also has a regularizing effect and allows higher learning rates, both of which improve training.

Layer normalization normalizes across the feature dimension for each individual example rather than across the batch. It is preferred for transformers and RNNs because it does not depend on batch size and works with variable-length sequences. Layer normalization is applied before the attention and feedforward layers in most modern transformer architectures (pre-norm configuration).

Both normalization methods help with vanishing gradients by keeping the scale of activations in a range where gradients are well-behaved. Without normalization, activations in deep networks can drift to extreme values where activation function derivatives are very small (for sigmoid/tanh) or where numerical precision is lost.

The Vanishing Gradient Problem in RNNs

RNNs face a particularly severe form of the problem because the "depth" of the network is the sequence length, which can be hundreds or thousands of steps. When gradients are propagated backward through time (BPTT), they must pass through the same weight matrix at every step. If this matrix has eigenvalues less than 1, gradients vanish exponentially with sequence length.

LSTMs solve this with a cell state that passes through the entire sequence with only element-wise operations (addition and multiplication by gate values close to 1), creating a gradient highway that avoids the repeated matrix multiplications that cause vanishing. Transformers solve it more radically by eliminating recurrence entirely, using attention to connect any two positions directly.

Key Takeaway

The vanishing gradient problem occurs when gradients shrink exponentially as they propagate backward through deep networks, preventing early layers from learning. It was caused primarily by saturating activation functions (sigmoid, tanh) and poor initialization. The solutions, ReLU activations, residual connections, proper initialization (He/Xavier), and normalization layers, are now standard components of every deep architecture and have made training networks with hundreds of layers routine.