What Is the Vanishing Gradient Problem?
The Detailed Answer
During backpropagation, gradients are computed by multiplying local gradients at each layer using the chain rule. If each local gradient is a number less than 1 (which happens naturally with sigmoid and tanh activations), the product of many such numbers shrinks exponentially. In a 20-layer network where each layer multiplies the gradient by 0.25, the gradient reaching the first layer is 0.25^19, which is approximately 0.000000003. This number is so small that the weight updates it produces are essentially zero, meaning the first layer stops learning entirely.
The problem gets worse with depth. In a 50-layer network, the gradient at the first layer would be 0.25^49, a number so small that it underflows to exactly zero in standard floating-point arithmetic. This is why neural networks with more than a few layers were considered impractical before the mid-2010s. Adding layers was supposed to increase capacity, but beyond a certain depth, additional layers actually degraded performance because the early layers could not learn.
Why This Matters
The vanishing gradient problem held back neural network research for over two decades. From the late 1980s through the early 2010s, it was the reason neural networks could not go deep. Shallow networks (2 to 3 layers) worked but could not learn the hierarchical representations needed for complex tasks. The solutions to this problem, ReLU, residual connections, normalization, are what enabled the deep learning revolution.
Solution 1: ReLU Activation
ReLU (Rectified Linear Unit) has a derivative of exactly 1 for positive inputs and 0 for negative inputs. Unlike sigmoid's maximum derivative of 0.25, ReLU passes gradients through unchanged when the neuron is active. In a 20-layer network with ReLU, the gradient at the first layer (for active neurons) is 1^19 = 1, not 0.25^19. This simple change, using ReLU instead of sigmoid, was one of the key innovations that enabled AlexNet to train 8 layers effectively in 2012.
ReLU does not completely solve the problem because neurons with negative inputs have a gradient of zero (the "dying ReLU" problem). If a large gradient update pushes a neuron into the negative region for all inputs, it becomes permanently inactive. Leaky ReLU (which has a small positive slope for negative inputs) addresses this by ensuring every neuron always has a non-zero gradient.
Solution 2: Residual Connections (Skip Connections)
Residual connections, introduced in ResNet (2015), are the most important architectural solution to vanishing gradients. A residual connection adds the input of a layer directly to its output: y = x + f(x), where f is the layer's transformation. During backpropagation, the gradient for this operation is: dy/dx = 1 + df/dx. The +1 term means the gradient can always flow directly through the skip connection, regardless of what happens in the layer's transformation.
In a 100-layer network with residual connections, the gradient from the loss can reach the first layer through a path that involves only addition operations, no multiplication by potentially small values. The transformation layers learn residuals (small corrections to the identity) rather than complete transformations, which is both easier to optimize and better for gradient flow.
Residual connections made it possible to train networks with 152 layers (ResNet-152), then 1,000+ layers in research settings. They are now standard in virtually every deep architecture, including transformers, where every attention block and feedforward block has a residual connection.
Solution 3: Proper Initialization
Xavier initialization (2010) and He initialization (2015) set initial weight values so that the variance of activations and gradients remains approximately constant across layers. If activations neither grow nor shrink as they pass through layers, gradients also neither grow nor shrink during backpropagation.
He initialization, designed for ReLU networks, samples weights from a distribution with variance 2/n, where n is the number of inputs to the layer. This specific variance compensates for the fact that ReLU zeros out half of its inputs, which would otherwise cause activations to shrink by a factor of 2 at each layer.
Initialization alone does not solve the vanishing gradient problem in very deep networks, but it provides a stable starting point that gives the other solutions (ReLU, residual connections, normalization) a foundation to work from. Poor initialization can prevent any of the other solutions from being effective.
Solution 4: Normalization Layers
Batch normalization (2015) normalizes the activations at each layer to have zero mean and unit variance (computed across the batch dimension). This prevents the internal distribution of activations from shifting during training (a phenomenon called internal covariate shift), which was believed to slow training and contribute to gradient instability. Batch normalization also has a regularizing effect and allows higher learning rates, both of which improve training.
Layer normalization normalizes across the feature dimension for each individual example rather than across the batch. It is preferred for transformers and RNNs because it does not depend on batch size and works with variable-length sequences. Layer normalization is applied before the attention and feedforward layers in most modern transformer architectures (pre-norm configuration).
Both normalization methods help with vanishing gradients by keeping the scale of activations in a range where gradients are well-behaved. Without normalization, activations in deep networks can drift to extreme values where activation function derivatives are very small (for sigmoid/tanh) or where numerical precision is lost.
The Vanishing Gradient Problem in RNNs
RNNs face a particularly severe form of the problem because the "depth" of the network is the sequence length, which can be hundreds or thousands of steps. When gradients are propagated backward through time (BPTT), they must pass through the same weight matrix at every step. If this matrix has eigenvalues less than 1, gradients vanish exponentially with sequence length.
LSTMs solve this with a cell state that passes through the entire sequence with only element-wise operations (addition and multiplication by gate values close to 1), creating a gradient highway that avoids the repeated matrix multiplications that cause vanishing. Transformers solve it more radically by eliminating recurrence entirely, using attention to connect any two positions directly.
The vanishing gradient problem occurs when gradients shrink exponentially as they propagate backward through deep networks, preventing early layers from learning. It was caused primarily by saturating activation functions (sigmoid, tanh) and poor initialization. The solutions, ReLU activations, residual connections, proper initialization (He/Xavier), and normalization layers, are now standard components of every deep architecture and have made training networks with hundreds of layers routine.