Backpropagation Explained Simply

Updated May 2026
Backpropagation is the algorithm that computes how much each weight in a neural network contributed to the prediction error, enabling gradient descent to adjust those weights and improve the model. It works by applying the chain rule of calculus to propagate error signals backward from the output layer through every hidden layer to the input, computing a gradient for every single weight in the network in a single efficient pass.

The Problem Backpropagation Solves

A neural network might have billions of weights. After making a wrong prediction, the network needs to know: which weights caused the error, by how much, and in which direction should each weight change to reduce the error? Answering this for every weight is the gradient computation problem.

The brute-force approach would be to tweak each weight individually, re-run the entire forward pass, and measure how the loss changed. For a network with N weights, this requires N+1 forward passes per training step. A network with 1 billion weights would need 1 billion forward passes to compute one gradient, making training completely impractical.

Backpropagation solves this by computing all gradients simultaneously in a single backward pass that takes roughly the same computational effort as one forward pass. This efficiency is what makes training large neural networks possible. Without backpropagation, modern deep learning simply could not exist.

The Chain Rule: The Mathematical Foundation

Backpropagation is an application of the chain rule from calculus. The chain rule says that if you have nested functions, the derivative of the outer function with respect to the innermost variable is the product of derivatives at each step.

For example, if y = f(g(x)), then dy/dx = df/dg * dg/dx. The rate of change of y with respect to x equals the rate of change of f with respect to g multiplied by the rate of change of g with respect to x. This rule extends to any number of nested functions: if y = f(g(h(x))), then dy/dx = df/dg * dg/dh * dh/dx.

A neural network is exactly such a chain of nested functions. The output layer applies a function to the last hidden layer, which applied a function to the previous layer, and so on back to the input. The loss is a function of the output, the output is a function of the last hidden layer, the last hidden layer is a function of the previous layer, all the way back. The chain rule lets you compute the gradient of the loss with respect to any weight in the network by multiplying the local gradients along the path from the loss to that weight.

The Forward Pass: Setting Up for Backpropagation

Before backpropagation can run, the network must perform a forward pass, processing the input through every layer to produce a prediction and compute the loss. During this forward pass, the network saves the intermediate values at each layer (the activations and pre-activation values), because backpropagation needs these values to compute gradients.

This is why training a neural network uses significantly more memory than inference. During inference, you only need to store the activations of the current and next layer. During training, you need to store the activations of every layer simultaneously, because backpropagation will need them all. For a model like GPT-4, this memory overhead is one of the main factors determining how many examples can fit in a training batch.

The Backward Pass: Step by Step

Backpropagation starts at the output and works backward through the network, one layer at a time.

Step 1: Compute the loss gradient. The gradient of the loss with respect to the network's output is computed directly from the loss function. For cross-entropy loss with softmax output, this simplifies to the predicted probability minus the true label (a vector of mostly zeros with a 1 at the correct class). This gradient tells the output layer how to adjust its outputs to reduce the loss.

Step 2: Propagate to the output layer weights. The gradient of the loss with respect to each output layer weight is computed using the chain rule: it equals the loss gradient (from step 1) multiplied by the activation from the last hidden layer. This tells you how much each output weight contributed to the error.

Step 3: Propagate to the last hidden layer. The gradient of the loss with respect to the last hidden layer's activations is computed by multiplying the loss gradient by the output layer's weight matrix (transposed). This "passes the blame" backward from the output layer to the hidden layer, distributing it proportionally to the weights of each connection.

Step 4: Account for the activation function. The gradient flowing into a layer must be multiplied by the derivative of the activation function at the pre-activation value. For ReLU, this derivative is 1 for positive values and 0 for negative values. For sigmoid, the derivative is sigmoid(x) * (1 - sigmoid(x)), which approaches zero for large and small inputs (the vanishing gradient problem).

Step 5: Repeat for every layer. Steps 2 through 4 are repeated for each layer, working backward from the output to the input. At each layer, the algorithm computes the gradient for that layer's weights (for the optimizer to use) and the gradient for the previous layer's activations (to continue the backward pass). By the time the computation reaches the first layer, every weight in the network has a gradient.

Computational Efficiency

The key insight of backpropagation is that it computes gradients by reusing intermediate results. Each layer's gradient computation depends on the gradient from the layer above (which was already computed) and the activations saved during the forward pass. No computation is wasted, and no computation is repeated.

The total cost of the backward pass is roughly 2 to 3 times the cost of the forward pass. The forward pass computes N multiplications and additions per layer. The backward pass computes roughly 2N multiplications per layer (one for the weight gradients, one for passing the gradient to the previous layer). This means that one complete training step (forward + backward + weight update) costs roughly 3 to 4 times one inference step.

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement backpropagation automatically through a technique called automatic differentiation. The programmer defines the forward computation, and the framework builds a computational graph that records every operation. When backward() is called, the framework traverses this graph in reverse, applying the chain rule at each operation. The programmer never needs to derive or implement gradients manually.

A Concrete Example

Consider a tiny network: 2 input neurons, 2 hidden neurons with ReLU, 1 output neuron with sigmoid, and mean squared error loss. Suppose the input is [1.0, 0.5], the target is 1.0, and the weights are initialized so the network predicts 0.3.

The loss is (1.0 - 0.3)^2 = 0.49. The gradient of the loss with respect to the output is 2 * (0.3 - 1.0) = -1.4. This says: "the output is too low, increase it."

Sigmoid's derivative at the pre-activation value scales this gradient. Multiply by the hidden layer activations to get the output layer weight gradients. Multiply by the output layer weights (transposed) to pass the gradient to the hidden layer. Apply ReLU's derivative (0 or 1). Multiply by the input values to get the hidden layer weight gradients.

The entire computation for this network involves roughly 30 arithmetic operations. Scale the same process to a network with 100 layers and 1 billion weights, and it involves roughly 3 billion operations per training step, which a modern GPU completes in milliseconds.

Historical Context

Backpropagation was not invented once but discovered independently by several researchers. The chain rule-based approach was described by Seppo Linnainmaa in 1970 for general automatic differentiation. Paul Werbos applied it to neural networks in his 1974 PhD thesis. But it was the 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams, "Learning Representations by Back-Propagating Errors," that demonstrated backpropagation's practical effectiveness and popularized it within the AI community.

Before backpropagation was widely adopted, training multi-layer networks was considered impractical. The perceptron learning rule worked for single-layer networks but could not assign credit (blame for errors) to hidden layer neurons. Backpropagation solved this credit assignment problem by providing an efficient, exact gradient for every weight in the network, including hidden layer weights that have no direct connection to the output.

Backpropagation remains the foundation of neural network training in 2026. Every major advance, CNNs, RNNs, transformers, diffusion models, uses backpropagation for gradient computation. While alternative training methods are researched (forward-forward algorithm, equilibrium propagation, perturbation-based methods), none has matched backpropagation's combination of efficiency, generality, and scalability.

Key Takeaway

Backpropagation computes the gradient of the loss with respect to every weight in a neural network using the chain rule, working backward from the output layer to the input. It runs in time proportional to a single forward pass, making it efficient enough to train networks with billions of parameters. Backpropagation solved the credit assignment problem for hidden layers and remains the universal training algorithm for all modern neural network architectures.