What Are Activation Functions?

Updated May 2026
Activation functions are nonlinear mathematical functions applied to each neuron's output in a neural network. They transform the weighted sum of inputs into a value that gets passed to the next layer. Without activation functions, a neural network of any depth would behave as a single linear transformation and could only learn straight-line relationships. Activation functions give neural networks the ability to learn complex, curved decision boundaries, which is why they are essential to every architecture from simple classifiers to large language models.

Why Nonlinearity Is Essential

A linear function has the form f(x) = mx + b. If you stack multiple linear functions, the result is still linear. Layer 1 computes y1 = W1*x + b1. Layer 2 computes y2 = W2*y1 + b2 = W2*(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2). This is just another linear function. No matter how many linear layers you stack, the final output is always a linear function of the input.

Linear models can only draw straight-line decision boundaries. A linear classifier can separate cats from dogs only if there is some straight line (or hyperplane in higher dimensions) that perfectly divides the two classes in the feature space. For real data, this is almost never the case. Cats and dogs overlap in every simple feature; distinguishing them requires learning complex, nonlinear combinations of features.

Activation functions break the linearity. By applying a nonlinear function after each layer's weighted sum, the composition of layers produces a nonlinear function. Two layers with ReLU activations can approximate any piecewise linear function. With enough layers and neurons, a network with nonlinear activations can approximate any continuous function to arbitrary precision (the universal approximation theorem).

Sigmoid

The sigmoid function squashes its input to a value between 0 and 1 using the formula f(x) = 1 / (1 + e^(-x)). Large positive inputs produce values near 1, large negative inputs produce values near 0, and inputs near zero produce values near 0.5. The output can be interpreted as a probability, which is why sigmoid is used in the output layer of binary classifiers.

Sigmoid was the dominant activation function in neural networks from the 1980s through the early 2010s. Its biological plausibility (real neurons have firing rates bounded between zero and a maximum) and smooth gradient made it a natural choice.

However, sigmoid has a critical flaw: it saturates. For inputs much larger or smaller than zero, the gradient (the derivative of the function) approaches zero. When gradients are near zero, weights barely update during backpropagation. In deep networks, this causes the vanishing gradient problem, where layers far from the output stop learning because the error signal has been squashed to near zero at each intervening layer. This problem severely limited the depth of practical networks for decades.

Tanh

The hyperbolic tangent function (tanh) squashes inputs to values between -1 and +1. Its formula is f(x) = (e^x - e^(-x)) / (e^x + e^(-x)). Tanh is zero-centered, meaning its outputs are distributed around zero rather than being strictly positive like sigmoid's outputs.

Zero-centered outputs matter because they make gradient descent more efficient. When all activations are positive (as with sigmoid), the gradients for the weights in the next layer all have the same sign, which constrains the update direction and causes zig-zagging during optimization. Tanh's centered outputs allow gradients to be both positive and negative, providing more direct optimization paths.

Tanh suffers from the same saturation problem as sigmoid, just in a different range. For very large or very small inputs, the gradient approaches zero and the neuron stops learning. For this reason, tanh was the preferred activation from the late 1990s until ReLU's rise in 2012, but it has been largely replaced in feedforward networks. It remains common in RNNs and LSTMs, where its bounded output helps control the magnitude of the hidden state.

ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in modern neural networks. Its definition is almost embarrassingly simple: f(x) = max(0, x). If the input is positive, ReLU passes it through unchanged. If the input is negative, ReLU outputs zero.

This simplicity has enormous practical advantages. ReLU is computationally cheap, requiring only a comparison operation instead of the exponential calculations needed for sigmoid and tanh. Its gradient is either 1 (for positive inputs) or 0 (for negative inputs), which means it does not saturate for positive values and gradients can flow unchanged through many layers. This property was the key that unlocked training of deep networks in the early 2010s.

ReLU's main problem is the "dying ReLU" issue. If a neuron's inputs are consistently negative (which can happen during training if a large gradient update pushes the weights into a region where all inputs produce negative pre-activation values), the neuron outputs zero for every input and its gradient is zero. A dead neuron stops learning permanently because zero gradient means zero weight updates. In a large network, 10-20% of neurons can die during training, wasting capacity.

Despite this limitation, ReLU remains the default recommendation for most feedforward and convolutional networks because its advantages (computational efficiency, no saturation for positive values, good empirical performance) outweigh its drawbacks. When dying neurons are a concern, variants like Leaky ReLU address the issue.

ReLU Variants

Leaky ReLU allows a small, non-zero gradient for negative inputs: f(x) = x if x > 0, and f(x) = 0.01x if x < 0. This prevents neurons from dying because even neurons with negative inputs maintain a small gradient that allows them to recover during training. The slope for negative values (0.01) is a hyperparameter that can be tuned, though 0.01 works well in practice.

Parametric ReLU (PReLU) makes the negative slope a learnable parameter: f(x) = x if x > 0, and f(x) = ax if x < 0, where a is learned during training. This gives each neuron its own optimal negative slope. PReLU was introduced by Kaiming He (of ResNet fame) and slightly outperformed ReLU on ImageNet, though the improvement is modest.

ELU (Exponential Linear Unit) uses an exponential function for negative inputs: f(x) = x if x > 0, and f(x) = a(e^x - 1) if x < 0. ELU produces negative outputs that push the mean activation toward zero (like tanh) while maintaining the non-saturating behavior of ReLU for positive inputs. It is smoother than ReLU at zero, which helps optimization, but the exponential computation is slower.

GELU and SiLU: Modern Activations

GELU (Gaussian Error Linear Unit) is the activation function used in BERT, GPT, and most modern transformer architectures. Its formula is f(x) = x * P(X < x), where P is the cumulative distribution function of the standard normal distribution. In practice, it is approximated as f(x) = 0.5x * (1 + tanh(sqrt(2/pi) * (x + 0.044715x^3))).

GELU smoothly transitions between passing positive inputs through and suppressing negative inputs, rather than making a hard switch at zero like ReLU. This smoothness can be beneficial for optimization, particularly in transformer models where the attention mechanism produces values that span a wide range. GELU has become the default activation for transformers, though the improvement over ReLU is typically small (0.1-0.5% on benchmarks).

SiLU (Sigmoid Linear Unit), also called Swish, computes f(x) = x * sigmoid(x). Like GELU, it smoothly gates the input based on its value. SiLU has been shown to outperform ReLU on deep networks in several benchmark studies and is used in models like EfficientNet. The choice between GELU and SiLU often depends more on convention within a model family than on measurable performance differences.

Softmax: The Output Activation

Softmax is unique among activation functions because it operates on a vector rather than a single value. Given a vector of raw scores (logits), softmax converts each score to a probability by exponentiating it and dividing by the sum of all exponentiated scores. The result is a probability distribution where all values are between 0 and 1 and sum to 1.

Softmax is used almost exclusively in the output layer of classification networks and in the attention mechanism of transformers. It is not used in hidden layers because its normalization property (outputs sum to 1) constrains the representation in ways that limit expressiveness.

Choosing an Activation Function

For most practical purposes, the choice is straightforward. Use ReLU for hidden layers in CNNs and feedforward networks. Use GELU for transformers. Use sigmoid for binary classification outputs. Use softmax for multi-class classification outputs. Use tanh in RNN and LSTM gating mechanisms. Use linear (no activation) for regression outputs.

Trying Leaky ReLU or SiLU as drop-in replacements for ReLU is worth a quick experiment, especially if you observe dying neurons. But activation function choice is rarely the bottleneck in a model's performance. Learning rate, architecture, data quality, and regularization almost always have a larger impact than switching between modern activation functions.

Key Takeaway

Activation functions introduce the nonlinearity that allows neural networks to learn complex patterns. ReLU is the standard for most networks because it is fast, avoids saturation for positive inputs, and works well empirically. GELU has become the default for transformers. Sigmoid and softmax are used in output layers for producing probabilities. The choice of activation function matters less than other design decisions, but using the wrong one (sigmoid in deep hidden layers, for instance) can prevent the network from training at all.