Why Nonlinearity?
Without activation functions, a neural network — no matter how many layers — is just a single linear transformation. Stacking linear layers $y = W_2(W_1 x + b_1) + b_2$ simplifies to $y = W'x + b'$ — the depth adds no expressiveness. Activation functions break this linearity, allowing networks to learn curved decision boundaries, thresholds, and complex patterns.
To see why stacking collapses, expand the two-layer expression:
where $W' = W_2 W_1$ and $b' = W_2 b_1 + b_2$. This collapses to a single layer because matrix multiplication is associative and linear — multiplying two matrices just produces another matrix. No matter how many layers you stack, the result is always some $W'x + b'$, a single affine transformation. You could replace the entire network with one layer and get the same input–output mapping.
An activation function $\sigma$ between layers breaks this collapse. Consider what happens when we insert one:
The function $\sigma(W_1 x + b_1)$ is not linear, so the composition $W_2 \, \sigma(W_1 x + b_1) + b_2$ cannot be reduced to a single matrix multiply. The nonlinearity creates a fundamentally richer function space — each layer can now bend and reshape its input in ways that no single linear layer can replicate. This is what gives deep networks their power: not the depth itself, but the nonlinear transformations between layers.
ReLU: The Default Choice
The Rectified Linear Unit (ReLU) is the simplest widely used activation function:
The behaviour is straightforward:
- When $x > 0$: the output is just $x$ (the identity). The gradient is exactly 1, so the signal passes through with no attenuation. This avoids the vanishing gradient problem that plagued earlier activations.
- When $x \leq 0$: the output is 0. The gradient is also 0 — the neuron is "killed" for this input. This is the nonlinearity: ReLU zeroes out all negative pre-activations, effectively selecting which neurons fire and which don't.
Why this form? ReLU is arguably the simplest possible nonlinearity. It requires just one comparison ($x > 0$?), making it extremely fast to compute. Its gradient is either 0 or 1 — never a small fraction — so positive signals propagate backward without shrinking. These properties made ReLU the activation that finally enabled training of deep networks (Nair & Hinton, 2010; Glorot et al., 2011).
There is a well-known failure mode, however: the dying ReLU problem. If a neuron's weights drift such that its pre-activation $W x + b$ is always negative for every input in the training set, then the neuron's output is permanently 0. Since the gradient is also 0 in this region, the weights never receive an update signal — the neuron is effectively dead. Once dead, it stays dead. This tends to happen more often with high learning rates or poor initialisation, and it can silently reduce a network's effective capacity.
The plot below shows ReLU and its gradient. Notice the sharp corner at $x = 0$ and the flat zero region for all negative inputs:
import numpy as np
import json
import js
x = np.linspace(-5, 5, 200)
relu = np.maximum(0, x)
relu_grad = (x > 0).astype(float)
plot_data = [{
"title": "ReLU and its Gradient",
"x_label": "x",
"y_label": "f(x)",
"x_data": x.tolist(),
"lines": [
{"label": "ReLU(x)", "data": relu.tolist(), "color": "#3b82f6"},
{"label": "Gradient", "data": relu_grad.tolist(), "color": "#ef4444"}
]
}]
js.window.py_plot_data = json.dumps(plot_data)
Leaky ReLU and GELU
Leaky ReLU is a direct fix for the dying ReLU problem. Instead of outputting exactly 0 for negative inputs, it applies a small slope $\alpha$ (typically 0.01):
When $x \leq 0$, the gradient is $\alpha$ instead of 0, so neurons on the negative side can still receive gradient updates and recover. The tradeoff is that the output for negative inputs is no longer exactly zero, which means the network can't "ignore" irrelevant features as cleanly as ReLU does. In practice, this tends to be a minor concern — keeping neurons alive is usually worth a slightly noisier negative region.
GELU (Gaussian Error Linear Unit) (Hendrycks & Gimpel, 2016) is the activation used in GPT, BERT, and most modern transformers:
where $\Phi(x)$ is the cumulative distribution function (CDF) of the standard normal distribution — the probability that a standard normal random variable $Z \sim \mathcal{N}(0, 1)$ is less than $x$. This gives GELU an elegant probabilistic interpretation: it multiplies each input by the probability that the input is "large enough" to keep. The behaviour in different regimes:
- When $x \gg 0$: $\Phi(x) \approx 1$, so $\text{GELU}(x) \approx x$ — it behaves like the identity, just like ReLU.
- When $x \ll 0$: $\Phi(x) \approx 0$, so $\text{GELU}(x) \approx 0$ — it kills negative inputs, just like ReLU.
- Near $x = 0$: there is a smooth, continuous transition. Unlike ReLU's sharp corner at zero, GELU curves gently through the origin. This smooth gradient landscape tends to help optimisation, especially in deep networks where sharp gradient discontinuities can cause instability.
You can think of GELU as a smooth, probabilistic version of ReLU: "include $x$ with probability $\Phi(x)$". When the input is clearly positive, it's almost certainly included. When it's clearly negative, it's almost certainly zeroed out. In the ambiguous region near zero, the function smoothly interpolates.
In practice, evaluating the exact Gaussian CDF is expensive, so implementations use a tanh-based approximation:
The plot below compares all three activations. Notice how GELU and Leaky ReLU both allow some signal through for negative inputs, unlike the hard cutoff of standard ReLU:
import numpy as np
import json
import js
from math import sqrt, pi
x = np.linspace(-3, 3, 300)
relu = np.maximum(0, x)
leaky_relu = np.where(x > 0, x, 0.01 * x)
# GELU approximation
gelu = 0.5 * x * (1 + np.tanh(sqrt(2/pi) * (x + 0.044715 * x**3)))
plot_data = [{
"title": "ReLU vs Leaky ReLU vs GELU",
"x_label": "x",
"y_label": "f(x)",
"x_data": x.tolist(),
"lines": [
{"label": "ReLU", "data": relu.tolist(), "color": "#3b82f6"},
{"label": "Leaky ReLU (α=0.01)", "data": leaky_relu.tolist(), "color": "#f59e0b"},
{"label": "GELU", "data": gelu.tolist(), "color": "#10b981"}
]
}]
js.window.py_plot_data = json.dumps(plot_data)
Sigmoid and Tanh
Sigmoid and tanh are the classic activation functions — they dominated neural networks before ReLU and are still important in specific roles today, even if they have been largely replaced by ReLU-family activations in hidden layers.
Sigmoid squashes any real number into the range $(0, 1)$:
Let's trace through the behaviour:
- When $x \gg 0$: $e^{-x} \to 0$, so the denominator approaches 1, and $\sigma(x) \to 1$.
- When $x \ll 0$: $e^{-x} \to \infty$, so the denominator blows up, and $\sigma(x) \to 0$.
- When $x = 0$: $e^{0} = 1$, so $\sigma(0) = \frac{1}{2}$ — the midpoint of the output range.
The output is always positive and bounded between 0 and 1, which is exactly why sigmoid is used when you need a probability: binary classification output layers, gates in LSTMs and GRUs, and attention mechanisms. However, using sigmoid in hidden layers of deep networks is problematic because of its gradient:
This gradient reaches its maximum at $x = 0$, where $\sigma'(0) = 0.5 \times 0.5 = 0.25$. For large $|x|$, the gradient vanishes toward zero. This is the vanishing gradient problem : in a network with $n$ sigmoid layers, gradients are multiplied through the chain rule, and since each sigmoid contributes at most a factor of 0.25, the gradient reaching the early layers decays roughly as $0.25^n$. With 10 layers, that's a factor of about $10^{-6}$ — the early layers essentially stop learning.
Tanh is closely related to sigmoid but outputs values in $(-1, 1)$:
The second form shows that tanh is just a rescaled and shifted sigmoid. The key differences:
- Zero-centered: tanh outputs range from $-1$ to $1$, centred around zero. This is better for hidden layers because zero-centered activations don't systematically bias the gradient direction. With sigmoid (always positive outputs), the gradients on the weights are all the same sign within a layer, which constrains the optimisation path to zigzag.
- Stronger gradient: the derivative is $\tanh'(x) = 1 - \tanh^2(x)$, which reaches a maximum of 1.0 at $x = 0$ (compared to sigmoid's 0.25). Tanh still suffers from vanishing gradients at large $|x|$, but the problem is considerably less severe — gradients decay as $1.0^n$ near the origin rather than $0.25^n$.
The plots below show both functions and their gradients side by side. Notice how tanh's gradient peak is four times higher than sigmoid's:
import numpy as np
import json
import js
x = np.linspace(-5, 5, 200)
sigmoid = 1 / (1 + np.exp(-x))
sigmoid_grad = sigmoid * (1 - sigmoid)
tanh = np.tanh(x)
tanh_grad = 1 - tanh**2
plot_data = [
{
"title": "Sigmoid and Tanh",
"x_label": "x",
"y_label": "f(x)",
"x_data": x.tolist(),
"lines": [
{"label": "Sigmoid", "data": sigmoid.tolist(), "color": "#8b5cf6"},
{"label": "Tanh", "data": tanh.tolist(), "color": "#ec4899"}
]
},
{
"title": "Gradients: Sigmoid vs Tanh",
"x_label": "x",
"y_label": "f'(x)",
"x_data": x.tolist(),
"lines": [
{"label": "Sigmoid gradient (max 0.25)", "data": sigmoid_grad.tolist(), "color": "#8b5cf6"},
{"label": "Tanh gradient (max 1.0)", "data": tanh_grad.tolist(), "color": "#ec4899"}
]
}
]
js.window.py_plot_data = json.dumps(plot_data)
Swish / SiLU
Swish (also called SiLU — Sigmoid Linear Unit) was discovered through automated neural architecture search (Ramachandran et al., 2017) :
The structure is similar to GELU — both multiply $x$ by a gating function — but Swish uses the sigmoid $\sigma(x)$ as the gate instead of the Gaussian CDF $\Phi(x)$. The behaviour follows the same pattern:
- When $x \gg 0$: $\sigma(x) \approx 1$, so $\text{Swish}(x) \approx x$ — the identity, like ReLU.
- When $x \ll 0$: $\sigma(x) \approx 0$, so $\text{Swish}(x) \approx 0$ — the input is suppressed, like ReLU.
- Near $x \approx -1.28$: Swish dips slightly below zero (reaching a minimum of about $-0.28$). This small negative region is a distinctive feature — unlike ReLU, which is strictly non-negative, Swish allows a small "undershoot" that appears to help optimisation by providing a richer gradient signal.
Swish is smooth and non-monotonic (because of that small dip below zero). The smooth gradient everywhere avoids the sharp discontinuity of ReLU at the origin, and the non-monotonicity provides a form of implicit regularisation — inputs near zero are treated differently from strongly positive or negative ones.
When to Use What
With so many activations available, here is a practical guide for choosing one:
- ReLU: the default for most hidden layers. Simple, fast, and it works well. Use it unless you have a specific reason not to.
- GELU: the standard in transformers (GPT, BERT, LLaMA). Its smooth gradients help with optimisation in very deep models, and it has become the de facto choice for attention-based architectures.
- Swish / SiLU: common in vision models (EfficientNet) and some LLM architectures (LLaMA uses SiLU in its feed-forward network). Very similar to GELU in practice.
- Sigmoid: output layer for binary classification (it produces a probability in $(0, 1)$). Also used as gates in recurrent architectures (LSTM, GRU) and in attention mechanisms.
- Tanh: output layer when you need values in $(-1, 1)$. Used in some normalisation schemes and older RNN architectures. Preferred over sigmoid for hidden layers when a saturating activation is needed, thanks to its zero-centered output.
- Leaky ReLU: useful when dying ReLU is a problem — typically in smaller networks, with aggressive learning rates, or with inputs that are predominantly negative.
Quiz
Test your understanding of activation functions and their role in neural networks.
Why do neural networks need activation functions?
What is the 'dying ReLU' problem?
Why is sigmoid problematic for hidden layers in deep networks?
What makes GELU suitable for transformers?