Activation Functions: Adding Nonlinearity

Why Nonlinearity?

Without activation functions, a neural network — no matter how many layers — is just a single linear transformation. Stacking linear layers $y = W_2(W_1 x + b_1) + b_2$ simplifies to $y = W'x + b'$ — the depth adds no expressiveness. Activation functions break this linearity, allowing networks to learn curved decision boundaries, thresholds, and complex patterns.

To see why stacking collapses, expand the two-layer expression:

$$W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2 = W' x + b'$$

where $W' = W_2 W_1$ and $b' = W_2 b_1 + b_2$. This collapses to a single layer because matrix multiplication is associative and linear — multiplying two matrices just produces another matrix. No matter how many layers you stack, the result is always some $W'x + b'$, a single affine transformation. You could replace the entire network with one layer and get the same input–output mapping.

An activation function $\sigma$ between layers breaks this collapse. Consider what happens when we insert one:

y = W_2 \, \sigma(W_1 x + b_1) + b_2

The function $\sigma(W_1 x + b_1)$ is not linear, so the composition $W_2 \, \sigma(W_1 x + b_1) + b_2$ cannot be reduced to a single matrix multiply. The nonlinearity creates a fundamentally richer function space — each layer can now bend and reshape its input in ways that no single linear layer can replicate. This is what gives deep networks their power: not the depth itself, but the nonlinear transformations between layers.

ReLU: The Default Choice

The Rectified Linear Unit (ReLU) is the simplest widely used activation function:

\text{ReLU}(x) = \max(0, x)

The behaviour is straightforward:

When $x > 0$: the output is just $x$ (the identity). The gradient is exactly 1, so the signal passes through with no attenuation. This avoids the vanishing gradient problem that plagued earlier activations.
When $x \leq 0$: the output is 0. The gradient is also 0 — the neuron is "killed" for this input. This is the nonlinearity: ReLU zeroes out all negative pre-activations, effectively selecting which neurons fire and which don't.

Why this form? ReLU is arguably the simplest possible nonlinearity. It requires just one comparison ($x > 0$?), making it extremely fast to compute. Its gradient is either 0 or 1 — never a small fraction — so positive signals propagate backward without shrinking. These properties made ReLU the activation that finally enabled training of deep networks (Nair & Hinton, 2010) ; (Glorot et al., 2011) .

There is a well-known failure mode, however: the dying ReLU problem. If a neuron's weights drift such that its pre-activation $W x + b$ is always negative for every input in the training set, then the neuron's output is permanently 0. Since the gradient is also 0 in this region, the weights never receive an update signal — the neuron is effectively dead. Once dead, it stays dead. This tends to happen more often with high learning rates or poor initialisation, and it can silently reduce a network's effective capacity.

The plot below shows ReLU and its gradient together. Notice the sharp corner at $x = 0$ where ReLU switches from flat zero to the identity line. The gradient is discontinuous there — it's exactly 0 for all negative inputs and exactly 1 for all positive inputs, with a gap at $x = 0$ where it's undefined:

import math, json, js

x = [i * 0.05 for i in range(-100, 101)]
relu = [max(0, xi) for xi in x]

# Gradient is a step function: 0 for x<0, 1 for x>0, undefined at x=0
# Use None at x=0 to create a gap (discontinuity) in the line
relu_grad = []
for xi in x:
    if xi < 0:
        relu_grad.append(0.0)
    elif xi > 0:
        relu_grad.append(1.0)
    else:
        relu_grad.append(None)  # gap at x=0

plot_data = [{
    "title": "ReLU and its Gradient",
    "x_label": "x",
    "y_label": "f(x)",
    "x_data": x,
    "lines": [
        {"label": "ReLU(x)", "data": relu, "color": "#3b82f6"},
        {"label": "Gradient", "data": relu_grad, "color": "#ef4444"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

Leaky ReLU and GELU

Leaky ReLU is a direct fix for the dying ReLU problem. Instead of outputting exactly 0 for negative inputs, it applies a small slope $\alpha$ (typically 0.01):

\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}

When $x \leq 0$, the gradient is $\alpha$ instead of 0, so neurons on the negative side can still receive gradient updates and recover. The tradeoff is that the output for negative inputs is no longer exactly zero, which means the network can't "ignore" irrelevant features as cleanly as ReLU does. In practice, this tends to be a minor concern — keeping neurons alive is usually worth a slightly noisier negative region.

GELU (Gaussian Error Linear Unit) (Hendrycks & Gimpel, 2016) is the activation used in GPT-2, BERT, and Vision Transformers (ViT). More recent LLMs like LLaMA and Mistral have switched to SwiGLU (a gated variant of SiLU/Swish, covered below), but GELU remains widespread:

\text{GELU}(x) = x \cdot \Phi(x)

where $\Phi(x)$ is the cumulative distribution function ( CDF ) of the standard normal distribution — the probability that a standard normal random variable $Z \sim \mathcal{N}(0, 1)$ is less than $x$. This gives GELU an elegant probabilistic interpretation: it multiplies each input by the probability that the input is "large enough" to keep. The behaviour in different regimes:

When $x \gg 0$: $\Phi(x) \approx 1$, so $\text{GELU}(x) \approx x$ — it behaves like the identity, just like ReLU.
When $x \ll 0$: $\Phi(x) \approx 0$, so $\text{GELU}(x) \approx 0$ — it kills negative inputs, just like ReLU.
Near $x = 0$: there is a smooth, continuous transition. Unlike ReLU's sharp corner at zero, GELU curves gently through the origin. This smooth gradient landscape tends to help optimisation, especially in deep networks where sharp gradient discontinuities can cause instability.

You can think of GELU as a smooth, probabilistic version of ReLU: "include $x$ with probability $\Phi(x)$". When the input is clearly positive, it's almost certainly included. When it's clearly negative, it's almost certainly zeroed out. In the ambiguous region near zero, the function smoothly interpolates.

In practice, evaluating the exact Gaussian CDF is expensive, so implementations use a tanh-based approximation:

\text{GELU}(x) \approx 0.5 \, x \left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715 \, x^3\right)\right]\right)

The plot below compares all three activations. Notice how GELU and Leaky ReLU both allow some signal through for negative inputs, unlike the hard cutoff of standard ReLU:

import numpy as np
import json
import js
from math import sqrt, pi

x = np.linspace(-3, 3, 300)
relu = np.maximum(0, x)
leaky_relu = np.where(x > 0, x, 0.01 * x)
# GELU approximation
gelu = 0.5 * x * (1 + np.tanh(sqrt(2/pi) * (x + 0.044715 * x**3)))

plot_data = [{
    "title": "ReLU vs Leaky ReLU vs GELU",
    "x_label": "x",
    "y_label": "f(x)",
    "x_data": x.tolist(),
    "lines": [
        {"label": "ReLU", "data": relu.tolist(), "color": "#3b82f6"},
        {"label": "Leaky ReLU (α=0.01)", "data": leaky_relu.tolist(), "color": "#f59e0b"},
        {"label": "GELU", "data": gelu.tolist(), "color": "#10b981"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

Sigmoid and Tanh

Sigmoid and tanh are the classic activation functions — they dominated neural networks before ReLU and are still important in specific roles today, even if they have been largely replaced by ReLU-family activations in hidden layers.

Sigmoid squashes any real number into the range $(0, 1)$:

\sigma(x) = \frac{1}{1 + e^{-x}}

What happens as we push different inputs through these functions?

When $x \gg 0$: $e^{-x} \to 0$, so the denominator approaches 1, and $\sigma(x) \to 1$.
When $x \ll 0$: $e^{-x} \to \infty$, so the denominator blows up, and $\sigma(x) \to 0$.
When $x = 0$: $e^{0} = 1$, so $\sigma(0) = \frac{1}{2}$ — the midpoint of the output range.

The output is always positive and bounded between 0 and 1, which is exactly why sigmoid is used when you need a probability: binary classification output layers, gates in LSTMs and GRUs, and attention mechanisms. However, using sigmoid in hidden layers of deep networks is problematic because of its gradient:

\sigma'(x) = \sigma(x)(1 - \sigma(x))

This gradient reaches its maximum at $x = 0$, where $\sigma'(0) = 0.5 \times 0.5 = 0.25$. For large $|x|$, the gradient vanishes toward zero. This is the vanishing gradient problem : in a network with $n$ sigmoid layers, gradients are multiplied through the chain rule, and since each sigmoid contributes at most a factor of 0.25, the gradient reaching the early layers decays roughly as $0.25^n$. With 10 layers, that's a factor of about $10^{-6}$ — the early layers essentially stop learning.

Tanh is closely related to sigmoid but outputs values in $(-1, 1)$:

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

The second form shows that tanh is just a rescaled and shifted sigmoid. The key differences:

Zero-centered: tanh outputs range from $-1$ to $1$, centred around zero. This is better for hidden layers because zero-centered activations don't systematically bias the gradient direction. With sigmoid (always positive outputs), the gradients on the weights are all the same sign within a layer, which constrains the optimisation path to zigzag.
Stronger gradient: the derivative is $\tanh'(x) = 1 - \tanh^2(x)$, which reaches a maximum of 1.0 at $x = 0$ (compared to sigmoid's 0.25). Tanh still suffers from vanishing gradients at large $|x|$, but the problem is considerably less severe — gradients decay as $1.0^n$ near the origin rather than $0.25^n$.

The first plot shows sigmoid and tanh side by side. The second shows their gradients — notice how tanh's gradient peaks at 1.0 while sigmoid's peaks at just 0.25, four times smaller:

import math, json, js

x = [i * 0.05 for i in range(-100, 101)]

sigmoid = [1 / (1 + math.exp(-xi)) for xi in x]
tanh_vals = [math.tanh(xi) for xi in x]

plot_data = [
    {
        "title": "Sigmoid vs Tanh",
        "x_label": "x",
        "y_label": "f(x)",
        "x_data": x,
        "lines": [
            {"label": "Sigmoid", "data": sigmoid, "color": "#8b5cf6"},
            {"label": "Tanh", "data": tanh_vals, "color": "#ec4899"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

import math, json, js

x = [i * 0.05 for i in range(-100, 101)]

sigmoid = [1 / (1 + math.exp(-xi)) for xi in x]
sigmoid_grad = [s * (1 - s) for s in sigmoid]

tanh_vals = [math.tanh(xi) for xi in x]
tanh_grad = [1 - t**2 for t in tanh_vals]

plot_data = [
    {
        "title": "Gradients: Sigmoid vs Tanh",
        "x_label": "x",
        "y_label": "f'(x)",
        "x_data": x,
        "lines": [
            {"label": "Sigmoid gradient (max 0.25)", "data": sigmoid_grad, "color": "#8b5cf6"},
            {"label": "Tanh gradient (max 1.0)", "data": tanh_grad, "color": "#ec4899"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

Swish / SiLU

Swish (also called SiLU — Sigmoid Linear Unit) was discovered through automated neural architecture search (Ramachandran et al., 2017) :

\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

The structure is similar to GELU — both multiply $x$ by a gating function — but Swish uses the sigmoid $\sigma(x)$ as the gate instead of the Gaussian CDF $\Phi(x)$. The behaviour follows the same pattern:

When $x \gg 0$: $\sigma(x) \approx 1$, so $\text{Swish}(x) \approx x$ — the identity, like ReLU.
When $x \ll 0$: $\sigma(x) \approx 0$, so $\text{Swish}(x) \approx 0$ — the input is suppressed, like ReLU.
Near $x \approx -1.28$: Swish dips slightly below zero (reaching a minimum of about $-0.28$). This small negative region is a distinctive feature — unlike ReLU, which is strictly non-negative, Swish allows a small "undershoot" that appears to help optimisation by providing a richer gradient signal.

Swish is smooth and non-monotonic (because of that small dip below zero). The smooth gradient everywhere avoids the sharp discontinuity of ReLU at the origin, and the non-monotonicity provides a form of implicit regularisation — inputs near zero are treated differently from strongly positive or negative ones.

💡 GELU and Swish are very similar in shape — both are smooth approximations of ReLU that allow small negative values. GELU was the default for earlier transformers (GPT-2, BERT, ViT). Many newer LLMs (LLaMA, Mistral, Qwen) use SiLU inside a gated FFN variant (SwiGLU), though the performance differences between these activations tend to be small in practice.

When to Use What

With so many activations available, here is a practical guide for choosing one:

ReLU: the default for most hidden layers. Simple, fast, and it works well. Use it unless you have a specific reason not to.
GELU: the default in the original transformer era (GPT-2, BERT, ViT). Smooth gradients help with optimisation in very deep models.
Swish / SiLU: the dominant activation in modern LLMs. LLaMA, Mistral, and Qwen use SiLU inside a gated FFN (SwiGLU). Also used in vision models like EfficientNet. Very similar to GELU in shape.
Sigmoid: output layer for binary classification (it produces a probability in $(0, 1)$). Also used as gates in recurrent architectures (LSTM, GRU) and in attention mechanisms.
Tanh: output layer when you need values in $(-1, 1)$. Used in some normalisation schemes and older RNN architectures. Preferred over sigmoid for hidden layers when a saturating activation is needed, thanks to its zero-centered output.
Leaky ReLU: useful when dying ReLU is a problem — typically in smaller networks, with aggressive learning rates, or with inputs that are predominantly negative.

Quiz

Test your understanding of activation functions and their role in neural networks.

Why do neural networks need activation functions?

To speed up matrix multiplication in each layer

Without them, stacking linear layers simplifies to a single linear transformation, regardless of depth

To reduce the number of parameters in the model

To ensure all outputs are positive

What is the 'dying ReLU' problem?

ReLU is too slow to compute for large networks

ReLU outputs grow unboundedly for large positive inputs

When a neuron's input is always negative, ReLU outputs 0 with gradient 0, so the neuron permanently stops learning

ReLU causes numerical overflow during backpropagation

Why is sigmoid problematic for hidden layers in deep networks?

Its output is always positive, which biases the next layer's weights

It is too expensive to compute compared to ReLU

Its maximum gradient is 0.25, causing gradients to vanish exponentially through many layers

It can only output values of 0 or 1

What makes GELU suitable for transformers?

It is the only activation function that is differentiable everywhere

It provides smooth gradients near zero (unlike ReLU's sharp corner) while behaving like ReLU for large positive and negative values

It has the highest maximum gradient of any activation function

It forces all outputs to be between 0 and 1, which helps attention scores