Entropy, KL Divergence, and Information Theory

Information and Surprise

Information theory starts with a deceptively simple question: how surprised are you when an event occurs? If it rains in London, you're barely surprised — it rains a lot. If it rains in the Sahara, you're very surprised — it almost never rains there. The more unlikely an event, the more "information" it carries when it happens. This isn't a vague metaphor — Claude Shannon formalised it into a precise mathematical quantity in 1948, and it became the foundation of everything from compression algorithms to the loss functions we use to train neural networks.

Shannon defined the information content (also called self-information or surprisal) of an event $x$ as:

I(x) = -\log_2 P(x)

Let's break this apart piece by piece.

$P(x)$ is the probability of event $x$ occurring. It must lie in the interval $(0, 1]$ — the event must be possible (probability greater than zero) and can be at most certain (probability equal to one). This is the only input to the information function: the probability of the thing that happened.

$-\log_2$ is the negative logarithm base 2, and the result is measured in "bits." Why negative? Because the logarithm of any number in $(0, 1]$ is non-positive ($\log_2(0.5) = -1$, $\log_2(1) = 0$), so the negation ensures information is always non-negative. Why base 2? Because it connects to binary encoding — one bit is the information gained from a single fair coin flip. You could use natural log (measuring in "nats") or $\log_{10}$ (measuring in "hartleys"), but bits are the most common unit in information theory.

Now let's trace the boundary and extreme behaviour:

When $P(x) = 1$ (certain event): $I(x) = -\log_2(1) = 0$ bits. No surprise at all — you already knew it would happen. No information gained.
When $P(x) = 0.5$ (coin flip): $I(x) = -\log_2(0.5) = 1$ bit. Exactly one bit of information — the fundamental unit. Learning the outcome of a fair coin flip gives you precisely one bit.
When $P(x) = 0.01$ (rare event): $I(x) = -\log_2(0.01) \approx 6.64$ bits. Very informative — a rare event carries a lot of surprise when it actually occurs.
As $P(x) \to 0$ (increasingly impossible): $I(x) \to \infty$. The information content grows without bound. An event with probability one in a trillion carries about 40 bits of information.

The logarithmic scale is essential. If you learn two independent facts (like flipping two independent coins), the total information is the sum: $I(x_1) + I(x_2) = -\log P(x_1) - \log P(x_2) = -\log(P(x_1) P(x_2))$. The log turns multiplication of probabilities into addition of information — a deeply useful property.

import numpy as np
import json
import js

probs = [1.0, 0.5, 0.25, 0.1, 0.01, 0.001]

rows = []
for p in probs:
    info = -np.log2(p) + 0.0
    rows.append([str(p), f"{info:.2f} bits"])

js.window.py_table_data = json.dumps({
    "headers": ["P(x)", "Information I(x)"],
    "rows": rows
})

print("A fair coin flip carries exactly 1 bit of information.")
print(f"A fair 6-sided die roll carries log\u2082(6) \u2248 {np.log2(6):.2f} bits.")
print("A fair 6-sided die roll carries log₂(6) ≈ 2.58 bits.")
print(f"  log₂(6) = {np.log2(6):.2f}")

Entropy: Average Surprise

Information tells us the surprise of a single event. But a probability distribution describes many possible events, each with its own probability. How much surprise should we expect on average? That's entropy — the expected information across all possible events:

H(P) = -\sum_{x} P(x) \log_2 P(x)

Every piece of this formula has a specific role:

$P(x)$ is the probability of event $x$ under distribution $P$. This tells us how often each event occurs.

$-\log_2 P(x)$ is the information (surprise) of event $x$ — the quantity we defined in the previous section. Rare events have high surprise, common events have low surprise.

$P(x) \times [-\log_2 P(x)]$ is the surprise of event $x$, weighted by how often it occurs. A very surprising event that almost never happens contributes little to the average. A moderately surprising event that happens frequently contributes a lot. This weighting is what makes entropy a meaningful average — it accounts for both how surprising each event is and how often you actually encounter it.

$\sum_x$ sums over all possible events in the distribution. The result is the average number of bits needed to encode a randomly drawn event from $P$.

Entropy measures the average uncertainty in a distribution. High entropy means high uncertainty — the distribution is spread out across many outcomes and you can't predict what will happen next. Low entropy means low uncertainty — the distribution is concentrated on a few outcomes and you can predict with confidence.

Boundary behaviour:

Maximum entropy: the uniform distribution. When all $K$ events are equally likely, $P(x) = 1/K$ for all $x$, and $H = \log_2 K$ bits. A fair coin ($K = 2$) has $H = 1$ bit. A fair 6-sided die ($K = 6$) has $H = \log_2(6) \approx 2.58$ bits. The uniform distribution maximises entropy because every outcome is equally uncertain — you have no information that helps you predict which event will occur.
Minimum entropy: $H = 0$ bits. This happens when all probability is concentrated on a single event ($P(x) = 1$ for one event, $P(x) = 0$ for all others). There is no uncertainty at all — you always know exactly what will happen. Zero entropy means zero surprise on average.

An important convention: we define $0 \log_2 0 = 0$, because $\lim_{p \to 0^+} p \log_2 p = 0$. Events that never happen contribute nothing to entropy. This makes the formula well-defined even when some probabilities are zero.

The following plot shows the entropy of a binary distribution (like a biased coin) as the probability $p$ of heads varies from 0 to 1:

import math, json, js

# Full curve
p_curve = [0.01 + i * 0.98 / 199 for i in range(200)]
h_curve = [-(p * math.log2(p) + (1 - p) * math.log2(1 - p)) for p in p_curve]

# Slider steps: ~50 points from p=0.01 to p=0.99
n_steps = 50
p_steps = [0.01 + i * 0.98 / (n_steps - 1) for i in range(n_steps)]
h_steps = [-(p * math.log2(p) + (1 - p) * math.log2(1 - p)) for p in p_steps]

# Trace 0: the full H(p) curve (always visible)
curve_trace = {
    "type": "scatter",
    "mode": "lines",
    "x": p_curve,
    "y": h_curve,
    "name": "H(p)",
    "line": {"color": "#8b5cf6", "width": 3},
    "hoverinfo": "skip"
}

# Trace 1: horizontal reference line at H = 1.0
hline = {
    "type": "scatter",
    "mode": "lines",
    "x": [0, 1],
    "y": [1.0, 1.0],
    "name": "Max entropy (H=1)",
    "line": {"color": "#94a3b8", "width": 1, "dash": "dash"},
    "hoverinfo": "skip"
}

# Trace 2: vertical reference line at p = 0.5
vline = {
    "type": "scatter",
    "mode": "lines",
    "x": [0.5, 0.5],
    "y": [0, 1.05],
    "name": "p = 0.5",
    "line": {"color": "#94a3b8", "width": 1, "dash": "dash"},
    "hoverinfo": "skip"
}

# Traces 3..3+n_steps-1: one marker per slider step
marker_traces = []
for i in range(n_steps):
    marker_traces.append({
        "type": "scatter",
        "mode": "markers",
        "x": [p_steps[i]],
        "y": [h_steps[i]],
        "marker": {"color": "#ef4444", "size": 12, "line": {"color": "#fff", "width": 2}},
        "name": "",
        "showlegend": False,
        "visible": (i == n_steps // 2),
        "hovertemplate": "p=%{x:.3f}<br>H(p)=%{y:.4f} bits<extra></extra>"
    })

# Default annotation for the middle step (p ~ 0.5)
default_idx = n_steps // 2
default_annotation = {
    "x": p_steps[default_idx],
    "y": h_steps[default_idx],
    "xref": "x",
    "yref": "y",
    "text": f"p = {p_steps[default_idx]:.3f}, H(p) = {h_steps[default_idx]:.4f}",
    "showarrow": True,
    "arrowhead": 2,
    "ax": 40,
    "ay": -40,
    "font": {"size": 13, "color": "#1e293b"},
    "bgcolor": "#f8fafc",
    "bordercolor": "#8b5cf6",
    "borderwidth": 1,
    "borderpad": 4
}

# Build slider steps: each step toggles which marker trace is visible
slider_steps = []
for i in range(n_steps):
    vis = [True, True, True] + [(j == i) for j in range(n_steps)]
    step = {
        "method": "update",
        "args": [
            {"visible": vis},
            {"annotations": [{
                "x": p_steps[i],
                "y": h_steps[i],
                "xref": "x",
                "yref": "y",
                "text": f"p = {p_steps[i]:.3f}, H(p) = {h_steps[i]:.4f}",
                "showarrow": True,
                "arrowhead": 2,
                "ax": 40,
                "ay": -40,
                "font": {"size": 13, "color": "#1e293b"},
                "bgcolor": "#f8fafc",
                "bordercolor": "#8b5cf6",
                "borderwidth": 1,
                "borderpad": 4
            }]}
        ],
        "label": f"{p_steps[i]:.2f}"
    }
    slider_steps.append(step)

data = [curve_trace, hline, vline] + marker_traces

layout = {
    "title": "Binary Entropy H(p)",
    "xaxis": {"title": "p", "range": [-0.02, 1.02]},
    "yaxis": {"title": "H(p) bits", "range": [0, 1.1]},
    "annotations": [default_annotation],
    "sliders": [{
        "active": default_idx,
        "currentvalue": {"prefix": "p = ", "font": {"size": 14}},
        "pad": {"t": 50},
        "steps": slider_steps
    }],
    "showlegend": True,
    "legend": {"x": 0.01, "y": 0.95}
}

js.window.py_plotly_data = json.dumps({"data": data, "layout": layout})

The curve is symmetric and peaks at $p = 0.5$ (maximum uncertainty — a fair coin). It drops to 0 at both extremes ($p = 0$ and $p = 1$, where the outcome is certain). This shape — highest when you know least, lowest when you know most — is the signature of entropy.

Cross-Entropy: Measuring Distribution Mismatch

Entropy measures the average surprise when you draw from a distribution and you know the true distribution. But what if you don't know the true distribution? What if reality follows one distribution $P$, but you're making predictions based on a different distribution $Q$? The average surprise you'll experience is the cross-entropy:

H(P, Q) = -\sum_{x} P(x) \log Q(x)

Let's dissect each component:

$P(x)$ is the true distribution — what actually happens. Events are drawn from $P$. In machine learning, this is the ground truth: the real data distribution or the one-hot encoded labels.

$Q(x)$ is the predicted distribution — what the model thinks will happen. This is the model's output, typically the softmax probabilities over classes.

$-\log Q(x)$ is how surprised the model $Q$ is when event $x$ occurs. If $Q$ assigns high probability to $x$, the surprise is low. If $Q$ assigns low probability to $x$, the surprise is high. This is the information content under the model's belief.

The sum weights each surprise by how often $x$ actually occurs (under $P$). Cross-entropy answers: "if reality follows $P$ but I'm using $Q$ to make predictions, how surprised will I be on average?"

Key property: $H(P, Q) \geq H(P)$, always. The cross-entropy is at least as large as the true entropy. This makes intuitive sense: using the wrong distribution can only increase your average surprise, never decrease it. Equality holds only when $Q = P$ — the best prediction is the true distribution itself.

The difference $H(P, Q) - H(P)$ is the KL divergence (next section) — the "extra surprise" from using $Q$ instead of $P$. This decomposition is fundamental: cross-entropy = entropy + KL divergence.

This is exactly the cross-entropy loss from article 3! In classification, $P$ is the one-hot true label and $Q$ is the softmax output. Since $P$ is one-hot (only one $P(x) = 1$, rest are 0), the sum $-\sum_x P(x) \log Q(x)$ collapses to $-\log Q(c)$ where $c$ is the true class. All the zero terms vanish. The cross-entropy loss we've been using all along is a special case of this general formula.

💡 The connection runs deep: minimising cross-entropy loss in classification is equivalent to maximum likelihood estimation, which is equivalent to minimising KL divergence from the true distribution. These are three different names for the same optimisation objective, viewed from different mathematical angles.

KL Divergence: The Extra Cost of Being Wrong

KL divergence (Kullback-Leibler divergence) measures the extra bits of surprise incurred by using distribution $Q$ when the true distribution is $P$. It has two equivalent forms:

D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = H(P, Q) - H(P)

Let's unpack every component of this formula:

$P(x) \log \frac{P(x)}{Q(x)}$ : for each event $x$, how much more surprising is $Q$'s prediction than reality? The ratio $P(x)/Q(x)$ measures the discrepancy between the two distributions at event $x$. If $Q(x) = P(x)$, the ratio is 1 and $\log 1 = 0$ — no extra surprise. If $Q(x) < P(x)$ (the model underestimates the probability), the ratio exceeds 1 and the log is positive — extra surprise. If $Q(x) > P(x)$ (the model overestimates), the ratio is less than 1 and the log is negative — but this is weighted by $P(x)$, so overestimated rare events contribute little.

$H(P, Q) - H(P)$ : the equivalent form. Cross-entropy minus entropy equals the extra bits of surprise caused by using $Q$ instead of $P$. This decomposition is elegant: $H(P)$ is the irreducible minimum surprise (the entropy of reality), and $D_{\text{KL}}(P \| Q)$ is the additional penalty for having the wrong model.

$D_{\text{KL}} \geq 0$ always (this is known as Gibbs' inequality). The KL divergence is zero if and only if $P = Q$ everywhere. You cannot do better than the true distribution — any deviation adds extra surprise.

NOT symmetric: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ in general. KL divergence is not a true distance metric. This asymmetry is not a mathematical inconvenience — it encodes something meaningful.

Why "not symmetric" matters: $D_{\text{KL}}(P \| Q)$ penalises cases where $P(x) > 0$ but $Q(x) \approx 0$ — the model assigns near-zero probability to something that actually happens. This is catastrophic: the model is saying "this is essentially impossible" about something that occurs in reality, producing a $-\log Q(x) \to \infty$ surprise. But it doesn't penalise $Q(x) > 0$ when $P(x) = 0$ — the model wastes probability on impossible events, which is wasteful but not catastrophic (those terms vanish because they're multiplied by $P(x) = 0$). This asymmetry is why the direction matters: in training, we typically minimise $D_{\text{KL}}(P_{\text{true}} \| Q_{\text{model}})$, which forces the model to cover all events that actually occur.

import numpy as np

def kl_divergence(p, q):
    """D_KL(P || Q)"""
    # Filter out zero entries in P (0 * log(0/q) = 0)
    mask = p > 0
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

def cross_entropy(p, q):
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))

def entropy(p):
    mask = p > 0
    return -np.sum(p[mask] * np.log2(p[mask]))

# True distribution: 70% cat, 20% dog, 10% bird
P = np.array([0.7, 0.2, 0.1])
labels = ["cat", "dog", "bird"]

# Good model prediction
Q_good = np.array([0.6, 0.3, 0.1])

# Bad model prediction (thinks bird is most likely)
Q_bad = np.array([0.1, 0.1, 0.8])

print(f"True distribution P:  {dict(zip(labels, P))}")
print()

for name, Q in [("Good model", Q_good), ("Bad model", Q_bad)]:
    H_P = entropy(P)
    H_PQ = cross_entropy(P, Q)
    KL = kl_divergence(P, Q)
    print(f"{name} Q: {dict(zip(labels, Q))}")
    print(f"  Entropy H(P):        {H_P:.4f} bits  (irreducible uncertainty)")
    print(f"  Cross-entropy H(P,Q): {H_PQ:.4f} bits  (total surprise)")
    print(f"  KL divergence:        {KL:.4f} bits  (extra surprise from using Q)")
    print(f"  Check: H(P,Q) - H(P) = {H_PQ - H_P:.4f} = KL ✓")
    print()

# Show asymmetry
print("Asymmetry of KL divergence:")
print(f"  D_KL(P || Q_good) = {kl_divergence(P, Q_good):.4f}")
print(f"  D_KL(Q_good || P) = {kl_divergence(Q_good, P):.4f}")
print(f"  They're different! KL is NOT symmetric.")

KL Divergence in Deep Learning

KL divergence is not just a theoretical curiosity — it appears as an explicit term in the loss functions and training procedures of several major deep learning paradigms. Understanding where and why it shows up will help you read modern ML papers with confidence.

Knowledge distillation: the student network minimises $D_{\text{KL}}(P_{\text{teacher}} \| P_{\text{student}})$ — matching the teacher's soft predictions. The temperature-scaled softmax from article 2 creates smoother teacher distributions that are easier to match, because the softened probabilities reveal inter-class similarities ("this 7 looks a bit like a 1") that hard labels throw away.
Variational autoencoders (VAEs): the loss includes $D_{\text{KL}}(q(z|x) \| p(z))$, which pushes the encoder's learned latent distribution $q(z|x)$ toward a standard normal prior $p(z) = \mathcal{N}(0, I)$. This KL term is what makes the latent space smooth and interpolable — without it, the encoder could map inputs to wildly scattered points with no structure, making generation impossible.
RLHF / DPO: a KL penalty $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ prevents the fine-tuned policy $\pi_\theta$ from drifting too far from the reference model $\pi_{\text{ref}}$. Without this constraint, the model can overfit to the reward signal and produce degenerate outputs — gaming the reward function in ways that produce high reward but low-quality text. The KL term acts as a leash, keeping the model close to its well-behaved starting point. (Covered in detail in the RLHF track.)
Information bottleneck: the principle of compressing representations while preserving task-relevant information is formalised as minimising mutual information (which involves KL terms). The idea is to find the most compact representation $Z$ of input $X$ that still predicts label $Y$ well. The trade-off between compression and prediction is governed by KL divergences between the joint and marginal distributions.

In every case, KL divergence serves the same conceptual role: it measures how far one distribution is from another and provides a differentiable penalty that the optimiser can push toward zero. The specific distributions differ (softmax outputs, latent encodings, policy distributions), but the mathematical machinery is identical.

Connecting It All Together

The four quantities we've covered form a clean hierarchy, each building on the last:

\text{Information} \xrightarrow{\text{average}} \text{Entropy} \xrightarrow{\text{mismatch}} \text{Cross-Entropy} \xrightarrow{\text{minus entropy}} \text{KL Divergence}

Information: surprise of a single event. $I(x) = -\log_2 P(x)$.
Entropy: average surprise across all events. $H(P) = -\sum_x P(x) \log_2 P(x)$.
Cross-entropy: average surprise when using the wrong distribution. $H(P, Q) = -\sum_x P(x) \log Q(x)$.
KL divergence: the extra surprise caused by the wrong distribution. $D_{\text{KL}}(P \| Q) = H(P, Q) - H(P)$.

And now the crucial connection to loss functions: minimising cross-entropy $H(P, Q)$ with respect to $Q$ is equivalent to minimising KL divergence $D_{\text{KL}}(P \| Q)$. Why? Because $H(P, Q) = D_{\text{KL}}(P \| Q) + H(P)$, and $H(P)$ is a constant — the data distribution doesn't change during training. Taking the gradient of $H(P, Q)$ with respect to the model's parameters gives the same gradient as taking the gradient of $D_{\text{KL}}(P \| Q)$. The constant $H(P)$ disappears.

This is why "cross-entropy loss" and "minimise KL divergence" are used interchangeably in practice. They produce identical gradients and identical trained models. The only difference is a constant offset in the loss value — and since we care about the direction of optimisation (the gradients), not the absolute loss number, the distinction is irrelevant for training.

Every time you see a cross-entropy loss in a training loop, you're implicitly minimising the KL divergence between the true data distribution and the model's learned distribution. Every time you see a KL penalty in a VAE or RLHF objective, you're using the same mathematical machinery. Information theory provides the unifying language.

Quiz

Test your understanding of information theory, entropy, and KL divergence.

Why does a rare event carry more information than a common one?

Rare events are always more important than common events in practice

Information is −log P(x); lower probability means a higher −log value, so more bits of surprise

Common events have negative information content, which cancels out

The log function amplifies common events and suppresses rare ones

What does maximum entropy mean for a probability distribution?

All probability is concentrated on a single event — complete certainty

The distribution is uniform — all outcomes are equally likely, maximum uncertainty

The distribution has the highest possible variance but lowest possible mean

The distribution assigns zero probability to most events

Why is KL divergence not a true distance metric?

It can be negative, which violates the non-negativity requirement for metrics

It is not symmetric: D_KL(P||Q) ≠ D_KL(Q||P) in general

It is always equal to zero regardless of the distributions

It does not satisfy the triangle inequality but is otherwise symmetric

Why is minimising cross-entropy H(P,Q) equivalent to minimising KL divergence D_KL(P||Q) during training?

Cross-entropy and KL divergence are different names for the same formula

Because H(P,Q) = D_KL(P||Q) + H(P), and H(P) is constant during training, so the gradients are identical

Minimising cross-entropy always sets the KL divergence to exactly zero

They are only equivalent when the model uses softmax activation