Softmax and Temperature

From Scores to Probabilities

Neural networks produce raw scores called logits — unbounded real numbers that can be positive, negative, or zero. A classification network with four output classes might produce logits like $[2.0, 1.0, 0.1, -1.0]$. These numbers tell us something about relative preferences, but they are not probabilities: they don't sum to 1, and some are negative. To interpret them as a probability distribution (positive values that sum to 1), we need a function that normalises them. Softmax is that function.

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Let's break down every piece of this formula.

$z_i$ is the raw logit for class $i$. It can be any real number: positive, negative, or zero. This is the input to softmax — the raw score that the network assigns to class $i$ before any normalisation.

$e^{z_i}$ is the exponential of the logit. The exponential function makes all values strictly positive, since $e^x > 0$ for every real number $x$. Crucially, it also amplifies differences: larger logits become exponentially larger. A logit of 2 maps to $e^2 \approx 7.4$, while a logit of 4 maps to $e^4 \approx 54.6$ — a difference of 2 in logit space becomes a factor of roughly 7.4 in exponential space. The plot below shows why this matters — the curve is nearly flat for negative inputs (suppressing them toward zero) and grows explosively for positive ones:

import math, json, js

x = [i * 0.1 for i in range(-40, 31)]
exp_vals = [math.exp(xi) for xi in x]

# Mark (0,1) and (1,e) with larger point radii on the same line
point_radii = []
for xi in x:
    if abs(xi) < 0.01 or abs(xi - 1.0) < 0.01:
        point_radii.append(4)
    else:
        point_radii.append(0)

plot_data = [
    {
        "title": "The Exponential Function",
        "x_label": "x (logit)",
        "y_label": "e^x",
        "x_data": x,
        "y_min": 0,
        "lines": [
            {"label": "e^x", "data": exp_vals, "color": "#8b5cf6", "pointRadius": point_radii}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

print("Key values:")
print(f"  e^(-4) = {math.exp(-4):.4f}  (nearly zero — but never reaches 0)")
print(f"  e^(0)  = {math.exp(0):.1f}      (exactly 1, the anchor point)")
print(f"  e^(1)  = {math.exp(1):.4f}  (this defines e itself)")
print(f"  e^(3)  = {math.exp(3):.4f} (already 20x bigger than e^0)")

$\sum_{j=1}^{K} e^{z_j}$ is the normalisation constant. We sum the exponentials of all $K$ logits, then divide each individual exponential by this sum. This guarantees that all outputs sum to 1, giving us a valid probability distribution.

Output range : each $\text{softmax}(z_i) \in (0, 1)$ (strictly between 0 and 1, never exactly 0 or 1) and $\sum_{i=1}^{K} \text{softmax}(z_i) = 1$.

What happens at the extremes is instructive. If one logit is much larger than the rest (e.g., $z_1 = 10$, all others $\approx 0$), then $\text{softmax}(z_1) \approx 1$ and all other outputs are $\approx 0$. Softmax approaches a hard argmax — nearly all the probability mass lands on the winner. If all logits are equal ($z_i = c$ for all $i$), then every exponential is the same, so $\text{softmax}(z_i) = 1/K$ for all $i$ — a perfectly uniform distribution. And if you add a constant $c$ to all logits, $\text{softmax}(z_i + c) = \text{softmax}(z_i)$, because the $e^c$ factor cancels between numerator and denominator. Only relative differences between logits matter.

Let's see this step by step in code.

import numpy as np

def softmax(z):
    # Subtract max for numerical stability (doesn't change result)
    z_stable = z - np.max(z)
    exp_z = np.exp(z_stable)
    return exp_z / np.sum(exp_z)

# Example: 4-class classification logits
logits = np.array([2.0, 1.0, 0.1, -1.0])

print("Step-by-step softmax:")
print(f"  Logits:        {logits}")
print(f"  Subtract max:  {logits - np.max(logits)}")
print(f"  Exponentials:  {np.exp(logits - np.max(logits)).round(4)}")
print(f"  Sum of exp:    {np.exp(logits - np.max(logits)).sum():.4f}")
print(f"  Softmax:       {softmax(logits).round(4)}")
print(f"  Sum:           {softmax(logits).sum():.6f}")
print()

# All equal -> uniform
equal = np.array([1.0, 1.0, 1.0, 1.0])
print(f"  Equal logits {equal} -> softmax {softmax(equal).round(4)} (uniform)")

# One dominant -> near argmax
dominant = np.array([10.0, 0.0, 0.0, 0.0])
print(f"  Dominant {dominant} -> softmax {softmax(dominant).round(6)} (near argmax)")

💡 The 'subtract max' trick is essential for numerical stability. Without it, $e^{z_i}$ can overflow to infinity for large logits (e.g., $e^{1000}$). Since softmax only depends on relative differences, subtracting the max doesn't change the result but keeps all exponentials in a safe range — the largest exponential becomes $e^0 = 1$.

Why Exponentials?

A natural question arises: why use $e^{z_i}$ specifically? Why not square the logits ($z_i^2$), take absolute values ($|z_i|$), or use some other function to make things positive before normalising? There are three compelling reasons the exponential is the right choice.

First, positivity . We need all values to be positive to form a valid probability distribution. The exponential satisfies this: $e^x > 0$ for all $x$, including negative values. Squaring also makes things positive, but...

Second, monotonicity . The exponential is monotonically increasing — larger logits always produce larger exponentials, which means larger probabilities. This preserves the ranking: if the network thinks class A is more likely than class B (higher logit), class A gets a higher probability. Squaring would break this: $z^2$ maps $z = -5$ to 25 and $z = 2$ to 4, reversing the ranking entirely. A function like $|z|$ has the same problem.

Third, gradient properties . When softmax is combined with cross-entropy loss (the standard loss for classification, which we'll cover in the next article), the gradient simplifies beautifully to $p_i - y_i$ — the predicted probability minus the true label. This clean gradient comes specifically from the exponential family of distributions and makes optimisation stable and efficient. Other positive-making functions would produce far messier gradients.

Temperature: Controlling Sharpness

Sometimes we don't want the standard softmax distribution. We might want a sharper distribution (more confident, more deterministic) or a flatter one (more uncertain, more exploratory). Temperature scaling gives us this control by dividing the logits by a parameter $\tau$ (tau) before applying softmax.

\text{softmax}(z_i / \tau) = \frac{e^{z_i / \tau}}{\sum_{j=1}^{K} e^{z_j / \tau}}

The parameter $\tau$ is called the temperature , borrowing terminology from statistical mechanics where temperature controls the randomness of particle states. Here's what different temperature values do.

$\tau = 1$ : standard softmax. No change — dividing by 1 is a no-op.

$\tau \to 0^+$ (low temperature): dividing by a tiny positive number makes all logits huge in magnitude, amplifying the differences between them. The softmax approaches a hard argmax — nearly all probability mass concentrates on the single largest logit. The model becomes very confident and deterministic.

$\tau \to \infty$ (high temperature): dividing by a huge number makes all logits approach 0, erasing the differences between them. The softmax approaches a uniform distribution — the model becomes maximally uncertain and random, assigning equal probability to every class.

In summary: $\tau < 1$ (low temperature) produces sharper, more confident distributions. $\tau > 1$ (high temperature) produces flatter, more uncertain distributions. Temperature is a single knob that smoothly interpolates between "always pick the best option" and "pick uniformly at random."

The following plot makes this concrete. We take the same set of five logits and apply softmax at five different temperatures, showing how the probability distribution changes from nearly one-hot (low $\tau$) to nearly uniform (high $\tau$). Use the slider to explore different temperatures interactively.

import math, json, js

logits = [2.0, 1.0, 0.5, 0.1, -0.5]
classes = ["A", "B", "C", "D", "E"]
n = len(logits)
uniform = 1.0 / n

def softmax_temp(z, tau):
    z_t = [zi / tau for zi in z]
    m = max(z_t)
    exp_z = [math.exp(zi - m) for zi in z_t]
    s = sum(exp_z)
    return [e / s for e in exp_z]

temperatures = [0.1, 0.2, 0.3, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0]
colors = ["#6366f1", "#8b5cf6", "#a78bfa", "#3b82f6", "#06b6d4",
          "#10b981", "#84cc16", "#f59e0b", "#f97316", "#ef4444"]

traces = []
for i, tau in enumerate(temperatures):
    probs = softmax_temp(logits, tau)
    traces.append({
        "x": classes,
        "y": probs,
        "type": "bar",
        "name": "\u03c4 = " + str(tau),
        "marker": {"color": colors[i]},
        "visible": i == 5
    })

# Uniform reference line (always visible)
traces.append({
    "x": classes,
    "y": [uniform] * n,
    "type": "scatter",
    "mode": "lines",
    "name": "Uniform (1/n)",
    "line": {"color": "#9ca3af", "width": 2, "dash": "dot"},
    "visible": True
})

steps = []
for i, tau in enumerate(temperatures):
    visibility = [False] * len(temperatures) + [True]
    visibility[i] = True
    steps.append({
        "method": "update",
        "args": [{"visible": visibility}],
        "label": str(tau)
    })

layout = {
    "title": {"text": "Softmax with Different Temperatures"},
    "xaxis": {"title": {"text": "Class"}},
    "yaxis": {"title": {"text": "Probability"}, "range": [0, 1.05]},
    "sliders": [{
        "active": 5,
        "pad": {"t": 50},
        "currentvalue": {"prefix": "Temperature \u03c4 = ", "visible": True},
        "steps": steps
    }],
    "showlegend": True,
    "legend": {"x": 0.75, "y": 0.95}
}

js.window.py_plotly_data = json.dumps({"data": traces, "layout": layout})

Temperature in Practice

Temperature is not just a theoretical curiosity — it is used everywhere in modern machine learning, often as one of the most important hyperparameters. Here are three major applications.

LLM sampling. When a large language model generates text, it produces logits over the entire vocabulary at each step. The temperature controls how the next token is sampled from the resulting distribution. High temperature (1.0-1.5) encourages creative, diverse text by spreading probability across many tokens. Low temperature (0.1-0.5) produces more factual, deterministic outputs by concentrating probability on the most likely tokens. Temperature = 0 is equivalent to greedy decoding: always pick the single most probable token, with no randomness at all.

Knowledge distillation. When training a smaller "student" model to mimic a larger "teacher," Hinton et al. (2015) showed that using high temperature (typically $\tau = 4$ to $\tau = 20$) on both teacher and student softmax outputs is crucial. At $\tau = 1$, the teacher's predictions are often nearly one-hot — most of the probability sits on one class — so the student only learns "the answer is class 3." At high temperature, the distribution softens, revealing the relative rankings of all classes. The teacher might show that class 5 is more plausible than class 7, even though both have tiny probability at $\tau = 1$. This "dark knowledge" in the tail of the distribution contains rich information about class similarities that helps the student generalise better.

Contrastive learning. In models like CLIP , a low temperature ($\tau \approx 0.07$) is used in the contrastive loss. The loss pushes matching image-text pairs to have high similarity and non-matching pairs to have low similarity. A low temperature makes this loss sharper, forcing the model to discriminate more aggressively between matching and non-matching pairs. The temperature is a learned parameter in CLIP, starting around 0.07 and adapted during training.

Log-Softmax and Numerical Stability

In practice, we almost always need $\log(\text{softmax}(z_i))$ rather than $\text{softmax}(z_i)$ directly. The reason is that the standard training loss for classification — cross-entropy, covered in the next article — involves the logarithm of the predicted probability. Computing softmax first and then taking the log is numerically dangerous: softmax can produce values extremely close to 0 (e.g., $10^{-45}$), and $\log(0) = -\infty$. Even values that are merely very small can lose precision when stored in floating point.

The solution is to compute log-softmax directly, without ever materialising the intermediate softmax values.

\log \text{softmax}(z_i) = z_i - \log \sum_{j=1}^{K} e^{z_j}

This follows from taking the log of the softmax formula: $\log(e^{z_i} / \sum_j e^{z_j}) = z_i - \log \sum_j e^{z_j}$. The key term is $\log \sum_j e^{z_j}$, known as the log-sum-exp . It can be computed stably by factoring out the maximum logit $m = \max_j z_j$:

\log \sum_{j=1}^{K} e^{z_j} = m + \log \sum_{j=1}^{K} e^{z_j - m}

After subtracting $m$, all exponents are $\leq 0$, so no exponential overflows. The largest exponential is $e^0 = 1$, which is perfectly safe. This is why PyTorch provides F.log_softmax and F.cross_entropy (which fuses log-softmax with the loss computation internally) — they use this identity under the hood to avoid the dangerous intermediate step.

The following code demonstrates why this matters.

import math

logits = [1000.0, 1001.0, 1002.0]  # logits large enough to overflow

# Naive: e^1002 overflows float64 (max ~1.8e308)
print("Naive approach:")
try:
    exp_z = [math.exp(z) for z in logits]
    print(f"  exp(logits) = {exp_z}")
except OverflowError:
    print(f"  math.exp({logits[2]}) -> OverflowError!")
    print(f"  float64 max is ~1.8e308, but e^1002 ~ 10^435")

# Stable: subtract max first, then exponentiate
print()
print("Stable approach (subtract max first):")
m = max(logits)
shifted = [z - m for z in logits]
print(f"  logits - max = {shifted}")
exp_shifted = [math.exp(z) for z in shifted]
print(f"  exp(shifted) = {[round(v, 4) for v in exp_shifted]}")
log_sum_exp = m + math.log(sum(exp_shifted))
stable_log_softmax = [z - log_sum_exp for z in logits]
print(f"  log-softmax  = {[round(v, 4) for v in stable_log_softmax]}")
print(f"  sum of probs = {sum(math.exp(v) for v in stable_log_softmax):.6f} (should be 1.0)")
print()
print("Same answer, no overflow — because we only ever compute e^0, e^-1, e^-2.")

💡 With logits of 100, 101, 102, the naive approach computes $e^{102}$ which is approximately $2.7 \times 10^{44}$ — already near the float64 overflow limit. With logits in the thousands (common in large models), naive softmax fails completely. The log-sum-exp trick makes this bulletproof.

Quiz

Test your understanding of softmax and temperature.

Why does softmax use exponentials rather than simply normalising by the sum of logits?

Exponentials are faster to compute on modern hardware

Exponentials ensure positivity (no negative probabilities) and monotonicity (larger logits always get higher probability)

Exponentials make the output values larger, which is better for gradient flow

Exponentials are required for the softmax output to sum to 1

What happens to the softmax output as temperature τ → 0?

All probabilities become equal (uniform distribution)

All probabilities become zero

It approaches a hard argmax, concentrating nearly all probability on the highest logit

The output becomes undefined

Why is adding a constant to all logits before softmax safe?

Adding a constant makes the logits positive, which softmax requires

Adding a constant reduces the variance of the logits

Softmax depends only on relative differences between logits; adding a constant cancels in the numerator and denominator

Adding a constant normalises the logits to have zero mean

Why does PyTorch provide F.log_softmax instead of just log(softmax(x))?

F.log_softmax uses less GPU memory by avoiding the intermediate softmax tensor

Computing log-softmax directly via the log-sum-exp trick avoids numerical underflow from taking log of near-zero softmax outputs

F.log_softmax is parallelised across multiple GPUs, while log(softmax(x)) is not

F.log_softmax returns gradients automatically, while log(softmax(x)) does not