What Is a Loss Function?

A loss function quantifies how far the model's prediction is from the correct answer. It takes the model's output and the true target, and returns a single number — the loss — that measures the magnitude of the mistake. Training a neural network is, at its core, the process of minimising this function: adjusting weights so that predictions get closer and closer to truth. The choice of loss function shapes what "closer" means and directly affects what the model learns.

Think of the loss function as a compass for the optimiser. Without it, gradient descent has no direction — it wouldn't know whether a change to the weights made things better or worse. The loss provides that signal: a lower loss means better predictions, a higher loss means worse ones. Every gradient computation, every weight update, every epoch of training is driven by the single goal of pushing the loss downward.

Two key properties every loss function needs:

  • Differentiable (at least almost everywhere): we need gradients for backpropagation. A loss we can't differentiate is a loss we can't optimise with gradient descent. The entire training loop depends on computing $\frac{\partial \mathcal{L}}{\partial \theta}$ — the gradient of the loss with respect to every parameter. If the loss function has no well-defined derivative, this chain breaks.
  • Lower-bounded : typically by 0, so the optimiser knows when to stop (or at least when further improvement is marginal). A loss of 0 means perfect predictions — the model's output matches the target exactly. Without a lower bound, the optimiser could chase the loss toward $-\infty$ indefinitely, which would mean the loss doesn't actually measure error in a meaningful way.

In the sections that follow, we'll examine the most important loss functions in deep learning: Mean Squared Error for regression, Cross-Entropy for classification, and Binary Cross-Entropy for binary problems. Each one encodes a different notion of "how wrong" the model is, and choosing the right one is one of the most consequential decisions in model design.

Mean Squared Error (MSE)

The most intuitive loss function is Mean Squared Error — the average squared distance between predictions and targets:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

Let's break down every component of this formula.

$y_i$ is the true (target) value for sample $i$. This is the ground truth — the number we want the model to predict. In a house price prediction task, $y_i$ might be the actual sale price of house $i$.

$\hat{y}_i$ is the model's predicted value for sample $i$. The hat notation ($\hat{\cdot}$) is a standard convention meaning "estimated" or "predicted." This is the output of the network's forward pass.

$(y_i - \hat{y}_i)^2$ is the squared error for a single sample. Squaring does two critical things. First, it makes all errors positive — a prediction that is too high by 5 and one that is too low by 5 both contribute an error of 25. Without squaring, positive and negative errors could cancel each other out, making the average error misleadingly small. Second, squaring penalises large errors disproportionately: an error of 10 contributes 100 to the loss, while an error of 1 contributes just 1. This means the model focuses on correcting its worst mistakes first.

$\frac{1}{N}$ averages the squared errors over the entire batch of $N$ samples. This normalisation is important because it makes the loss independent of batch size. Without it, a batch of 64 samples would have a loss 64 times larger than a batch of 1, making learning rate tuning dependent on batch size.

Why squared, not absolute? A natural alternative is the Mean Absolute Error (MAE): $\frac{1}{N} \sum |y_i - \hat{y}_i|$. MAE works perfectly well as a loss, but it has a practical problem: the absolute value function $|x|$ is not differentiable at $x = 0$. The gradient is $-1$ for negative errors and $+1$ for positive errors, with an abrupt jump at zero. MSE, by contrast, is smooth everywhere. Its gradient with respect to the prediction is:

$$\frac{\partial}{\partial \hat{y}} (y - \hat{y})^2 = -2(y - \hat{y})$$

This gradient is proportional to the error itself — large errors produce large gradients, accelerating correction, while small errors produce small gradients, allowing fine-tuning. This self-scaling property makes MSE particularly well-behaved for optimisation.

Used for: regression tasks where the model predicts a continuous value — house prices, temperature forecasts, stock returns, object positions in an image, or any problem where the target is a real number rather than a category.

The plot below shows the MSE loss landscape for a single sample with true value $y = 3.0$. As the prediction $\hat{y}$ varies, the loss traces out a parabola — smooth, convex, and with a clear minimum at $\hat{y} = 3.0$. The gradient (red) passes through zero at the minimum and grows linearly with distance from the target:

import numpy as np
import json
import js

# True value: y = 3.0. Plot MSE as prediction varies.
y_true = 3.0
y_pred = np.linspace(-1, 7, 200)
mse = (y_true - y_pred) ** 2
gradient = -2 * (y_true - y_pred)

plot_data = [{
    "title": "MSE Loss Landscape (target = 3.0)",
    "x_label": "Prediction ŷ",
    "y_label": "Loss / Gradient",
    "x_data": y_pred.tolist(),
    "lines": [
        {"label": "MSE loss", "data": mse.tolist(), "color": "#3b82f6"},
        {"label": "Gradient", "data": gradient.tolist(), "color": "#ef4444"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

Cross-Entropy Loss

Cross-entropy is the standard loss function for classification. While MSE measures distance between numbers, cross-entropy measures how well a predicted probability distribution matches the true distribution. It answers a different question: not "how far off is the number?" but "how surprised should we be by the correct answer, given what the model predicted?"

For a single sample where the true class is $c$ and the model outputs predicted probabilities $p_1, p_2, \ldots, p_K$ (typically from a softmax layer), the cross-entropy loss is remarkably simple:

$$\mathcal{L}_{\text{CE}} = -\log p_c$$

That's it — just the negative log of the predicted probability for the correct class. Let's unpack why this works so well.

$p_c$ is the predicted probability for the correct class. This is the only term that matters — we don't care about the probabilities assigned to wrong classes. If the model predicts $[0.1, 0.7, 0.2]$ for a 3-class problem and the true class is 1, then $p_c = 0.7$. The loss depends entirely on how much probability the model placed on the right answer.

$-\log$ creates the penalty structure. When $p_c = 1$ (perfect prediction — the model is 100% confident in the correct class), $-\log(1) = 0$: no loss at all. When $p_c \to 0$ (the model assigns near-zero probability to the correct answer — it thinks the right answer is essentially impossible), $-\log(p_c) \to \infty$: the loss explodes toward infinity. This creates an extremely strong gradient when the model is confidently wrong, forcing rapid correction of catastrophic mistakes.

Why $-\log$? The mathematical justification comes from maximum likelihood estimation (MLE). If we treat the model as defining a probability distribution over classes, the likelihood of observing the true labels given the model's predictions is $\prod_i p_{c_i}$. Maximising this likelihood is equivalent to minimising the negative log-likelihood: $-\sum_i \log p_{c_i}$. This is exactly cross-entropy. So cross-entropy loss is not an arbitrary choice — it's the principled, information-theoretic way to train a probabilistic classifier.

The full batch form averages over all $N$ samples:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log p_{c_i}$$

There is an equivalent formulation using one-hot encoded label vectors $\mathbf{y}_i$, where $y_{i,k} = 1$ if $k = c_i$ (the correct class) and $y_{i,k} = 0$ otherwise:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log p_{i,k}$$

These two forms are exactly equivalent. Since $y_{i,k}$ is 0 for every class except the correct one ($k \neq c_i$), the inner sum $\sum_{k=1}^{K} y_{i,k} \log p_{i,k}$ collapses to just $1 \cdot \log p_{i,c_i} = \log p_{c_i}$. All the zero terms vanish. The one-hot formulation is useful because it generalises naturally to soft labels (where the target distribution isn't a hard 0/1 vector), which appears in techniques like knowledge distillation and label smoothing.

The following plot shows the cross-entropy loss as a function of $p_c$ — the probability assigned to the correct class. Notice the steep cliff as $p_c$ approaches 0: the model is severely punished for being confidently wrong.

import numpy as np
import json
import js

# Cross-entropy as a function of predicted probability for the correct class
p_correct = np.linspace(0.01, 1.0, 200)
ce_loss = -np.log(p_correct)

plot_data = [{
    "title": "Cross-Entropy Loss vs Predicted Probability",
    "x_label": "p(correct class)",
    "y_label": "-log(p)",
    "x_data": p_correct.tolist(),
    "lines": [
        {"label": "Cross-entropy loss", "data": ce_loss.tolist(), "color": "#8b5cf6"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

Let's make this concrete with a worked example. We'll compute cross-entropy for a 3-class problem, first when the model is correct and then when it's confidently wrong:

import numpy as np

def cross_entropy(logits, true_class):
    # Stable computation via log-softmax
    shifted = logits - np.max(logits)
    log_sum_exp = np.log(np.sum(np.exp(shifted)))
    log_probs = shifted - log_sum_exp
    return -log_probs[true_class]

# 3-class problem. True class = 1
logits = np.array([2.0, 5.0, 1.0])  # model is fairly confident about class 1
true_class = 1

probs = np.exp(logits - np.max(logits)) / np.sum(np.exp(logits - np.max(logits)))
loss = cross_entropy(logits, true_class)

print(f"Logits:    {logits}")
print(f"Softmax:   {probs.round(4)}")
print(f"True class: {true_class} (p = {probs[true_class]:.4f})")
print(f"CE loss:   -log({probs[true_class]:.4f}) = {loss:.4f}")
print()

# What if model is wrong?
logits_wrong = np.array([5.0, 1.0, 2.0])  # confident about class 0, but truth is 1
probs_wrong = np.exp(logits_wrong - np.max(logits_wrong)) / np.sum(np.exp(logits_wrong - np.max(logits_wrong)))
loss_wrong = cross_entropy(logits_wrong, true_class)
print(f"Wrong prediction:")
print(f"Logits:    {logits_wrong}")
print(f"Softmax:   {probs_wrong.round(4)}")
print(f"True class: {true_class} (p = {probs_wrong[true_class]:.4f})")
print(f"CE loss:   -log({probs_wrong[true_class]:.4f}) = {loss_wrong:.4f}")
print(f"\nConfidently wrong → much higher loss ({loss_wrong:.2f} vs {loss:.2f})")
💡 Notice how the cross-entropy loss is computed using the log-softmax trick: subtract the max, compute log-sum-exp, then subtract. This avoids the numerical instability of computing softmax first and then taking the log — exactly the technique we covered in the softmax article.

Binary Cross-Entropy (BCE)

Binary Cross-Entropy is a special case of cross-entropy for two-class (binary) problems. Instead of a softmax over $K$ classes, the model outputs a single probability $p \in (0, 1)$ via the sigmoid function, representing the probability of the positive class (class 1). The probability of class 0 is simply $1 - p$.

$$\mathcal{L}_{\text{BCE}} = -\left[ y \log p + (1 - y) \log(1 - p) \right]$$

Let's dissect each piece of this formula.

$y \in \{0, 1\}$ is the true label. It's either 0 (negative class) or 1 (positive class). In a spam detection task, $y = 1$ means the email is spam, $y = 0$ means it's not.

$p \in (0, 1)$ is the predicted probability of class 1, produced by applying the sigmoid function to the model's raw output (logit). Sigmoid squashes any real number into the $(0, 1)$ range, giving us a valid probability.

The formula has two terms, and the label $y$ acts as a switch that selects which one is active:

  • When $y = 1$: the second term vanishes ($(1 - 1) \log(1 - p) = 0$), leaving just $-\log p$. This penalises low predicted probability $p$ — if the true label is positive but the model predicts $p = 0.01$, the loss is $-\log(0.01) = 4.6$, which is very high.
  • When $y = 0$: the first term vanishes ($0 \cdot \log p = 0$), leaving just $-\log(1 - p)$. This penalises high predicted probability $p$ — if the true label is negative but the model predicts $p = 0.99$, the loss is $-\log(1 - 0.99) = -\log(0.01) = 4.6$. Symmetrically harsh for confidently wrong predictions in either direction.

You can verify that BCE is a special case of general cross-entropy. In a 2-class problem with probabilities $[1 - p, p]$ and one-hot label $[1 - y, y]$, the cross-entropy $-\sum_k y_k \log p_k$ expands to $-(1 - y)\log(1 - p) - y \log p$, which is exactly the BCE formula.

BCE appears throughout modern deep learning beyond simple binary classification. If you've read the VLM track, you'll recognise it in SigLIP , where each image-text pair is treated as an independent binary classification: "does this image match this text?" Each pair gets its own sigmoid probability and its own BCE loss. This allows SigLIP to avoid the global softmax normalisation of CLIP's contrastive loss, making it more scalable and efficient.

MSE vs Cross-Entropy: When to Use Which

A common beginner mistake is using MSE for classification. It works in a technical sense — you can compute $(p - y)^2$ and differentiate it — but cross-entropy is almost always the better choice for classification. Here's why.

1. Gradient strength for confident mistakes. Consider a model that predicts $p \approx 0$ when the true label is $y = 1$ — it's confidently wrong. With MSE, the gradient is $\frac{\partial}{\partial p}(p - 1)^2 = 2(p - 1) \approx -2$. That's a fixed, bounded gradient of magnitude 2, regardless of how wrong the model is. With cross-entropy, the loss is $-\log(p)$, which approaches infinity as $p \to 0$. The gradient $-1/p$ also grows without bound. This means cross-entropy creates an enormously strong learning signal for confidently wrong predictions, while MSE gives only a mild nudge. In practice, this translates to faster convergence and better final accuracy for classification tasks.

2. Probabilistic interpretation. Cross-entropy directly optimises the predicted probability distribution — it comes from maximum likelihood estimation and has a rigorous information-theoretic foundation. MSE treats probabilities as continuous values to regress toward, ignoring the fact that they must sum to 1 and live on a probability simplex. Using MSE for probabilities is like measuring the distance between two points on a sphere using a straight line through the interior — it works, but it doesn't respect the geometry of the space.

3. The softmax-cross-entropy gradient. When cross-entropy is combined with a softmax output layer (which is nearly always the case in classification), the gradient simplifies to $p_i - y_i$ — the predicted probability minus the true label. This is clean, numerically stable, and scales naturally. MSE combined with softmax produces a more complicated gradient that involves additional factors of $p_i(1 - p_i)$, which approach 0 when the model is confident, further suppressing the gradient precisely when it's needed most.

The practical rule of thumb is straightforward:

  • Regression (continuous targets like prices, temperatures, positions) → MSE . Or Huber loss if you need robustness to outliers — Huber behaves like MSE for small errors and like MAE for large ones, capping the penalty for extreme outliers.
  • Multi-class classification (discrete categories like ImageNet classes, next-token prediction) → Cross-entropy with softmax.
  • Binary classification (two classes: spam/not spam, match/no match) → BCE with sigmoid.

Quiz

Test your understanding of loss functions and when to use each one.

Why does MSE use squared error rather than absolute error?

In cross-entropy loss, what happens when the model assigns very low probability to the correct class?

Why is cross-entropy preferred over MSE for classification tasks?

In Binary Cross-Entropy, when the true label y = 0, which term of the loss is active?