Evaluating Fine-tuned Models

How Do You Know If Fine-tuning Worked?

You've spent hours curating data, choosing hyperparameters, and running a training job. The loss went down. The model generates text. But is it actually better than the base model you started from? The hardest part of fine-tuning isn't the training — it's knowing whether the result is actually an improvement.

Unlike classification, where accuracy gives you a single number that everyone agrees on, generative model quality is multi-dimensional. A fine-tuned model might follow your output format perfectly but hallucinate facts. It might be factually impeccable but ignore instructions. It might nail short prompts and fall apart on long ones. There is no single number that captures all of this.

In practice, evaluation is a stack, and each layer catches different failures:

Perplexity: a sanity check — did the model learn the language distribution? If perplexity went up after fine-tuning, something is fundamentally wrong.
Benchmarks: standardised tests — did the model retain (or improve) its general capabilities? MMLU, HellaSwag, HumanEval, and others test knowledge, reasoning, and coding.
Task-specific metrics: the most important layer — does the model do what YOU need it to do on YOUR data? F1, exact match, BLEU, ROUGE, or custom metrics tied to your use case.
Human / LLM judgment: the gold standard — do real evaluators (human or a strong judge model) prefer the fine-tuned model's outputs?

No single layer is sufficient on its own. A model can have great perplexity but fail at your task. It can ace benchmarks but produce outputs your users hate. It can please human judges on cherry-picked examples but break on edge cases. The goal is to build a complete picture by combining multiple evaluation signals, and this article walks through each one.

Perplexity: The Sanity Check

Perplexity measures how surprised the model is by a held-out evaluation set. Given a sequence of tokens, the model assigns a probability to each next token. If those probabilities are high (the model predicted well), perplexity is low. If the model is constantly guessing wrong, perplexity is high. Formally:

\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{<t})\right)

Let's unpack every piece of this. $T$ is the total number of tokens in the evaluation sequence. $x_t$ is the actual token at position $t$, and $x_{<t}$ denotes all tokens before position $t$ (the context). $P(x_t \mid x_{<t})$ is the probability the model assigns to the correct next token given the preceding context. We take the $\log$ of each probability (which is negative, since $0 < P \leq 1$), sum them up, negate the sum, divide by $T$ to get the average, and finally exponentiate. The inner sum $-\frac{1}{T}\sum_t \log P(x_t \mid x_{<t})$ is just the cross-entropy loss — the same number you see during training. The $\exp$ converts it from log-space to a more interpretable scale.

What do the boundary values look like? If the model predicts every token perfectly — $P(x_t \mid x_{<t}) = 1$ for all $t$ — then every $\log P$ is $0$, the sum is $0$, and $\text{PPL} = \exp(0) = 1$. That's the theoretical minimum: a model that is never surprised. At the other extreme, if the model assigns equal probability to every token in the vocabulary $V$ (pure random guessing), then $P = 1/V$ for each token, $\log P = -\log V$, and $\text{PPL} = \exp(\log V) = V$. For a model with a vocabulary of 32,000 tokens (like LLaMA), that's a perplexity of 32,000. In practice, well-trained LLMs achieve perplexities of 5-15 on general text.

💡 A perplexity of 8 has a nice intuition: it means that on average, the model is as uncertain as if it were choosing uniformly between 8 equally likely options at each position. The model has narrowed down 32,000 vocabulary tokens to an effective choice of 8 — that's a lot of confidence, but not certainty.

Why do we call perplexity a "sanity check" rather than a quality metric? Because it measures language modelling ability , not helpfulness, safety, or format adherence. A model with lower perplexity can still give worse answers — it might predict tokens better in general but fail at following instructions or refuse less often. Perplexity doesn't distinguish between a model that outputs beautiful prose about the wrong topic and one that gives a correct but clunky answer to the right question.

That said, perplexity is invaluable as a negative signal . If perplexity on a held-out set increased after fine-tuning, something went wrong: you may have overfit, corrupted the data, or used a learning rate that destabilised the weights. A perplexity increase doesn't tell you what broke, but it tells you that something did.

The code below computes perplexity from a list of log-probabilities on a toy example, showing the direct relationship between cross-entropy loss and perplexity:

import math, json, js

# Simulated log-probabilities for each token in a short sequence
# Each value is log P(x_t | x_<t) — the model's confidence in the correct next token
# More negative = less confident

log_probs_good_model = [-0.10, -0.22, -0.05, -0.51, -0.30, -0.15, -0.08, -0.42]
log_probs_bad_model  = [-2.10, -1.80, -3.05, -2.51, -1.90, -2.15, -2.80, -1.42]

def compute_perplexity(log_probs):
    T = len(log_probs)
    avg_neg_log_prob = -sum(log_probs) / T   # cross-entropy
    ppl = math.exp(avg_neg_log_prob)
    return avg_neg_log_prob, ppl

ce_good, ppl_good = compute_perplexity(log_probs_good_model)
ce_bad, ppl_bad   = compute_perplexity(log_probs_bad_model)

# Boundary cases
perfect_log_probs = [0.0, 0.0, 0.0, 0.0]
_, ppl_perfect = compute_perplexity(perfect_log_probs)

vocab_size = 32000
random_log_probs = [-math.log(vocab_size)] * 4
_, ppl_random = compute_perplexity(random_log_probs)

rows = [
    ["Good model", f"{ce_good:.4f}", f"{ppl_good:.2f}", f"~{ppl_good:.0f} equally likely options"],
    ["Bad model", f"{ce_bad:.4f}", f"{ppl_bad:.2f}", f"~{ppl_bad:.0f} equally likely options"],
    ["Perfect (P=1)", "0.0000", f"{ppl_perfect:.1f}", "no uncertainty at all"],
    ["Random (V=32k)", f"{math.log(vocab_size):.4f}", f"{ppl_random:.0f}", "picking from entire vocabulary"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Model", "Cross-Entropy", "Perplexity", "Interpretation"],
    "rows": rows
})

print("Lower perplexity = better. PPL of 1 is perfect, PPL of V is random guessing.")

Benchmarks: Standardised Tests for LLMs

Perplexity tells you whether the model can predict tokens. Benchmarks tell you whether it can reason, know facts, write code, and avoid common pitfalls . The LLM community has built a collection of standardised tests that serve as shared yardsticks. After fine-tuning, you typically run the model on several benchmarks to check whether general capabilities were preserved (or improved).

The major benchmark suites you'll encounter:

MMLU (Hendrycks et al., 2021) : 57 subjects (from abstract algebra to virology), multiple-choice format. Tests breadth of knowledge and reasoning. This is the standard benchmark for general capability — virtually every model paper reports an MMLU score.
HellaSwag (Zellers et al., 2019) : sentence completion tasks that test common-sense reasoning. The model must choose the most plausible continuation of a scenario. Humans score ~95%; models that score well here demonstrate strong grounding in everyday physical and social reasoning.
HumanEval (Chen et al., 2021) : 164 programming problems where the model writes Python functions that must pass unit tests. The metric is pass@k — the probability that at least one of $k$ generated samples passes all tests. This is the standard benchmark for code generation.
TruthfulQA (Lin et al., 2022) : 817 questions designed to probe whether models generate common misconceptions and popular falsehoods (e.g., "What happens if you crack your knuckles?"). A model that has memorised the internet's most repeated myths will score poorly here. Crucial for detecting whether fine-tuning made the model more or less truthful.
MT-Bench (Zheng et al., 2023) : 80 multi-turn conversation prompts across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). A strong LLM (GPT-4) scores the responses on a 1-10 scale. This benchmark is particularly valuable for fine-tuned chat models because it tests multi-turn coherence, not just single-response quality.
Open LLM Leaderboard by HuggingFace: aggregates scores from multiple benchmarks into a single leaderboard. Useful for comparing your fine-tuned model against publicly available models, though the specific benchmarks included have evolved over time.

Benchmarks are invaluable for catching capability regressions — if your base model scored 63% on MMLU and your fine-tuned version scores 58%, you've lost general knowledge. But they come with serious limitations:

Contamination: if benchmark questions leaked into the training data (a surprisingly common problem with web-scraped datasets), scores are artificially inflated. The model isn't reasoning — it's recalling.
Overfitting to benchmarks: some training pipelines deliberately optimise for benchmark performance, producing models that ace MMLU but perform poorly on real-world tasks. A high benchmark score doesn't guarantee your fine-tuned model is better for your specific use case.
Format sensitivity: many benchmarks are multiple-choice. A model fine-tuned for free-form generation might score worse on multiple-choice simply because it doesn't output answers in the expected format (e.g., it writes a paragraph instead of "A").

💡 Always evaluate on YOUR task, not just public benchmarks. Benchmarks are a shared reference point, but they don't measure what matters most: whether the model solves the problem you fine-tuned it for. Use benchmarks to detect regressions, then use task-specific evaluation to measure improvement.

Task-Specific Evaluation

The most important evaluation question is simple: does the model do what you need it to do? Perplexity and benchmarks are proxies. Task-specific evaluation is direct measurement.

The right metric depends on what kind of task you fine-tuned for:

Classification tasks: precision, recall, and F1 score. Precision asks "of all the items the model labelled positive, how many actually are?"; recall asks "of all the actually positive items, how many did the model catch?". F1 is their harmonic mean, balancing both.
Extraction tasks: exact match (did the model extract exactly the right string?) and token-level F1 (how much overlap is there between predicted and gold tokens?). Exact match is strict — a single extra word scores 0. Token-level F1 gives partial credit.
Generation tasks: automated metrics like BLEU (precision of n-gram overlap with reference) and ROUGE (recall of n-gram overlap). These are imperfect — they measure surface-level similarity, not semantic quality — but they're cheap to compute and useful for catching major regressions. Human preference or LLM-as-judge (covered next) remains the gold standard for generation quality.

Regardless of the metric, you need a good evaluation set . Building one is more important (and more often neglected) than choosing the right metric:

Size: 100-500 examples that represent your production distribution. Too few and results are noisy; too many and annotation is expensive. You want enough for the confidence interval on your metric to be tight enough to distinguish between models.
Coverage: include edge cases and known failure modes, not just the easy examples. If your model handles customer support tickets, include tickets with typos, ambiguous requests, multiple issues, and adversarial inputs.
Separation: the eval set must be completely separate from the training data. This sounds obvious but is commonly violated, especially when people split a dataset and then later add more examples to the training set without checking for overlap.
Version control: if you change the eval set, historical comparisons become meaningless. Version it like code. When you add or remove examples, record why, and re-run previous models on the new set if you need to compare.

The simplest and most powerful evaluation technique is the A/B comparison : run the base model and the fine-tuned model on exactly the same inputs from your eval set, then compare outputs side by side. No metric can replace actually reading model outputs — you'll catch problems that no automated score would flag (wrong tone, subtle hallucinations, format deviations). Automated metrics tell you how much changed; reading outputs tells you what changed.

The code below implements a token-level F1 score computation — one of the most common metrics for extraction and question-answering tasks. The idea is to treat the predicted and gold answers as bags of tokens and compute precision, recall, and F1 over the token overlap:

from collections import Counter
import json, js

def token_f1(prediction, ground_truth):
    """Compute token-level precision, recall, and F1 between two strings."""
    pred_tokens = prediction.lower().split()
    gold_tokens = ground_truth.lower().split()

    pred_counts = Counter(pred_tokens)
    gold_counts = Counter(gold_tokens)

    # Overlap: min count for each shared token
    overlap = sum((pred_counts & gold_counts).values())

    if overlap == 0:
        return 0.0, 0.0, 0.0

    precision = overlap / len(pred_tokens)   # of predicted tokens, how many are correct?
    recall    = overlap / len(gold_tokens)   # of gold tokens, how many were predicted?
    f1        = 2 * precision * recall / (precision + recall)

    return precision, recall, f1

gold = "Paris is the capital of France"
examples = [
    "the capital of France is Paris",
    "Paris",
    "the capital of France is Berlin",
    "I don't know",
]

rows = []
for pred in examples:
    p, r, f1 = token_f1(pred, gold)
    rows.append([pred, f"{p:.2f}", f"{r:.2f}", f"{f1:.2f}"])

js.window.py_table_data = json.dumps({
    "headers": ["Prediction", "Precision", "Recall", "F1"],
    "rows": rows
})

print(f"Gold: '{gold}'")
print("All words correct but reordered => F1=1.00 (token-level ignores order)")
print("Only 'Paris' predicted => perfect precision but low recall => F1=0.29")

💡 Notice how the F1 metric gives partial credit. The first example (same content, different word order) scores high because most tokens overlap. The second example (terse but correct) has perfect precision but low recall — every predicted token is correct, but most gold tokens are missing. The third example (wrong answer) still gets partial credit for shared function words like "the", "capital", "of", "France" — a reminder that token F1 measures overlap, not correctness. Always pair automated metrics with manual inspection.

LLM-as-Judge

Automated metrics like F1 and BLEU measure surface-level overlap. Human evaluation captures quality but is slow and expensive. LLM-as-judge sits in between: use a strong model (GPT-4, Claude, or another capable LLM) to evaluate the outputs of the model you're testing. This approach was formalised by (Zheng et al., 2023) in their paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", which showed that GPT-4 judge ratings correlate strongly with human preferences — above 80% agreement, comparable to inter-human agreement.

The setup is straightforward: give the judge model a rubric (what to evaluate), the original input/prompt, and the model's output. Ask the judge to score the output on a numerical scale and provide a justification. A typical judge prompt looks like this:

You are an expert evaluator. Rate the following response on a scale of 1-10.

## Rubric
- Accuracy (1-10): Are the facts correct? No hallucinations?
- Completeness (1-10): Does the response address all parts of the question?
- Format (1-10): Does the response follow the requested output format?
- Clarity (1-10): Is the response well-written and easy to understand?

## User Prompt
{the original prompt sent to the model being evaluated}

## Model Response
{the output from the fine-tuned model}

## Your Evaluation
For each criterion, provide a score and a brief justification.
Then provide an overall score (1-10).

The advantages are compelling: LLM-as-judge is scalable (you can evaluate thousands of outputs overnight), consistent (the same rubric is applied uniformly, unlike human annotators who drift over time), and cheaper than human evaluation by orders of magnitude. For SFT evaluation in particular, where you need to compare multiple checkpoints or hyperparameter configurations, LLM-as-judge has become the de facto standard.

But the method has known biases you must account for:

Position bias: in A/B comparisons ("which response is better: A or B?"), judges tend to prefer whichever response is presented first. Mitigation: run each comparison twice with positions swapped and average the scores.
Self-preference: GPT-4 may systematically prefer GPT-4-style outputs (verbose, hedging, with caveats) over outputs from other models that are equally good but written differently. Mitigation: use multiple judge models, or at least be aware of this when interpreting scores.
Rubric sensitivity: small changes in rubric wording can shift scores significantly. "Is the response accurate?" and "Does the response contain any factual errors?" sound equivalent but may produce different score distributions. Mitigation: test your rubric on a small sample before running full evaluation, and don't change the rubric mid-experiment.
Verbosity bias: judges tend to prefer longer, more detailed responses even when a shorter response is equally correct. A concise, correct answer may score lower than a verbose, padded one. Mitigation: include "penalise unnecessary verbosity" in the rubric, or normalise by response length.

Best practices for LLM-as-judge: use multiple judges (different models or different rubric variations) and average their scores. Randomise the position of responses in A/B comparisons. Calibrate against a small set of human ratings to verify the judge agrees with humans on your specific task. And always spot-check the judge's justifications — if the reasoning is wrong, the score is unreliable even if it looks right.

💡 LLM-as-judge is rapidly becoming the standard for SFT evaluation because it occupies a sweet spot: faster and cheaper than human evaluation, far more nuanced than automated metrics like BLEU or F1. The key insight from Zheng et al. is that strong LLMs can approximate human judgment well enough for most practical evaluation needs — not perfectly, but well enough to make reliable training decisions.

Overfitting: The Silent Killer

If there's one failure mode that derails more fine-tuning projects than any other, it's overfitting . The model memorises the training examples instead of learning the underlying pattern, and it does so silently — training loss keeps decreasing, outputs on training examples look perfect, and everything seems fine until you test on new inputs.

Overfitting is especially dangerous with the small datasets typical of fine-tuning. Pre-training uses billions of tokens, so overfitting is rare. But SFT datasets are often 1,000-10,000 examples. With a 7-billion-parameter model and 5,000 training examples, the model has more than a million parameters per example — more than enough capacity to memorise every example verbatim without learning any generalisable pattern.

The symptoms of overfitting:

Diverging loss curves: training loss keeps decreasing, but evaluation loss on a held-out set starts increasing. This is the classic signal — the model is fitting the training data more tightly while getting worse at generalising.
Perfect training, poor generalisation: the model gives excellent outputs on training examples but struggles on new inputs, even similar ones. If you rephrase a training example slightly and the quality drops dramatically, the model memorised the example rather than learning the task.
Verbatim regurgitation: the model starts producing exact phrases or entire sentences from the training data in response to unrelated prompts. This is the clearest sign of memorisation.

Detection requires one non-negotiable practice: always reserve a held-out evaluation set . Set aside 10-20% of your data before training begins. Never touch it during training. After each epoch (or every N steps), compute the loss on this held-out set and compare it to the training loss. The moment eval loss starts climbing while training loss still falls, overfitting has begun.

Prevention comes down to five levers:

Fewer epochs: 1-3 epochs is often optimal for SFT. Beyond 3, you're almost certainly overfitting on small datasets. Some practitioners find that a single pass through the data is sufficient.
Lower learning rate: a smaller learning rate means smaller weight updates, which slows the model's ability to memorise individual examples. Typical SFT learning rates are $1 \times 10^{-5}$ to $5 \times 10^{-5}$, much lower than pre-training.
Dropout: LoRA supports a dropout parameter (typically 0.05-0.1) that randomly zeroes out a fraction of the adapter activations during training, preventing the model from relying on any single pathway too heavily.
More data: the most reliable antidote to overfitting is more training examples. If you can't collect more real data, data augmentation (paraphrasing existing examples, varying the format) can help, though it's no substitute for genuine diversity.
Early stopping: monitor eval loss during training and save a checkpoint whenever it improves. When it hasn't improved for a set number of steps (the patience), stop training and use the best checkpoint. This is the most direct defence against overfitting.

The conceptual tradeoff is a U-shaped curve. On one end, too little training: the model hasn't learned your format, style, or task — that's underfitting. On the other end, too much training: the model has memorised your examples and lost the ability to generalise — that's overfitting. The goal is to find the sweet spot in between: enough training to learn the pattern, then stop before memorisation takes over. The held-out eval loss tells you where you are on this curve.

The code below simulates training and evaluation loss curves, showing what overfitting looks like in practice. Notice how training loss keeps falling while eval loss bottoms out around epoch 3 and then climbs — the gap between the two curves is the overfitting signal:

import math, json, js

# Simulate train and eval loss over 10 epochs
# Train loss always decreases (model fits training data tighter)
# Eval loss decreases initially, then increases (overfitting)

epochs = list(range(1, 11))

# Train loss: starts at 2.5, decays smoothly toward ~0.3
train_loss = [2.5 * math.exp(-0.25 * e) + 0.3 for e in epochs]

# Eval loss: decreases for first 3 epochs, then increases
eval_loss = []
for e in epochs:
    if e <= 3:
        # Improving: model generalises better
        val = 2.6 * math.exp(-0.3 * e) + 0.8
    else:
        # Overfitting: eval loss climbs back up
        val = 0.8 + 0.12 * (e - 3) ** 1.3
    eval_loss.append(val)

# Find the best epoch (lowest eval loss)
best_epoch = epochs[eval_loss.index(min(eval_loss))]

rows = []
for i, e in enumerate(epochs):
    gap = eval_loss[i] - train_loss[i]
    if e < best_epoch:
        status = "improving"
    elif e == best_epoch:
        status = "best checkpoint"
    else:
        status = "OVERFITTING"
    rows.append([str(e), f"{train_loss[i]:.4f}", f"{eval_loss[i]:.4f}", f"{gap:+.4f}", status])

js.window.py_table_data = json.dumps({
    "headers": ["Epoch", "Train Loss", "Eval Loss", "Gap", "Status"],
    "rows": rows
})

print(f"Best checkpoint: epoch {best_epoch} (eval loss = {min(eval_loss):.4f})")
print(f"Training to epoch 10: eval loss = {eval_loss[-1]:.4f} (worse by {eval_loss[-1] - min(eval_loss):.4f})")
print(f"The widening gap after epoch {best_epoch} IS the overfitting signal.")

💡 In practice, the overfitting point varies by dataset size, model size, and learning rate. Smaller datasets overfit faster (sometimes within a single epoch). Larger models overfit faster on the same data (more parameters = more memorisation capacity). The only reliable way to find the sweet spot is to monitor eval loss during training — never skip this step.

Quiz

Test your understanding of fine-tuning evaluation methods.

A model has a perplexity of 1 on a held-out evaluation set. What does this mean?

The model is performing at random-chance level

The model predicts every token in the evaluation set with probability 1 — it is never surprised

The model has overfit to the evaluation set and should be discarded

The model's cross-entropy loss is 1.0

After fine-tuning, your model's MMLU score dropped from 63% to 58%, but it performs much better on your specific task. What should you conclude?

The fine-tuning failed because benchmark performance decreased

The model lost some general knowledge but gained task-specific capability — this is a common and often acceptable tradeoff

MMLU is unreliable and the drop can be ignored entirely

You should retrain with a higher learning rate to recover MMLU performance

During training, you observe that training loss is 0.15 and steadily decreasing, while evaluation loss is 1.8 and increasing. What is happening?

The model is underfitting and needs more training epochs

The learning rate is too low and should be increased

The model is overfitting — it is memorising the training data instead of generalising

This is normal behaviour and training should continue

What is the main limitation of using LLM-as-judge for evaluation?

LLM judges are always less accurate than automated metrics like BLEU

LLM judges can only evaluate English-language outputs

LLM judges have systematic biases including position bias, self-preference, and verbosity bias that must be mitigated

LLM judges require fine-tuning before they can evaluate other models