Evaluación de Modelos Fine-tuneados

¿Cómo Sabes si el Fine-tuning Funcionó?

You've spent hours curating data, choosing hiperparámetros, and running a training job. The loss went down. The model generates text. But is it actually better than the base model you started from? The hardest part of fine-tuning isn't the training — it's knowing whether the result is actually an improvement.

Unlike classification, where accuracy gives you a single number that everyone agrees on, generative model quality is multi-dimensional. A hacer fine-tuningado model might follow your output format perfectly but hallucinate facts. It might be factually impeccable but ignore instructions. It might nail short prompts and fall apart on long ones. There is no single number that captures all of this.

En la práctica, evaluation is a stack, and each layer catches different failures:

Perplexity: a sanity check — did the model learn the language distribution? If perplejidad went up after fine-tuning, something is fundamentalmente wrong.
Benchmarks: standardised tests — did the model retain (or improve) its general capabilities? MMLU, HellaSwag, HumanEval, and others test knowledge, reasoning, and coding.
Task-specific metrics: the most important layer — does the model do what YOU need it to do on YOUR data? F1, exact match, BLEU, ROUGE, or custom metrics tied to your use case.
Human / LLM judgment: the gold standard — do real evaluators (human or a strong judge model) prefer the hacer fine-tuningado model's outputs?

No single layer is sufficient on its own. A model can have great perplejidad but fail at your task. It can ace benchmarks but produce outputs your users hate. It can please human judges on cherry-picked examples but break on edge cases. The goal is to build a complete picture by combining multiple evaluation signals, and this article walks through each one.

Perplejidad: La Verificación de Cordura

Perplexity measures how surprised the model is by a held-out conjunto de evaluación. Given a sequence of tokens, the model assigns a probability to each next token. If those probabilities are high (the model predicted well), perplejidad is low. If the model is constantly guessing wrong, perplejidad is high. Formalmente:

\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{<t})\right)

Descompongamos every piece of this. $T$ is the total number of tokens in the evaluation sequence. $x_t$ is the actual token at position $t$, and $x_{<t}$ denotes all tokens before position $t$ (the context). $P(x_t \mid x_{<t})$ is the probability the model assigns to the correct next token given the preceding context. We take the $\log$ of each probability (which is negative, since $0 < P \leq 1$), sum them up, negate the sum, divide by $T$ to get the average, and finally exponentiate. The inner sum $-\frac{1}{T}\sum_t \log P(x_t \mid x_{<t})$ is just the entropía cruzada loss — the same number you see during training. The $\exp$ converts it from log-space to a more interpretable scale.

What do the boundary values look like? If the model predicts every token perfectly — $P(x_t \mid x_{<t}) = 1$ for all $t$ — then every $\log P$ is $0$, the sum is $0$, and $\text{PPL} = \exp(0) = 1$. That's the theoretical minimum: a model that is never surprised. At the other extreme, if the model assigns equal probability to every token in the vocabulary $V$ (pure random guessing), then $P = 1/V$ for each token, $\log P = -\log V$, and $\text{PPL} = \exp(\log V) = V$. For a model with a vocabulary of 32,000 tokens (like LLaMA), that's a perplejidad of 32,000. En la práctica, well-trained LLMs achieve perplexities of 5-15 on general text.

💡 A perplejidad of 8 has a nice intuition: it means that on average, the model is as uncertain as if it were choosing uniformly between 8 equally likely options at each position. The model has narrowed down 32,000 vocabulary tokens to an effective choice of 8 — that's a lot of confidence, but not certainty.

Why do we call perplejidad a "sanity check" rather than a quality metric? Because it measures language modelling ability , not helpfulness, safety, or format adherence. A model with lower perplejidad can still give worse answers — it might predict tokens better in general but fail at following instructions or refuse less often. Perplexity doesn't distinguish between a model that outputs beautiful prose about the wrong topic and one that gives a correct but clunky answer to the right question.

That said, perplejidad is invaluable as a negative signal . If perplejidad on a held-out set increased after fine-tuning, something went wrong: you may have overfit, corrupted the data, or used a tasa de aprendizaje that destabilised the weights. A perplejidad increase doesn't tell you what broke, but it tells you that something did.

The code below computes perplejidad from a list of log-probabilities on a toy example, showing the direct relationship between entropía cruzada loss and perplejidad:

import math, json, js

# Simulated log-probabilities for each token in a short sequence
# Each value is log P(x_t | x_<t) — the model's confidence in the correct next token
# More negative = less confident

log_probs_good_model = [-0.10, -0.22, -0.05, -0.51, -0.30, -0.15, -0.08, -0.42]
log_probs_bad_model  = [-2.10, -1.80, -3.05, -2.51, -1.90, -2.15, -2.80, -1.42]

def compute_perplexity(log_probs):
    T = len(log_probs)
    avg_neg_log_prob = -sum(log_probs) / T   # cross-entropy
    ppl = math.exp(avg_neg_log_prob)
    return avg_neg_log_prob, ppl

ce_good, ppl_good = compute_perplexity(log_probs_good_model)
ce_bad, ppl_bad   = compute_perplexity(log_probs_bad_model)

# Boundary cases
perfect_log_probs = [0.0, 0.0, 0.0, 0.0]
_, ppl_perfect = compute_perplexity(perfect_log_probs)

vocab_size = 32000
random_log_probs = [-math.log(vocab_size)] * 4
_, ppl_random = compute_perplexity(random_log_probs)

rows = [
    ["Good model", f"{ce_good:.4f}", f"{ppl_good:.2f}", f"~{ppl_good:.0f} equally likely options"],
    ["Bad model", f"{ce_bad:.4f}", f"{ppl_bad:.2f}", f"~{ppl_bad:.0f} equally likely options"],
    ["Perfect (P=1)", "0.0000", f"{ppl_perfect:.1f}", "no uncertainty at all"],
    ["Random (V=32k)", f"{math.log(vocab_size):.4f}", f"{ppl_random:.0f}", "picking from entire vocabulary"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Model", "Cross-Entropy", "Perplexity", "Interpretation"],
    "rows": rows
})

print("Lower perplexity = better. PPL of 1 is perfect, PPL of V is random guessing.")

Benchmarks: Pruebas Estandarizadas para LLMs

Perplexity tells you whether the model can predict tokens. Benchmarks tell you whether it can reason, know facts, write code, and avoid common pitfalls . The LLM community has built a collection of standardised tests that serve as shared yardsticks. After fine-tuning, you typically run the model on several benchmarks to check whether general capabilities were preserved (or improved).

The major benchmark suites you'll encounter:

MMLU (Hendrycks et al., 2021) : 57 subjects (from abstract algebra to virology), multiple-choice format. Tests breadth of knowledge and reasoning. Este es el standard benchmark for general capability — virtually every model paper reports an MMLU score.
HellaSwag (Zellers et al., 2019) : sentence completion tasks that test common-sense reasoning. The model must choose the most plausible continuation of a scenario. Humans score ~95%; models that score well here demonstrate strong grounding in everyday physical and social reasoning.
HumanEval (Chen et al., 2021) : 164 programming problems where the model writes Python functions that must pass unit tests. The metric is pass@k — the probability that at least one of $k$ generated samples passes all tests. Este es el standard benchmark for code generation.
TruthfulQA (Lin et al., 2022) : 817 questions designed to probe whether models generate common misconceptions and popular falsehoods (e.g., "¿Qué pasa if you crack your knuckles?"). A model that has memorised the internet's most repeated myths will score poorly here. Crucial for detecting whether fine-tuning made the model more or less truthful.
MT-Bench (Zheng et al., 2023) : 80 multi-turn conversation prompts across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). A strong LLM (GPT-4) scores the responses on a 1-10 scale. This benchmark is particularly valuable for hacer fine-tuningado chat models because it tests multi-turn coherence, not just single-response quality.
Open LLM Leaderboard by HuggingFace: aggregates scores from multiple benchmarks into a single leaderboard. Useful for comparing your hacer fine-tuningado model against publicly available models, though the specific benchmarks included have evolved over time.

Benchmarks are invaluable for catching capability regressions — if your base model scored 63% on MMLU and your hacer fine-tuningado version scores 58%, you've lost general knowledge. But they come with serious limitations:

Contamination: if benchmark questions leaked into the datos de entrenamiento (a sorprendentemente common problem with web-scraped datasets), scores are artificially inflated. The model isn't reasoning — it's recalling.
Overfitting to benchmarks: some training pipelines deliberately optimise for benchmark performance, producing models that ace MMLU but perform poorly on real-world tasks. A high benchmark score doesn't guarantee your hacer fine-tuningado model is better for your specific use case.
Format sensitivity: many benchmarks are multiple-choice. A model hacer fine-tuningado for free-form generation might score worse on multiple-choice simply because it doesn't output answers in the expected format (e.g., it writes a paragraph instead of "A").

💡 Always evaluate on YOUR task, not just public benchmarks. Benchmarks are a shared reference point, but they don't measure what matters most: whether the model solves the problem you hacer fine-tuningado it for. Use benchmarks to detect regressions, then use task-specific evaluation to measure improvement.

Evaluación Específica de la Tarea

The most important evaluation question is simple: does the model do what you need it to do? Perplexity and benchmarks are proxies. Task-specific evaluation is direct measurement.

The right metric depends on what kind of task you hacer fine-tuningado for:

Classification tasks: precision, recall, and F1 score. Precision asks "of all the items the model labelled positive, how many actually are?"; recall asks "of all the actually positive items, how many did the model catch?". F1 is their harmonic mean, balancing both.
Extraction tasks: exact match (did the model extract exactly the right string?) and token-level F1 (how much overlap is there between predicted and gold tokens?). Exact match is strict — a single extra word scores 0. Token-level F1 gives partial credit.
Generation tasks: automated metrics like BLEU (precision of n-gram overlap with reference) and ROUGE (recall of n-gram overlap). These are imperfect — they measure surface-level similarity, not semantic quality — but they're cheap to compute and useful for catching major regressions. Human preference or LLM-as-judge (covered next) remains the gold standard for generation quality.

Independientemente de the metric, you need a good conjunto de evaluación . Building one is more important (and more often neglected) than choosing the right metric:

Size: 100-500 examples that represent your production distribution. Too few and results are noisy; too many and annotation is expensive. You want enough for the confidence interval on your metric to be tight enough to distinguish between models.
Coverage: include edge cases and known failure modes, not just the easy examples. If your model handles customer support tickets, include tickets with typos, ambiguous requests, multiple issues, and adversarial inputs.
Separation: the eval set must be completely separate from the datos de entrenamiento. This sounds obvious but is commonly violated, especially when people split a dataset and then later add more examples to the conjunto de entrenamiento without checking for overlap.
Version control: if you change the eval set, historical comparisons become meaningless. Version it like code. When you add or remove examples, record why, and re-run previous models on the new set if you need to compare.

The simplest and most powerful evaluation technique is the A/B comparison : run the base model and the hacer fine-tuningado model on exactly the same inputs from your eval set, then compare outputs side by side. No metric can replace actually reading model outputs — you'll catch problems that no automated score would flag (wrong tone, subtle hallucinations, format deviations). Automated metrics tell you how much changed; reading outputs tells you what changed.

The code below implements a token-level F1 score computation — one of the most common metrics for extraction and question-answering tasks. The idea is to treat the predicted and gold answers as bags of tokens and compute precision, recall, and F1 over the token overlap:

from collections import Counter
import json, js

def token_f1(prediction, ground_truth):
    """Compute token-level precision, recall, and F1 between two strings."""
    pred_tokens = prediction.lower().split()
    gold_tokens = ground_truth.lower().split()

    pred_counts = Counter(pred_tokens)
    gold_counts = Counter(gold_tokens)

    # Overlap: min count for each shared token
    overlap = sum((pred_counts & gold_counts).values())

    if overlap == 0:
        return 0.0, 0.0, 0.0

    precision = overlap / len(pred_tokens)   # of predicted tokens, how many are correct?
    recall    = overlap / len(gold_tokens)   # of gold tokens, how many were predicted?
    f1        = 2 * precision * recall / (precision + recall)

    return precision, recall, f1

gold = "Paris is the capital of France"
examples = [
    "the capital of France is Paris",
    "Paris",
    "the capital of France is Berlin",
    "I don't know",
]

rows = []
for pred in examples:
    p, r, f1 = token_f1(pred, gold)
    rows.append([pred, f"{p:.2f}", f"{r:.2f}", f"{f1:.2f}"])

js.window.py_table_data = json.dumps({
    "headers": ["Prediction", "Precision", "Recall", "F1"],
    "rows": rows
})

print(f"Gold: '{gold}'")
print("All words correct but reordered => F1=1.00 (token-level ignores order)")
print("Only 'Paris' predicted => perfect precision but low recall => F1=0.29")

💡 Observa cómo the F1 metric gives partial credit. The first example (same content, different word order) scores high because most tokens overlap. The second example (terse but correct) has perfect precision but low recall — every predicted token is correct, but most gold tokens are missing. The third example (wrong answer) still gets partial credit for shared function words like "the", "capital", "of", "France" — a reminder that token F1 measures overlap, not correctness. Always pair automated metrics with manual inspection.

LLM como Juez

Automated metrics like F1 and BLEU measure surface-level overlap. Human evaluation captures quality but is slow and expensive. LLM-as-judge sits in between: use a strong model (GPT-4, Claude, or another capable LLM) to evaluate the outputs of the model you're testing. Este enfoque was formalised by (Zheng et al., 2023) in their paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", which showed that GPT-4 judge ratings correlate strongly with preferencias humanas — above 80% agreement, comparable to inter-human agreement.

The setup is straightforward: give the judge model a rubric (what to evaluate), the original input/prompt, and the model's output. Ask the judge to score the output on a numerical scale and provide a justification. A typical judge prompt looks like this:

You are an expert evaluator. Rate the following response on a scale of 1-10.

## Rubric
- Accuracy (1-10): Are the facts correct? No hallucinations?
- Completeness (1-10): Does the response address all parts of the question?
- Format (1-10): Does the response follow the requested output format?
- Clarity (1-10): Is the response well-written and easy to understand?

## User Prompt
{the original prompt sent to the model being evaluated}

## Model Response
{the output from the fine-tuned model}

## Your Evaluation
For each criterion, provide a score and a brief justification.
Then provide an overall score (1-10).

The advantages are compelling: LLM-as-judge is scalable (you can evaluate thousands of outputs overnight), consistent (the same rubric se aplica uniformly, unlike anotadores humanos who drift over time), and cheaper than evaluación humana by orders of magnitude. For SFT evaluation in particular, where you need to compare multiple checkpoints or hiperparámetro configurations, LLM-as-judge has become the de facto standard.

But the method has known biases you must account for:

Position bias: in A/B comparisons ("which response is better: A or B?"), judges tend to prefer whichever response is presented first. Mitigation: run each comparison twice with positions swapped and average the scores.
Self-preference: GPT-4 may systematically prefer GPT-4-style outputs (verbose, hedging, with caveats) over outputs from other models that are equally good but written differently. Mitigation: use multiple judge models, or at least be aware of this when interpreting scores.
Rubric sensitivity: small changes in rubric wording can shift scores significativamente. "Is the response accurate?" and "Does the response contain any factual errors?" sound equivalent but may produce different score distributions. Mitigation: test your rubric on a small sample before running full evaluation, and don't change the rubric mid-experiment.
Verbosity bias: judges tend to prefer longer, more detailed responses even when a shorter response is equally correct. A concise, correct answer may score lower than a verbose, padded one. Mitigation: include "penalise unnecessary verbosity" in the rubric, or normalise by response length.

Best practices for LLM-as-judge: use multiple judges (different models or different rubric variations) and average their scores. Randomise the position of responses in A/B comparisons. Calibrate against a small set of human ratings to verify the judge agrees with humans on your specific task. And always spot-check the judge's justifications — if the reasoning is wrong, the score is unreliable even if it looks right.

💡 LLM-as-judge is rapidly becoming the standard for SFT evaluation because it occupies a sweet spot: faster and cheaper than evaluación humana, far more nuanced than automated metrics like BLEU or F1. La idea clave from Zheng et al. is that strong LLMs can approximate juicio humano well enough for most practical evaluation needs — not perfectly, but well enough to make reliable training decisions.

Sobreajuste: El Asesino Silencioso

If there's one failure mode that derails more fine-tuning projects than any other, it's sobreajuste . The model memorises the ejemplos de entrenamiento instead of learning the underlying pattern, and it does so silently — pérdida de entrenamiento keeps decreasing, outputs on ejemplos de entrenamiento look perfect, and everything seems fine until you test on new inputs.

Overfitting is especially dangerous with the small datasets typical of fine-tuning. Pre-training uses billions of tokens, so sobreajuste is rare. But SFT datasets are often 1,000-10,000 examples. With a 7-billion-parameter model and 5,000 ejemplos de entrenamiento, the model has more than a million parameters per example — more than enough capacity to memorise every example verbatim without learning any generalisable pattern.

The symptoms of sobreajuste:

Diverging loss curves: pérdida de entrenamiento keeps decreasing, but pérdida de evaluación on a held-out set starts increasing. Este es el classic signal — the model is fitting the datos de entrenamiento more tightly while getting worse at generalising.
Perfect training, poor generalisation: the model gives excellent outputs on ejemplos de entrenamiento but struggles on new inputs, even similar ones. If you rephrase a training example slightly and the quality drops dramáticamente, the model memorised the example rather than learning the task.
Verbatim regurgitation: the model starts producing exact phrases or entire sentences from the datos de entrenamiento in response to unrelated prompts. Este es el clearest sign of memorisation.

Detection requires one non-negotiable practice: always reserve a held-out conjunto de evaluación . Set aside 10-20% of your data before training begins. Never touch it during training. After each epoch (or every N steps), compute the loss on this held-out set and compare it to the pérdida de entrenamiento. The moment eval loss starts climbing while pérdida de entrenamiento still falls, sobreajuste has begun.

Prevention comes down to five levers:

Fewer epochs: 1-3 epochs is often optimal for SFT. Beyond 3, you're almost certainly sobreajuste on small datasets. Some practitioners find that a single pass through the data is sufficient.
Lower tasa de aprendizaje: a smaller tasa de aprendizaje means smaller actualización de pesoss, which slows the model's ability to memorise individual examples. Typical SFT tasa de aprendizajes are $1 \times 10^{-5}$ to $5 \times 10^{-5}$, much lower than pre-entrenamiento.
Dropout: LoRA supports a dropout parameter (typically 0.05-0.1) that randomly zeroes out a fraction of the adapter activaciones during training, preventing the model from relying on any single pathway too heavily.
More data: the most reliable antidote to sobreajuste is more ejemplos de entrenamiento. If you can't collect more real data, data augmentation (paraphrasing existing examples, varying the format) can help, though it's no substitute for genuine diversity.
Early stopping: monitor eval loss during training and save a checkpoint whenever it improves. When it hasn't improved for a set number of steps (the patience), stop training and use the best checkpoint. Este es el most direct defence against sobreajuste.

The conceptual tradeoff is a U-shaped curve. On one end, too little training: the model hasn't learned your format, style, or task — that's subajuste. On the other end, too much training: the model has memorised your examples and lost the ability to generalise — that's sobreajuste. The goal is to find the sweet spot in between: enough training to learn the pattern, then stop before memorisation takes over. The held-out eval loss tells you where you are on this curve.

The code below simulates training and pérdida de evaluación curves, showing what sobreajuste looks like in practice. Observa cómo pérdida de entrenamiento keeps falling while eval loss bottoms out around epoch 3 and then climbs — the gap between the two curves is the sobreajuste signal:

import math, json, js

# Simulate train and eval loss over 10 epochs
# Train loss always decreases (model fits training data tighter)
# Eval loss decreases initially, then increases (overfitting)

epochs = list(range(1, 11))

# Train loss: starts at 2.5, decays smoothly toward ~0.3
train_loss = [2.5 * math.exp(-0.25 * e) + 0.3 for e in epochs]

# Eval loss: decreases for first 3 epochs, then increases
eval_loss = []
for e in epochs:
    if e <= 3:
        # Improving: model generalises better
        val = 2.6 * math.exp(-0.3 * e) + 0.8
    else:
        # Overfitting: eval loss climbs back up
        val = 0.8 + 0.12 * (e - 3) ** 1.3
    eval_loss.append(val)

# Find the best epoch (lowest eval loss)
best_epoch = epochs[eval_loss.index(min(eval_loss))]

rows = []
for i, e in enumerate(epochs):
    gap = eval_loss[i] - train_loss[i]
    if e < best_epoch:
        status = "improving"
    elif e == best_epoch:
        status = "best checkpoint"
    else:
        status = "OVERFITTING"
    rows.append([str(e), f"{train_loss[i]:.4f}", f"{eval_loss[i]:.4f}", f"{gap:+.4f}", status])

js.window.py_table_data = json.dumps({
    "headers": ["Epoch", "Train Loss", "Eval Loss", "Gap", "Status"],
    "rows": rows
})

print(f"Best checkpoint: epoch {best_epoch} (eval loss = {min(eval_loss):.4f})")
print(f"Training to epoch 10: eval loss = {eval_loss[-1]:.4f} (worse by {eval_loss[-1] - min(eval_loss):.4f})")
print(f"The widening gap after epoch {best_epoch} IS the overfitting signal.")

💡 En la práctica, the sobreajuste point varies by dataset size, model size, and tasa de aprendizaje. Smaller datasets overfit faster (sometimes within a single epoch). Larger models overfit faster on the same data (more parameters = more memorisation capacity). The only reliable way to find the sweet spot is to monitor eval loss during training — never skip this step.

Quiz

Test your understanding of fine-tuning evaluation methods.

A model has a perplejidad of 1 on a held-out conjunto de evaluación. ¿Qué significa esto?

The model is performing at random-chance level

The model predicts every token in the conjunto de evaluación with probability 1 — it is never surprised

The model has overfit to the conjunto de evaluación and should be discarded

The model's entropía cruzada loss is 1.0

After fine-tuning, your model's MMLU score dropped from 63% to 58%, but it performs much better on your specific task. What should you conclude?

The fine-tuning failed because benchmark performance decreased

The model lost some general knowledge but gained task-specific capability — this is a common and often acceptable tradeoff

MMLU is unreliable and the drop can be ignored entirely

You should retrain with a higher tasa de aprendizaje to recover MMLU performance

During training, you observe that pérdida de entrenamiento is 0.15 and steadily decreasing, while pérdida de evaluación is 1.8 and increasing. What is happening?

The model is subajuste and needs more training epochs

The tasa de aprendizaje is too low and should be increased

The model is sobreajuste — it is memorising the datos de entrenamiento instead of generalising

This is normal behaviour and training should continue

What is the main limitation of using LLM-as-judge for evaluation?

LLM judges are always less accurate than automated metrics like BLEU

LLM judges can only evaluate English-language outputs

LLM judges have systematic biases including position bias, self-preference, and verbosity bias that must be mitigated

LLM judges require fine-tuning before they can evaluate other models