El Bucle de Entrenamiento SFT

De la Teoría al Entrenamiento

We know what to hacer fine-tuning (full fine-tuning in article 2, LoRA in article 3, QLoRA in article 4, and the broader PEFT landscape in article 5). We know what to train on (instruction datasets in article 6 — formats, templates, masking strategies, and data sourcing). But how do we actually wire all of this together and run training end-to-end? What tasa de aprendizaje do we use? How many epochs? How do we avoid running out of memoria de GPU? What does a complete training script look like, and how do we know if something is going wrong?

This article answers those questions. We will walk through every hiperparámetro that matters for supervised fine-tuning (SFT), explain the data-efficiency tricks that can cut your training time in half, build a complete training script from scratch, and learn how to monitor training so we catch problems before they waste hours of GPU time.

💡 Want to run QLoRA fine-tuning yourself? There's a runnable notebook at the end of this article — load a model in NF4, apply LoRA, train on an instruction dataset, and merge the result. All on a free T4 GPU.

The modern fine-tuning ecosystem revolves around four HuggingFace libraries that snap together like building blocks. transformers (Wolf et al., 2020) provides the base models and tokenizers. PEFT (Mangrulkar et al., 2022) adds de eficiencia de parámetros adapters like LoRA and QLoRA on top of those models. TRL (Transformer Reinforcement Learning) (von Werra et al., 2020) provides the SFTTrainer class that handles the bucle de entrenamiento, data collation, packing, and plantilla de chat formatting. And datasets (Lhoest et al., 2021) handles loading, streaming, and preprocessing data from the HuggingFace Hub or local files.

The high-level pipeline is always the same, regardless of model size or adapter type:

# 1. Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", ...)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# 2. Apply PEFT adapter (LoRA / QLoRA)
model = get_peft_model(model, lora_config)

# 3. Load and format instruction dataset
dataset = load_dataset("your-org/your-sft-data")

# 4. Configure SFTTrainer with hyperparameters
trainer = SFTTrainer(model, args=training_args, train_dataset=dataset, ...)

# 5. Train
trainer.train()

# 6. Save adapter weights (or merge into base model)
trainer.save_model("./sft-adapter")

Each of these steps has decisions to make and pitfalls to avoid. The rest of this article unpacks them one by one.

¿Qué Hiperparámetros Realmente Importan?

Fine-tuning a pre-trained language model is not the same as training one from scratch. The model already sits in a good region of parameter space, and our job is to nudge it — not shove it. That means the hiperparámetro choices are different from pre-entrenamiento, and getting them wrong can either waste compute (too conservative) or destroy the model's pre-trained knowledge (too aggressive). What are the key knobs, and where should we set them?

Learning rate is the single most important hiperparámetro. It controls how large each gradient update step is. For SFT on a modelo pre-entrenado, typical values range from 1e-5 to 2e-4 . This is 10-100x smaller than pre-entrenamiento tasa de aprendizajes (which can be 1e-3 or higher), because we're fine-tuning — the weights are already close to useful values, and large updates would knock them out of place. When using LoRA, the tasa de aprendizaje can sit at the higher end of this range (1e-4 to 2e-4) because only a small number of adapter parameters are being updated and the base model weights están congelados. For full fine-tuning, stay closer to the lower end (1e-5 to 5e-5).

📌 If your tasa de aprendizaje is too high, you will see loss spikes, NaN values, or rapid sobreajuste. If it is too low, training will be painfully slow and the model may barely change from the base. When in doubt, start with 2e-5 for full fine-tuning or 1e-4 for LoRA, and adjust based on the loss curve.

Effective tamaño de batch determines how many examples contribute to each gradient update. Larger batches produce smoother, more stable gradients (each update averages over more examples), but require more memory. In distributed or memory-constrained setups, we rarely use a single large batch. Instead, we combine three factors:

B_{\text{eff}} = b_{\text{micro}} \times G \times N_{\text{gpu}}

Where $b_{\text{micro}}$ is the micro-tamaño de batch (number of examples per GPU per paso forward — the largest batch that physically fits in memoria de GPU), $G$ is the number of acumulación de gradientes steps (how many micro-batches of gradients we sum up before performing a actualización de pesos), and $N_{\text{gpu}}$ is the number of GPUs performing data-parallel training. The micro-tamaño de batch is constrained by memory; acumulación de gradientes and GPU count let us scale the effective tamaño de batch without increasing per-memoria de GPU.

¿Qué pasa at the boundaries? A very small $B_{\text{eff}}$ (say, 1 or 2) means each update is based on very few examples, so the gradient direction is noisy — the model zigzags through parameter space and may oscillate instead of converging. A very large $B_{\text{eff}}$ (say, 512 or 1024) produces smooth gradients but each step is expensive and, for small datasets, you may take so few steps per epoch that the model doesn't have enough actualizaciones de gradiente to learn. For SFT, effective tamaño de batchs in the range of 16 to 128 are common. A good starting point is $b_{\text{micro}} = 4$, $G = 4$, giving $B_{\text{eff}} = 16$ on a single GPU.

Number of epochs controls how many times the model sees the entire dataset. For SFT, 1 to 3 epochs is the standard range. With small datasets (under 10,000 examples), even 2-3 epochs can cause sobreajuste — the model memorises the ejemplos de entrenamiento and loses the ability to generalise. With larger datasets (100,000+ examples), you may need only a single epoch. The LIMA paper (Zhou et al., 2023) we discussed in article 6 showed strong results with just 1,000 examples over 3 epochs, but they used very high-quality, diverse data. The rule of thumb: start with 1 epoch, try 2-3 only if the model is clearly undertrained (eval loss is still dropping, the model hasn't converged).

Maximum longitud de secuencia (often called max_seq_length ) determines the longest input the model will process during training. Any example longer than this is truncated; any shorter is padded (or packed, as we'll discuss next). This should be set to match or slightly exceed the longest examples in your dataset. Setting it too high wastes memory on padding tokens, while setting it too short silently chops off the ends of your ejemplos de entrenamiento, which can corrupt your targets. Common values are 512, 1024, or 2048, depending on the task.

Warmup ratio specifies what fraction of total paso de entrenamientos should use a gradually increasing tasa de aprendizaje before the main schedule takes over. Typical values are 3% to 10% of total steps. Why does warmup help? At the very start of fine-tuning, the model encounters gradients from the new distribution (instruction data) that may be quite different from what the pesos pre-entrenados expect. A full-sized tasa de aprendizaje applied to these early, noisy gradients can cause sudden large updates that destabilise the model — loss spikes, gradient explosions, or even NaN values. Warmup starts with a near-zero tasa de aprendizaje, lets the optimizer's moment estimates (in Adam, the running mean and variance of gradients) calibrate to the new data distribution, and then gradually ramps up to the target rate. By the time the full tasa de aprendizaje kicks in, the optimizer has a stable picture of the gradient landscape.

Weight decay is a regularización technique that adds a penalty proportional to the magnitude of the weights at each update, gently pushing them toward zero. For SFT, typical values are 0.01 to 0.1 . It acts as a brake against sobreajuste: without it, the model can develop very large weight values that perfectly fit the datos de entrenamiento but generalise poorly. With LoRA, weight decay applies only to the adapter parameters (the base model está congelado), so its effect is more contained. A value of 0.01 is a safe default.

Learning rate schedule determines how the tasa de aprendizaje changes over the course of training. After warmup, should it stay constant, drop linearly, or follow some curve? The dominant choice for fine-tuning (and pre-entrenamiento) is cosine decay — the tasa de aprendizaje follows a smooth cosine curve from its peak value down to a minimum:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)

Descompongamos every symbol. $\eta_t$ is the tasa de aprendizaje at step $t$. $\eta_{\max}$ is the peak tasa de aprendizaje — the value reached at the end of warmup. $\eta_{\min}$ is the minimum tasa de aprendizaje at the end of training, typically set to $0.1 \times \eta_{\max}$ or simply $0$. $t$ is the current paso de entrenamiento (after warmup). $T$ is the total number of paso de entrenamientos (excluding warmup). And $\pi$ is just the mathematical constant (~3.14159) that makes cosine complete half a cycle over the ejecución de entrenamiento.

La idea clave is how the $\cos$ term drives the schedule. At step $t = 0$ (start of training after warmup), we compute $\cos(0) = 1$, so the parenthesised term becomes $(1 + 1) / 2 = 1$, giving $\eta_0 = \eta_{\min} + (\eta_{\max} - \eta_{\min}) = \eta_{\max}$. At step $t = T$ (end of training), we compute $\cos(\pi) = -1$, so the parenthesised term becomes $(1 + (-1)) / 2 = 0$, giving $\eta_T = \eta_{\min}$. In between, the cosine sweeps smoothly from 1 to -1, producing a curve that decreases slowly at first (the model is still learning, keep the rate high), then more rapidly in the middle, and finally slowly again as it approaches $\eta_{\min}$ (gentle landing). This shape is well-suited to fine-tuning: the model makes its biggest adjustments early when the gradient signal is strongest, and takes increasingly cautious steps as it converges.

💡 Why cosine over linear decay? Cosine keeps the tasa de aprendizaje higher for longer during the first half of training, then drops more steeply. Empíricamente, this produces slightly better results on most benchmarks compared to linear decay, which drops at a constant rate. The difference is modest, but cosine has become the de facto standard.

Hagámoslo concreto. The code below computes the cosine schedule at 10 evenly spaced checkpoints so we can see exactly how the tasa de aprendizaje evolves. We use $\eta_{\max} = 2 \times 10^{-4}$, $\eta_{\min} = 0$ (common for SFT), and $T = 1000$ total steps:

import math, json, js

eta_max = 2e-4   # peak learning rate (after warmup)
eta_min = 0.0    # minimum learning rate
T = 1000         # total training steps

def cosine_lr(t, T, eta_max, eta_min):
    """Cosine decay schedule: smoothly decays from eta_max to eta_min."""
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + math.cos(t / T * math.pi))

# Compute LR at 11 evenly-spaced steps
rows = []
for i in range(11):
    t = int(i * T / 10)
    lr = cosine_lr(t, T, eta_max, eta_min)
    pct = t / T * 100
    rows.append([
        str(t),
        f"{pct:.0f}%",
        f"{lr:.6f}",
        f"{lr / eta_max * 100:.1f}%"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["Step (t)", "Progress", "Learning Rate", "% of Peak"],
    "rows": rows
})

print(f"Peak LR (eta_max): {eta_max}")
print(f"Min LR (eta_min):  {eta_min}")
print(f"Total steps (T):   {T}")
print()
print("Notice: LR stays above 50% of peak for the first ~33% of training,")
print("then drops steeply in the middle, and gently approaches 0 at the end.")

Packing vs Padding: ¿Cómo Manejamos Ejemplos de Longitud Variable?

Instruction datasets are messy. Some examples are 50 tokens (a short question with a one-word answer), others are 2,000 tokens (a complex reasoning chain). But GPUs are most efficient when processing fixed-size tensors — every sequence in a batch must be the same length. How do we reconcile variable-length data with fixed-size batches? There are two strategies, and the choice between them can easily make a 2x difference in velocidad de entrenamiento.

Padding is the naive approach. Take the longest sequence in a batch (or the configured max_seq_length ), and fill every shorter sequence with a special [PAD] token until they all match. This is simple, and every sequence is nicely isolated — no risk of one example's attention leaking into another. But it's wasteful. If your max_seq_length is 2048 and the average example is 200 tokens, then roughly 90% of every batch is padding. The GPU dutifully computes attention over pad tokens, computes gradients for pad positions, and produces outputs that are immediately masked out and thrown away. For short-example datasets (chat, QA, classification), padding waste can reach 60-90% of total compute.

Packing solves this by concatenating multiple examples into a single sequence, separated by the model's end-of-sequence (EOS) token, until the sequence is full. If the average example is 200 tokens and max_seq_length = 2048 , we can fit roughly 10 examples into one packed sequence. Every token position now contains a real training token — no padding waste, no wasted compute.

# Illustrate the difference between padding and packing
max_len = 2048
examples = [180, 95, 310, 150, 220, 60, 400, 130, 275, 200]  # token lengths

# Padding: each example becomes its own sequence of length max_len
padded_tokens = len(examples) * max_len
real_tokens_padded = sum(examples)
pad_tokens = padded_tokens - real_tokens_padded
pad_waste = pad_tokens / padded_tokens * 100

print("=== PADDING ===")
print(f"Examples: {len(examples)}")
print(f"Total positions: {padded_tokens:,} ({len(examples)} x {max_len})")
print(f"Real tokens:     {real_tokens_padded:,}")
print(f"Pad tokens:      {pad_tokens:,}")
print(f"Waste:           {pad_waste:.1f}%")
print()

# Packing: concatenate examples into sequences of max_len
packed_seqs = []
current_seq = 0
seq_count = 0
for length in examples:
    if current_seq + length > max_len:
        packed_seqs.append(current_seq)
        current_seq = length
        seq_count += 1
    else:
        current_seq += length
if current_seq > 0:
    packed_seqs.append(current_seq)
    seq_count += 1

packed_total = seq_count * max_len
packed_real = sum(examples)
packed_waste = (packed_total - packed_real) / packed_total * 100

print("=== PACKING ===")
print(f"Examples: {len(examples)} packed into {seq_count} sequences")
print(f"Total positions: {packed_total:,} ({seq_count} x {max_len})")
print(f"Real tokens:     {packed_real:,}")
print(f"Remaining waste: {packed_total - packed_real:,}")
print(f"Waste:           {packed_waste:.1f}%")
print()
print(f"Speedup: {padded_tokens / packed_total:.1f}x fewer total positions to process")

There is one important complication with packing: attention must not cross example boundaries. In a packed sequence containing examples A, B, and C concatenated together, example B should not attend to tokens from example A or C — each example must believe it is alone in the sequence. Without this isolation, the model would learn spurious dependencies between unrelated examples that happen to be packed next to each other.

La solución estándar is to use a block-diagonal attention mask (also called a packing mask or sample mask). Instead of the usual causal mask (lower-triangular, allowing each position to attend to all previous positions), we create a block-diagonal mask where each block corresponds to one example. Position $i$ can attend to position $j$ only if both positions belong to the same example. TRL's SFTTrainer handles this automatically when you set packing=True — it packs examples, inserts EOS delimiters, and generates the correct attention masks.

💡 Packing is especially impactful for short-example datasets like chat (average ~150-300 tokens), QA (average ~100-200 tokens), and classification fine-tuning (average ~50-100 tokens). If your examples are already near max_seq_length (long document summarisation, for instance), packing provides little benefit because there is minimal padding to eliminate.

Acumulación de Gradientes y Gradient Checkpointing

What if we want a large effective tamaño de batch for stable gradients, but our GPU can only fit 2 examples at a time? And what if even those 2 examples cause near-OOM conditions because the model's intermediate activaciones eat all the memory? These are the two most common memory cuello de botellas in fine-tuning, and they have different solutions that are often confused. Let's disentangle them.

Gradient accumulation addresses the batch-size problem. Instead of computing one forward and paso backward on a batch of 32 examples (which might not fit in memory), we compute 8 sequential forward-paso backwardes on micro-batches of 4 examples each. After each micro-batch, the gradients are accumulated (summed) into the same gradient buffers without updating the weights. After all 8 micro-batches, we perform a single optimizer step using the accumulated gradients. El resultado es matemáticamente equivalent to training with a batch of 32 — the gradient is the same sum — but we only ever hold 4 examples in memory at once.

La compensación es straightforward: acumulación de gradientes gives us the gradient quality of a large batch with the costo de memoria of a small batch, but it takes $G$ times as many forward-paso backwardes per optimizer step, so each step takes $G$ times longer in wall-clock time. If $G = 8$, each paso de entrenamiento is ~8x slower than a full-batch step would be (if it fit in memory). But since the alternative is "doesn't fit in memory at all," this is usually a good trade.

Gradient checkpointing (also called activation checkpointing ) addresses a completely different memory cuello de botella: the intermediate activaciones stored during the paso forward. Normally, the paso forward computes and saves activaciones at every layer (attention outputs, FFN intermediates, layer norms, etc.), because the paso backward needs them to compute gradients. For a model like Llama-3.1-8B processing a 2048-token sequence, these activaciones can consume tens of gigabytes — often more than the model weights themselves.

With gradient checkpointing enabled, the paso forward discards most of these intermediate activaciones instead of storing them. During the paso backward, when a discarded activation is needed, the relevant portion of the paso forward is re-executed on the fly to recompute it. This trades compute for memory: the model uses roughly 60% less memoria de activaciones but training is about 30% slower because of the recomputation overhead (Chen et al., 2016) . For fine-tuning large models on consumer GPUs (24GB or 48GB VRAM), gradient checkpointing is almost always essential — it's the difference between the training fitting in memory or not.

📌 Don't confuse these two: acumulación de gradientes saves memory by reducing the tamaño de batch per paso forward (solving the "batch too large" problem), while gradient checkpointing saves memory by discarding intermediate activaciones (solving the "model too large" problem). They are complementary — you can and often should use both simultaneously.

In HuggingFace's TrainingArguments , both are simple flags:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",

    # ── Gradient Accumulation ──────────────────────────────────
    per_device_train_batch_size=4,      # micro-batch: 4 examples per GPU
    gradient_accumulation_steps=8,       # accumulate 8 micro-batches
    # Effective batch = 4 * 8 = 32 per GPU (without needing 32-example memory)

    # ── Gradient Checkpointing ────────────────────────────────
    gradient_checkpointing=True,         # ~60% less activation memory, ~30% slower
    # Without this, a 7B model on a 24GB GPU will likely OOM on sequences > 1024

    # Other essentials
    learning_rate=1e-4,
    num_train_epochs=2,
    bf16=True,                           # bfloat16 mixed precision
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
)

A practical rule of thumb: on a single 24GB GPU (e.g. RTX 3090 / 4090) fine-tuning a 7B parameter model with QLoRA, you can typically fit a micro-batch of 2-4 with gradient checkpointing enabled and sequences of 1024-2048 tokens. Use acumulación de gradientes steps of 4-8 to reach an effective batch of 16-32. On a 48GB GPU (A6000 / A40), you can increase the micro-batch to 4-8 and may get away without gradient checkpointing for shorter sequences.

El Script de Entrenamiento Completo

Ahora put everything together into a single, production-ready SFT training script. This script hacer fine-tunings a Llama 3.1 8B model using QLoRA (4-bit cuantización with LoRA adapters), demonstrating every concept we've discussed: hiperparámetro configuration, packing, acumulación de gradientes, gradient checkpointing, and proper model saving. We'll walk through each section after the code.

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 1. QUANTISATION CONFIG — load base model in 4-bit (QLoRA)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # 4-bit NormalFloat quantisation
    bnb_4bit_quant_type="nf4",            # NF4 data type (optimal for Gaussians)
    bnb_4bit_compute_dtype=torch.bfloat16,# compute in BF16 for stability
    bnb_4bit_use_double_quant=True,       # quantise the quantisation constants too
)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 2. LOAD BASE MODEL AND TOKENIZER
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
model_name = "meta-llama/Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,       # apply 4-bit quantisation
    device_map="auto",                    # spread layers across available GPUs
    attn_implementation="flash_attention_2",  # faster, memory-efficient attention
)
model = prepare_model_for_kbit_training(model)  # freeze quantised layers, enable grads

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token     # many models lack a pad token
tokenizer.padding_side = "right"              # pad on right for causal LM

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 3. LORA CONFIG — which layers to adapt and how
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
lora_config = LoraConfig(
    r=16,                                 # rank of the low-rank matrices
    lora_alpha=32,                        # scaling factor (effective LR ~ alpha/r)
    lora_dropout=0.05,                    # dropout on adapter activations
    bias="none",                          # don't train bias terms
    task_type="CAUSAL_LM",               # optimisation for causal language models
    target_modules=[                      # which weight matrices to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",   # attention projections
        "gate_proj", "up_proj", "down_proj",       # FFN projections
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Typical output: "trainable params: 41,943,040 || all params: 8,072,204,288 || 0.52%"

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 4. LOAD AND FORMAT DATASET
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

# SFTTrainer expects a "messages" column in chat format:
# [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
# ultrachat_200k already has this format. If yours doesn't, map it:
#
# def format_example(example):
#     return {"messages": [
#         {"role": "user", "content": example["instruction"]},
#         {"role": "assistant", "content": example["output"]},
#     ]}
# dataset = dataset.map(format_example)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 5. TRAINING ARGUMENTS — every hyperparameter in one place
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
training_args = TrainingArguments(
    output_dir="./llama3-sft-qlora",
    num_train_epochs=2,                   # 2 epochs (watch for overfitting)
    per_device_train_batch_size=4,        # micro-batch per GPU
    gradient_accumulation_steps=4,        # effective batch = 4 * 4 = 16
    gradient_checkpointing=True,          # save activation memory
    learning_rate=1e-4,                   # LoRA sweet spot
    lr_scheduler_type="cosine",           # cosine decay after warmup
    warmup_ratio=0.05,                    # 5% warmup
    weight_decay=0.01,                    # mild regularisation
    bf16=True,                            # bfloat16 mixed precision
    logging_steps=10,                     # log loss every 10 steps
    save_strategy="steps",
    save_steps=500,                       # checkpoint every 500 steps
    save_total_limit=3,                   # keep only last 3 checkpoints
    evaluation_strategy="steps",
    eval_steps=500,                       # evaluate every 500 steps
    max_grad_norm=1.0,                    # gradient clipping
    report_to="wandb",                    # log to Weights & Biases
    seed=42,
)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 6. CONFIGURE AND RUN SFTTrainer
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    packing=True,                         # pack multiple examples per sequence
    max_seq_length=2048,                  # maximum packed sequence length
)

trainer.train()

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 7. SAVE — adapter only (small) or merged (full model)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Option A: Save just the adapter (~80 MB)
trainer.save_model("./llama3-sft-qlora/adapter")

# Option B: Merge adapter into base model and save full weights
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-sft-qlora/merged")
tokenizer.save_pretrained("./llama3-sft-qlora/merged")

Recorramos the key sections. Section 1 (Quantisation Config) sets up 4-bit NormalFloat cuantización, which is the "Q" in QLoRA. The bnb_4bit_compute_dtype=torch.bfloat16 flag is critical — it means that even though the weights are stored in 4-bit, all matrix multiplications are performed in BFloat16 for numerical stability. The bnb_4bit_use_double_quant=True option applies a second round of cuantización to the cuantización constants themselves, saving an additional ~0.4 bits per parameter (a small but free savings).

Section 2 (Model Loading) loads the model with the cuantización config applied. The device_map="auto" flag tells Accelerate to automatically distribute model layers across available GPUs (or CPU RAM if memoria de GPU is insufficient). prepare_model_for_kbit_training() freezes the cuantizado layers and enables cálculo de gradiente for the adapter layers that will be attached next. Setting the pad token is essential — many models (including Llama) don't ship with a pad token defined, and the trainer will error without one.

Section 3 (LoRA Config) specifies which matrices de pesos in the model receive LoRA adapters. We target all the proyección de atencións ($W_Q$, $W_K$, $W_V$, $W_O$) and the feed-forward projections ($W_{\text{gate}}$, $W_{\text{up}}$, $W_{\text{down}}$). Targeting more modules increases parámetros entrenables (from ~0.2% to ~0.5% of total) but captures more nuanced adaptations. The lora_alpha=32 with r=16 gives a scaling factor of $\alpha / r = 2$, which efectivamente doubles the adapter tasa de aprendizaje relative to the configured learning_rate — a common setting that works well in practice.

Section 5 (Training Arguments) is where all the hiperparámetros from the previous section come together. Note the max_grad_norm=1.0 — this clips gradients to a maximum L2 norm of 1.0, which prevents any single batch with unusually large gradients from destabilising training. The save_total_limit=3 keeps only the 3 most recent checkpoints, which is important because even adapter checkpoints accumulate and can fill disk space over long runs.

Section 7 (Saving) shows two options. Option A saves only the adapter weights (~80 MB for the config above). At tiempo de inferencia, you load the base model and then load the adapter on top with PeftModel.from_pretrained() . This is memory-efficient and lets you swap adapters easily. Option B merges the adapter weights back into the base model, producing a standalone model with the same architecture as the original. This is simpler for deployment (one model, no adapter loading code) but loses the ability to swap adapters and the merged model is full-sized (~16 GB for an 8B model in BF16).

💡 The ultrachat_200k dataset used above is a high-quality, multi-turn chat dataset with ~200,000 conversations. It is one of the most popular choices for SFT because of its diversity and quality. For your own projects, any dataset in the HuggingFace chat format (a "messages" column with role/content pairs) will work with SFTTrainer.

¿Cómo Sabemos si el Entrenamiento Está Funcionando?

A training script can run for hours without errors and still produce a useless model. The loss goes down, the GPU utilisation looks good, but the model outputs gibberish — or worse, it outputs fluent text that is subtly wrong. How do we monitor training to catch problems early, before they waste our compute budget?

Training loss is the primary signal. It should decrease steadily over the course of training — quickly at first (the model is learning the new format), then more slowly as it converges. For SFT, pérdida de entrenamiento typically starts at 1.5-2.5 (depending on the model and dataset) and settles somewhere between 0.5-1.2 after a few hundred steps. Sudden spikes in pérdida de entrenamiento indicate a problem: the tasa de aprendizaje may be too high, a corrupted batch may have entered the pipeline, or numerical instability may be creeping in.

Evaluation loss (computed on a held-out validation set) is the sobreajuste detector. Early in training, eval loss should track pérdida de entrenamiento — both decreasing together. The moment eval loss starts increasing while pérdida de entrenamiento continues to decrease, the model is sobreajuste: it is memorising the conjunto de entrenamiento rather than learning generalisable patterns. This divergence is your signal to stop training. For SFT, this often happens after 1-2 epochs on small datasets, which is why we recommended 1-3 epochs earlier.

Learning rate curve should show a clean warmup followed by a smooth cosine (or linear) decay. If your tasa de aprendizaje is constant when you expected cosine, or if warmup is missing, there is a configuration bug. Most logging tools (Weights & Biases, TensorBoard) plot this automatically. Verifying it takes seconds and can save hours of wasted training.

Gradient norm measures the magnitude of the gradient vector at each step. A healthy ejecución de entrenamiento shows relatively stable gradient norms (they fluctuate but stay in a bounded range). If the gradient norm suddenly spikes to very large values (10x-100x the normal range), the model is experiencing exploding gradients — a sign that the tasa de aprendizaje is too high, the data has an anomalous batch, or the model has numerical instability. The max_grad_norm=1.0 clipping we set in the training arguments prevents single-step catastrophes, but persistent gradient spikes are still a warning sign.

💡 Weights & Biases (wandb.ai) is the most popular logging tool for fine-tuning experiments. Setting report_to="wandb" in TrainingArguments automatically logs loss, tasa de aprendizaje, gradient norm, memoria de GPU, throughput (tokens/second), and more — all visible in real-time dashboards. TensorBoard (report_to="tensorboard") is a free alternative that works locally.

Beyond these metrics, what are the most common failure modes, and how do we diagnose them?

Loss doesn't decrease: the model isn't learning. This is usually caused by a tasa de aprendizaje that is too low (the updates are too small to matter), or by a data formatting bug where the model isn't seeing the target tokens. A common version of this: the loss mask se aplica incorrectly so the model trains only on prompt tokens (which it can already predict well) and never sees the response tokens it should be learning. Check your plantilla de chat and label masking.
Loss drops to near-zero very quickly: the model is sobreajuste aggressively. If pérdida de entrenamiento drops below 0.1 within the first epoch, the model has likely memorised the conjunto de entrenamiento. This happens with very small datasets (under 1,000 examples), too many epochs, or too high a tasa de aprendizaje. The fix: reduce epochs, reduce tasa de aprendizaje, or add more diverse data.
Loss becomes NaN: numerical instability. This is often caused by FP16 precisión mixta — certain operations (especially in attention and layer norms) can produce values outside FP16's representable range ($\pm 65504$). The fix: switch from fp16=True to bf16=True (BFloat16 has a much larger range: $\pm 3.4 \times 10^{38}$, the same as FP32). If NaN persists, reduce the tasa de aprendizaje. If the loss spikes right at the start, increase the warmup ratio.
Good loss, bad outputs: low loss does not always mean the model is useful. If the eval loss looks good but the model generates repetitive, degenerate, or off-topic text, the issue is often in the data: inconsistent formatting, contradictory examples, or a mismatch between the training template and the inference template. Always do qualitative evaluation — generate outputs from a set of test prompts and read them yourself.

A useful diagnostic practice is to log sample generations every N steps — have the trainer generate a response to a few fixed test prompts and log them alongside the loss curves. This gives you qualitative feedback in real time: you can literally read the model's outputs improving (or degrading) over the course of training. TRL's SFTTrainer doesn't do this automatically, but you can implement it with a custom Callback :

from transformers import TrainerCallback

class GenerationLogCallback(TrainerCallback):
    """Generates and logs sample outputs every N steps for qualitative monitoring."""

    def __init__(self, tokenizer, test_prompts, every_n_steps=200):
        self.tokenizer = tokenizer
        self.test_prompts = test_prompts
        self.every_n_steps = every_n_steps

    def on_step_end(self, args, state, control, model=None, **kwargs):
        if state.global_step % self.every_n_steps != 0:
            return

        model.eval()
        print(f"\n{'='*60}")
        print(f"Sample generations at step {state.global_step}")
        print(f"{'='*60}")

        for prompt in self.test_prompts:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(model.device)
            with torch.no_grad():
                output = model.generate(
                    **inputs, max_new_tokens=200, temperature=0.7, do_sample=True
                )
            response = self.tokenizer.decode(output[0], skip_special_tokens=True)
            print(f"\nPrompt: {prompt}")
            print(f"Output: {response[len(prompt):]}")

        model.train()

# Usage: add to SFTTrainer
test_prompts = [
    "Explain gradient descent in simple terms.",
    "Write a Python function that reverses a linked list.",
    "What are the pros and cons of microservices?",
]
callback = GenerationLogCallback(tokenizer, test_prompts, every_n_steps=200)
trainer = SFTTrainer(..., callbacks=[callback])

In summary: watch the four metrics (pérdida de entrenamiento, eval loss, tasa de aprendizaje, gradient norm), set up early stopping or manual checkpoints, and always verify with qualitative generation tests. A well-monitored ejecución de entrenamiento lets you catch problems within minutes, not hours.

Pruébalo Tú Mismo

This notebook implements the complete QLoRA fine-tuning pipeline from this article. It loads Qwen2.5-1.5B in NF4, applies LoRA, trains on 2,000 instruction examples, and merges the result — all on a free T4 GPU in about 15 minutes.

The notebook tests the model before and after training so you can see the difference first-hand. We encourage you to try changing the LoRA rank, the dataset, or the number of training epochs to build intuition for how each hiperparámetro affects the result.

Quiz

Test your understanding of the SFT bucle de entrenamiento, its hiperparámetros, and training diagnostics.

If you have per_device_train_batch_size=2, gradient_accumulation_steps=8, and 4 GPUs, what is the effective tamaño de batch?

10 (2 + 8)

16 (2 × 8)

64 (2 × 8 × 4)

32 (8 × 4)

What is the main advantage of packing over padding in SFT training?

Packing allows the model to learn cross-example dependencies

Packing eliminates wasted compute on pad tokens by filling every position with real training tokens

Packing requires less memoria de GPU per sequence

Packing removes the need for attention masks entirely

Gradient checkpointing trades what for what?

Training speed for model accuracy (slower training but better results)

Memory for compute (less memoria de activaciones but ~30% slower due to recomputation)

Batch size for longitud de secuencia (smaller batches to fit longer sequences)

Disk space for memoria de GPU (saves activaciones to disk instead of VRAM)

During training, eval loss starts increasing while pérdida de entrenamiento continues to decrease. What does this indicate and what should you do?

The tasa de aprendizaje is too low — increase it so the model converges faster

This is normal cosine decay behaviour — no action needed

The model is sobreajuste — stop training, reduce epochs, or add more data

Gradient checkpointing is causing numerical errors — disable it