From Theory to Training

We know what to fine-tune (full fine-tuning in article 2, LoRA in article 3, QLoRA in article 4, and the broader PEFT landscape in article 5). We know what to train on (instruction datasets in article 6 — formats, templates, masking strategies, and data sourcing). But how do we actually wire all of this together and run training end-to-end? What learning rate do we use? How many epochs? How do we avoid running out of GPU memory? What does a complete training script look like, and how do we know if something is going wrong?

This article answers those questions. We will walk through every hyperparameter that matters for supervised fine-tuning (SFT), explain the data-efficiency tricks that can cut your training time in half, build a complete training script from scratch, and learn how to monitor training so we catch problems before they waste hours of GPU time.

The modern fine-tuning ecosystem revolves around four HuggingFace libraries that snap together like building blocks. transformers (Wolf et al., 2020) provides the base models and tokenizers. PEFT (Mangrulkar et al., 2022) adds parameter-efficient adapters like LoRA and QLoRA on top of those models. TRL (Transformer Reinforcement Learning) (von Werra et al., 2020) provides the SFTTrainer class that handles the training loop, data collation, packing, and chat template formatting. And datasets (Lhoest et al., 2021) handles loading, streaming, and preprocessing data from the HuggingFace Hub or local files.

The high-level pipeline is always the same, regardless of model size or adapter type:

# 1. Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", ...)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# 2. Apply PEFT adapter (LoRA / QLoRA)
model = get_peft_model(model, lora_config)

# 3. Load and format instruction dataset
dataset = load_dataset("your-org/your-sft-data")

# 4. Configure SFTTrainer with hyperparameters
trainer = SFTTrainer(model, args=training_args, train_dataset=dataset, ...)

# 5. Train
trainer.train()

# 6. Save adapter weights (or merge into base model)
trainer.save_model("./sft-adapter")

Each of these steps has decisions to make and pitfalls to avoid. The rest of this article unpacks them one by one.

What Hyperparameters Actually Matter?

Fine-tuning a pre-trained language model is not the same as training one from scratch. The model already sits in a good region of parameter space, and our job is to nudge it — not shove it. That means the hyperparameter choices are different from pre-training, and getting them wrong can either waste compute (too conservative) or destroy the model's pre-trained knowledge (too aggressive). What are the key knobs, and where should we set them?

Learning rate is the single most important hyperparameter. It controls how large each gradient update step is. For SFT on a pre-trained model, typical values range from 1e-5 to 2e-4 . This is 10-100x smaller than pre-training learning rates (which can be 1e-3 or higher), because we're fine-tuning — the weights are already close to useful values, and large updates would knock them out of place. When using LoRA, the learning rate can sit at the higher end of this range (1e-4 to 2e-4) because only a small number of adapter parameters are being updated and the base model weights are frozen. For full fine-tuning, stay closer to the lower end (1e-5 to 5e-5).

📌 If your learning rate is too high, you will see loss spikes, NaN values, or rapid overfitting. If it is too low, training will be painfully slow and the model may barely change from the base. When in doubt, start with 2e-5 for full fine-tuning or 1e-4 for LoRA, and adjust based on the loss curve.

Effective batch size determines how many examples contribute to each gradient update. Larger batches produce smoother, more stable gradients (each update averages over more examples), but require more memory. In distributed or memory-constrained setups, we rarely use a single large batch. Instead, we combine three factors:

$$B_{\text{eff}} = b_{\text{micro}} \times G \times N_{\text{gpu}}$$

Where $b_{\text{micro}}$ is the micro-batch size (number of examples per GPU per forward pass — the largest batch that physically fits in GPU memory), $G$ is the number of gradient accumulation steps (how many micro-batches of gradients we sum up before performing a weight update), and $N_{\text{gpu}}$ is the number of GPUs performing data-parallel training. The micro-batch size is constrained by memory; gradient accumulation and GPU count let us scale the effective batch size without increasing per-GPU memory.

What happens at the boundaries? A very small $B_{\text{eff}}$ (say, 1 or 2) means each update is based on very few examples, so the gradient direction is noisy — the model zigzags through parameter space and may oscillate instead of converging. A very large $B_{\text{eff}}$ (say, 512 or 1024) produces smooth gradients but each step is expensive and, for small datasets, you may take so few steps per epoch that the model doesn't have enough gradient updates to learn. For SFT, effective batch sizes in the range of 16 to 128 are common. A good starting point is $b_{\text{micro}} = 4$, $G = 4$, giving $B_{\text{eff}} = 16$ on a single GPU.

Number of epochs controls how many times the model sees the entire dataset. For SFT, 1 to 3 epochs is the standard range. With small datasets (under 10,000 examples), even 2-3 epochs can cause overfitting — the model memorises the training examples and loses the ability to generalise. With larger datasets (100,000+ examples), you may need only a single epoch. The LIMA paper (Zhou et al., 2023) we discussed in article 6 showed strong results with just 1,000 examples over 3 epochs, but they used very high-quality, diverse data. The rule of thumb: start with 1 epoch, try 2-3 only if the model is clearly undertrained (eval loss is still dropping, the model hasn't converged).

Maximum sequence length (often called max_seq_length ) determines the longest input the model will process during training. Any example longer than this is truncated; any shorter is padded (or packed, as we'll discuss next). This should be set to match or slightly exceed the longest examples in your dataset. Setting it too high wastes memory on padding tokens, while setting it too short silently chops off the ends of your training examples, which can corrupt your targets. Common values are 512, 1024, or 2048, depending on the task.

Warmup ratio specifies what fraction of total training steps should use a gradually increasing learning rate before the main schedule takes over. Typical values are 3% to 10% of total steps. Why does warmup help? At the very start of fine-tuning, the model encounters gradients from the new distribution (instruction data) that may be quite different from what the pre-trained weights expect. A full-sized learning rate applied to these early, noisy gradients can cause sudden large updates that destabilise the model — loss spikes, gradient explosions, or even NaN values. Warmup starts with a near-zero learning rate, lets the optimizer's moment estimates (in Adam, the running mean and variance of gradients) calibrate to the new data distribution, and then gradually ramps up to the target rate. By the time the full learning rate kicks in, the optimizer has a stable picture of the gradient landscape.

Weight decay is a regularisation technique that adds a penalty proportional to the magnitude of the weights at each update, gently pushing them toward zero. For SFT, typical values are 0.01 to 0.1 . It acts as a brake against overfitting: without it, the model can develop very large weight values that perfectly fit the training data but generalise poorly. With LoRA, weight decay applies only to the adapter parameters (the base model is frozen), so its effect is more contained. A value of 0.01 is a safe default.

Learning rate schedule determines how the learning rate changes over the course of training. After warmup, should it stay constant, drop linearly, or follow some curve? The dominant choice for fine-tuning (and pre-training) is cosine decay — the learning rate follows a smooth cosine curve from its peak value down to a minimum:

$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$$

Let's unpack every symbol. $\eta_t$ is the learning rate at step $t$. $\eta_{\max}$ is the peak learning rate — the value reached at the end of warmup. $\eta_{\min}$ is the minimum learning rate at the end of training, typically set to $0.1 \times \eta_{\max}$ or simply $0$. $t$ is the current training step (after warmup). $T$ is the total number of training steps (excluding warmup). And $\pi$ is just the mathematical constant (~3.14159) that makes cosine complete half a cycle over the training run.

The key insight is how the $\cos$ term drives the schedule. At step $t = 0$ (start of training after warmup), we compute $\cos(0) = 1$, so the parenthesised term becomes $(1 + 1) / 2 = 1$, giving $\eta_0 = \eta_{\min} + (\eta_{\max} - \eta_{\min}) = \eta_{\max}$. At step $t = T$ (end of training), we compute $\cos(\pi) = -1$, so the parenthesised term becomes $(1 + (-1)) / 2 = 0$, giving $\eta_T = \eta_{\min}$. In between, the cosine sweeps smoothly from 1 to -1, producing a curve that decreases slowly at first (the model is still learning, keep the rate high), then more rapidly in the middle, and finally slowly again as it approaches $\eta_{\min}$ (gentle landing). This shape is well-suited to fine-tuning: the model makes its biggest adjustments early when the gradient signal is strongest, and takes increasingly cautious steps as it converges.

💡 Why cosine over linear decay? Cosine keeps the learning rate higher for longer during the first half of training, then drops more steeply. Empirically, this produces slightly better results on most benchmarks compared to linear decay, which drops at a constant rate. The difference is modest, but cosine has become the de facto standard.

Let's make this concrete. The code below computes the cosine schedule at 10 evenly spaced checkpoints so we can see exactly how the learning rate evolves. We use $\eta_{\max} = 2 \times 10^{-4}$, $\eta_{\min} = 0$ (common for SFT), and $T = 1000$ total steps:

import math, json, js

eta_max = 2e-4   # peak learning rate (after warmup)
eta_min = 0.0    # minimum learning rate
T = 1000         # total training steps

def cosine_lr(t, T, eta_max, eta_min):
    """Cosine decay schedule: smoothly decays from eta_max to eta_min."""
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + math.cos(t / T * math.pi))

# Compute LR at 11 evenly-spaced steps
rows = []
for i in range(11):
    t = int(i * T / 10)
    lr = cosine_lr(t, T, eta_max, eta_min)
    pct = t / T * 100
    rows.append([
        str(t),
        f"{pct:.0f}%",
        f"{lr:.6f}",
        f"{lr / eta_max * 100:.1f}%"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["Step (t)", "Progress", "Learning Rate", "% of Peak"],
    "rows": rows
})

print(f"Peak LR (eta_max): {eta_max}")
print(f"Min LR (eta_min):  {eta_min}")
print(f"Total steps (T):   {T}")
print()
print("Notice: LR stays above 50% of peak for the first ~33% of training,")
print("then drops steeply in the middle, and gently approaches 0 at the end.")

Packing vs Padding: How Do We Handle Variable-Length Examples?

Instruction datasets are messy. Some examples are 50 tokens (a short question with a one-word answer), others are 2,000 tokens (a complex reasoning chain). But GPUs are most efficient when processing fixed-size tensors — every sequence in a batch must be the same length. How do we reconcile variable-length data with fixed-size batches? There are two strategies, and the choice between them can easily make a 2x difference in training speed.

Padding is the naive approach. Take the longest sequence in a batch (or the configured max_seq_length ), and fill every shorter sequence with a special [PAD] token until they all match. This is simple, and every sequence is nicely isolated — no risk of one example's attention leaking into another. But it's wasteful. If your max_seq_length is 2048 and the average example is 200 tokens, then roughly 90% of every batch is padding. The GPU dutifully computes attention over pad tokens, computes gradients for pad positions, and produces outputs that are immediately masked out and thrown away. For short-example datasets (chat, QA, classification), padding waste can reach 60-90% of total compute.

Packing solves this by concatenating multiple examples into a single sequence, separated by the model's end-of-sequence (EOS) token, until the sequence is full. If the average example is 200 tokens and max_seq_length = 2048 , we can fit roughly 10 examples into one packed sequence. Every token position now contains a real training token — no padding waste, no wasted compute.

# Illustrate the difference between padding and packing
max_len = 2048
examples = [180, 95, 310, 150, 220, 60, 400, 130, 275, 200]  # token lengths

# Padding: each example becomes its own sequence of length max_len
padded_tokens = len(examples) * max_len
real_tokens_padded = sum(examples)
pad_tokens = padded_tokens - real_tokens_padded
pad_waste = pad_tokens / padded_tokens * 100

print("=== PADDING ===")
print(f"Examples: {len(examples)}")
print(f"Total positions: {padded_tokens:,} ({len(examples)} x {max_len})")
print(f"Real tokens:     {real_tokens_padded:,}")
print(f"Pad tokens:      {pad_tokens:,}")
print(f"Waste:           {pad_waste:.1f}%")
print()

# Packing: concatenate examples into sequences of max_len
packed_seqs = []
current_seq = 0
seq_count = 0
for length in examples:
    if current_seq + length > max_len:
        packed_seqs.append(current_seq)
        current_seq = length
        seq_count += 1
    else:
        current_seq += length
if current_seq > 0:
    packed_seqs.append(current_seq)
    seq_count += 1

packed_total = seq_count * max_len
packed_real = sum(examples)
packed_waste = (packed_total - packed_real) / packed_total * 100

print("=== PACKING ===")
print(f"Examples: {len(examples)} packed into {seq_count} sequences")
print(f"Total positions: {packed_total:,} ({seq_count} x {max_len})")
print(f"Real tokens:     {packed_real:,}")
print(f"Remaining waste: {packed_total - packed_real:,}")
print(f"Waste:           {packed_waste:.1f}%")
print()
print(f"Speedup: {padded_tokens / packed_total:.1f}x fewer total positions to process")

There is one important complication with packing: attention must not cross example boundaries. In a packed sequence containing examples A, B, and C concatenated together, example B should not attend to tokens from example A or C — each example must believe it is alone in the sequence. Without this isolation, the model would learn spurious dependencies between unrelated examples that happen to be packed next to each other.

The standard solution is to use a block-diagonal attention mask (also called a packing mask or sample mask). Instead of the usual causal mask (lower-triangular, allowing each position to attend to all previous positions), we create a block-diagonal mask where each block corresponds to one example. Position $i$ can attend to position $j$ only if both positions belong to the same example. TRL's SFTTrainer handles this automatically when you set packing=True — it packs examples, inserts EOS delimiters, and generates the correct attention masks.

💡 Packing is especially impactful for short-example datasets like chat (average ~150-300 tokens), QA (average ~100-200 tokens), and classification fine-tuning (average ~50-100 tokens). If your examples are already near max_seq_length (long document summarisation, for instance), packing provides little benefit because there is minimal padding to eliminate.

Gradient Accumulation and Gradient Checkpointing

What if we want a large effective batch size for stable gradients, but our GPU can only fit 2 examples at a time? And what if even those 2 examples cause near-OOM conditions because the model's intermediate activations eat all the memory? These are the two most common memory bottlenecks in fine-tuning, and they have different solutions that are often confused. Let's disentangle them.

Gradient accumulation addresses the batch-size problem. Instead of computing one forward and backward pass on a batch of 32 examples (which might not fit in memory), we compute 8 sequential forward-backward passes on micro-batches of 4 examples each. After each micro-batch, the gradients are accumulated (summed) into the same gradient buffers without updating the weights. After all 8 micro-batches, we perform a single optimizer step using the accumulated gradients. The result is mathematically equivalent to training with a batch of 32 — the gradient is the same sum — but we only ever hold 4 examples in memory at once.

The tradeoff is straightforward: gradient accumulation gives us the gradient quality of a large batch with the memory cost of a small batch, but it takes $G$ times as many forward-backward passes per optimizer step, so each step takes $G$ times longer in wall-clock time. If $G = 8$, each training step is ~8x slower than a full-batch step would be (if it fit in memory). But since the alternative is "doesn't fit in memory at all," this is usually a good trade.

Gradient checkpointing (also called activation checkpointing ) addresses a completely different memory bottleneck: the intermediate activations stored during the forward pass. Normally, the forward pass computes and saves activations at every layer (attention outputs, FFN intermediates, layer norms, etc.), because the backward pass needs them to compute gradients. For a model like Llama-3.1-8B processing a 2048-token sequence, these activations can consume tens of gigabytes — often more than the model weights themselves.

With gradient checkpointing enabled, the forward pass discards most of these intermediate activations instead of storing them. During the backward pass, when a discarded activation is needed, the relevant portion of the forward pass is re-executed on the fly to recompute it. This trades compute for memory: the model uses roughly 60% less activation memory but training is about 30% slower because of the recomputation overhead (Chen et al., 2016) . For fine-tuning large models on consumer GPUs (24GB or 48GB VRAM), gradient checkpointing is almost always essential — it's the difference between the training fitting in memory or not.

📌 Don't confuse these two: gradient accumulation saves memory by reducing the batch size per forward pass (solving the "batch too large" problem), while gradient checkpointing saves memory by discarding intermediate activations (solving the "model too large" problem). They are complementary — you can and often should use both simultaneously.

In HuggingFace's TrainingArguments , both are simple flags:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",

    # ── Gradient Accumulation ──────────────────────────────────
    per_device_train_batch_size=4,      # micro-batch: 4 examples per GPU
    gradient_accumulation_steps=8,       # accumulate 8 micro-batches
    # Effective batch = 4 * 8 = 32 per GPU (without needing 32-example memory)

    # ── Gradient Checkpointing ────────────────────────────────
    gradient_checkpointing=True,         # ~60% less activation memory, ~30% slower
    # Without this, a 7B model on a 24GB GPU will likely OOM on sequences > 1024

    # Other essentials
    learning_rate=1e-4,
    num_train_epochs=2,
    bf16=True,                           # bfloat16 mixed precision
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
)

A practical rule of thumb: on a single 24GB GPU (e.g. RTX 3090 / 4090) fine-tuning a 7B parameter model with QLoRA, you can typically fit a micro-batch of 2-4 with gradient checkpointing enabled and sequences of 1024-2048 tokens. Use gradient accumulation steps of 4-8 to reach an effective batch of 16-32. On a 48GB GPU (A6000 / A40), you can increase the micro-batch to 4-8 and may get away without gradient checkpointing for shorter sequences.

The Complete Training Script

Now let's put everything together into a single, production-ready SFT training script. This script fine-tunes a Llama 3.1 8B model using QLoRA (4-bit quantisation with LoRA adapters), demonstrating every concept we've discussed: hyperparameter configuration, packing, gradient accumulation, gradient checkpointing, and proper model saving. We'll walk through each section after the code.

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 1. QUANTISATION CONFIG — load base model in 4-bit (QLoRA)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # 4-bit NormalFloat quantisation
    bnb_4bit_quant_type="nf4",            # NF4 data type (optimal for Gaussians)
    bnb_4bit_compute_dtype=torch.bfloat16,# compute in BF16 for stability
    bnb_4bit_use_double_quant=True,       # quantise the quantisation constants too
)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 2. LOAD BASE MODEL AND TOKENIZER
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
model_name = "meta-llama/Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,       # apply 4-bit quantisation
    device_map="auto",                    # spread layers across available GPUs
    attn_implementation="flash_attention_2",  # faster, memory-efficient attention
)
model = prepare_model_for_kbit_training(model)  # freeze quantised layers, enable grads

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token     # many models lack a pad token
tokenizer.padding_side = "right"              # pad on right for causal LM

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 3. LORA CONFIG — which layers to adapt and how
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
lora_config = LoraConfig(
    r=16,                                 # rank of the low-rank matrices
    lora_alpha=32,                        # scaling factor (effective LR ~ alpha/r)
    lora_dropout=0.05,                    # dropout on adapter activations
    bias="none",                          # don't train bias terms
    task_type="CAUSAL_LM",               # optimisation for causal language models
    target_modules=[                      # which weight matrices to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",   # attention projections
        "gate_proj", "up_proj", "down_proj",       # FFN projections
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Typical output: "trainable params: 41,943,040 || all params: 8,072,204,288 || 0.52%"

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 4. LOAD AND FORMAT DATASET
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

# SFTTrainer expects a "messages" column in chat format:
# [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
# ultrachat_200k already has this format. If yours doesn't, map it:
#
# def format_example(example):
#     return {"messages": [
#         {"role": "user", "content": example["instruction"]},
#         {"role": "assistant", "content": example["output"]},
#     ]}
# dataset = dataset.map(format_example)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 5. TRAINING ARGUMENTS — every hyperparameter in one place
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
training_args = TrainingArguments(
    output_dir="./llama3-sft-qlora",
    num_train_epochs=2,                   # 2 epochs (watch for overfitting)
    per_device_train_batch_size=4,        # micro-batch per GPU
    gradient_accumulation_steps=4,        # effective batch = 4 * 4 = 16
    gradient_checkpointing=True,          # save activation memory
    learning_rate=1e-4,                   # LoRA sweet spot
    lr_scheduler_type="cosine",           # cosine decay after warmup
    warmup_ratio=0.05,                    # 5% warmup
    weight_decay=0.01,                    # mild regularisation
    bf16=True,                            # bfloat16 mixed precision
    logging_steps=10,                     # log loss every 10 steps
    save_strategy="steps",
    save_steps=500,                       # checkpoint every 500 steps
    save_total_limit=3,                   # keep only last 3 checkpoints
    evaluation_strategy="steps",
    eval_steps=500,                       # evaluate every 500 steps
    max_grad_norm=1.0,                    # gradient clipping
    report_to="wandb",                    # log to Weights & Biases
    seed=42,
)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 6. CONFIGURE AND RUN SFTTrainer
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    packing=True,                         # pack multiple examples per sequence
    max_seq_length=2048,                  # maximum packed sequence length
)

trainer.train()

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# 7. SAVE — adapter only (small) or merged (full model)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Option A: Save just the adapter (~80 MB)
trainer.save_model("./llama3-sft-qlora/adapter")

# Option B: Merge adapter into base model and save full weights
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-sft-qlora/merged")
tokenizer.save_pretrained("./llama3-sft-qlora/merged")

Let's walk through the key sections. Section 1 (Quantisation Config) sets up 4-bit NormalFloat quantisation, which is the "Q" in QLoRA. The bnb_4bit_compute_dtype=torch.bfloat16 flag is critical — it means that even though the weights are stored in 4-bit, all matrix multiplications are performed in BFloat16 for numerical stability. The bnb_4bit_use_double_quant=True option applies a second round of quantisation to the quantisation constants themselves, saving an additional ~0.4 bits per parameter (a small but free savings).

Section 2 (Model Loading) loads the model with the quantisation config applied. The device_map="auto" flag tells Accelerate to automatically distribute model layers across available GPUs (or CPU RAM if GPU memory is insufficient). prepare_model_for_kbit_training() freezes the quantised layers and enables gradient computation for the adapter layers that will be attached next. Setting the pad token is essential — many models (including Llama) don't ship with a pad token defined, and the trainer will error without one.

Section 3 (LoRA Config) specifies which weight matrices in the model receive LoRA adapters. We target all the attention projections ($W_Q$, $W_K$, $W_V$, $W_O$) and the feed-forward projections ($W_{\text{gate}}$, $W_{\text{up}}$, $W_{\text{down}}$). Targeting more modules increases trainable parameters (from ~0.2% to ~0.5% of total) but captures more nuanced adaptations. The lora_alpha=32 with r=16 gives a scaling factor of $\alpha / r = 2$, which effectively doubles the adapter learning rate relative to the configured learning_rate — a common setting that works well in practice.

Section 5 (Training Arguments) is where all the hyperparameters from the previous section come together. Note the max_grad_norm=1.0 — this clips gradients to a maximum L2 norm of 1.0, which prevents any single batch with unusually large gradients from destabilising training. The save_total_limit=3 keeps only the 3 most recent checkpoints, which is important because even adapter checkpoints accumulate and can fill disk space over long runs.

Section 7 (Saving) shows two options. Option A saves only the adapter weights (~80 MB for the config above). At inference time, you load the base model and then load the adapter on top with PeftModel.from_pretrained() . This is memory-efficient and lets you swap adapters easily. Option B merges the adapter weights back into the base model, producing a standalone model with the same architecture as the original. This is simpler for deployment (one model, no adapter loading code) but loses the ability to swap adapters and the merged model is full-sized (~16 GB for an 8B model in BF16).

💡 The ultrachat_200k dataset used above is a high-quality, multi-turn chat dataset with ~200,000 conversations. It is one of the most popular choices for SFT because of its diversity and quality. For your own projects, any dataset in the HuggingFace chat format (a "messages" column with role/content pairs) will work with SFTTrainer.

How Do We Know If Training Is Working?

A training script can run for hours without errors and still produce a useless model. The loss goes down, the GPU utilisation looks good, but the model outputs gibberish — or worse, it outputs fluent text that is subtly wrong. How do we monitor training to catch problems early, before they waste our compute budget?

Training loss is the primary signal. It should decrease steadily over the course of training — quickly at first (the model is learning the new format), then more slowly as it converges. For SFT, training loss typically starts at 1.5-2.5 (depending on the model and dataset) and settles somewhere between 0.5-1.2 after a few hundred steps. Sudden spikes in training loss indicate a problem: the learning rate may be too high, a corrupted batch may have entered the pipeline, or numerical instability may be creeping in.

Evaluation loss (computed on a held-out validation set) is the overfitting detector. Early in training, eval loss should track training loss — both decreasing together. The moment eval loss starts increasing while training loss continues to decrease, the model is overfitting: it is memorising the training set rather than learning generalisable patterns. This divergence is your signal to stop training. For SFT, this often happens after 1-2 epochs on small datasets, which is why we recommended 1-3 epochs earlier.

Learning rate curve should show a clean warmup followed by a smooth cosine (or linear) decay. If your learning rate is constant when you expected cosine, or if warmup is missing, there is a configuration bug. Most logging tools (Weights & Biases, TensorBoard) plot this automatically. Verifying it takes seconds and can save hours of wasted training.

Gradient norm measures the magnitude of the gradient vector at each step. A healthy training run shows relatively stable gradient norms (they fluctuate but stay in a bounded range). If the gradient norm suddenly spikes to very large values (10x-100x the normal range), the model is experiencing exploding gradients — a sign that the learning rate is too high, the data has an anomalous batch, or the model has numerical instability. The max_grad_norm=1.0 clipping we set in the training arguments prevents single-step catastrophes, but persistent gradient spikes are still a warning sign.

💡 Weights & Biases (wandb.ai) is the most popular logging tool for fine-tuning experiments. Setting report_to="wandb" in TrainingArguments automatically logs loss, learning rate, gradient norm, GPU memory, throughput (tokens/second), and more — all visible in real-time dashboards. TensorBoard (report_to="tensorboard") is a free alternative that works locally.

Beyond these metrics, what are the most common failure modes, and how do we diagnose them?

  • Loss doesn't decrease: the model isn't learning. This is usually caused by a learning rate that is too low (the updates are too small to matter), or by a data formatting bug where the model isn't seeing the target tokens. A common version of this: the loss mask is applied incorrectly so the model trains only on prompt tokens (which it can already predict well) and never sees the response tokens it should be learning. Check your chat template and label masking.
  • Loss drops to near-zero very quickly: the model is overfitting aggressively. If training loss drops below 0.1 within the first epoch, the model has likely memorised the training set. This happens with very small datasets (under 1,000 examples), too many epochs, or too high a learning rate. The fix: reduce epochs, reduce learning rate, or add more diverse data.
  • Loss becomes NaN: numerical instability. This is often caused by FP16 mixed precision — certain operations (especially in attention and layer norms) can produce values outside FP16's representable range ($\pm 65504$). The fix: switch from fp16=True to bf16=True (BFloat16 has a much larger range: $\pm 3.4 \times 10^{38}$, the same as FP32). If NaN persists, reduce the learning rate. If the loss spikes right at the start, increase the warmup ratio.
  • Good loss, bad outputs: low loss does not always mean the model is useful. If the eval loss looks good but the model generates repetitive, degenerate, or off-topic text, the issue is often in the data: inconsistent formatting, contradictory examples, or a mismatch between the training template and the inference template. Always do qualitative evaluation — generate outputs from a set of test prompts and read them yourself.

A useful diagnostic practice is to log sample generations every N steps — have the trainer generate a response to a few fixed test prompts and log them alongside the loss curves. This gives you qualitative feedback in real time: you can literally read the model's outputs improving (or degrading) over the course of training. TRL's SFTTrainer doesn't do this automatically, but you can implement it with a custom Callback :

from transformers import TrainerCallback

class GenerationLogCallback(TrainerCallback):
    """Generates and logs sample outputs every N steps for qualitative monitoring."""

    def __init__(self, tokenizer, test_prompts, every_n_steps=200):
        self.tokenizer = tokenizer
        self.test_prompts = test_prompts
        self.every_n_steps = every_n_steps

    def on_step_end(self, args, state, control, model=None, **kwargs):
        if state.global_step % self.every_n_steps != 0:
            return

        model.eval()
        print(f"\n{'='*60}")
        print(f"Sample generations at step {state.global_step}")
        print(f"{'='*60}")

        for prompt in self.test_prompts:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(model.device)
            with torch.no_grad():
                output = model.generate(
                    **inputs, max_new_tokens=200, temperature=0.7, do_sample=True
                )
            response = self.tokenizer.decode(output[0], skip_special_tokens=True)
            print(f"\nPrompt: {prompt}")
            print(f"Output: {response[len(prompt):]}")

        model.train()

# Usage: add to SFTTrainer
test_prompts = [
    "Explain gradient descent in simple terms.",
    "Write a Python function that reverses a linked list.",
    "What are the pros and cons of microservices?",
]
callback = GenerationLogCallback(tokenizer, test_prompts, every_n_steps=200)
trainer = SFTTrainer(..., callbacks=[callback])

In summary: watch the four metrics (training loss, eval loss, learning rate, gradient norm), set up early stopping or manual checkpoints, and always verify with qualitative generation tests. A well-monitored training run lets you catch problems within minutes, not hours.

Quiz

Test your understanding of the SFT training loop, its hyperparameters, and training diagnostics.

If you have per_device_train_batch_size=2, gradient_accumulation_steps=8, and 4 GPUs, what is the effective batch size?

What is the main advantage of packing over padding in SFT training?

Gradient checkpointing trades what for what?

During training, eval loss starts increasing while training loss continues to decrease. What does this indicate and what should you do?