Why Fine-tune? - Cruxr.ai

What Can a Pre-trained Model Actually Do?

A large language model fresh out of pre-training is impressive and useless at the same time. It has read billions of words of text — books, code, web pages, scientific papers — and distilled statistical patterns about language into its weights. It can complete sentences, mimic styles, and even solve math problems if the prompt is crafted just right. But ask it a straightforward question like "Summarise this contract in three bullet points" and you'll likely get something that looks more like a continuation of your sentence than an actual summary. Why?

Because pre-training optimises for one thing: predicting the next token. The model learns $P(x_t \mid x_{<t})$, the probability of the next word given everything before it. That objective teaches the model what language is — syntax, facts, reasoning patterns — but not what the user wants . The original GPT-3 paper (Brown et al., 2020) demonstrated this gap vividly. GPT-3 could do impressive few-shot tasks when given carefully constructed prompts with examples, but it couldn't reliably follow a simple instruction like "translate this to French" without a demonstration. It would sometimes repeat the instruction, continue the text as if it were a document, or veer off-topic entirely.

💡 Think of pre-training as teaching someone to read every book in a library. They know an enormous amount, but if you hand them a task — "write a customer support email in this exact JSON format" — they have no idea what you expect. They've never practised following instructions; they've only practised predicting what comes next in text they've already read.

This is the gap between capability and alignment . The model has the raw ability to generate fluent, knowledgeable text, but it doesn't know how to channel that ability toward your specific goal. Prompting can sometimes bridge this gap for simple tasks, but for anything requiring a consistent output format, a domain-specific reasoning style, or reliable instruction-following, we need something more. So how do we close this gap?

The Adaptation Spectrum

There isn't just one way to adapt a model. There's a whole spectrum of techniques, ranging from free and instant (but limited) to expensive and slow (but transformative). Before committing to fine-tuning, it's worth understanding where it sits relative to the alternatives — because the cheapest option that solves your problem is usually the right one.

Prompting (zero-shot / few-shot): Write a good system prompt, maybe include a few examples, and hope the model gets it. No training, no GPU, no data collection. This works surprisingly well for generic tasks, but falls apart when you need consistent formatting, domain-specific jargon, or behaviour the model wasn't pre-trained for. You're limited by the context window and by your prompt-engineering skill.
In-context learning (ICL): Pack examples directly into the prompt. "Here are 10 examples of how to classify support tickets — now classify this one." Better than zero-shot, but every example eats tokens, results are sensitive to example ordering, and you're renting the behaviour per-call rather than owning it. The model doesn't actually learn; it just pattern-matches within the context.
Retrieval-Augmented Generation ( RAG ): Fetch relevant documents at query time and paste them into the context. Excellent for factual knowledge — the model grounds its answer in retrieved passages instead of hallucinating. But RAG can't change how the model reasons or what format it uses. If the model doesn't know how to generate valid JSON, no amount of retrieved documents will fix that.
Fine-tuning: Update the model's weights on task-specific examples. This changes the model's behaviour — its default output format, reasoning style, tone, tool-use patterns. The model learns new habits from your data. Requires a training dataset and GPU compute, but the per-inference cost drops (no need for long prompts full of examples) and the behaviour becomes consistent.
Pre-training from scratch: Train a new model from randomly initialised weights on a massive corpus. This is the nuclear option: months of compute on thousands of GPUs, billions of tokens of curated data. Almost never justified unless you need a fundamentally different architecture, a new language, or a domain so specialised (e.g. protein sequences) that existing models have no useful prior.

The table below makes the tradeoffs concrete. Notice how fine-tuning is the only approach that changes model behaviour without requiring full pre-training — that's the sweet spot we'll be exploring throughout this track.

import json, js

rows = [
    ["Prompting",       "None",        "No",   "No",  "No",  "Instant"],
    ["ICL (few-shot)",  "A few examples", "No",   "No",  "No",  "Instant"],
    ["RAG",             "A document corpus", "No",   "No",  "Yes (facts)", "Hours (indexing)"],
    ["Fine-tuning",     "100–100k examples", "Yes",  "Yes", "Partially",   "Hours–days"],
    ["Pre-training",    "Billions of tokens", "Yes",  "Yes", "Yes",         "Weeks–months"],
]

js.window.py_table_data = json.dumps({
    "headers": [
        "Approach",
        "Data needed",
        "GPU required?",
        "Changes behaviour?",
        "Adds knowledge?",
        "Setup time"
    ],
    "rows": rows
})

print("Key insight: fine-tuning sits in the sweet spot — it changes behaviour")
print("with modest data and compute, without the cost of full pre-training.")

📌 These approaches are not mutually exclusive. The best production systems often combine them: a fine-tuned model (for behaviour and format) with RAG (for up-to-date facts) and careful prompting (for per-request instructions). Think of them as complementary tools, not competing choices.

So when prompting and RAG aren't enough — when you need the model to consistently adopt a new behaviour, format, or skill — fine-tuning is where you turn. But what does fine-tuning actually do under the hood?

What Does Fine-tuning Actually Change?

During pre-training, the model learns a general distribution over language: given a context, what token is likely to come next? This distribution is extremely broad — it covers everything from legal contracts to Reddit comments to Python code. The model is a generalist, and that's both its strength and its weakness.

Fine-tuning narrows this distribution. We show the model a curated set of (input, desired output) pairs, and we adjust its weights so that it assigns higher probability to the outputs we want. Formally, the model starts with pre-trained parameters $\theta_0$ and we update them to $\theta^*$ by minimising a loss over our task-specific dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{N}$, where each $x^{(i)}$ is an instruction or input and each $y^{(i)}$ is the desired response.

The standard loss for Supervised Fine-Tuning (SFT) is the cross-entropy loss over the target tokens:

\mathcal{L}_{\text{SFT}}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(y_t \mid x, y_{<t})

Let's unpack every symbol in this formula so that nothing is left to guesswork:

$\theta$ — the model's parameters (all the weight matrices). These are the variables we're optimising. We start from $\theta_0$ (the pre-trained weights) and nudge them toward $\theta^*$.
$x$ — the input (instruction, prompt, question). This is the context the model conditions on.
$y_t$ — the target token at position $t$ in the desired response. This is what we want the model to predict.
$y_{<t}$ — all target tokens before position $t$. The model generates autoregressively, so predicting $y_t$ depends on both the input $x$ and all previously generated tokens $y_1, y_2, \ldots, y_{t-1}$.
$T$ — the total number of tokens in the target response. Dividing by $T$ averages the loss across the sequence so that long responses don't dominate short ones in the same batch.
$P_\theta(y_t \mid x, y_{<t})$ — the probability the model assigns to the correct token $y_t$, given the input and all preceding target tokens. This comes from the softmax over the vocabulary at position $t$.
$-\log(\cdot)$ — the negative log transforms probabilities into a loss. When the model is confident and correct (probability near 1.0), the loss is near zero. When it's wrong (probability near 0), the loss is enormous.

Now let's see what happens at the boundary values — this is where the intuition lives. Suppose the model assigns probability 1.0 to the correct token: $-\log(1.0) = 0$, so there's zero loss. The model already knows this one perfectly. If the model assigns probability 0.5 (a coin flip between two tokens): $-\log(0.5) \approx 0.693$. Moderate penalty — the model needs to become more decisive. If the model assigns probability 0.01 (nearly certain it's a different token): $-\log(0.01) \approx 4.605$. Massive penalty. And at the extreme, if the model assigns probability approaching 0: $-\log(\epsilon) \rightarrow \infty$. The loss explodes, creating a very strong gradient signal to fix this.

import math, json, js

# Show the negative log loss at different probability values
probs = [1.0, 0.9, 0.5, 0.1, 0.01, 0.001]
rows = []
for p in probs:
    loss = -math.log(p)
    interpretation = ""
    if p == 1.0:
        interpretation = "Perfect — zero loss"
    elif p >= 0.9:
        interpretation = "Confident and correct — small nudge"
    elif p >= 0.5:
        interpretation = "Coin flip — moderate penalty"
    elif p >= 0.1:
        interpretation = "Mostly wrong — strong gradient"
    elif p >= 0.01:
        interpretation = "Very wrong — large penalty"
    else:
        interpretation = "Catastrophically wrong — loss explodes"
    rows.append([f"{p}", f"{loss:.4f}", interpretation])

js.window.py_table_data = json.dumps({
    "headers": ["P(correct token)", "-log(P)", "Interpretation"],
    "rows": rows
})

print("The negative log function makes the loss asymmetric:")
print("getting the right answer barely reduces loss (0 is the floor),")
print("but getting it wrong creates explosive gradients.")

💡 The $1/T$ term matters more than it looks. Without it, a 500-token response would contribute 10x more to the loss than a 50-token response, and the model would spend all its capacity optimising long examples at the expense of short ones. Averaging over $T$ treats every example equally regardless of length.

One crucial detail: during fine-tuning, we typically use a much lower learning rate than during pre-training — on the order of $1 \times 10^{-5}$ to $5 \times 10^{-5}$, compared to $1 \times 10^{-3}$ or higher during pre-training. Why? Because the pre-trained weights already encode enormously useful knowledge about language. We want to nudge the model toward our task, not overwrite what it already knows. A learning rate that's too high causes catastrophic forgetting — the model learns your task but loses its general language ability, producing fluent task-specific outputs that fall apart on anything slightly different.

There's also an important subtlety about what we compute the loss on. In SFT, we typically mask the loss on the input tokens . The model sees the full sequence $[x; y]$ (input concatenated with target), but we only compute the loss on the $y$ tokens. We don't want to penalise the model for not predicting the user's instruction — we only care that it produces the right response. This is why the formula sums over $t = 1$ to $T$ (the target length), not over the entire input-output sequence.

So fine-tuning takes a model that knows language in general and focuses it on producing the specific outputs we want. But how do we know when fine-tuning is the right tool for the job?

When Should You Fine-tune?

Not every problem needs fine-tuning, and reaching for it too early is one of the most common (and expensive) mistakes in applied ML. The key insight is that fine-tuning excels at changing behaviour , not injecting knowledge . Understanding this distinction will save you weeks of wasted compute.

Fine-tune for behaviour

If you need the model to consistently produce output in a specific format (JSON, XML, markdown tables), adopt a particular tone (formal legal language, friendly customer support, concise medical notes), use tools correctly (API calls, function invocations, code execution), or follow a multi-step reasoning pattern — these are all behavioural changes that fine-tuning handles well. You're teaching the model how to respond, not what to say.

Fine-tune for skills

Code generation in a niche framework, structured data extraction from messy documents, translation quality in a low-resource language pair, mathematical reasoning — these are skills that improve with fine-tuning. The model has some baseline ability from pre-training, but you can sharpen it dramatically with targeted examples. The LIMA paper (Zhou et al., 2023) showed that just 1,000 carefully curated examples can produce a model that rivals systems trained on 50x more data — quality matters far more than quantity.

💡 LIMA stands for "Less Is More for Alignment." The paper's central finding is that almost all knowledge in a language model is learned during pre-training, and only a small number of high-quality examples are needed during fine-tuning to teach the model how to present that knowledge. This is why data quality trumps data quantity in fine-tuning.

Don't fine-tune for facts

If your problem is that the model doesn't know something — yesterday's stock prices, your company's internal policies, the latest research papers — don't fine-tune . Use RAG instead (see the RAG track for a full treatment). Fine-tuning bakes knowledge into weights, which means it goes stale, requires retraining to update, and is prone to hallucination when the model is uncertain about memorised facts. RAG keeps knowledge in an external corpus that can be swapped or updated without touching the model.

The decision framework

When faced with a new task, walk through these questions in order:

Can prompting solve it? Try zero-shot and few-shot prompts first. If the model consistently gets it right with a good prompt, you're done — no training needed. This is the cheapest, fastest path.
Is it a knowledge problem? If the model has the right behaviour but lacks specific facts, use RAG. Retrieve the relevant documents and let the model reason over them.
Is it a behaviour, format, or skill problem? If the model knows the facts but doesn't present them correctly, doesn't follow your format, or lacks a specific skill — fine-tune. This is the sweet spot.
Is the base model architecture fundamentally wrong? If you need a model for a completely different modality (protein folding, molecular generation) or language family with zero pre-training coverage — pre-train from scratch. This is rare and expensive.

The InstructGPT paper (Ouyang et al., 2022) was the landmark demonstration of this pipeline in practice. OpenAI took GPT-3 (a pure next-token predictor), fine-tuned it on ~13,000 instruction-following demonstrations (SFT), and then further aligned it with human preferences using RLHF. The result was a model that was dramatically better at following instructions — not because it knew more, but because it had learned a new behaviour : read the instruction, then respond helpfully. That behavioural shift is exactly what fine-tuning excels at.

📌 A common trap: teams collect thousands of question-answer pairs about their domain and fine-tune, hoping the model will "learn" their knowledge base. What usually happens is the model memorises some facts, hallucinates others, and can't be updated without retraining. The right move is almost always RAG for the knowledge + fine-tuning for the format and reasoning style.

Now that we know when to fine-tune, let's zoom out and see where it fits in the full model training pipeline.

The Three-Stage Pipeline

Modern language models don't go from random weights to helpful assistant in one step. The process has three distinct stages, each with its own objective, data, and scale. Understanding this pipeline is essential because fine-tuning (Stage 2) builds directly on what pre-training (Stage 1) gives you, and alignment (Stage 3) refines what fine-tuning produces.

Stage 1: Pre-training — Learning language

The model trains on billions (or trillions) of tokens from a broad corpus — web text, books, code, scientific papers. The objective is next-token prediction: $P(x_t \mid x_{<t})$. This is the most expensive stage by far (thousands of GPUs, weeks to months of training), and it produces a model with vast general knowledge but no ability to follow instructions. We covered the mechanics of pre-training in the Transformers track (article 10).

Stage 2: Supervised Fine-Tuning (SFT) — Learning to follow instructions

We take the pre-trained model and train it on a much smaller dataset of (instruction, response) pairs. Thousands to tens of thousands of examples, not billions. The model learns to read instructions and produce helpful, well-formatted responses. This is the stage that transforms a next-token predictor into something that feels like an assistant. The loss is the cross-entropy formula we saw above, computed only on the response tokens.

Stage 3: Alignment (RLHF / DPO) — Learning human preferences

SFT teaches the model to follow instructions, but not which of several valid responses a human would prefer. Alignment techniques like Reinforcement Learning from Human Feedback ( RLHF ) (Ouyang et al., 2022) and Direct Preference Optimisation ( DPO ) (Rafailov et al., 2023) use human preference data ("response A is better than response B") to further refine the model. This stage handles safety, helpfulness, conciseness, and other qualities that are hard to specify with (input, output) pairs alone. The RLHF track covers this stage in depth.

import json, js

rows = [
    [
        "1. Pre-training",
        "Next-token prediction",
        "Trillions of tokens (web, books, code)",
        "Weeks–months on 1000s of GPUs",
        "General language ability"
    ],
    [
        "2. SFT",
        "Cross-entropy on responses",
        "1k–100k (instruction, response) pairs",
        "Hours–days on 1–8 GPUs",
        "Instruction following"
    ],
    [
        "3. Alignment",
        "RLHF / DPO",
        "10k–100k preference pairs",
        "Hours–days on 1–8 GPUs",
        "Human-preferred behaviour"
    ],
]

js.window.py_table_data = json.dumps({
    "headers": ["Stage", "Objective", "Data", "Compute", "What it teaches"],
    "rows": rows
})

print("Each stage builds on the previous one.")
print("Stage 2 (SFT) is where fine-tuning lives — and it's the focus of this track.")

💡 Notice the data efficiency at each stage. Pre-training needs trillions of tokens. SFT needs thousands of examples. Alignment needs thousands of preference pairs. Each successive stage is orders of magnitude cheaper because it builds on the knowledge already encoded in the weights. This is why fine-tuning is so powerful: you inherit billions of dollars of pre-training compute for free.

What this track covers

This fine-tuning track focuses on Stage 2 — supervised fine-tuning and the parameter-efficient techniques that make it practical. Here's what's ahead:

Article 2: Full fine-tuning mechanics — how gradient updates flow through the full model, what hyperparameters matter, and when full fine-tuning is the right choice.
Article 3: LoRA (Low-Rank Adaptation) — the breakthrough technique that fine-tunes only a tiny fraction of the parameters by injecting trainable low-rank matrices, making fine-tuning accessible on consumer GPUs.
Article 4: QLoRA and quantisation — combining 4-bit quantisation with LoRA to fine-tune models that wouldn't otherwise fit in memory.
Article 5: Data preparation — how to collect, clean, and format training data. The "garbage in, garbage out" article.
Articles 6–10: Advanced topics including adapter methods, prompt tuning, multi-task fine-tuning, evaluation strategies, and deploying fine-tuned models in production.

By the end of this track, you'll understand not just how to fine-tune, but when to fine-tune, how much data you need, which parameters to update, and how to evaluate whether it worked. The next article dives into the mechanics of full fine-tuning — how gradients flow, what the optimiser does, and why the learning rate schedule matters so much.

Quiz

Test your understanding of when and why to fine-tune a pre-trained model.

A pre-trained language model can generate fluent text but doesn't reliably follow instructions. What is the primary reason?

The model doesn't have enough parameters to follow instructions

Pre-training optimises for next-token prediction, not instruction-following — the model learned what language is, but not what the user wants

The model's context window is too small to read instructions

Pre-training data didn't contain any instructions

Your model produces correct answers but in plain text, and you need consistent JSON output. What is the most appropriate approach?

Pre-train a new model from scratch on JSON data

Use RAG to retrieve JSON formatting examples

Fine-tune on examples of (instruction, JSON response) pairs to teach the model the output format

Increase the model's context window size

In the SFT loss $\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(y_t \mid x, y_{

It makes the gradient computation faster

It normalises the loss so that long responses don't contribute disproportionately more than short ones

It prevents the loss from becoming negative

It scales the learning rate based on sequence length

Your chatbot needs to answer questions about your company's frequently updated product catalog. Which approach is most appropriate?

Fine-tune the model on the current catalog — it will memorise the facts

Pre-train a new model from scratch on your catalog data

Use RAG to retrieve relevant catalog entries at query time, so knowledge stays current without retraining

Use a longer system prompt containing the entire catalog