Pre-training and Supervised Fine-Tuning

How Does a Model Learn Language Without Labels?

In the previous article we built a decoder-only transformer and trained it to reverse short sequences. That toy task used a few thousand parameters and a synthetic dataset (nothing resembling real language). The leap from a toy model to something like GPT-4 or BERT uses the same architectural building blocks, but the training changes completely. The first stage, pre-training , exposes the model to enormous amounts of raw text with a self-supervised objective that requires no human labels. The model learns syntax, facts, reasoning patterns, and world knowledge purely by predicting text.

There are two dominant pre-training objectives, and they map directly to the two architectures we studied in article 8. Causal language modelling (CLM) is used for decoder-only models like GPT (Radford et al., 2018) . The model reads tokens left to right and, at each position, predicts the next token. The loss is the average negative log-likelihood over all positions:

\mathcal{L}_{\text{CLM}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1})

This is exactly the loss we used in article 9 for the reversal task, just applied to natural language at scale. Every sentence in the training corpus is a free training example (no annotation needed). The causal mask ensures the model can't see future tokens, so each position provides a genuine prediction task.

Masked language modelling (MLM) is used for encoder models like BERT (Devlin et al., 2019) . Instead of predicting the next token, the model receives a sequence where roughly 15% of tokens have been replaced with a [MASK] token, and it predicts the original identity of each masked position. Because encoders have bidirectional attention (no causal mask), the model can use context from both sides of a masked position, which tends to produce richer representations for downstream classification tasks.

\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{t \in \mathcal{M}} \log P(x_t \mid x_{\setminus \mathcal{M}})

where $\mathcal{M}$ is the set of masked positions and $x_{\setminus \mathcal{M}}$ denotes the sequence with those positions replaced. The loss is computed only over masked positions, not the entire sequence, which means each training example provides fewer gradient signals than CLM (about 15% of tokens versus 100%). This is one reason encoder models typically need more training data or epochs to reach comparable saturation.

The conditioning on $x_{\setminus \mathcal{M}}$ (the unmasked tokens) is what gives MLM its bidirectional character: the model sees tokens on both sides of the gap when predicting what belongs in it. If we masked 100% of the tokens, the model would have no context at all and the loss would reduce to predicting tokens from nothing (essentially a unigram language model). If we masked 0%, there would be no training signal. The 15% masking rate is a compromise: enough context for the model to make informed predictions, enough masked positions to produce a useful gradient. Devlin et al. tried other rates and found 15% to work well empirically, though later work like SpanBERT showed that masking contiguous spans rather than random individual tokens can improve downstream performance.

T5 (Raffel et al., 2020) introduced a third variant for encoder-decoder models: span corruption , where contiguous spans of tokens are replaced with a single sentinel token and the decoder generates the missing spans. This bridges the two approaches: the encoder sees corrupted bidirectional context, and the decoder generates the missing pieces autoregressively.

💡 Both CLM and MLM are self-supervised because the labels come from the text itself. That property is what makes pre-training scalable: we don't need humans to annotate billions of tokens, just raw text.

What Data and Compute Does Pre-training Require?

Self-supervision means we can use any text as training data, which shifts the bottleneck from annotation to data collection and compute. Modern pre-training corpora are massive. Common Crawl contains petabytes of web scrape data accumulated since 2008. The Pile (Gao et al., 2020) curated 825 GB of diverse English text from 22 sources including books, academic papers, GitHub code, and Stack Exchange. RedPajama (Together, 2023) replicated and extended the LLaMA training data recipe with over 1.2 trillion tokens from web, books, Wikipedia, code, and academic sources.

But more data alone is not enough, because we also need enough parameters to absorb it and enough compute to run the optimisation. These three quantities (data, parameters, compute) are tightly linked. Kaplan et al. (2020) first characterised this relationship, showing that loss follows smooth power laws as a function of model size, dataset size, and compute budget. However, their analysis suggested that models should be scaled up faster than data, leading to large models trained on relatively few tokens.

Hoffmann et al. (2022) revisited this question with Chinchilla and reached a different conclusion: for a fixed compute budget, the optimal approach is to scale model parameters and training tokens at roughly the same rate. Specifically, they found that the number of training tokens should be approximately 20 times the number of parameters. A 10-billion-parameter model, by this estimate, should see about 200 billion tokens to make the best use of the available compute.

The practical impact was immediate. Before Chinchilla, models like Gopher (280B parameters, 300B tokens) were arguably undertrained relative to their size. Chinchilla itself (70B parameters, 1.4 trillion tokens) outperformed Gopher despite being four times smaller, precisely because it was trained on proportionally more data. The lesson is that neither model size nor data size is a single lever; there's an optimal balance for any given compute budget, and that balance usually involves more tokens than earlier practice suggested.

💡 Chinchilla scaling laws are a guideline, not a hard rule. Later work (like LLaMA, which intentionally overtrained smaller models for better inference efficiency) showed that the optimal ratio depends on whether we're optimising for training cost or inference cost. A model that will serve millions of requests benefits from being smaller even if training it costs more per parameter.

After pre-training, we have a model that can predict the next token with remarkable accuracy. It has absorbed grammar, facts, and even some reasoning ability from the statistical structure of its training data. But if we prompt it with a question like "Explain quantum entanglement in simple terms," it is as likely to continue with another question, a Wikipedia-style paragraph, or a forum post as it is to produce a helpful answer. Pre-training teaches the model to mimic the distribution of text on the internet, not to follow instructions. That gap is what fine-tuning addresses.

How Does a Pre-trained Model Learn to Follow Instructions?

Supervised fine-tuning (SFT) takes a pre-trained language model and continues training it on a curated dataset of (instruction, response) pairs. The data might look like:

# Example SFT training pair
instruction = "Explain why the sky is blue in two sentences."
response = "Sunlight contains all wavelengths of visible light. When it hits Earth's atmosphere, shorter blue wavelengths scatter more than longer red ones (Rayleigh scattering), so the sky appears blue from the ground."

# The model sees the concatenation as one sequence:
# [instruction tokens] [response tokens]
# and we compute the loss only on the response tokens.

The training objective is still next-token prediction (the same cross-entropy loss from pre-training) but applied only to the response tokens. The instruction tokens are fed to the model as context (they pass through the forward pass and influence attention), but we zero out their contribution to the loss. This teaches the model to generate helpful completions conditioned on instructions, without penalising it for not predicting the instruction itself.

The datasets for SFT are much smaller than pre-training corpora (typically tens of thousands to a few hundred thousand examples, rather than billions of tokens). InstructGPT (Ouyang et al., 2022) used roughly 13,000 demonstration examples for its SFT stage. Alpaca (Taori et al., 2023) showed that even 52,000 instruction-response pairs generated by GPT-4 could turn a base LLaMA model into a passable instruction follower. The reason such small datasets work is that SFT isn't teaching the model new knowledge (the knowledge is already in the pre-trained weights). SFT is teaching the model a new format : read the instruction, then produce a direct answer instead of continuing in whatever style the pre-training data happened to contain.

Formally, given an SFT example where the instruction has $T_I$ tokens and the response has $T_R$ tokens, the loss is:

\mathcal{L}_{\text{SFT}} = -\frac{1}{T_R} \sum_{t=T_I+1}^{T_I+T_R} \log P(x_t \mid x_1, \ldots, x_{t-1})

This is identical to the CLM loss but summed only over the response portion $[T_I+1, \ldots, T_I + T_R]$, which is sometimes called instruction masking or prompt masking . If we look at the edge case where $T_I = 0$ (no instruction, just a response), the SFT loss reduces to the standard CLM loss over the entire sequence, which is exactly pre-training. SFT is really just pre-training on a more targeted distribution.

Where Does SFT Break Down?

SFT produces models that follow instructions and generate coherent, well-formatted responses. It is the single biggest behavioural shift in the training pipeline (the difference between a base model that rambles and a chat model that answers questions). But it has structural limitations that become apparent once we look closely at how the loss treats individual tokens.

The cross-entropy loss treats every token in the target response equally. If the gold response is "The capital of France is Paris," the model is penalised the same amount for getting "The" wrong as for getting "Paris" wrong. But clearly these tokens don't carry equal information: "Paris" is the actual answer, while "The capital of France is" is formulaic phrasing that could reasonably take many forms. A loss that weighted answer-critical tokens more heavily would provide a better learning signal, but standard SFT has no mechanism for distinguishing which tokens matter.

There is a deeper issue. SFT forces one specific generation path. If a training example answers "What is 2+2?" with "The answer is 4," the model learns to produce exactly those tokens in that order. But "4" and "2+2 equals 4" and "Four" are all valid responses. SFT penalises the model for producing any of the alternatives, even correct ones, because cross-entropy loss measures divergence from the single reference sequence, not from the set of acceptable answers. The model is being trained to imitate, not to be correct.

Some recent work addresses the token-weighting problem directly. Token-weighted SFT approaches assign different loss weights to different tokens in the response based on how important they are. One method is to use a separate reward model to score each token's contribution to response quality, then upweight tokens that the reward model considers important. Selective reflection-tuning (Li et al., 2023) takes a related approach, using a teacher model to identify which tokens in a response are most critical and weighting the loss accordingly. These methods tend to improve performance on benchmarks compared to uniform-weight SFT, but they require knowing how to score individual tokens, which introduces its own complexity.

We can also see the rigidity problem from a distributional perspective. The pre-trained model has a broad distribution over possible continuations for any given prefix. SFT narrows that distribution to match the training examples, which is desirable (we want the model to produce good answers) but also fragile (the model learns to produce these specific good answers rather than the general class of good answers). The model becomes a precise imitator of its training data, which works well when the training data is comprehensive but fails when the model encounters novel questions that require novel phrasing.

📌 SFT models can also be confidently wrong. Because they've been trained to produce fluent, well-formatted responses, they'll generate a confident-sounding answer even when the pre-trained knowledge is uncertain. SFT teaches the format but doesn't teach the model to calibrate its confidence.

What Comes After SFT?

Let's take stock of where we are. Pre-training gives us a model that understands language. SFT reshapes its behaviour so it follows instructions instead of mimicking internet text. Token-weighted SFT partially addresses the equal-weighting problem by telling the model which tokens matter more. But all of these approaches share a fundamental constraint: they operate at the token level, telling the model exactly which tokens to produce (or weighting them), one position at a time.

What if we took a completely different approach? Instead of specifying the correct output token by token, we could let the model generate a complete response however it wants, then score the entire response with a reward signal. A response that's correct, helpful, and well-reasoned gets a high reward; one that's wrong or unhelpful gets a low reward. The model receives no guidance on which tokens to change, so it has to figure that out on its own by exploring different generation strategies and observing which ones lead to higher rewards.

This is reinforcement learning from human feedback (RLHF) , introduced for language models by Ouyang et al. (2022) in the InstructGPT paper. The pipeline has three stages: (1) pre-train a base model, (2) fine-tune with SFT to get a model that follows instructions, and (3) further refine with RL using a reward model trained on human preference data. Each stage addresses the limitations of the previous one: pre-training provides knowledge, SFT provides format, and RL provides alignment with what humans actually want.

RL solves both of SFT's core problems in principle. It doesn't require specifying which tokens matter, because the reward is computed on the full output, so the model learns on its own which tokens are load-bearing. And it doesn't force a single generation path, because any response that achieves a high reward is reinforced, regardless of whether it matches a specific reference. If "4" and "Four" and "The answer is 4" all receive the same reward, the model learns that all three are valid.

Of course, RL introduces its own challenges: reward hacking (the model finds shortcuts that game the reward model), training instability (policy gradient methods are notoriously high-variance), and the need for a good reward model in the first place. These are the subjects of the reinforcement learning track, which picks up exactly where this article leaves off.

The trajectory from this track to the next one is a straight line. We've gone from "what is attention?" to "how do all the pieces fit together?" to "how do we train a model at scale?" to "how do we make it follow instructions?" to the open question: "how do we make it actually good at following instructions, not just imitative?" That question (and its answer through reinforcement learning) is where the next track begins.

💡 The full modern LLM pipeline is typically: pre-train (CLM on trillions of tokens) → SFT (instruction-response pairs) → RLHF or DPO (preference optimisation). Each stage is cheaper and uses less data than the one before it, but each produces a qualitative shift in the model's behaviour.

Quiz

Test your understanding of pre-training and supervised fine-tuning.

What is the key difference between CLM and MLM as pre-training objectives?

CLM uses a larger vocabulary than MLM

CLM predicts the next token left-to-right using a causal mask, while MLM predicts randomly masked tokens using bidirectional context

MLM is used for decoder models and CLM is used for encoder models

CLM requires labelled data while MLM is self-supervised

According to the Chinchilla scaling laws, what is the approximate optimal ratio of training tokens to model parameters?

1:1 (equal tokens and parameters)

20:1 (roughly 20 tokens per parameter)

100:1 (training tokens should vastly outnumber parameters)

The ratio doesn't matter as long as total compute is sufficient

Why does SFT work with relatively small datasets (tens of thousands of examples) compared to pre-training (trillions of tokens)?

SFT uses a more efficient optimiser that converges faster

SFT isn't teaching new knowledge; it's teaching the model to use a new output format, which requires far less data

SFT only updates the final layer of the model

SFT examples are longer than pre-training examples, so each one provides more signal

What fundamental limitation of SFT does reinforcement learning address?

SFT cannot handle long sequences, but RL can

SFT requires too much compute, and RL is cheaper

SFT forces a single correct token sequence and weights all tokens equally, while RL scores complete outputs and allows any generation path that achieves high reward

SFT only works for encoder models, while RL works for decoders