From Imitation to Reward

What's Wrong with Copying the Training Data?

Supervised fine-tuning (SFT) works by showing the model a prompt and a target response, then minimizing cross-entropy loss across every token in that target. For a target sequence $y = (y_1, y_2, \ldots, y_T)$ conditioned on prompt $x$, the loss is:

\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} \log \pi_\theta(y_t \mid x, y_{<t})

Every token in the target contributes equally to this sum. If the target response to "What is the capital of France?" is "The capital of France is Paris," then the model gets penalized just as much for assigning low probability to "The" as it does for assigning low probability to "Paris." But those tokens are not equally important — "Paris" is the actual answer, while "The capital of France is" is filler that could be phrased dozens of different ways.

This equal-weighting problem compounds with a deeper issue: SFT rewards only the exact sequence in the training data. Suppose there are three perfectly valid responses to that same question ("Paris," "It's Paris," and "The capital of France is Paris"). If the training example contains only the third, the model is penalized for producing either of the first two, even though they're correct. SFT is essentially imitation learning : the model learns to mimic the demonstrations it was given, not to produce good outputs in general.

For tasks where there is exactly one correct token sequence (say, copying a string), this is fine. But most interesting tasks (writing code, answering questions, summarizing documents) have a wide space of acceptable outputs, and SFT's rigid one-path objective systematically undervalues that diversity. We can see this empirically: SFT'd models often produce outputs that are stilted or formulaic, because they've learned to minimize divergence from the training distribution rather than to maximize response quality (Ouyang et al., 2022) .

💡 SFT loss treats the target as the only correct path. If the training set says "The capital of France is Paris" then generating "Paris" alone (a perfectly correct answer) gets penalized because it doesn't match the expected token at each position.

Can We At Least Weight Tokens Differently?

A natural first fix is to weight tokens by importance. Instead of summing log-probabilities uniformly, we multiply each term by a weight $w_t$ that reflects how much that token matters:

\mathcal{L}_{\text{weighted}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t \mid x, y_{<t})

If we set $w_t = 1$ for every $t$ we recover standard SFT. If we set $w_t$ high for answer-bearing tokens ("Paris") and low for boilerplate ("The," "is"), the model focuses its capacity on getting the important parts right. Some recent work does exactly this: Token-Weighted SFT uses a separate reward model or heuristic to assign per-token importance scores, then uses those as weights during fine-tuning.

This helps with the unequal-importance problem, but it introduces a new dependency: we need something to produce those weights. A rule-based heuristic (like upweighting named entities) is brittle and domain-specific, so in practice we often turn to a learned scoring model that can assess which tokens carry the meaning. That scoring model is itself a form of reward signal, and once we have one, a natural question arises: why limit ourselves to token-level weighting? Why not let the model generate freely, score the entire output , and optimize for that score directly?

Token-weighted SFT also doesn't solve the diversity problem. We still need a reference response to weight; we're just weighting its tokens differently. The model is still imitating a single demonstration; it's just imitating some parts harder than others.

📌 Token weighting is sometimes called "selective training" or "filtered SFT" in the literature. The idea shows up in several forms: masking low-quality tokens, upweighting reasoning steps, or using reward-model scores as importance weights. All share the same limitation: they still require a reference sequence to weight against.

What If We Just Scored the Whole Output?

Here is the conceptual leap. Instead of dictating the exact token sequence the model should produce, we let the model generate freely , then assign a scalar reward $r$ to the complete response. High reward means the response was good; low reward means it was bad. The model's job is to adjust its parameters so that future generations tend to get higher rewards.

This framing turns language generation into a reinforcement learning (RL) problem, and each RL concept maps directly to a language generation counterpart.

Policy $\pi_\theta$: the language model itself. Given a prompt (the state), it samples tokens (actions) one at a time to produce a response.
Action: generating a single token at each timestep. The action space is the full vocabulary.
Trajectory: a complete generated response $(y_1, y_2, \ldots, y_T)$, analogous to an episode in game-playing RL.
Reward $r$: a scalar score assigned to the full trajectory. In RLHF, this comes from a learned reward model; in other setups, it might come from a rule-based verifier, a compiler, or a unit-test suite.
Return: the cumulative reward over the trajectory. Since we typically assign a single reward at the end (not per-token), the return for a response is just $r$.

This buys us enormous flexibility. The model can produce "Paris," "It's Paris," or "The capital of France is Paris," and all score high if the reward model cares about correctness, not phrasing. We've replaced the rigid per-token supervision of SFT with a loose output-level signal that allows the model to find its own path.

But flexibility comes at a cost. In SFT, every token has a clear target, so the gradient tells the model exactly which direction to move. In RL, the model generates a 200-token response and receives a single scalar at the end. Which of those 200 tokens was responsible for the high (or low) reward? This is the credit assignment problem , and it's one of the central challenges in RL. The model must figure out, through many rounds of generation and scoring, which token-level decisions led to good outcomes, a much harder learning signal than SFT's per-token supervision. As a result, RL-based training tends to require more compute and more samples to converge, and it can be unstable if the reward signal is noisy or the optimization steps are too aggressive.

💡 In practice, we almost always do SFT first and then apply RL on top. SFT gives the model a reasonable starting point (it already knows how to produce coherent, instruction-following text) and RL refines its behavior toward what humans actually prefer. Starting RL from a randomly initialized model would be prohibitively expensive because the model would need to discover basic language fluency through trial and error.

Where Each Approach Sits on the Flexibility Spectrum

We can arrange the three training paradigms along a spectrum from rigid to flexible:

SFT: rigid. Every token has an explicit target. Training is fast and stable because the gradient signal is dense (one loss term per token), but the model can only learn behaviors that appear verbatim in the training data.
Token-weighted SFT: slightly less rigid. Still requires a reference sequence, but lets us emphasize the tokens that matter. Needs a token-level scoring signal (a reward model or heuristic).
RL with output-level reward: flexible. The model generates freely and receives a single score. No reference sequence needed, but training is slower (sparse reward signal), noisier (high-variance gradients), and requires careful stabilization (which is where PPO comes in, as we'll see in the next article).

The practical pipeline reflects this spectrum: we start with SFT to get a competent model, then switch to RL to push it beyond imitation. The SFT phase handles the bulk of capability acquisition (learning to follow instructions, produce well-formed text, stay on topic), and the RL phase handles alignment — nudging the model's outputs toward what humans judge as helpful, honest, and harmless. This two-stage recipe is exactly how InstructGPT (Ouyang et al., 2022) was trained, and it remains the foundation of most RLHF systems today.

But to do the RL step, we need two things: a way to optimize a policy based on a reward signal (the subject of the next article on policy gradients and PPO), and a way to produce that reward signal in the first place (the subject of the third article on reward modeling). For now, the key takeaway is that moving from SFT to RL means moving from "copy this exact answer" to "produce any answer that scores well" (a shift that unlocks the model's full generative flexibility at the cost of a harder optimization problem).

Quiz

Test your understanding of SFT's limitations and the RL framing.

Why does standard SFT penalize valid alternative responses?

Because the model has insufficient parameters to represent multiple answers

Because the per-token cross-entropy loss only rewards the exact target sequence in the training data

Because the learning rate is too high during fine-tuning

Because the tokenizer merges alternative phrasings into the same tokens

What problem does token-weighted SFT still fail to solve?

It cannot handle long sequences

It still requires a reference sequence, so the model can only imitate the demonstration, just with different emphasis per token

It makes training slower than standard SFT

It requires more GPU memory than RL

In the RL framing of language generation, what is the 'policy'?

The reward model that scores outputs

The training data distribution

The language model itself, which maps states (prompts + partial responses) to actions (next tokens)

The KL divergence constraint

Why is credit assignment harder in RL-based training than in SFT?

RL uses a smaller batch size

RL models have more parameters

RL provides a single reward for the entire response, so the model must figure out which tokens contributed to the score across many training rounds

RL uses a different tokenizer