RLHF: Aligning Language Models

Why Can't SFT Alone Produce Aligned Models?

A model that has been pre-trained and then fine-tuned with SFT can follow instructions, answer questions, and produce fluent text. But following instructions is not the same as being aligned (the model can still generate outputs that are harmful, dishonest, or unhelpful), because SFT only teaches it to mimic the style and structure of the demonstrations, not to internalize the qualities that make a response good. If the training data contains a biased or subtly incorrect response, the model reproduces it faithfully. If there's a harmful response that the model can produce by interpolating between training examples, SFT provides no mechanism to penalize it unless that exact pattern was explicitly excluded from the data.

We established in the first article that RL offers a way forward: let the model generate freely and score the output. But scoring with RL requires a reward signal, and the reward we ultimately care about is human judgment (whether a person would consider the response helpful, honest, and harmless). We obviously can't have a human rating every response generated during training (a single PPO run might produce millions of responses), so we need a proxy: a model trained to predict what humans would prefer. That proxy is the reward model .

This idea of learning a reward function from human feedback and then optimizing a policy against it was formalized for language models by Christiano et al. (2017) and later scaled to instruction-following LLMs by Ouyang et al. (2022) in the InstructGPT paper. The three-step pipeline they introduced (SFT, reward modeling, PPO) remains the backbone of most alignment work today, even as newer methods like DPO have started to compress or replace parts of it.

💡 Alignment is often framed through the "HHH" criteria: helpful (answers the question, follows the instruction), honest (doesn't fabricate facts, expresses uncertainty when appropriate), and harmless (refuses dangerous requests, avoids bias). These are the qualities the reward model is trained to recognize.

How Do We Train a Model to Predict Human Preferences?

The reward model is trained on comparison data. Human annotators are shown a prompt and two candidate responses, and they indicate which response they prefer. This is significantly easier than asking annotators to write ideal responses from scratch (which is what SFT data collection requires) or to assign absolute quality scores on a numeric scale (which tends to be noisy and inconsistent across annotators). Binary preference judgments are faster, cheaper, and more reliable.

Given a dataset of such comparisons (triples $(x, y_w, y_l)$ where $x$ is the prompt, $y_w$ is the preferred ("winning") response, and $y_l$ is the rejected ("losing") response), we train a reward model $r_\theta$ to assign scalar scores such that preferred responses score higher. The training objective comes from the Bradley-Terry model of pairwise preferences (Bradley & Terry, 1952) :

P(y_w \succ y_l \mid x) = \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)

Here $\sigma$ is the sigmoid function $\sigma(z) = 1/(1 + e^{-z})$. When $r_\theta(x, y_w)$ is much larger than $r_\theta(x, y_l)$, the difference is a large positive number, $\sigma$ outputs a value close to $1$, and we're confident $y_w$ is preferred, which matches the label. When the two scores are close, $\sigma$ outputs something near $0.5$, representing genuine uncertainty about the preference. When $r_\theta(x, y_l)$ is higher (the model ranks them incorrectly), $\sigma$ outputs a value near $0$, and the loss penalizes this heavily.

The training loss is the negative log-likelihood of the observed preferences:

\mathcal{L}_{\text{RM}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big) \right]

This is just binary cross-entropy where the "positive class" is always the preferred response. To see the edge cases: when the model already assigns a much higher score to $y_w$, $\sigma$ is near $1$, $\log(\sigma)$ is near $0$, and the loss contribution is small (correctly confident, little to learn). When the model gets the ranking wrong, $\sigma$ is near $0$, $\log(\sigma)$ is a large negative number, and the loss contribution is large (strong gradient signal to fix the ranking).

In practice, the reward model is typically initialized from the SFT model itself (or from the same pre-trained checkpoint), with the language modeling head replaced by a scalar output head. This gives the reward model a strong starting representation of language (it already understands what coherent, well-structured text looks like) and only needs to learn the preference mapping on top. The following code sketch shows the training setup.

import torch
import torch.nn.functional as F

def reward_model_loss(reward_model, prompt, chosen, rejected):
    """
    Compute Bradley-Terry loss for a single preference pair.

    reward_model: maps (prompt, response) -> scalar reward
    chosen: the human-preferred response
    rejected: the human-rejected response
    """
    r_chosen  = reward_model(prompt, chosen)    # scalar
    r_rejected = reward_model(prompt, rejected)  # scalar

    # Bradley-Terry: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
    # Loss = -log P(chosen > rejected)
    loss = -F.logsigmoid(r_chosen - r_rejected)

    # For monitoring: how often does the model rank correctly?
    accuracy = (r_chosen > r_rejected).float()

    return loss, accuracy

# Training loop (simplified)
for batch in preference_dataloader:
    losses, accs = [], []
    for prompt, chosen, rejected in batch:
        loss, acc = reward_model_loss(reward_model, prompt, chosen, rejected)
        losses.append(loss)
        accs.append(acc)

    batch_loss = torch.stack(losses).mean()
    batch_loss.backward()
    optimizer.step()
    optimizer.zero_grad()

A well-trained reward model typically reaches 65-75% agreement with held-out human preferences (Ouyang et al., 2022) . This may seem modest, but human annotators often agree with each other only about 73% of the time on the same comparisons, so the reward model is approaching the ceiling of inter-annotator agreement. Where the reward model tends to fail is on out-of-distribution inputs (prompts or response styles that were rare or absent in the preference data), which is one reason why the KL penalty in the next step is so important.

📌 The reward model only needs to get the ranking right, not the absolute scores. Adding a constant to both $r_\theta(x, y_w)$ and $r_\theta(x, y_l)$ doesn't change the sigmoid output or the loss. This means reward scores are only meaningful relative to each other, not in absolute terms (a reward of $3.2$ doesn't mean "good" unless we know what other responses to the same prompt scored).

How Does the Full RLHF Pipeline Fit Together?

With the reward model in hand, we can assemble the full RLHF pipeline as described in the InstructGPT paper (Ouyang et al., 2022) . It has three stages, each building on the last:

Stage 1 is supervised fine-tuning . We collect a dataset of human-written demonstrations (prompt-response pairs where the responses are high quality) and fine-tune the pre-trained model on them with standard cross-entropy loss. This gives us a model $\pi^{\text{SFT}}$ that can follow instructions and produce coherent responses (the starting point for RL).

Stage 2 is reward model training . We collect comparison data (for each prompt, annotators rank two or more responses by preference) and train the reward model $r_\theta$ using the Bradley-Terry loss described above. The comparisons are typically generated by $\pi^{\text{SFT}}$ itself (sometimes with different sampling temperatures to get diverse response pairs), so the reward model sees the kinds of outputs it will need to score during RL.

Stage 3 is RL optimization . We run PPO with the reward model as the environment's reward signal, but with a critical addition. The reward for a generated response $y$ given prompt $x$ is not just the raw reward model score; it includes a KL divergence penalty that penalizes the policy for straying too far from the SFT model:

R(x, y) = r_\theta(x, y) - \beta \cdot \text{KL}\big(\pi_\phi(\cdot \mid x) \;\|\; \pi^{\text{ref}}(\cdot \mid x)\big)

Here $\pi_\phi$ is the policy being optimized, $\pi^{\text{ref}}$ is the reference policy (typically the SFT model, frozen), and $\beta$ is a coefficient controlling the strength of the penalty. The KL divergence $\text{KL}(\pi_\phi \| \pi^{\text{ref}})$ measures how much the current policy's output distribution has drifted from the reference, computed per-token and summed over the response.

To understand why this penalty is essential, consider what happens without it. The reward model is an imperfect proxy for human preferences (it's a neural network trained on a finite dataset, so it has blind spots and exploitable patterns). Given unconstrained optimization, PPO will find outputs that score very high on the reward model but look nothing like natural language (degenerate strings that exploit reward model artifacts). This failure mode is called reward hacking (sometimes "reward overoptimization"), and it has been observed consistently in practice (Gao et al., 2023) . The KL penalty prevents this by anchoring the policy to the reference: if the model's output distribution diverges too far from what the SFT model would produce, the penalty grows and offsets any reward gain.

We can examine the edge cases of $\beta$. When $\beta = 0$, there is no KL penalty and the policy is free to optimize reward without constraint, which leads to reward hacking, typically within a few hundred PPO steps. When $\beta \to \infty$, the KL penalty dominates and the policy never moves away from the reference; we recover the SFT model with no RL benefit. The practical sweet spot lies in between, usually found by sweeping $\beta$ on a validation set of human preferences. Ouyang et al. (2022) and subsequent work often start with $\beta$ around $0.01$–$0.2$ and adjust based on how quickly the KL divergence grows during training.

💡 In practice, the KL divergence is computed at the token level as $\sum_t \log \frac{\pi_\phi(y_t \mid x, y_{<t})}{\pi^{\text{ref}}(y_t \mid x, y_{<t})}$, which is just the sum of log-probability ratios. This requires a forward pass through both the policy and the reference model for each generated response (one reason RLHF training is expensive, since four models may be in memory simultaneously: policy, reference, reward model, and critic).

What Can Go Wrong, and What Comes Next?

RLHF is powerful but fragile. Several failure modes are well-documented in the literature and worth understanding before treating the pipeline as a solved problem.

Reward hacking remains a concern even with the KL penalty. Gao et al. (2023) showed that as the KL divergence from the reference model increases, the proxy reward (what the reward model predicts) continues to climb, but the true reward (measured by actual human ratings) peaks and then declines. The model learns to exploit features of the reward model that don't correspond to genuine quality (verbosity is a common one, since reward models often give higher scores to longer responses regardless of content). The KL penalty slows this process but doesn't eliminate it entirely, and choosing the right stopping point (or the right $\beta$) requires monitoring true human preference scores throughout training.

The reward model itself introduces a bottleneck. Human preference data is expensive to collect, and the resulting dataset is usually orders of magnitude smaller than the pre-training corpus. If the preference data doesn't cover certain topics or response styles, the reward model's scores in those regions are essentially random, and PPO will exploit that randomness. Distributional shift between the comparison data and the policy's outputs is a persistent challenge, because as PPO updates the policy, the kinds of responses it generates drift away from what the reward model was trained to evaluate.

The computational cost is substantial. Running RLHF requires four models in memory simultaneously during the PPO phase (policy, reference, reward model, critic), all of which may be large LLMs. This roughly quadruples the memory footprint compared to SFT, and the training loop involves generating full responses (slow autoregressive decoding), scoring them with the reward model, computing advantages, and running multiple PPO epochs (all per batch). This expense has motivated research into more efficient alternatives.

One such alternative is Direct Preference Optimization (DPO) (Rafailov et al., 2023) , which sidesteps the reward model and PPO entirely by directly optimizing the policy on preference data. DPO shows that the optimal policy under the RLHF objective (reward model + KL penalty) can be expressed in closed form, and that this closed form leads to a simple classification-like loss on preference pairs (no RL loop, no reward model, no critic). Whether DPO matches the performance of full RLHF at scale is still actively debated (Xu et al., 2024) , but it has become a popular alternative for teams without the infrastructure to run PPO at scale.

Other directions include GRPO (Group Relative Policy Optimization) (Shao et al., 2024) , which eliminates the critic model entirely by using the average reward of a group of sampled responses as the baseline, and Constitutional AI (Bai et al., 2022) , which replaces some human labeling with AI-generated feedback (an LLM critiques and revises its own outputs according to a set of principles). The field is moving quickly, but the core insight of RLHF (learn a reward from human preferences, then optimize for it) remains the foundation that all these variations build on.

📌 Many teams find that the biggest quality gains come from the SFT and reward-model stages, not from PPO itself. High-quality demonstration data and well-calibrated preference data often matter more than the RL optimization details. This is sometimes summarized as "data quality dominates algorithm choice" (a useful corrective to the focus on RL methods).

Quiz

Test your understanding of RLHF and the alignment pipeline.

In the Bradley-Terry reward model, what does the sigmoid of the score difference represent?

The absolute quality of the preferred response

The probability that the preferred response is ranked higher than the rejected one

The KL divergence between the two responses

The loss gradient for the reward model

What is the purpose of the KL penalty term in the RLHF reward function?

To speed up PPO convergence by reducing the action space

To prevent the policy from exploiting reward model weaknesses by anchoring it to the reference (SFT) policy

To ensure the reward model and the policy have the same architecture

To normalize the reward scores to a fixed range

What happens when β (the KL penalty coefficient) is set to zero?

The policy converges to the SFT model

Training becomes faster and more stable

The policy is unconstrained and typically reward-hacks — optimizing proxy reward while degrading true quality

The reward model is ignored during training

Why are binary preference comparisons used instead of absolute quality scores for reward model training?

Binary comparisons require more sophisticated models, leading to better reward functions

Absolute scores would make the reward model too accurate, causing overfitting

Binary comparisons are faster, cheaper, and more reliable across annotators — people find it easier to say which of two responses is better than to assign a numeric score

Absolute scores cannot be used with the Bradley-Terry model