Why Does RLHF Need So Many Moving Parts?

In the previous article we assembled the full RLHF pipeline: train a reward model on human preferences, then run PPO against that reward while keeping the policy close to a reference via a KL penalty. The result is impressive (InstructGPT demonstrated clear gains over SFT alone), but the engineering cost is substantial. We need to maintain three models simultaneously (the policy, the reward model, and the value network for PPO's baseline), generate samples on the fly, compute reward scores for each, estimate advantages, and perform clipped policy gradient updates. Each of these steps introduces hyperparameters, potential instabilities, and memory overhead.

A natural question emerges: do we really need the reward model as a separate artifact? We train it from human preference data, then use it only to score completions during PPO. If the reward model is just an intermediate step, perhaps we can cut it out entirely and optimize the policy directly from preference pairs. This is exactly what Direct Preference Optimization achieves.

Rafailov et al. (2023) observed that the RLHF objective has an analytical solution: given a reward function, the optimal policy under the KL-constrained objective takes a specific closed form. By substituting that closed form back into the Bradley-Terry preference model, we get a loss function that depends only on the policy and the preference data, with no reward model anywhere in sight.

Where Does the DPO Loss Come From?

To see how DPO eliminates the reward model, we need to trace through one key derivation. Recall from the RLHF article that the objective we maximize is the expected reward minus a KL penalty:

$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \Big[ r(x, y) - \beta \, \text{KL}\big(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\big) \Big]$$

This optimization problem has a closed-form solution. The optimal policy $\pi^*$ that maximizes this objective satisfies:

$$\pi^*(y|x) = \frac{1}{Z(x)} \, \pi_{\text{ref}}(y|x) \, \exp\!\left(\frac{r(x,y)}{\beta}\right)$$

This formula is worth examining. The optimal policy is the reference policy reweighted by $\exp(r(x,y)/\beta)$, so responses with high reward get exponentially boosted while low-reward responses get suppressed. If $\beta$ is large, the exponent is small for all $r$ values and $\pi^*$ stays close to $\pi_{\text{ref}}$ (the KL constraint dominates). If $\beta$ is small, even modest reward differences produce large exponential ratios, so $\pi^*$ concentrates almost all probability on the highest-reward response (the reward dominates). The normalizing constant $Z(x)$ ensures the distribution sums to 1 over all possible responses.

We can rearrange this to express the reward in terms of the policy:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

The insight is that the reward is just $\beta$ times the log-ratio of the optimal policy to the reference, plus a prompt-dependent constant. When we plug this into the Bradley-Terry preference model ($P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))$), the $\beta \log Z(x)$ terms cancel because they depend only on $x$ and not on which response we are comparing. What remains is a loss that involves only log-probability ratios under the policy and the reference, with no explicit reward function at all.

After substituting and simplifying, this yields the DPO loss.

$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right) \right]$$

This single equation replaces both the reward model training step and the PPO loop. We optimize it with standard gradient descent on batches of preference triplets $(x, y_w, y_l)$, where $y_w$ is the preferred response and $y_l$ is the rejected one.

What Does Each Piece of the Loss Do?

The DPO loss is compact, but each component carries specific meaning, and understanding them reveals why the method works as well as it does. Let us walk through the formula from the inside out.

Consider the log-ratio $\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ for the preferred ("winning") response. This measures how much more (or less) likely our current policy makes $y_w$ compared to the reference policy. If the ratio is positive, the policy has already shifted toward favoring this response relative to its starting point. If negative, the policy still assigns lower probability to it than the reference does. There is a matching term for the rejected ("losing") response $y_l$.

The difference between these two log-ratios captures what we actually care about: has the policy increased the relative likelihood of the preferred response more than the rejected one? When this difference is large and positive, the policy already reflects the preference well. When it is negative, the policy still favors the rejected response relative to the preferred one, and the loss will be large.

The parameter $\beta$ controls how aggressively we trust the preference labels. To see why, consider the extremes. When $\beta \to \infty$, the sigmoid's argument grows large for even a small difference in log-ratios, so the loss pushes hard to separate the two responses and the implicit reward becomes extremely sensitive to small policy shifts. When $\beta \to 0$, the sigmoid's argument shrinks toward zero regardless of the log-ratios, which means the loss becomes nearly flat and the policy barely moves from the reference. In practice, $\beta$ values between 0.1 and 0.5 tend to work well, though the optimal value depends on how noisy the preference labels are (noisier labels warrant lower $\beta$ to avoid overfitting to mislabeled pairs).

The sigmoid $\sigma$ converts the scaled difference into a probability in $[0, 1]$, and the negative log turns it into a cross-entropy-style loss. When the policy strongly prefers the winning response (large positive argument to $\sigma$), $\sigma$ is close to 1 and $-\log \sigma$ is close to 0. When the policy incorrectly prefers the losing response (large negative argument), $\sigma$ is near 0 and $-\log \sigma$ becomes very large. This is the same log-sigmoid structure that appears in logistic regression and in the Bradley-Terry reward model loss from the previous article, which makes sense because DPO is effectively performing preference classification directly in policy space.

One particularly elegant consequence of the derivation is that DPO defines an implicit reward:

$$r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$

We never train a reward model, but we get one for free: the log-ratio of the learned policy to the reference, scaled by $\beta$. If we ever need to score a new response (for monitoring or debugging), we can compute this quantity directly from the policy.

๐Ÿ’ก The reference policy $\pi_{\text{ref}}$ is typically the SFT model we start from, frozen throughout DPO training. It serves the same role as the KL anchor in RLHF: preventing the policy from drifting too far from coherent language.

How Simple Is DPO in Practice?

One of DPO's most appealing properties is that its implementation is remarkably short. We need preference data in the form of triplets $(x, y_w, y_l)$, a policy model $\pi_\theta$ (initialized from SFT), and a frozen copy of that same model as $\pi_{\text{ref}}$. The training loop computes log-probabilities for both responses under both models, assembles the loss, and backpropagates. The following code sketches the core computation.

import torch
import torch.nn.functional as F

def dpo_loss(policy_model, ref_model, input_ids_w, input_ids_l,
             attention_mask_w, attention_mask_l, beta=0.1):
    """
    Compute DPO loss for a batch of preference pairs.
    input_ids_w, input_ids_l: token ids for preferred / rejected responses
    """
    # Forward pass through both models (no grad for reference)
    with torch.no_grad():
        ref_logps_w = get_sequence_log_probs(ref_model, input_ids_w, attention_mask_w)
        ref_logps_l = get_sequence_log_probs(ref_model, input_ids_l, attention_mask_l)

    policy_logps_w = get_sequence_log_probs(policy_model, input_ids_w, attention_mask_w)
    policy_logps_l = get_sequence_log_probs(policy_model, input_ids_l, attention_mask_l)

    # Log-ratios: how much does the policy differ from the reference?
    log_ratio_w = policy_logps_w - ref_logps_w
    log_ratio_l = policy_logps_l - ref_logps_l

    # DPO loss: push the preferred response's log-ratio above the rejected one's
    logits = beta * (log_ratio_w - log_ratio_l)
    loss = -F.logsigmoid(logits).mean()

    return loss


def get_sequence_log_probs(model, input_ids, attention_mask):
    """Sum of per-token log-probs for the response portion."""
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]  # shift: predict next token
    labels = input_ids[:, 1:]

    log_probs = F.log_softmax(logits, dim=-1)
    token_log_probs = log_probs.gather(2, labels.unsqueeze(-1)).squeeze(-1)

    # Mask padding tokens, then sum over sequence
    mask = attention_mask[:, 1:]
    return (token_log_probs * mask).sum(dim=-1)

Compare this with the RLHF pipeline from the previous article. There is no sampling loop, no advantage estimation, no clipping logic, and no value network. The entire optimization is a supervised loss on static preference data, which means we can use standard training infrastructure (data loaders, gradient accumulation, distributed data parallel) without the complexity of an online RL loop.

This simplicity does come with a tradeoff. Because DPO trains on a fixed dataset of preference pairs, the policy never generates its own completions during training and never encounters its own mistakes. PPO, by contrast, generates responses on the fly and learns from the reward model's feedback on those responses, which allows it to explore and correct failure modes that the static dataset might not cover. In practice, DPO tends to work well when the preference dataset is large and diverse enough to cover the distribution of prompts the model will see at deployment. When the dataset is narrow or the deployment distribution shifts, online methods like PPO (or GRPO, which we will see in the next article) can adapt more robustly.

There is also a subtlety around reward hacking. With RLHF, the policy can learn to exploit quirks in the reward model (producing outputs that score highly but are not actually good). DPO avoids this failure mode entirely because there is no explicit reward model to exploit; the "reward" is implicit in the policy's own log-ratios. On the other hand, DPO can overfit to surface patterns in the preference data (for example, always preferring longer responses if the training data is biased toward length), so careful data curation remains important.

๐Ÿ’ก Since its publication, DPO has become a widely adopted alignment method among open-source models and smaller teams because it requires roughly half the GPU memory of RLHF (no value network, no reward model at inference time) and converges with standard supervised training schedules.

Quiz

Test your understanding of Direct Preference Optimization.

What key insight allows DPO to eliminate the reward model from the RLHF pipeline?

In the DPO loss, what does $\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ represent?

What happens when $\beta$ is set very high in DPO?

What is a key limitation of DPO compared to PPO-based RLHF?