How Do We Increase the Probability of Good Actions?

We established in the previous article that RL-based training lets the model generate freely and receive a scalar reward for the full output. The question now is mechanical: given that reward signal, how do we actually update the model's weights? We need a gradient — a direction in parameter space that, when followed, makes high-reward outputs more likely and low-reward outputs less likely.

The simplest answer is the REINFORCE algorithm (Williams, 1992) . The idea is clean: sample a trajectory (generate a full response), observe the reward, then nudge the model's parameters so that the actions taken during that trajectory become more or less likely in proportion to how good the reward was. Formally, we want to maximize the expected return $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$, where $\tau$ is a trajectory sampled from policy $\pi_\theta$ and $R(\tau)$ is the reward. The policy gradient theorem gives us the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R_t \right]$$

Each piece of this formula plays a specific role. $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ is the direction in parameter space that would increase the probability of taking action $a_t$ in state $s_t$ (in language model terms, this is the direction that makes the model more likely to produce token $a_t$ given the prompt and all tokens generated so far). $R_t$ is the return from timestep $t$ onward, and it acts as a scaling factor: when $R_t$ is large and positive, we take a big step in the direction that increases this token's probability; when $R_t$ is negative, we step in the opposite direction, making this token less likely.

The expectation $\mathbb{E}_{\tau \sim \pi_\theta}$ means we average over many sampled trajectories. In practice we approximate this with a batch of generated responses, computing the gradient for each and averaging. The more trajectories we sample, the better our estimate of the true gradient, but each trajectory requires a full forward pass through the model, so there's a direct tradeoff between gradient quality and compute.

To see why this works, consider what happens with a single trajectory. If we generate a response and it receives high reward, every token in that response gets its probability pushed up. If it receives low reward, every token gets pushed down. Over many trajectories, tokens that consistently appear in high-reward responses will have their probabilities increased, while tokens that appear in low-reward responses will be suppressed. The algorithm discovers which token-level decisions lead to good outcomes purely from output-level feedback, without ever needing per-token supervision.

💡 In the language model setting, the state $s_t$ at timestep $t$ is the prompt concatenated with all tokens generated so far $(x, y_1, \ldots, y_{t-1})$, and the action $a_t$ is the next token $y_t$. The policy $\pi_\theta(a_t \mid s_t)$ is just the model's next-token distribution (the same softmax output we use during inference).

Why Does REINFORCE Struggle with Variance?

REINFORCE is correct in expectation (given infinite trajectories, it converges to the true gradient). But in practice we work with finite batches, and the variance of the gradient estimate can be enormous. Suppose we sample two responses to the same prompt: one scores $R = 8$ and another scores $R = 2$. Both are positive, so REINFORCE pushes up the probability of both responses (just more for the first). But neither response's gradient "knows" that $8$ is good and $2$ is mediocre; they only see absolute reward values, not relative ones.

The fix is to subtract a baseline $b$ from the reward, replacing $R_t$ with $R_t - b$. Subtracting a constant baseline doesn't change the expected gradient (because $\mathbb{E}[\nabla \log \pi \cdot b] = 0$ for any constant $b$) but it can dramatically reduce variance by centering the reward signal around zero. If $b$ is close to the average return, then above-average trajectories get positive weight (pushed up) and below-average ones get negative weight (pushed down).

The most common baseline is the value function $V^\pi(s_t)$, which estimates the expected return from state $s_t$ under the current policy. Subtracting it gives us the advantage :

$$A_t = R_t - V^\pi(s_t)$$

The advantage $A_t$ answers a precise question: was the actual outcome better or worse than what we expected? If $A_t > 0$, the action taken was better than average for that state, and we should increase its probability. If $A_t < 0$, it was worse, and we should decrease it. This is strictly more informative than raw reward because it accounts for context — a reward of $5$ is great if we expected $2$, but disappointing if we expected $8$.

With the advantage, the policy gradient becomes:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot A_t \right]$$

In practice, we train a separate critic network (often sharing a backbone with the policy) to predict $V^\pi(s_t)$. This critic is updated alongside the policy using standard regression on observed returns. The combination of a policy (the "actor") and a value function (the "critic") is called an actor-critic architecture, and it's the foundation of practically all modern policy gradient methods.

📌 Generalized Advantage Estimation (GAE) by Schulman et al. (2016) provides a more sophisticated way to compute $A_t$ by blending multi-step returns with a decay parameter $\lambda$. This further reduces variance at the cost of a small bias, and is the default advantage estimator in most PPO implementations.

How Does PPO Prevent Destructive Updates?

Even with a good advantage estimate, vanilla policy gradient methods tend to be unstable. A single batch with an unusually high-reward trajectory can produce a large gradient that overshoots, drastically changing the policy in a way that degrades performance. Once the policy has shifted too far, the value function estimates become stale, the advantage calculations become unreliable, and training can spiral. Neural network policies are particularly fragile here because a small change in weights can produce a large change in output distribution.

Proximal Policy Optimization (PPO) (Schulman et al., 2017) solves this with a clipped surrogate objective that prevents the policy from changing too much in a single update. Instead of directly using $\nabla \log \pi_\theta \cdot A_t$, PPO works with the probability ratio between the new and old policies:

$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$

This ratio $r_t$ measures how much the updated policy's probability for action $a_t$ has changed relative to the policy that originally generated the trajectory. If $r_t = 1$, the new policy assigns the same probability as the old one. If $r_t = 1.5$, the new policy is 50% more likely to take this action. If $r_t = 0.5$, it's half as likely. The PPO objective uses this ratio alongside the advantage:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot A_t, \; \text{clip}(r_t(\theta), \, 1-\varepsilon, \, 1+\varepsilon) \cdot A_t \right) \right]$$

Let's walk through what the clipping does in each case.

When $A_t > 0$ (a good action we want to reinforce), the unclipped term $r_t \cdot A_t$ grows as we increase $r_t$, encouraging the optimizer to make this action ever more probable. Without clipping, a very high advantage could push $r_t$ to extreme values, concentrating all probability mass on this one action. The $\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)$ term caps $r_t$ at $1+\varepsilon$, so the clipped term saturates once the probability ratio exceeds $1+\varepsilon$. The $\min$ then takes whichever is lower, so once $r_t > 1+\varepsilon$, there is no further gradient pushing the ratio higher. The model can still be encouraged to take this action more, but not excessively so in a single update.

When $A_t < 0$ (a bad action we want to suppress), the unclipped term $r_t \cdot A_t$ becomes less negative as $r_t$ decreases (since we're multiplying a positive shrinking number by a negative advantage), which is exactly what optimization wants: reduce $r_t$ to minimize the negative contribution. But the clip stops $r_t$ from falling below $1-\varepsilon$, so the model cannot panic-flee from this action in a single step. Again, the $\min$ takes the more pessimistic (lower) value, ensuring the gradient vanishes once $r_t$ drops below $1-\varepsilon$.

The hyperparameter $\varepsilon$ controls how far the policy can move in one update and is typically set between $0.1$ and $0.2$. Smaller $\varepsilon$ means more conservative updates (more stable but slower learning); larger $\varepsilon$ allows bigger steps (faster but riskier). With $\varepsilon = 0.2$, the probability of any individual action can change by at most 20% per update step in either direction.

To see why the $\min$ is needed, consider what happens without it. If we only had the clipped term, the objective would plateau outside the clip range but wouldn't actively prevent the ratio from moving further. The $\min$ ensures that the objective is always the more conservative of the two terms (a pessimistic lower bound). When the ratio is inside $[1-\varepsilon, 1+\varepsilon]$, both terms are identical and the gradient flows normally. When the ratio drifts outside that range, the gradient is killed, creating a trust region around the old policy.

💡 PPO's clipped objective is a simpler alternative to TRPO (Trust Region Policy Optimization), which enforces a hard KL-divergence constraint between old and new policies. TRPO requires computing second-order derivatives (the Fisher information matrix), making it expensive. PPO achieves a similar effect with first-order optimization by using clipping as a soft constraint, which is why it became the de facto standard.

What Does the Training Loop Look Like?

With the clipped objective defined, we can sketch the PPO training loop end to end. The algorithm alternates between two phases: (1) collecting trajectories by letting the current policy generate responses, and (2) running several epochs of gradient updates on those trajectories using the clipped objective. The following pseudocode shows the structure.

# PPO training loop (simplified for language model fine-tuning)

for iteration in range(num_iterations):
    # ── Phase 1: Collect trajectories ──────────────────────────
    prompts = sample_batch(prompt_dataset, batch_size)

    with torch.no_grad():
        responses = policy.generate(prompts)            # sample full responses
        old_log_probs = policy.log_probs(prompts, responses)  # π_old(a|s)
        rewards = reward_model(prompts, responses)      # scalar per response
        values = critic(prompts, responses)              # V(s) per token position

    advantages = compute_gae(rewards, values, gamma, lam)  # A_t via GAE

    # ── Phase 2: PPO update (multiple epochs on same batch) ───
    for epoch in range(ppo_epochs):                     # typically 2-4 epochs
        new_log_probs = policy.log_probs(prompts, responses)

        # Probability ratio r_t = π_new / π_old
        ratio = torch.exp(new_log_probs - old_log_probs)

        # Clipped surrogate objective
        unclipped = ratio * advantages
        clipped = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
        policy_loss = -torch.min(unclipped, clipped).mean()

        # Value function loss (train the critic)
        value_loss = F.mse_loss(critic(prompts, responses), returns)

        # Combined loss
        loss = policy_loss + 0.5 * value_loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

A few things to notice in this loop. The old_log_probs are computed once with torch.no_grad() and frozen; they serve as the reference point for the probability ratio. The inner loop runs multiple gradient steps on the same batch of trajectories, which is the key efficiency gain of PPO over vanilla policy gradient (where each batch would be used for a single update and then discarded). The clipping ensures that these multiple passes don't move the policy too far from where the trajectories were collected, keeping the advantage estimates valid.

The compute_gae function computes Generalized Advantage Estimation (Schulman et al., 2016) , which blends single-step and multi-step advantage estimates using a decay parameter $\lambda$. In practice, $\gamma$ (the discount factor) is usually close to $1.0$ for language model tasks since we care about the total reward of the response, and $\lambda$ is typically $0.95$.

When this loop is applied to language models (as in RLHF), there's one additional ingredient that we haven't shown yet: a KL penalty that keeps the policy from drifting too far from the original SFT model. That penalty is critical for alignment and is the focus of the next article.

📌 In large-scale RLHF implementations (like those behind ChatGPT), the reward model, policy, critic, and reference model may each be a full-sized LLM. Running PPO at this scale requires careful distributed training strategies (often with the four models spread across different GPU groups). Libraries like TRL (Hugging Face) and OpenRLHF abstract much of this complexity.

Quiz

Test your understanding of policy gradients and PPO.

In the REINFORCE policy gradient, what role does the reward R_t play?

Why do we subtract a baseline from the reward to compute the advantage?

In PPO's clipped objective, what happens when A_t > 0 and the ratio r_t exceeds 1 + ε?

Why does PPO run multiple gradient epochs on the same batch of trajectories?