How Do We Increase the Probability of Good Actions?
We established in the previous article that RL-based training lets the model generate freely and receive a scalar reward for the full output. The question now is mechanical: given that reward signal, how do we actually update the model's weights? We need a gradient — a direction in parameter space that, when followed, makes high-reward outputs more likely and low-reward outputs less likely.
The simplest answer is the REINFORCE algorithm (Williams, 1992) . The idea is clean: sample a trajectory (generate a full response), observe the reward, then nudge the model's parameters so that the actions taken during that trajectory become more or less likely in proportion to how good the reward was. Formally, we want to maximize the expected return $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$, where $\tau$ is a trajectory sampled from policy $\pi_\theta$ and $R(\tau)$ is the reward. The policy gradient theorem gives us the gradient of this objective:
Each piece of this formula plays a specific role. $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ is the direction in parameter space that would increase the probability of taking action $a_t$ in state $s_t$ (in language model terms, this is the direction that makes the model more likely to produce token $a_t$ given the prompt and all tokens generated so far). $R_t$ is the return from timestep $t$ onward, and it acts as a scaling factor: when $R_t$ is large and positive, we take a big step in the direction that increases this token's probability; when $R_t$ is negative, we step in the opposite direction, making this token less likely.
The expectation $\mathbb{E}_{\tau \sim \pi_\theta}$ means we average over many sampled trajectories. In practice we approximate this with a batch of generated responses, computing the gradient for each and averaging. The more trajectories we sample, the better our estimate of the true gradient, but each trajectory requires a full forward pass through the model, so there's a direct tradeoff between gradient quality and compute.
To see why this works, consider what happens with a single trajectory. If we generate a response and it receives high reward, every token in that response gets its probability pushed up. If it receives low reward, every token gets pushed down. Over many trajectories, tokens that consistently appear in high-reward responses will have their probabilities increased, while tokens that appear in low-reward responses will be suppressed. The algorithm discovers which token-level decisions lead to good outcomes purely from output-level feedback, without ever needing per-token supervision.
Why Does REINFORCE Struggle with Variance?
REINFORCE is correct in expectation (given infinite trajectories, it converges to the true gradient). But in practice we work with finite batches, and the variance of the gradient estimate can be enormous. Suppose we sample two responses to the same prompt: one scores $R = 8$ and another scores $R = 2$. Both are positive, so REINFORCE pushes up the probability of both responses (just more for the first). But neither response's gradient "knows" that $8$ is good and $2$ is mediocre; they only see absolute reward values, not relative ones.
The fix is to subtract a baseline $b$ from the reward, replacing $R_t$ with $R_t - b$. Subtracting a constant baseline doesn't change the expected gradient (because $\mathbb{E}[\nabla \log \pi \cdot b] = 0$ for any constant $b$) but it can dramatically reduce variance by centering the reward signal around zero. If $b$ is close to the average return, then above-average trajectories get positive weight (pushed up) and below-average ones get negative weight (pushed down).
The most common baseline is the value function $V^\pi(s_t)$, which estimates the expected return from state $s_t$ under the current policy. Subtracting it gives us the advantage :
The advantage $A_t$ answers a precise question: was the actual outcome better or worse than what we expected? If $A_t > 0$, the action taken was better than average for that state, and we should increase its probability. If $A_t < 0$, it was worse, and we should decrease it. This is strictly more informative than raw reward because it accounts for context — a reward of $5$ is great if we expected $2$, but disappointing if we expected $8$.
With the advantage, the policy gradient becomes:
In practice, we train a separate critic network (often sharing a backbone with the policy) to predict $V^\pi(s_t)$. This critic is updated alongside the policy using standard regression on observed returns. The combination of a policy (the "actor") and a value function (the "critic") is called an actor-critic architecture, and it's the foundation of practically all modern policy gradient methods.
How Does PPO Prevent Destructive Updates?
Even with a good advantage estimate, vanilla policy gradient methods tend to be unstable. A single batch with an unusually high-reward trajectory can produce a large gradient that overshoots, drastically changing the policy in a way that degrades performance. Once the policy has shifted too far, the value function estimates become stale, the advantage calculations become unreliable, and training can spiral. Neural network policies are particularly fragile here because a small change in weights can produce a large change in output distribution.
Proximal Policy Optimization (PPO) (Schulman et al., 2017) solves this with a clipped surrogate objective that prevents the policy from changing too much in a single update. Instead of directly using $\nabla \log \pi_\theta \cdot A_t$, PPO works with the probability ratio between the new and old policies:
This ratio $r_t$ measures how much the updated policy's probability for action $a_t$ has changed relative to the policy that originally generated the trajectory. If $r_t = 1$, the new policy assigns the same probability as the old one. If $r_t = 1.5$, the new policy is 50% more likely to take this action. If $r_t = 0.5$, it's half as likely. The PPO objective uses this ratio alongside the advantage:
Let's walk through what the clipping does in each case.
When $A_t > 0$ (a good action we want to reinforce), the unclipped term $r_t \cdot A_t$ grows as we increase $r_t$, encouraging the optimizer to make this action ever more probable. Without clipping, a very high advantage could push $r_t$ to extreme values, concentrating all probability mass on this one action. The $\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)$ term caps $r_t$ at $1+\varepsilon$, so the clipped term saturates once the probability ratio exceeds $1+\varepsilon$. The $\min$ then takes whichever is lower, so once $r_t > 1+\varepsilon$, there is no further gradient pushing the ratio higher. The model can still be encouraged to take this action more, but not excessively so in a single update.
When $A_t < 0$ (a bad action we want to suppress), the unclipped term $r_t \cdot A_t$ becomes less negative as $r_t$ decreases (since we're multiplying a positive shrinking number by a negative advantage), which is exactly what optimization wants: reduce $r_t$ to minimize the negative contribution. But the clip stops $r_t$ from falling below $1-\varepsilon$, so the model cannot panic-flee from this action in a single step. Again, the $\min$ takes the more pessimistic (lower) value, ensuring the gradient vanishes once $r_t$ drops below $1-\varepsilon$.
The hyperparameter $\varepsilon$ controls how far the policy can move in one update and is typically set between $0.1$ and $0.2$. Smaller $\varepsilon$ means more conservative updates (more stable but slower learning); larger $\varepsilon$ allows bigger steps (faster but riskier). With $\varepsilon = 0.2$, the probability of any individual action can change by at most 20% per update step in either direction.
To see why the $\min$ is needed, consider what happens without it. If we only had the clipped term, the objective would plateau outside the clip range but wouldn't actively prevent the ratio from moving further. The $\min$ ensures that the objective is always the more conservative of the two terms (a pessimistic lower bound). When the ratio is inside $[1-\varepsilon, 1+\varepsilon]$, both terms are identical and the gradient flows normally. When the ratio drifts outside that range, the gradient is killed, creating a trust region around the old policy.
What Does the Training Loop Look Like?
With the clipped objective defined, we can sketch the PPO training loop end to end. The algorithm alternates between two phases: (1) collecting trajectories by letting the current policy generate responses, and (2) running several epochs of gradient updates on those trajectories using the clipped objective. The following pseudocode shows the structure.
# PPO training loop (simplified for language model fine-tuning)
for iteration in range(num_iterations):
# ── Phase 1: Collect trajectories ──────────────────────────
prompts = sample_batch(prompt_dataset, batch_size)
with torch.no_grad():
responses = policy.generate(prompts) # sample full responses
old_log_probs = policy.log_probs(prompts, responses) # π_old(a|s)
rewards = reward_model(prompts, responses) # scalar per response
values = critic(prompts, responses) # V(s) per token position
advantages = compute_gae(rewards, values, gamma, lam) # A_t via GAE
# ── Phase 2: PPO update (multiple epochs on same batch) ───
for epoch in range(ppo_epochs): # typically 2-4 epochs
new_log_probs = policy.log_probs(prompts, responses)
# Probability ratio r_t = π_new / π_old
ratio = torch.exp(new_log_probs - old_log_probs)
# Clipped surrogate objective
unclipped = ratio * advantages
clipped = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
policy_loss = -torch.min(unclipped, clipped).mean()
# Value function loss (train the critic)
value_loss = F.mse_loss(critic(prompts, responses), returns)
# Combined loss
loss = policy_loss + 0.5 * value_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
A few things to notice in this loop. The
old_log_probs
are computed once with
torch.no_grad()
and frozen; they serve as the reference point for the probability ratio. The inner loop runs multiple gradient steps on the same batch of trajectories, which is the key efficiency gain of PPO over vanilla policy gradient (where each batch would be used for a single update and then discarded). The clipping ensures that these multiple passes don't move the policy too far from where the trajectories were collected, keeping the advantage estimates valid.
The
compute_gae
function computes Generalized Advantage Estimation
(Schulman et al., 2016)
, which blends single-step and multi-step advantage estimates using a decay parameter $\lambda$. In practice, $\gamma$ (the discount factor) is usually close to $1.0$ for language model tasks since we care about the total reward of the response, and $\lambda$ is typically $0.95$.
When this loop is applied to language models (as in RLHF), there's one additional ingredient that we haven't shown yet: a KL penalty that keeps the policy from drifting too far from the original SFT model. That penalty is critical for alignment and is the focus of the next article.
Quiz
Test your understanding of policy gradients and PPO.
In the REINFORCE policy gradient, what role does the reward R_t play?
Why do we subtract a baseline from the reward to compute the advantage?
In PPO's clipped objective, what happens when A_t > 0 and the ratio r_t exceeds 1 + ε?
Why does PPO run multiple gradient epochs on the same batch of trajectories?