Can We Do Online RL Without a Critic?

We have now seen two ends of the alignment spectrum. PPO -based RLHF is powerful but complex: it requires a reward model , a value network (the critic), and an online sampling loop that generates completions, scores them, estimates advantages, and performs clipped updates. DPO eliminates all of that complexity by optimizing directly from preference data, but it gives up online generation entirely, training instead on a fixed dataset of preference pairs. A natural question follows: is there a middle ground that keeps the online generation (so the model learns from its own outputs) but drops the expensive critic?

Shao et al. (2024) proposed Group Relative Policy Optimization (GRPO) as part of the DeepSeek series of models, and the core idea is disarmingly simple. Instead of training a separate value network to estimate "how good is this state?" as a baseline, we generate a group of responses to the same prompt, score all of them with a reward function, and use the group's own statistics (mean and standard deviation) as the baseline. Responses that score above the group average get reinforced; those below get suppressed. No critic needed.

This matters especially for large language models, where the critic (value network) is typically another model of comparable size to the policy. Maintaining it can roughly double the memory footprint during training. GRPO trades that memory cost for compute: generating $G$ responses per prompt requires more forward passes, but we avoid storing and updating an entire extra neural network.

How Does Group Normalization Replace the Value Network?

In PPO, the advantage $\hat{A}_t$ at each timestep tells the optimizer whether an action was better or worse than expected, where "expected" comes from a learned value function $V_\phi(s_t)$. Training $V_\phi$ to be accurate is itself a supervised learning problem that runs alongside the policy update, and getting it wrong introduces bias into the advantage estimates. GRPO sidesteps this entirely.

For each prompt $x$, we sample a group of $G$ complete responses $\{y_1, y_2, \ldots, y_G\}$ from the current policy $\pi_\theta$. We then score each one using a reward function $r(x, y_i)$ (this can be a trained reward model, a rule-based verifier, or any scalar scoring function). The advantage for response $i$ is computed by normalizing within the group:

$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$

This replaces the value network's role entirely. Instead of asking "was this response better than what we expected from this state?" (which requires a trained function approximator), we ask "was this response better than the other responses we just generated for the same prompt?" The group itself provides the context for what counts as good or bad.

To see why this works, consider what happens at different group sizes. When $G$ is large (say 64), the group mean and standard deviation are stable estimates of the reward distribution for that prompt, so $\hat{A}_i$ reliably identifies which responses are above or below average. The normalization also handles the fact that different prompts may have very different reward scales: a math problem where most responses score 0 or 1 gets normalized separately from an open-ended writing prompt where scores spread across a continuous range.

When $G = 2$, something interesting happens. We have two responses, and after normalization one gets a positive advantage and the other gets a negative one. The update reinforces the better response and suppresses the worse one, which is conceptually similar to what DPO does with a preference pair (though the mechanism is different because GRPO still uses an explicit reward signal and policy gradient updates rather than a supervised loss on log-ratios). As $G$ increases, we get finer-grained information about the reward distribution for each prompt.

๐Ÿ’ก The group normalization also has a variance reduction effect similar to a baseline in REINFORCE. Subtracting the mean reward centers the advantages around zero, which tends to reduce the variance of the policy gradient estimate and stabilize training.

What Does the Full GRPO Objective Look Like?

With the group-normalized advantages in hand, GRPO uses a PPO-style clipped objective to update the policy. For each response $y_i$ in the group, let $\rho_i$ denote the importance sampling ratio between the current policy and the policy that generated the sample (the "old" policy from the previous iteration):

$$\rho_i = \frac{\pi_\theta(y_i | x)}{\pi_{\theta_{\text{old}}}(y_i | x)}$$

The full GRPO objective combines these normalized advantages with a PPO-style clipped update.

$$\mathcal{J}_{\text{GRPO}} = \mathbb{E}_{x \sim \mathcal{D}} \; \frac{1}{G} \sum_{i=1}^{G} \left[ \min\!\left( \rho_i \, \hat{A}_i, \; \text{clip}(\rho_i, 1-\varepsilon, 1+\varepsilon) \, \hat{A}_i \right) \right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$

Several pieces of this should look familiar from the PPO article. The $\min(\rho_i \hat{A}_i, \text{clip}(\rho_i, 1{-}\varepsilon, 1{+}\varepsilon) \hat{A}_i)$ term is the same clipped surrogate objective that prevents destructively large policy updates. When $\rho_i$ stays within $[1{-}\varepsilon, 1{+}\varepsilon]$, the clipping has no effect and the gradient flows normally. When the policy tries to change too much (pushing $\rho_i$ outside this interval), the clipping caps the objective and the gradient vanishes, which acts as a trust region.

The KL penalty $\beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})$ serves the same purpose as in RLHF: it prevents the policy from drifting too far from the reference model (typically the SFT checkpoint). Without it, the policy could degenerate into producing a narrow set of high-reward outputs that no longer resemble coherent language. In practice, DeepSeek computes this KL term at the token level and averages it across the sequence, using an approximation $\text{KL} \approx \frac{\pi_{\text{ref}}(y|x)}{\pi_\theta(y|x)} - \log \frac{\pi_{\text{ref}}(y|x)}{\pi_\theta(y|x)} - 1$ that is more numerically stable than the raw log-ratio form.

Putting it all together, the GRPO training loop for each batch looks like this.

# Simplified GRPO training loop (pseudocode)

def grpo_step(policy, ref_policy, prompts, reward_fn, G=16, eps=0.2, beta=0.04):
    all_losses = []

    for x in prompts:
        # 1. Sample a group of G responses from the current policy
        responses = [policy.generate(x) for _ in range(G)]

        # 2. Score each response
        rewards = torch.tensor([reward_fn(x, y) for y in responses])

        # 3. Group-normalize advantages (replaces the critic)
        advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # 4. Compute log-probs under current and old policy
        for y_i, A_i in zip(responses, advantages):
            log_pi = policy.log_prob(y_i, given=x)
            log_pi_old = policy_old.log_prob(y_i, given=x)  # from before this update
            log_pi_ref = ref_policy.log_prob(y_i, given=x)

            # Importance sampling ratio
            rho = torch.exp(log_pi - log_pi_old)

            # Clipped surrogate (same as PPO)
            surr1 = rho * A_i
            surr2 = torch.clamp(rho, 1 - eps, 1 + eps) * A_i
            policy_loss = -torch.min(surr1, surr2)

            # KL penalty against reference
            kl = torch.exp(log_pi_ref - log_pi) - (log_pi_ref - log_pi) - 1
            loss = policy_loss + beta * kl

            all_losses.append(loss)

    total_loss = torch.stack(all_losses).mean()
    total_loss.backward()
    optimizer.step()

Notice what is absent: there is no $V_\phi$ network, no advantage estimation via GAE (Generalized Advantage Estimation), and no separate value loss. The only neural network being updated is the policy itself. This reduces memory usage significantly (no critic model in GPU memory) and simplifies the training code, at the cost of needing $G$ forward passes per prompt to generate the group.

How Did DeepSeek Use GRPO for Reasoning?

The most prominent application of GRPO is DeepSeek-R1 (DeepSeek-AI, 2025) , which used it to train a model with strong reasoning capabilities. The setup is particularly well-suited to reasoning tasks because the reward signal can often be verified automatically: for a math problem, we check whether the final answer is correct; for a coding task, we run the generated code against test cases. This binary or near-binary reward structure means we do not even need a learned reward model, just a verifier.

DeepSeek-R1's training pipeline started with a base model (DeepSeek-V3) and applied GRPO directly, without an initial SFT stage on reasoning data. The group size was large enough to ensure that for each prompt, some responses in the group would arrive at the correct answer and others would not. The group normalization then naturally assigned positive advantages to correct responses and negative advantages to incorrect ones, creating a clear learning signal without any human preference labels.

One of the most striking findings was that the model spontaneously developed chain-of-thought reasoning during GRPO training. Without being explicitly told to "think step by step," the model learned to produce longer, more structured reasoning traces because those traces tended to lead to correct answers (and therefore received positive advantages). It also learned self-correction behaviors, where the model would write something, realize it was wrong, backtrack, and try a different approach, all within a single generation. These behaviors emerged from the reward signal alone.

This highlights a broader point about online RL methods (both PPO and GRPO): because the model generates its own training data, it can discover strategies that no human annotator would have thought to demonstrate. DPO, training on a fixed dataset of human preferences, can only learn behaviors that are already represented in that dataset. GRPO's online generation opens the door to emergent behaviors, which is both its greatest strength (the model can surprise us with creative solutions) and a potential risk (it can also discover undesirable shortcuts if the reward function has gaps).

๐Ÿ’ก DeepSeek-R1 also revealed an "aha moment" during training where the model began allocating more tokens to harder problems and fewer to easier ones, suggesting that GRPO's group-relative signal can teach resource allocation strategies that would be difficult to specify explicitly.

Quiz

Test your understanding of GRPO and how it simplifies the RL pipeline.

What does GRPO use instead of a learned value network (critic) to compute advantages?

What happens to GRPO's group normalization when the group size G = 2?

Why is GRPO particularly well-suited to reasoning tasks like math and coding?

What is the main memory advantage of GRPO over PPO?