How Does TRL Turn These Ideas into Code?
Over the last five articles, we built up the alignment pipeline piece by piece: SFT to teach the model to follow instructions, reward modelling to capture human preferences, PPO to optimize against the reward, DPO to skip the reward model entirely, and GRPO to drop the critic. Each of these involved non-trivial implementation details (log-probability computations, clipped objectives, group normalization, KL penalties), and in a research or production setting, re-implementing them from scratch for every experiment would be slow and error-prone.
HuggingFace's TRL (Transformer Reinforcement Learning) library (von Werra et al., 2020) wraps all of these methods into a unified API that sits on top of the Transformers and PEFT ecosystems. The mapping between concepts and TRL classes is direct: each training method we studied has a corresponding Trainer class that handles the inner loop, and what took dozens of lines of custom code reduces to configuration. The following examples show the essential pattern for each method.
SFT is the starting point for every alignment pipeline, and TRL's SFTTrainer handles tokenization, packing multiple examples into a single sequence for efficiency, and the standard causal language modelling loss.
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# dataset has a "text" column with instruction-response pairs
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./sft-output", max_seq_length=2048),
train_dataset=sft_dataset,
processing_class=tokenizer,
)
trainer.train()
Once we have an SFT model, we can train a reward model from preference data. TRL's RewardTrainer expects a dataset with columns for the chosen and rejected responses and handles the Bradley-Terry loss internally.
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-3.1-8B", num_labels=1
)
# dataset has "chosen" and "rejected" columns
reward_trainer = RewardTrainer(
model=reward_model,
args=RewardConfig(output_dir="./reward-output"),
train_dataset=preference_dataset,
processing_class=tokenizer,
)
reward_trainer.train()
With a reward model in hand, we can run PPO. TRL's PPOTrainer manages the generation loop, reward scoring, advantage estimation, and clipped updates, which is where most of the implementation complexity lives.
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-output")
ppo_config = PPOConfig(
output_dir="./ppo-output",
learning_rate=1e-6,
kl_penalty="kl", # KL penalty type
init_kl_coef=0.2, # initial beta for KL
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=ppo_model,
ref_model=None, # uses a copy of the initial model
processing_class=tokenizer,
reward_model=reward_model,
train_dataset=prompt_dataset,
)
ppo_trainer.train()
DPO is even simpler because there is no generation loop and no reward model. The DPOTrainer takes the same preference dataset format (chosen/rejected pairs) and optimizes the DPO loss directly.
from trl import DPOTrainer, DPOConfig
dpo_model = AutoModelForCausalLM.from_pretrained("./sft-output")
dpo_trainer = DPOTrainer(
model=dpo_model,
args=DPOConfig(output_dir="./dpo-output", beta=0.1),
train_dataset=preference_dataset,
processing_class=tokenizer,
)
dpo_trainer.train()
Each of these examples omits details that matter in production ( LoRA configuration, dataset formatting, distributed training flags, gradient checkpointing), but the point is that the conceptual mapping is clean: every Trainer class corresponds to one of the methods we studied, and the hyperparameters ($\beta$, $\varepsilon$, learning rate, KL coefficient) are the same ones we analyzed in the formulas.
When Should We Use Which Method?
With so many alignment methods available, choosing between them can feel overwhelming. The decision usually depends on three factors: what data we have, how much compute we can afford, and whether the task benefits from online exploration. The following tendencies (not hard rules) can guide the choice.
SFT is always the first step, regardless of what comes after. It teaches the base model to follow instructions, produce structured outputs, and adopt the right tone. Skipping SFT and jumping directly to RL tends to produce unstable training because the policy starts too far from the target distribution for reward signals to be meaningful.
DPO is often the default choice for teams that have preference data but limited compute. Because it trains with a standard supervised loss on static data, it uses roughly half the GPU memory of RLHF (no value network, no online generation), tends to converge faster, and is easier to debug. It tends to work well when the preference dataset is diverse enough to cover the deployment distribution and when we do not need the model to discover novel behaviors beyond what the data demonstrates.
PPO-based RLHF becomes worthwhile when we have a reliable reward model and enough compute for the online generation loop. Its key advantage is flexibility: because the policy generates its own completions and receives reward feedback, it can explore behaviors that no human annotator demonstrated. The reward model can also be reused across experiments or updated independently. The downside is engineering complexity (three or four models in GPU memory simultaneously) and the risk of reward hacking if the reward model has exploitable gaps.
GRPO occupies a middle ground: it keeps the online generation that makes PPO powerful but drops the critic, reducing memory usage and implementation complexity. It is particularly well-suited to tasks where reward can be verified automatically (math, code, structured output), because the verifier replaces the learned reward model and eliminates reward hacking. DeepSeek-R1's success on reasoning benchmarks demonstrated that GRPO can match or exceed PPO performance in this regime.
Token-weighted SFT (from the first article in this track) is a lighter-weight option when we have per-token quality signals but do not want the overhead of full RL. It sits between SFT and DPO on the complexity spectrum and can be useful as a warm-start before DPO or as a standalone method when the quality signals are reliable.
What Can Go Wrong with Alignment?
Alignment methods are powerful, but they introduce failure modes that do not exist with SFT alone. Understanding these is important because they shape how practitioners design, monitor, and iterate on alignment pipelines.
Reward hacking is perhaps the most discussed failure mode. When a model is optimized against a learned reward model, it tends to find inputs that score highly according to the reward model but are not actually good by human standards. A classic example is the model learning to produce verbose, hedge-filled responses because the reward model was trained on data where longer responses were generally preferred. The model exploits the correlation between length and quality without actually improving quality. Gao et al. (2023) studied this systematically and found that reward model scores and true human preference diverge as optimization pressure increases: past a certain point, pushing harder against the reward model makes outputs worse rather than better. This is why the KL penalty is so important; it limits how far the policy can drift, which bounds the degree of reward hacking.
The alignment tax refers to the observation that alignment can slightly reduce a model's raw capabilities on benchmarks that measure factual knowledge or reasoning. A model that has been RLHF-tuned to be helpful and harmless may score slightly lower on, say, MMLU than the same model after SFT alone. This happens because the KL penalty and preference optimization pull the model's distribution away from the one that maximizes next-token prediction accuracy. In practice the tax tends to be small (on the order of a few percentage points in reported benchmarks), and the gains in usability usually outweigh it, but it is worth monitoring.
Scalable oversight is a deeper challenge that grows with model capability. RLHF assumes that humans can reliably judge which of two outputs is better, but as models produce longer, more complex, or more specialized outputs, this assumption becomes strained. A human labeler may not be able to tell whether a 500-line program is correct, or whether a detailed medical explanation contains subtle errors. If the preference labels are noisy or systematically biased, the resulting alignment inherits those flaws.
Several research directions aim to address scalable oversight. Constitutional AI (Bai et al., 2022) replaces human labelers with an AI that critiques outputs against a set of written principles, reducing (though not eliminating) the bottleneck of human annotation. RLAIF (Reinforcement Learning from AI Feedback) extends this by using an LLM to generate preference labels for training the reward model. Debate and recursive reward modelling are more speculative proposals where models argue with each other or help humans evaluate complex outputs. None of these fully solves the problem, but they represent active directions toward making alignment work as models become more capable.
Where Is the Field Heading?
The field of alignment has shifted dramatically in a short time. In 2022, RLHF with PPO was the only proven method, and it required significant infrastructure (InstructGPT reportedly required a dedicated team and substantial compute). By 2023, DPO showed that we could get comparable results with a simple supervised loss. By 2024, GRPO demonstrated that online RL could work without a critic, and DeepSeek-R1 showed that emergent reasoning behaviors could arise from reward optimization alone. Each step simplified the pipeline while preserving (or improving) the outcomes.
What remains constant across all these methods are the foundational ideas we have built up throughout this track. Policy optimization (whether through REINFORCE, PPO clipping, or DPO's implicit gradient) provides the engine for improvement. Preference learning (the Bradley-Terry model, pairwise comparisons) translates human judgment into a training signal. KL regularization prevents the policy from drifting into degenerate regions. These three pillars appear in every method we have studied, just combined in different ways.
New methods continue to appear at a rapid pace. Iterative DPO variants like IPO (Azar et al., 2023) and KTO (Ethayarajh et al., 2024) refine the preference loss to handle noisy labels or work with unpaired data (where we have "good" and "bad" examples but not matched pairs). Online DPO methods generate new completions during training to combine DPO's simplicity with the exploration benefits of PPO. Process reward models score each step of a reasoning chain rather than just the final answer, giving finer-grained feedback for math and code tasks. Each of these builds on the foundations we have covered.
For practitioners, the practical advice is to start simple and add complexity only when needed. SFT gets us most of the way. DPO on a good preference dataset handles most alignment needs. Online RL (PPO or GRPO) is worth the investment when we need the model to discover new behaviors, when we have a reliable automated reward signal, or when the deployment distribution differs significantly from the training data. The foundations do not change, even as the methods evolve.
Quiz
Test your understanding of the alignment methods and practical considerations.
Why is SFT typically performed before any RL-based alignment method?
What is reward hacking in the context of RLHF?
Which alignment method is generally best suited for a small team with a good preference dataset but limited compute?
What problem does Constitutional AI (CAI) aim to address?
Which three foundational ideas appear across all alignment methods covered in this track?