PPO, DPO & GRPO: Reinforcement Learning Techniques for Training LLMs
1) Introduction
Traditional language model training has a fundamental limitation: it only teaches models to mimic patterns from existing data. While this produces grammatically correct text, it doesn’t create models that can truly think, reason, or align with human preferences.
Reinforcement Learning (RL) transforms models from pattern copiers into thinking agents that can reason, adapt, and align with human preferences. For e.g.,
Article: “AI’s Impact on Employment”
Model Generation A: “AI will replace jobs.”
Model Generation B: “AI is transforming employment by automating routine tasks while creating new roles in AI development, data analysis, and human-AI collaboration, requiring proactive reskilling initiatives.”
Human Feedback: B > A (more nuanced, specific, actionable)
RL rewards models based on output quality and human preferences. This forces models to move beyond pattern matching and actually consider what makes a response good — analyzing context, weighing different approaches, and optimizing for human-defined criteria. Through trial and feedback, models discover that systematic reasoning (breaking down problems, considering multiple angles, structuring thoughts) consistently produces higher-quality outputs that get better rewards. Over time, this reasoning process becomes learned behavior, transforming models from text generators into thinking agents that can adapt and solve problems.
2) What is PPO (Proximal Policy Optimization)?
Imagine you’re mentoring a new journalist who writes article summaries. Every time they write a summary, you give them a score from 1–10 based on how accurate, comprehensive, and engaging it is. The journalist wants to improve their score, but you don’t want them to change their writing style too dramatically between summaries — sudden changes might make them worse, not better.
PPO[1] works exactly like this mentoring approach:
- Generate summaries: The AI writes summaries of AI-related articles
- Get feedback: A “reward model” (like our editor) scores each summary based on accuracy, completeness, and readability
- Improve gradually: The AI adjusts its summarization style to get higher scores, but only makes small, safe changes at a time
- Prevent dramatic changes: PPO includes a “safety mechanism” that stops the AI from changing too much too quickly
2.1) Technical Details
PPO is a policy gradient algorithm that optimizes the probability distribution over tokens (the “policy”) while maintaining stability through careful constraint mechanisms.
Following are the key components in PPO:
Policy Model π_θ(a|s):
- This is the main language model (e.g., GPT-style transformer) that we’re training
- Takes article text as input, outputs probability distribution over vocabulary for each position
- θ represents the neural network parameters (weights and biases)
- For summarization: given article text, predicts probability of each possible next word
Reward Model R(s,a):
- A separate neural network trained to score summary quality
- Usually a transformer encoder that takes full summary as input and outputs a single score
- Trained on human ratings: “This summary gets 8/10 for accuracy and completeness”
- Acts as a proxy for human judgment during RL training
Advantage Function A(s, a):
- Measures how much better a specific action is compared to the expected value
- Calculated as: A(s, a) = Q(s, a) — V(s) where Q is action-value and V is state-value
- Positive advantage = better than average choice, negative = worse than average
The PPO Objective Function:
L^CLIP(θ) = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
Where:
r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
(probability ratio between new and old policy) A_t is the advantage estimate
ε (typically 0.1-0.2) is the clipping parameterTraining Process:
- Supervised fine-tuning: Use demonstration data to fine-tune the pre-trained model as the base policy
- Train reward model: Use comparison data to train a classifier that scores summary quality, creating a “reward model” that mimics human preferences
- RL optimization: Use PPO algorithm to update policy based on reward model scores, with clipping constraints to prevent large changes that could destabilize training
Let’s trace through a concrete example where our AI-Assistant is writing a summary:
Article Input: “AI researchers at Stanford have developed a new language model…”
Current Summary So Far: “Stanford researchers have created”
Step 1: Policy Model Output
Next token probabilities from π_θ:
- “a” → 0.35
- “an” → 0.28
- “new” → 0.25
- “advanced” → 0.12
Step 2: Generate Complete Summaries Let’s say we sample and get two complete summaries:
- Summary A: “Stanford researchers have created a breakthrough AI system”
- Summary B: “Stanford researchers have created new technology”
Step 3: Reward Model Scoring
- Summary A: 8.2/10 (comprehensive, specific)
- Summary B: 5.1/10 (too vague, lacks detail)
Step 4: Advantage Calculation
The advantage function A(s,a) = Q(s,a) — V(s) measures how much better a specific action is compared to the expected value from that state.
Current state: “Stanford researchers have created”
V(state) = Expected total reward starting from this state = 6.8
(i.e, Estimated by value network: what’s the expected final summary score if we continue from here?)
Q(state, “a”) = Expected total reward if we choose “a” then follow current policy = 8.2 (i.e, if we choose “a” at this position, then continue with current policy π_θ, the expected final summary score is 8.2)
Q(state, “new”) = Expected total reward if we choose “new” then follow current policy = 5.1
A(state, "a") = Q(state, "a") - V(state) = 8.2–6.8 = +1.4
A(state, "new") = Q(state, "new") - V(state) = 5.1–6.8 = -1.7For estimating Q-values in practice:
To estimate Q(state, "a"):
1. From current state "Stanford researchers have created"
2. Force next token to be "a" → "Stanford researchers have created a"
3. Continue sampling from current policy π_θ to complete summary
4. Feed each complete summary to the trained reward model
5. Average the reward model scores: Q(state, "a") = average of all completion scores
Sample completions starting with "a":
- "Stanford researchers have created a breakthrough AI system that can
process both text and images with unprecedented accuracy."
→ Reward Model Score: 8.5
- "Stanford researchers have created a new language model for research."
→ Reward Model Score: 7.8
- "Stanford researchers have created a powerful tool for AI development."
→ Reward Model Score: 8.3
- ... (97 more samples)
- Average: Q(state, "a") = 8.2
The rewards (8.5, 7.8, 8.3) come from the Reward Model
- a neural network that was trained earlier on human feedback dataStep 5: Policy Update Calculation
Current situation:
- π_old("a") = 0.35 (current policy assigns 35% probability to token "a")
- Advantage A(state, "a") = +1.4 (choosing "a" is better than average)
PPO Update Process:
1. Calculate gradient: ∇ = A(state, "a") * (1/π_old("a"))
∇ = 1.4 * (1/0.35) = 4.0
2. Apply gradient step: π_new("a") = π_old("a") + learning_rate * ∇
π_new("a") = 0.35 + 0.01 * 4.0 = 0.35 + 0.04 = 0.39
3. Calculate ratio: r_t = π_new("a") / π_old("a") = 0.39 / 0.35 = 1.11
4. Apply clipping: ε = 0.1, so ratio is clipped to [1-0.1,1+0.1]
clipped_ratio = clip(1.11, 0.9, 1.1) = 1.1
5. Final update: π_final("a") = π_old("a") * clipped_ratio = 0.35 * 1.1 = 0.385What this means:
- Token “a” had positive advantage (+1.4), so we want to increase its probability. Gradient step would increase probability from 35% to 39%
- But clipping limits the increase to 38.5% (only 10% relative change allowed). This prevents the policy from changing too dramatically in one update
3) What is DPO (Direct Preference Optimization)?
Instead of giving numerical scores to individual summaries, you show your trainee pairs of summaries for the same article and say “Summary A is better than Summary B” or “Summary B is better than Summary A.” You collect hundreds of these comparisons across different articles about AI.
DPO[2] works like this simplified editing method:
- Show pairs: Present the AI with pairs of summaries — one preferred, one not preferred
- Learn directly: Instead of first learning to “score” summaries (like PPO does), the AI directly learns to increase the probability of generating preferred summaries
- Skip the middleman: No need for a separate “scoring system” (reward model)
Training Process:
- Collect preference data: Gather (article, preferred_summary, dispreferred_summary) triplets
- Initialize with SFT model: Start with a supervised fine-tuned model as π_ref
- Optimize directly: Update policy to increase log-probability of preferred summaries and decrease log-probability of dispreferred summaries
3.1) Technical Details
DPO eliminates the reward model by directly optimizing the policy on preference data using a mathematical re-parameterization of the RLHF objective.
Core Components Explained:
Reference Policy π_ref:
- The initial supervised fine-tuned model (frozen, not updated)
- Serves as anchor to prevent the model from drifting too far from natural language
- Usually the same model architecture, trained on human-written summaries
Trainable Policy π_θ:
- The model being optimized (same architecture as reference)
- Starts as copy of π_ref, then updated based on preferences
- Parameters θ are updated to satisfy human preferences
Preference Dataset:
- Triplets: (article, preferred_summary, not-preferred_summary)
- No numerical scores needed, just binary preferences
- Collected by showing human annotators pairs of summaries
The DPO Objective Function:
L_DPO = -E[(x,y_w,y_l)~D][log σ(β log π_θ(y_w|x) - β log π_θ(y_l|x))]
Where:
(x,y_w,y_l)~D means "sample a preference triplet from dataset D"
x = input article (e.g., "OpenAI releases GPT-4...")
y_w is the preferred (winning) summary
y_l is the less preferred (losing) summary
σ is the sigmoid function
β is a temperature parameter controlling the strength of the preference
π_θ is the policy being optimizedLet’s trace through a concrete DPO training step:
Article: “OpenAI releases GPT-4, a large multimodal model…”
Data Preference Pair:
- Summary A (Preferred): “OpenAI has launched GPT-4, a multimodal AI system capable of processing both text and images, representing a significant advancement in language model capabilities with improved reasoning and reduced hallucinations.”
- Summary B (Not-preferred): “OpenAI made a new AI model called GPT-4.”
Step 1: Calculate Log Probabilities
For each summary, calculate the log probability under current policy π_θ. These are the probabilities of generating each token given the article context and previous tokens.
Article Context: "OpenAI releases GPT-4, a large multimodal model..."
Summary A tokens: ["OpenAI", "has", "launched", "GPT-4", ...]
Log probabilities for each token (given article + previous tokens):
log π_θ("OpenAI"|article_context) = -2.1
log π_θ("has"|article_context + "OpenAI") = -1.8
log π_θ("launched"|article_context + "OpenAI has") = -3.2
log π_θ("GPT-4"|article_context + "OpenAI has launched") = -2.8
...
Total log π_θ(Summary A|article) = -45.6
Summary B tokens: ["OpenAI", "made", "a", "new", ...]
Log probabilities:
log π_θ("OpenAI"|article_context) = -2.1
log π_θ("made"|article_context + "OpenAI") = -2.4
log π_θ("a"|article_context + "OpenAI made") = -1.2
log π_θ("new"|article_context + "OpenAI made a") = -1.8
...
Total log π_θ(Summary B|article) = -28.3Summary A has Lower probability (-45.6) than Summary B (-28.3) under the current policy, even though humans prefer Summary A! This is exactly what DPO needs to fix.
Step 2: Apply DPO Loss Calculation
β = 0.5 (temperature parameter)
Preference logit = β × (log π_θ(Summary A) - log π_θ(Summary B))
= 0.5 × (-45.6 - (-28.3))
= 0.5 × (-17.3)
= -8.65
Sigmoid probability = σ(-8.65) = 0.00018
DPO loss = -log(0.00018) = 8.62The negative preference logit (-8.65) indicates a problem: our current policy assigns much higher probability to the not-preferred Summary B than to the preferred Summary A.
Step 3: Gradient Update
The loss encourages:
- Increasing log probability of preferred Summary A
- Decreasing log probability of not-preferred Summary B
- The β parameter controls how strongly to enforce the preference
4) What is GRPO (Group Relative Policy Optimization)?
Instead of editing one summary at a time, you ask your trainee to write 5 different summaries of the same AI article. Then, you rank all 5 summaries from best to worst based on their accuracy, completeness, and readability. The trainee learns by understanding the relative quality within this group.
GRPO[3] works with this group-based learning:
- Generate multiple summaries: For each AI article, the AI writes several different summaries (e.g., 4–8 summaries)
- Rank within groups: Evaluate and rank these summaries relative to each other
- Learn from rankings: The AI learns to increase the probability of generating higher-ranked summaries and decrease the probability of lower-ranked ones
- Focus on relative quality: Instead of absolute scores, focus on “Summary A is better than Summary B for this specific article”
GRPO Objective Function:
L_GRPO = -E[∑∑ I(rank(y_i) > rank(y_j)) * log σ(β(log π_θ(y_i|x) - log π_θ(y_j|x)))]
Where:
I(rank(y_i) > rank(y_j)) is an indicator function (1 if y_i is ranked higher than y_j, 0 otherwise)
The sum is over all pairs within each group
β controls the strength of the relative preferenceTraining Process
- Batch Sampling: Sample articles and generate k summaries per article
- Group Evaluation: Rank summaries within each group using reward model or human raters
- Pairwise Loss: Compute loss over all valid pairs within groups
- Policy Update: Update policy to satisfy the learned relative preferences
4.1) Technical Details
GRPO optimizes the policy by leveraging relative preferences within groups of summaries generated for the same article, providing a more sample-efficient and stable alternative to traditional RL approaches.
Core Components Explained:
Group Generator:
- Samples multiple diverse summaries for each article
- Uses techniques like nucleus sampling, temperature sampling, or beam search variants
- Ensures diversity to get meaningful quality differences within groups
Group Ranker (Multiple Options):
- Option A: Human annotators — manually rank summaries from best to worst
- Option B: Reward model — same as PPO, but used for ranking instead of absolute scoring
- Option C: Automated evaluation — objective metrics (e.g., for code: compiler success, test pass rates) — DeepSeek V3 [3] used it to improve the code generation.
- Provides relative ordering rather than absolute scores
Pairwise Comparisons:
- Converts rankings into pairwise preferences
- From ranking [A > B > C], extracts pairs: (A>B), (A>C), (B>C)
- Each pair becomes a training signal
Let’s trace through a complete GRPO training step:
Article: “Meta announces Llama 2, an open-source large language model…”
Step 1: Group Generation Generate 4 diverse summaries using different sampling strategies:
Summary 1: "Meta has released Llama 2, an open-source language model."
Summary 2: "Meta unveiled Llama 2, a powerful open-source AI model designed
to compete with proprietary systems like GPT-4, featuring improved safety
measures and commercial licensing."
Summary 3: "Meta announced Llama 2, their latest language model that offers
open-source access with enhanced capabilities and safety features for
researchers and businesses."
Summary 4: "Meta made a new AI."Step 2: Human/Model Ranking Human annotators rank the summaries:
Ranking: 2 > 3 > 1 > 4
(Summary 2 is best, Summary 4 is worst)
Step 3: Extract Pairwise Comparisons From ranking, extract all valid pairs (we generated 6 train-samples for 1 sample).
Valid pairs (6 total):
- (2 > 3): Summary 2 better than Summary 3
- (2 > 1): Summary 2 better than Summary 1
- (2 > 4): Summary 2 better than Summary 4
- (3 > 1): Summary 3 better than Summary 1
- (3 > 4): Summary 3 better than Summary 4
- (1 > 4): Summary 1 better than Summary 4
Step 4: Calculate Log Probabilities
Log probabilities under current policy π_θ:
log π_θ(Summary 1) = -32.1
log π_θ(Summary 2) = -67.4
log π_θ(Summary 3) = -58.9
log π_θ(Summary 4) = -18.2Step 5: Compute Pairwise Losses For each pair, calculate preference logit and loss:
β = 0.3
Pair (2 > 3):
preference_logit = 0.3 × (-67.4 - (-58.9)) = 0.3 × (-8.5) = -2.55
probability = σ(-2.55) = 0.072
loss = -log(0.072) = 2.63
Pair (2 > 1):
preference_logit = 0.3 × (-67.4 - (-32.1)) = 0.3 × (-35.3) = -10.59
probability = σ(-10.59) = 0.000025
loss = -log(0.000025) = 10.60
Pair (1 > 4):
preference_logit = 0.3 × (-32.1 - (-18.2)) = 0.3 × (-13.9) = -4.17
probability = σ(-4.17) = 0.015
loss = -log(0.015) = 4.20Loss is higher for the pair (2 >1) as base model assigned lower score to it.
Step 6: Total Group Loss
Total GRPO loss = sum of all pairwise losses
= 2.63 + 10.60 + 4.20 + ... (all 6 pairs)
= 24.85) Comparison: Pros and Cons
PPO (Proximal Policy Optimization) is the most mature [2017] and widely-used approach for RL systems. It provides flexible reward design and interpretable feedback through explicit reward scores, making it suitable for complex multi-objective optimization. However, PPO requires a complex three-stage pipeline with separate reward model training, suffers from training instability, and demands significant computational resources due to its multi-model architecture.
DPO (Direct Preference Optimization) offers remarkable simplicity by eliminating the separate reward model and directly optimizing on preference data, resulting in faster training and lower computational costs. The main limitations are reduced flexibility for complex reward structures, and dependency on high-quality preference datasets (exposure to human bias).
GRPO (Group Relative Policy Optimization) maximizes sample efficiency by extracting multiple pairwise comparisons from each group evaluation, making it particularly valuable when human evaluation time is expensive. It’s robust to reward model calibration issues through relative rankings and naturally handles varying problem difficulties by comparing within groups.
6) Conclusions
The evolution from PPO -> DPO -> GRPO represents a progression toward simpler, more stable, and more efficient methods for training models that align with human preferences. PPO established that RL could effectively improve summary quality beyond traditional supervised learning, but its complexity and instability motivated research into alternatives. DPO simplified the training pipeline while maintaining effectiveness, making preference-based summarization training more accessible to researchers and practitioners. GRPO advances the field further by improving sample efficiency and robustness through group-based learning, where relative quality comparisons are often more reliable than absolute ratings.
