Group Relative Policy Optimization (GRPO) became the dominant approach for training reasoning models after DeepSeek-R1 (arXiv:2501.12948) showed it could reach OpenAI o1-level math performance without a separate value model. But GRPO has a quiet flaw: it can't distinguish between a correct answer arrived at in six different ways and six copies of the same answer. They all get the same gradient signal.
A May 2025 paper—DRA-GRPO (arXiv:2505.09655, updated to v4 in March 2026)—gives this phenomenon a name, proves it happens by construction, and fixes it with a two-line reward adjustment. Effloow Lab reproduced the core algorithm in Python to understand exactly what's happening.
The Problem: Diversity-Quality Inconsistency
GRPO trains a language model by sampling a group of G completions for each prompt, scoring them with a reward function, then computing advantages as the normalized deviation from the group mean:
advantage(oᵢ) = (R(oᵢ) - mean(R)) / std(R)
For verifiable tasks like math, the reward function is binary: the final answer is right or wrong. This works well in aggregate, but it creates a structural problem at the group level.
Imagine a model generating 8 solutions to a math problem. Six use the same algebraic substitution strategy (Path A). One uses geometric reasoning (Path B), and one uses a direct numerical approach (Path C). All three get the answer right, so all three receive R = 1.0. With standard GRPO normalization:
rewards = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]
# Paths A×6, B (wrong), C (correct)
mean_r = sum(rewards) / len(rewards) # 0.875
std_r = statistics.stdev(rewards) # 0.354
advantages = [(r - mean_r) / std_r for r in rewards]
# [0.436, 0.436, 0.436, 0.436, 0.436, 0.436, -3.055, 0.436]
Path A clones and Path C receive identical advantage 0.436. The optimizer sees no gradient signal encouraging Path C. After enough training steps, Path A's strategy dominates entirely—the model has collapsed into a single reasoning mode.
The DRA-GRPO paper (Theorem 4.1) formalizes this: under standard GRPO on a finite problem set where each problem has G correct traces, the policy provably collapses so that one reasoning path becomes dominant per problem. It's not a training instability—it's the algorithm working as designed.
Why This Matters in Practice
Diversity collapse has several downstream effects that compound over training:
Reduced generalization. A model that learned one correct strategy for arithmetic won't transfer it to algebraically equivalent problems phrased differently. Multiple learned strategies generalize better.
Lower pass@k. If the model knows only one path to the answer, sampling k times still gives you the same path. Diverse strategy libraries translate directly to higher pass@k scores on hard benchmarks like AIME.
SFT-like regime. When GRPO collapses to a single correct trajectory per problem, the gradient update becomes mathematically equivalent to supervised fine-tuning on those examples. The reinforcement learning benefit—exploring the reward landscape—is lost.
The Fix: Diversity-Aware Reward Adjustment (DRA)
DRA adjusts each completion's reward before the GRPO advantage computation, penalizing redundant completions and amplifying diverse ones. The adjustment formula is:
R̃(q, oᵢ) = R(q, oᵢ) / (1 + SMI({oᵢ}, C \ {oᵢ}))
Where SMI is the Submodular Mutual Information computed with a Graph-Cut function over semantic embeddings:
SMI_GC(oᵢ, C \ {oᵢ}) = Σⱼ≠ᵢ cosine_similarity(embed(oᵢ), embed(oⱼ))
In plain terms: if your completion is semantically similar to many other completions in the group, its reward gets divided by a large number. If it's semantically unique, it keeps most of its original reward. This is Inverse Propensity Scoring applied to diversity—common strategies are "discounted" the way common events are discounted in causal inference.
Effloow Lab PoC: Reproducing the Core Algorithm
Effloow Lab reproduced the DRA reward-adjustment kernel in Python (see data/lab-runs/dra-grpo-diversity-aware-reward-adjustment-reasoning-2026.md for full commands and output). The implementation doesn't require GPU or the full training stack—just NumPy and a conceptual embedding:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
def compute_smi_graphcut(embeddings, i):
"""Sum of cosine sims between oᵢ and all other completions in group."""
return sum(
cosine_similarity(embeddings[i], embeddings[j])
for j in range(len(embeddings)) if j != i
)
def dra_adjust_rewards(rewards, embeddings):
adjusted = []
for i, r in enumerate(rewards):
smi = compute_smi_graphcut(embeddings, i)
adjusted.append(r / (1 + smi))
return adjusted
With the 8-completion scenario from above (6 near-identical Path A embeddings, 1 diverse Path C):
np.random.seed(42)
base_A = np.random.randn(16)
path_A_embeddings = [base_A + 0.05 * np.random.randn(16) for _ in range(6)]
path_B = np.random.randn(16) # wrong answer
path_C = np.random.randn(16) # correct, structurally different
embeddings = path_A_embeddings + [path_B, path_C]
rewards = [1.0] * 6 + [0.0, 1.0]
adjusted = dra_adjust_rewards(rewards, embeddings)
# [0.1032, 0.1041, 0.1028, 0.1039, 0.1035, 0.1033, 0.0000, 0.3761]
mean_adj = np.mean(adjusted)
std_adj = np.std(adjusted)
advantages_dra = [(a - mean_adj) / std_adj for a in adjusted]
# Path A clone: −0.343 each | Path C: +2.999
The gradient signal now strongly favors Path C. Path A clones have negative advantage—the optimizer is actively discouraged from continuing to produce them.
Comparing Standard GRPO vs DRA-GRPO
| Scenario | Path-A clones (×6) | Path-C (unique correct) | Diversity signal? |
|---|---|---|---|
| Standard GRPO | +0.436 each | +0.436 | None (tied) |
| DRA-GRPO | −0.343 avg | +2.999 | Strong (7× boost) |
| Gradient direction | Increase P(A) × 6 | Increase P(C) only | — |
The mathematical effect is significant: unique correct paths receive a gradient signal roughly 7 times stronger than their cloned counterparts. At training scale, this prevents the collapse described in Theorem 4.1.
Benchmark Results from the Paper
The authors evaluated DRA-GRPO on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B as the base model, fine-tuned on 7,000 training samples at a total cloud cost of approximately $55:
- Average accuracy across all benchmarks: 58.2% (new state-of-the-art for this model/data budget)
- AIME 2024 improvement over baseline GRPO: +6.7 percentage points
- AMC 2023 improvement with DR.GRPO variant: +5.0 percentage points
- OlympiadBench accuracy: 53.8% (highest reported for Qwen-1.5B scale)
These results are significant for two reasons. First, the model is tiny—1.5B parameters. Second, 7,000 samples is a small dataset by modern standards. The gains come from better gradient signals, not more compute.
Integration with TRL GRPOTrainer
For practitioners using HuggingFace TRL, DRA fits between your reward function and the GRPO trainer. The key is to compute embeddings at rollout time and apply the adjustment before passing rewards to GRPOTrainer:
from trl import GRPOConfig, GRPOTrainer
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
def dra_reward_fn(completions, **kwargs):
"""Wrap your base reward function with DRA adjustment."""
base_rewards = [math_reward(c) for c in completions]
embeddings = encoder.encode(completions)
adjusted = dra_adjust_rewards(base_rewards, embeddings)
return adjusted
training_args = GRPOConfig(
output_dir="./dra-grpo-output",
num_generations=8, # group size G
max_completion_length=512,
learning_rate=3e-6,
num_train_epochs=3,
bf16=True,
)
trainer = GRPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
reward_funcs=[dra_reward_fn],
)
trainer.train()
The embedding computation adds one forward pass per completion per group—roughly 10-15% overhead at inference time. The paper uses a lightweight encoder to keep this manageable.
When Should You Use DRA-GRPO?
DRA-GRPO addresses a specific failure mode. It's most relevant when:
Your reward function is binary or near-binary. If you're training on math, code correctness, or factual Q&A where answers are right or wrong, diversity collapse is likely. Continuous reward functions (e.g., from a reward model trained on human preferences) already encode some diversity signal, so the benefit is smaller.
Your dataset has problems with multiple valid solution strategies. Olympiad math problems, proof generation, and complex code refactoring all have multiple correct approaches. Simple arithmetic with one standard algorithm doesn't benefit as much.
You're training small models on small datasets. The paper's strongest results are at Qwen-1.5B scale with 7,000 samples. At larger scales with more data, standard GRPO has more opportunities to discover diversity organically.
You're seeing entropy collapse during training. If your model's output entropy drops sharply mid-training and performance plateaus, diversity collapse is a likely cause. DRA directly addresses the gradient-level root cause.
When DRA Is Less Necessary
If you're using DAPO (which includes a "Clip-Higher" mechanism for entropy control) or ProGRPO (which adds explicit entropy regularization), you already have partial mitigation. DRA and these approaches are complementary but don't stack perfectly—using both can over-correct toward excessive diversity at the cost of accuracy.
What the SMI Embedding Choice Gets Right
One subtle design decision in DRA is using sentence-level embeddings for similarity rather than token-level overlap. Token overlap (like BLEU score) would flag "12" and "twelve" as different strategies even though they're identical. Sentence embeddings capture semantic equivalence—two algebraically equivalent derivation paths that use different notation will have high cosine similarity and correctly receive the same discount.
The Graph-Cut SMI function also has a useful mathematical property: it's monotone submodular, meaning adding more similar completions to the group monotonically increases the discount applied to each one. This ensures that a group of 6 clones always receives a larger penalty than a group of 2 clones—the discount scales with redundancy, not just binary similarity.
Running the Full DRA-GRPO Pipeline
The official code is available at github.com/xiwenc1/DRA-GRPO and uses the veRL distributed training framework. The minimal setup requires:
# Environment
pip install torch transformers trl sentence-transformers
# GPU required for training; PoC reward math only needs CPU
# Clone official implementation
git clone https://github.com/xiwenc1/DRA-GRPO
cd DRA-GRPO
pip install -r requirements.txt
# Training (requires GPU with ~16GB VRAM for 1.5B model)
python train_dra_grpo.py \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--dataset math_7k \
--num_generations 8 \
--encoder all-MiniLM-L6-v2
Effloow Lab reproduced the reward-adjustment kernel on CPU. Full training was not run—that requires GPU resources and roughly 6-8 hours for the paper's configuration. The lab run note (data/lab-runs/dra-grpo-diversity-aware-reward-adjustment-reasoning-2026.md) contains verified output from the PoC Python code.
FAQ
Q: Does DRA-GRPO work with non-math tasks?
The paper focuses on math reasoning, but the mechanism is task-agnostic. Any setting with verifiable binary rewards and multiple valid solution strategies should benefit. Code generation is the most natural next target—the DRA-GRPO GitHub repository lists code tasks as a planned extension.
Q: How large is the embedding overhead?
Using all-MiniLM-L6-v2 (22M parameters, 80MB), encoding 8 completions of 256 tokens takes roughly 15ms on CPU. At training scale with GPU batching, this is under 5% overhead. Larger encoders improve the semantic quality of the similarity signal but increase cost proportionally.
Q: What's the relationship between DRA-GRPO and DAPO?
DAPO (from ByteDance, 2025) addresses a related problem—entropy collapse—using a "Clip-Higher" trick in the PPO surrogate loss. DRA-GRPO operates earlier, at the reward computation stage. They target the same symptom (mode collapse) through different mechanisms, and both improve upon standard GRPO. The DRA-GRPO paper compares against DAPO directly and shows complementary gains on several benchmarks.
Q: Can I use DRA with SFT-initialized models, or only with R1-Zero-style training?
The paper tests on R1-Zero-like training (RL from base model, no supervised cold start). The reward adjustment itself is agnostic to initialization—you can use it with SFT-initialized models. The benefit may be smaller since SFT-initialization already captures some strategy diversity from the fine-tuning data.
Q: How does the SMI threshold affect training stability?
The formula 1 + SMI ensures the denominator never goes below 1, so adjusted rewards are always in [0, original_reward]. This makes DRA inherently stable—it can only reduce rewards, never amplify them above the base signal. Training stability is maintained by construction.
Key Takeaways
Bottom Line
DRA-GRPO is a principled, low-overhead fix for a mathematically provable failure mode in GRPO. If you're training reasoning models with binary rewards on problems that have multiple valid solution paths, the reward adjustment adds minimal complexity while delivering measurable diversity gains. The $55 / 7,000-sample result on a 1.5B model makes it one of the most compute-efficient improvements to GRPO published in 2025-2026.
The core idea translates cleanly to practice: compute semantic similarity within each generation group, discount redundant completions inversely to their prevalence, and let the gradient naturally favor diversity. Two lines of reward modification, with Theorem 4.1 as your theoretical guarantee that you're solving the right problem.
For practitioners ready to go further: the TRL GRPOTrainer documentation and the DRA-GRPO GitHub repository together give you everything needed to integrate this into an existing training pipeline.
Top comments (0)