Spaces:
Running
[Experiment] Training R1-Zero-like models with Open R1
Context
There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Understanding R1-Zero-Like Training: A Critical Perspective
- Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!
Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:
Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):
With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.
We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!
Setup
- Models: Qwen2.5-7B for ablations and Qwen2.5-32B for final runs
- Datasets: SynthLabsAI/Big-Math-RL-Verified and BytedTsinghua-SIA/DAPO-Math-17k for math. Code and other domains to be decided.
Links
- Code: I'll be running experiments from this draft PR of
open-r1
: https://github.com/huggingface/open-r1/pull/569 - Experiment logs: https://api.wandb.ai/links/huggingface/8eew2ipo
- Models and datasets: https://huggingface.co/collections/open-r1/open-r1-zero-67eba6a037505bbcb5157d07
Experiments to run
- Train a baseline using "standard" parameters on Big-Math-RL-Verified to compare relative performance & learning dynamics ✅
- Measure effect on convergence with
μ=2,4
(default is 1 in TRL) ✅ - Disable KL term with
β=0
✅ - Clip higher with
ε_low=0.2
andε_high=0.28
(DAPO values) ✅ - Add soft overlong reward function from DAPO paper
- Add overlong filter (mass loss of truncated completions)
- DAPO (default) vs Dr. GRPO loss ✅
Features to add to TRL
- Overlong filter could be exposed as an arg like
mask_truncated_completions
inGRPOConfig
Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo)Done: https://github.com/huggingface/trl/pull/3188
Features to add to Open R1
Logbook [1.4.2025]
Experiments
- Focused on training a baseline with
Qwen2.5-7B
and discovered a serious bug in the accuracy reward function ofopen-r1
🙀. First, the parser was failing on non-LaTeX ground truth answers like"6"
, and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566
- I am running 3 baseline experiments to gauge stability on
SynthLabsAI/Big-Math-RL-Verified
:- v00.0X: train on everything
- v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
- v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e.,
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.
User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).
Assistant:
As many other papers has observed, Qwen2.5-7B
is remarkably good at following instructions with little prompting and is able to emit the \boxed{}
format fairly consistently without any reference to this in the prompt!
TRL / Open R1 updates
- @edbeeching has added the new completion metrics here: https://github.com/huggingface/trl/pull/3188
- @ShirinYamani has added the soft overlong reward function: https://github.com/huggingface/open-r1/pull/567
Next
- Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
- Rebase on
trl@main
and re-run baseline to measure stability. - Gather downstream evals with
pass@1
metric fromlighteval
: https://github.com/huggingface/lighteval/pull/647
Logbook [4.4.2025]
Experiments
- Following
@edbeeching
's suggestion to filter the
Big-Math-RL-Verified
dataset for answers that can be parsed bymath-verify
, I've created a processed version to use as a new basis for our experiments: https://huggingface.co/datasets/open-r1/Big-Math-RL-Verified-Processed - Ablations run for 0.1 epochs over the full training set (about 21k samples), with 32 unique prompts per batch, 16 completions per prompt and 8k max tokens.
- Report: https://api.wandb.ai/links/huggingface/qps1tmoj
tl;dr
- For β>0 it seems necessary to update the policy every N steps to avoid instabilities.
- Increasing μ leads to faster convergence but is less stable
- Setting β=0 is surprisingly more stable than β>0
- Clip higher and Dr GRPO loss do not have much effect on the rewards, but also do not induce any additional instability
- The format reward seems to be too hard for the model to learn, possibly because we enforce a specific new-line format.
- The new completion metrics like
clipped_ratio
are very handy for knowing when a run is going off the rails!
Baselines
While setting a baseline with the default settings, we found that vanilla GRPO is unstable and the completions explode midway through training:
In line with DAPO, we suspect this is caused by the truncated completions destabilising the training and @ShirinYamani has opened a PR in TRL we can test this hypothesis with. Nevertheless, we found that replacing the reference model with the policy every 100 steps (about every 1/6th of training) mitigated the instability for now.
Note: set sync_ref_model=True
and sync every 100 steps.
Effect from μ iterations
The GRPO algorithm has an inner optimisation loop where the policy is updated μ times on a given batch:
We explored the effect of setting μ=1,2,4 and as shown below we can see that larger values of μ converge much faster *but are less stable:
The convergences is most visible in the early phases of training where larger μ values reach the same reward as μ=1 but with far fewer steps:
Note: if we can stabilise vanilla GRPO, we should revisit scaling μ as it has a clear computational advantage
Effect from having no reference model
Somewhat surprisingly, setting β=0 seems to be more stable than including the reference model + syncing every 100 steps:
Disabling the KL term in the GRPO loss is what DAPO recommends (better exploration), but it is still surprising to see it is more stable (intuitively I would have expected the lack of a KL term to encourage more unbounded completions)
Note: explore the effect of increasing μ when β=0. Are the runs still stable?
Clip higher
The DAPO paper recommends using a larger ε on the upper bound of the trust region in the clipped loss. Using their value of ε=0.28 doesn't seem to have much impact on the rewards, but does increase the completion lengths somewhat:
Note: compare downstream evals to draw a proper conclusion here. Also consider different values of ε_high
Dr GRPO loss (scale_rewards=False
)
The Dr GRPO paper recommends removing the reward scaling by σ. Compared to our baseline, this doesn't seem to have a large impact on the rewards, but does produce smaller grad norms and KL terms:
Next steps
- Run downstream evals to compare relation between rewards and things we actually care about
- Benchmark @ShirinYamani 's PR
- Explore relaxing the new-line structure of the format reward (or having a soft variant)
- Run μ ablation for β=0
- Integrate new pass@1 metric from
lighteval