[Experiment] Training R1-Zero-like models with Open R1

#20
by lewtun - opened

Context

There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:

These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!

Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:

Screenshot 2025-03-30 at 21.00.59.png

Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):

Screenshot 2025-03-30 at 21.04.15.png

With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.

We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!

Setup

Links

Experiments to run

  1. Train a baseline using "standard" parameters on Big-Math-RL-Verified to compare relative performance & learning dynamics ✅
  2. Measure effect on convergence with μ=2,4 (default is 1 in TRL) ✅
  3. Disable KL term with β=0
  4. Clip higher with ε_low=0.2 and ε_high=0.28 (DAPO values) ✅
  5. Add soft overlong reward function from DAPO paper
  6. Add overlong filter (mass loss of truncated completions)
  7. DAPO (default) vs Dr. GRPO loss ✅

Features to add to TRL

  1. Overlong filter could be exposed as an arg like mask_truncated_completions in GRPOConfig
  2. Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo) Done: https://github.com/huggingface/trl/pull/3188

Features to add to Open R1

  1. Add logging for pass@k accuracy (SimpleRL-Zero)
  2. Add reasoning behaviours callback with LLM APIs to track backtracking and other behaviours during training (SimpleRL-Zero)
    Screenshot 2025-03-30 at 21.25.50.png
lewtun pinned discussion

Logbook [1.4.2025]

Experiments

  • Focused on training a baseline with Qwen2.5-7B and discovered a serious bug in the accuracy reward function of open-r1 🙀. First, the parser was failing on non-LaTeX ground truth answers like "6", and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566

W&B Chart 1_4_2025, 8_57_12 am.png

  • I am running 3 baseline experiments to gauge stability on SynthLabsAI/Big-Math-RL-Verified:
    • v00.0X: train on everything
    • v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
    • v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates

Screenshot 2025-04-01 at 10.58.30.png

Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., 
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.

User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).

Assistant: 

As many other papers has observed, Qwen2.5-7B is remarkably good at following instructions with little prompting and is able to emit the \boxed{} format fairly consistently without any reference to this in the prompt!

TRL / Open R1 updates

Next

  • Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
  • Rebase on trl@main and re-run baseline to measure stability.
  • Gather downstream evals with pass@1 metric from lighteval: https://github.com/huggingface/lighteval/pull/647

Logbook [4.4.2025]

Experiments

tl;dr

  • For β>0 it seems necessary to update the policy every N steps to avoid instabilities.
  • Increasing μ leads to faster convergence but is less stable
  • Setting β=0 is surprisingly more stable than β>0
  • Clip higher and Dr GRPO loss do not have much effect on the rewards, but also do not induce any additional instability
  • The format reward seems to be too hard for the model to learn, possibly because we enforce a specific new-line format.
  • The new completion metrics like clipped_ratio are very handy for knowing when a run is going off the rails!

Baselines

While setting a baseline with the default settings, we found that vanilla GRPO is unstable and the completions explode midway through training:

Screenshot 2025-04-04 at 09.46.22.png

In line with DAPO, we suspect this is caused by the truncated completions destabilising the training and @ShirinYamani has opened a PR in TRL we can test this hypothesis with. Nevertheless, we found that replacing the reference model with the policy every 100 steps (about every 1/6th of training) mitigated the instability for now.

Note: set sync_ref_model=True and sync every 100 steps.

Effect from μ iterations

The GRPO algorithm has an inner optimisation loop where the policy is updated μ times on a given batch:

Screenshot 2025-04-04 at 09.55.02.png

We explored the effect of setting μ=1,2,4 and as shown below we can see that larger values of μ converge much faster *but are less stable:

Screenshot 2025-04-04 at 09.54.09.png

The convergences is most visible in the early phases of training where larger μ values reach the same reward as μ=1 but with far fewer steps:

Screenshot 2025-04-04 at 09.57.41.png

Note: if we can stabilise vanilla GRPO, we should revisit scaling μ as it has a clear computational advantage

Effect from having no reference model

Somewhat surprisingly, setting β=0 seems to be more stable than including the reference model + syncing every 100 steps:

Screenshot 2025-04-04 at 10.01.41.png

Disabling the KL term in the GRPO loss is what DAPO recommends (better exploration), but it is still surprising to see it is more stable (intuitively I would have expected the lack of a KL term to encourage more unbounded completions)

Note: explore the effect of increasing μ when β=0. Are the runs still stable?

Clip higher

The DAPO paper recommends using a larger ε on the upper bound of the trust region in the clipped loss. Using their value of ε=0.28 doesn't seem to have much impact on the rewards, but does increase the completion lengths somewhat:

Screenshot 2025-04-04 at 10.05.18.png

Note: compare downstream evals to draw a proper conclusion here. Also consider different values of ε_high

Dr GRPO loss (scale_rewards=False)

The Dr GRPO paper recommends removing the reward scaling by σ. Compared to our baseline, this doesn't seem to have a large impact on the rewards, but does produce smaller grad norms and KL terms:

Screenshot 2025-04-04 at 10.07.18.png

Next steps

  • Run downstream evals to compare relation between rewards and things we actually care about
  • Benchmark @ShirinYamani 's PR
  • Explore relaxing the new-line structure of the format reward (or having a soft variant)
  • Run μ ablation for β=0
  • Integrate new pass@1 metric from lighteval
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment