open-r1/README · [Experiment] Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO

Open R1 org 2 days ago

•

In the DeepSeek-R1 tech report there is a hidden gem of advice about applying RL to their SFT distilled models:

I've started running some experiments to test the impact that GRPO can have on the existing distilled models. To keep things simple, I'm using the DeepSeek-R1-Distill-Qwen-1.5B model and small, yet high-quality LIMO dataset to iterate faster. I'll be using this discussion to track my progress, but feel free to chime in if you have ideas on how to improve the training!

Links

Weights and biases report with training metrics: https://api.wandb.ai/links/huggingface/3l5cglav
My branch in open-r1: https://github.com/huggingface/open-r1/tree/grpo-limo
Leaderboard to track downstream evals (search with limo to get the models): https://huggingface.co/spaces/open-r1/open-r1-eval-leaderboard

Experimental setup

Baseline parameters from v00.00 run:

LR = 1e-6
Number of tokens: 4096
Number of generations: 7
3 epochs
Effective batch size: 56 (8 per device, grad acc steps 1, 7 H100s). Per device batch size and grad acc steps tuned for max tokens 8192 (4, 2) and 16384 (2, 4)

Ablations over:

Number of tokens generated (v00.0X runs): 4096, 8192, 16384
Learning rate (v01.0X runs): 2e-6, 4e-6, 8e-6
Number of generations (v0.2.0X runs): 14, 28, 56
Optimizer (v03.0X runs): Paged Adam8Bit

Note: there is a bug in the format rewards (https://github.com/huggingface/open-r1/issues/237), so we should re-run the best params again once this is fixed.

Key takeaways so far

It really works! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated).
Generating more tokens gives larger rewards and better loss
Larger learning rates give larger rewards, but produce a "bump and dip" around 100 steps. The KL is also much larger.
Increasing the number of generations gives larger rewards, but also seems to induce more spikes in the loss/KL for some reason (maybe a bug in TRL?). The smoothest run appears to be N=14
The accuracy reward is rather flat, perhaps suggesting we are not generating enough tokens to emit the required \boxed{} answer.
Using 8-bit Paged AdamW doesn't seem to noticeably affect the training dynamics vs 32-bit AdamW (apart from KL being somewhat larger). This is great for memory!

More to come as I run more experiments 🤗

HarleyCooper

2 days ago

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

lewtun

Open R1 org 2 days ago

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

I am currently using the full set of reward functions we have in open-r1: https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py

I think the current biggest driver of performance is the accuracy rewards and reasoning steps reward:

lewtun pinned discussion 2 days ago

alucchi

2 days ago

Awesome! It would be nice to also report the batch size and gradient_accumulation_steps (couldn't find this reported anywhere, maybe I missed it). I would also be curious to know how sensitive the performance is to the beta parameter (weight of the KL term), it looks like it's set to 0.04 in all your experiments. If you look at the plots for the loss and KL term, they look very similar (up to beta=0.04) which seems to indicate that the KL term dominates in the loss and maybe you don't give enough weight to the reward function.

lewtun

Open R1 org 2 days ago

Great questions @alucchi ! I've added some notes on the batch size I'm using (effectively 64 for all runs). And good idea to scan over β: currently all runs have β=0.04, so I'm now checking the effect there

chewkokwah

2 days ago

AIME 2025 questions had just released, maybe you can use it as additional evaluation set as well.

meigel

2 days ago

Great work. You don't get reasoning with such a small model, do you? And can this be trained GPU poor (with unsloth model)?

baohao

2 days ago

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

anirudhb11

1 day ago

•

edited 1 day ago

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

lewtun

Open R1 org 1 day ago

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

Good catch, indeex it should be 56! Fixed now :)

lewtun

Open R1 org 1 day ago

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

We haven't done too many experiments with the base models yet as that DeepSeek-R1 tech report shows that distillation outperforms pure RL:

This suggests that the recipe for producing strong models is as follows:

Create synthetic data from R1 on various domains of interest
Run SFT with a smaller model
Apply GRPO to squeeze out additional performance

Dzmitry

about 5 hours ago

@lewtun nice writeup! what if you used the same LIMA prompts to distill R1 instead of using these prompts for RL? It would be nice to separate two potential explanations for the improvement here.

apssg96

about 3 hours ago

@lewtun can the training dataset for your experiments be released? In case it is I didn't find a link to it. Or perhaps I understood wrong and you are just passing the raw questions from AIME and GPQA to the model.

chankhavu

about 2 hours ago

@lewtun Thanks for the detailed write up and doing all these experiments! You mentioned that you got +10pts on AIME for 1.5 model, but from the LB I see it's rather a spike than a consistent improvement? Or did I misunderstood something?