[Experiment] Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO

#15
by lewtun HF staff - opened
Open R1 org
โ€ข
edited 1 day ago

In the DeepSeek-R1 tech report there is a hidden gem of advice about applying RL to their SFT distilled models:

Screenshot 2025-02-08 at 11.03.52.png

I've started running some experiments to test the impact that GRPO can have on the existing distilled models. To keep things simple, I'm using the DeepSeek-R1-Distill-Qwen-1.5B model and small, yet high-quality LIMO dataset to iterate faster. I'll be using this discussion to track my progress, but feel free to chime in if you have ideas on how to improve the training!

Links

Experimental setup

Baseline parameters from v00.00 run:

  • LR = 1e-6
  • Number of tokens: 4096
  • Number of generations: 7
  • 3 epochs
  • Effective batch size: 56 (8 per device, grad acc steps 1, 7 H100s). Per device batch size and grad acc steps tuned for max tokens 8192 (4, 2) and 16384 (2, 4)

Ablations over:

  • Number of tokens generated (v00.0X runs): 4096, 8192, 16384
  • Learning rate (v01.0X runs): 2e-6, 4e-6, 8e-6
  • Number of generations (v0.2.0X runs): 14, 28, 56
  • Optimizer (v03.0X runs): Paged Adam8Bit

Note: there is a bug in the format rewards (https://github.com/huggingface/open-r1/issues/237), so we should re-run the best params again once this is fixed.

Key takeaways so far

  • It really works! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated).
  • Generating more tokens gives larger rewards and better loss
  • Larger learning rates give larger rewards, but produce a "bump and dip" around 100 steps. The KL is also much larger.
  • Increasing the number of generations gives larger rewards, but also seems to induce more spikes in the loss/KL for some reason (maybe a bug in TRL?). The smoothest run appears to be N=14
  • The accuracy reward is rather flat, perhaps suggesting we are not generating enough tokens to emit the required \boxed{} answer.
  • Using 8-bit Paged AdamW doesn't seem to noticeably affect the training dynamics vs 32-bit AdamW (apart from KL being somewhat larger). This is great for memory!

Screenshot 2025-02-08 at 11.18.35.png

More to come as I run more experiments ๐Ÿค—

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

Open R1 org

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

I am currently using the full set of reward functions we have in open-r1: https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py

I think the current biggest driver of performance is the accuracy rewards and reasoning steps reward:

Screenshot 2025-02-08 at 11.41.04.png

lewtun pinned discussion

Awesome! It would be nice to also report the batch size and gradient_accumulation_steps (couldn't find this reported anywhere, maybe I missed it). I would also be curious to know how sensitive the performance is to the beta parameter (weight of the KL term), it looks like it's set to 0.04 in all your experiments. If you look at the plots for the loss and KL term, they look very similar (up to beta=0.04) which seems to indicate that the KL term dominates in the loss and maybe you don't give enough weight to the reward function.

Open R1 org

Great questions @alucchi ! I've added some notes on the batch size I'm using (effectively 64 for all runs). And good idea to scan over ฮฒ: currently all runs have ฮฒ=0.04, so I'm now checking the effect there

AIME 2025 questions had just released, maybe you can use it as additional evaluation set as well.

Great work. You don't get reasoning with such a small model, do you? And can this be trained GPU poor (with unsloth model)?

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

Open R1 org

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

Good catch, indeex it should be 56! Fixed now :)

Open R1 org

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

We haven't done too many experiments with the base models yet as that DeepSeek-R1 tech report shows that distillation outperforms pure RL:

Screenshot 2025-02-09 at 10.01.45.png

This suggests that the recipe for producing strong models is as follows:

  • Create synthetic data from R1 on various domains of interest
  • Run SFT with a smaller model
  • Apply GRPO to squeeze out additional performance

@lewtun nice writeup! what if you used the same LIMA prompts to distill R1 instead of using these prompts for RL? It would be nice to separate two potential explanations for the improvement here.

@lewtun can the training dataset for your experiments be released? In case it is I didn't find a link to it. Or perhaps I understood wrong and you are just passing the raw questions from AIME and GPQA to the model.

@lewtun Thanks for the detailed write up and doing all these experiments! You mentioned that you got +10pts on AIME for 1.5 model, but from the LB I see it's rather a spike than a consistent improvement? Or did I misunderstood something?
image.png

Sign up or log in to comment