Spaces:
Running
[Experiment] Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO
In the DeepSeek-R1 tech report there is a hidden gem of advice about applying RL to their SFT distilled models:
I've started running some experiments to test the impact that GRPO can have on the existing distilled models. To keep things simple, I'm using the DeepSeek-R1-Distill-Qwen-1.5B
model and small, yet high-quality LIMO dataset to iterate faster. I'll be using this discussion to track my progress, but feel free to chime in if you have ideas on how to improve the training!
Links
- Weights and biases report with training metrics: https://api.wandb.ai/links/huggingface/3l5cglav
- My branch in
open-r1
: https://github.com/huggingface/open-r1/tree/grpo-limo - Leaderboard to track downstream evals (search with
limo
to get the models): https://huggingface.co/spaces/open-r1/open-r1-eval-leaderboard
Experimental setup
Baseline parameters from v00.00 run:
- LR = 1e-6
- Number of tokens: 4096
- Number of generations: 7
- 3 epochs
- Effective batch size: 56 (8 per device, grad acc steps 1, 7 H100s). Per device batch size and grad acc steps tuned for max tokens 8192 (4, 2) and 16384 (2, 4)
Ablations over:
- Number of tokens generated (v00.0X runs): 4096, 8192, 16384
- Learning rate (v01.0X runs): 2e-6, 4e-6, 8e-6
- Number of generations (v0.2.0X runs): 14, 28, 56
- Optimizer (v03.0X runs): Paged Adam8Bit
Note: there is a bug in the format rewards (https://github.com/huggingface/open-r1/issues/237), so we should re-run the best params again once this is fixed.
Key takeaways so far
- It really works! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated).
- Generating more tokens gives larger rewards and better loss
- Larger learning rates give larger rewards, but produce a "bump and dip" around 100 steps. The KL is also much larger.
- Increasing the number of generations gives larger rewards, but also seems to induce more spikes in the loss/KL for some reason (maybe a bug in TRL?). The smoothest run appears to be N=14
- The accuracy reward is rather flat, perhaps suggesting we are not generating enough tokens to emit the required
\boxed{}
answer. - Using 8-bit Paged AdamW doesn't seem to noticeably affect the training dynamics vs 32-bit AdamW (apart from KL being somewhat larger). This is great for memory!
More to come as I run more experiments ๐ค
Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.
Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.
I am currently using the full set of reward functions we have in open-r1: https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py
I think the current biggest driver of performance is the accuracy rewards and reasoning steps reward:
Awesome! It would be nice to also report the batch size and gradient_accumulation_steps (couldn't find this reported anywhere, maybe I missed it). I would also be curious to know how sensitive the performance is to the beta parameter (weight of the KL term), it looks like it's set to 0.04 in all your experiments. If you look at the plots for the loss and KL term, they look very similar (up to beta=0.04) which seems to indicate that the KL term dominates in the loss and maybe you don't give enough weight to the reward function.
AIME 2025 questions had just released, maybe you can use it as additional evaluation set as well.
Great work. You don't get reasoning with such a small model, do you? And can this be trained GPU poor (with unsloth model)?
I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.
Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?
I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.
Good catch, indeex it should be 56! Fixed now :)
Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?
We haven't done too many experiments with the base models yet as that DeepSeek-R1 tech report shows that distillation outperforms pure RL:
This suggests that the recipe for producing strong models is as follows:
- Create synthetic data from R1 on various domains of interest
- Run SFT with a smaller model
- Apply GRPO to squeeze out additional performance