@Jaward on Hugging Face: "The beauty in GRPO is the fact that it doesn’t care if the rewards are…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

Jaward

posted an update Jan 31

Post

1506

The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300

mkurman

Feb 1

Yeah, the fun part is that I use any QA dataset in GRPO just by instructing a model to follow simple rules. Place your answer in \boxed{} or ** ** tags. I do a regex, and it simply works.

In this post

Jaward Jaward Sesay
mkurman Mariusz Kurman