details pls

#1
by archit11 - opened

can you give some details on what data it was trained on and for how many steps , i tried it to do grpo with smollm 350m on gsm8k but it was really bad so i stopped after few steps

This was just a dry random run, don't expect any thing from this. It was trained on 'trl-lib/ultrafeedback_binarized'.
I am planning to experiment with GRPO and SmolLM this week so lets see how that goes.

ubermenchh changed discussion status to closed

Sign up or log in to comment