Results on AI-MO/aimo-validation-aime
#1
by
tugstugi
- opened
I have made an evaluation on AI-MO/aimo-validation-aime
1 shot inferencing with vllm==v0.6.4.post1
- Qwen/QwQ-32B-Preview -> 25 correct of 90
- Qwen/Qwen2.5-Math-7B-Instruct -> 13 correct of 90
- Qwen/Qwen2.5-14B-Instruct -> 11 correct of 90
- this model -> 6 correct of 90
so this model is much worse than the original model Qwen/Qwen2.5-14B-Instruct
PS: the system prompt was: You are a helpful and harmless assistant. You should think step-by-step and put the answer in \boxed{}.
Interesting results! I'll do some tests on my end too. I don't think it was trained with that system prompt, maybe that's affecting the performance.
QwQ uses top_p=0.8, after using it with temperature=1.0, repitition_penalty=1.05 and max_seq_length=16K, the score is now 10 correct of 90. From 90, only 37 solutions have \boxed element.
Thanks for the update! I'm also evaluating both this model and the original Qwen 2.5 14B on MATH 500
tugstugi
changed discussion status to
closed