Results on AI-MO/aimo-validation-aime

by tugstugi - opened Dec 20, 2024

Discussion

tugstugi

Dec 20, 2024

•

edited Dec 20, 2024

I have made an evaluation on AI-MO/aimo-validation-aime

1 shot inferencing with vllm==v0.6.4.post1

Qwen/QwQ-32B-Preview -> 25 correct of 90
Qwen/Qwen2.5-Math-7B-Instruct -> 13 correct of 90
Qwen/Qwen2.5-14B-Instruct -> 11 correct of 90
this model -> 6 correct of 90

so this model is much worse than the original model Qwen/Qwen2.5-14B-Instruct

PS: the system prompt was: You are a helpful and harmless assistant. You should think step-by-step and put the answer in \boxed{}.

qingy2024

Owner Dec 20, 2024

Interesting results! I'll do some tests on my end too. I don't think it was trained with that system prompt, maybe that's affecting the performance.

tugstugi

Dec 22, 2024

QwQ uses top_p=0.8, after using it with temperature=1.0, repitition_penalty=1.05 and max_seq_length=16K, the score is now 10 correct of 90. From 90, only 37 solutions have \boxed element.

qingy2024

Owner Dec 22, 2024

Thanks for the update! I'm also evaluating both this model and the original Qwen 2.5 14B on MATH 500

tugstugi changed discussion status to closed Dec 31, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment