Text Generation
Safetensors
English
medical

Please submit this model to the Open LLM Leaderboard

#1
by grimjim - opened

The leaderboard is located here.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

I performed a merge of two o1 models, including yours, and hit an unusually high MATH benchmark of 33.99%.
I posit that your model may be highly capable in mathematical reasoning despite the focus being on medical reasoning.

There's an issue with submitting it to the leaderboard, see here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/1055

Personal merge tests with this model showed very high BBH and MMLU-PRO benchmarks, so I'd expect Skywork has hidden math performance.

@grimjim The issue should be fixed on their end! Now we just need to upvote the model: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/vote

If some votes could also be thrown at my models too I'd appreciate it! 🌴

Although the result was diluted, the o1 merge above was able to uplift most benchmarks for another L3.1 8B when merged in. Every bench went up outside of IFEval.
https://huggingface.co/grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B

I have to wonder how much strength is hidden due to lack of compliance with benchmark formatting requirements, with unremarkable IFEval as a potential sign of untapped benchmark potential with HuaTouGPT-o1 8B.

Sign up or log in to comment