Please submit this model to the Open LLM Leaderboard
The leaderboard is located here.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
I performed a merge of two o1 models, including yours, and hit an unusually high MATH benchmark of 33.99%.
I posit that your model may be highly capable in mathematical reasoning despite the focus being on medical reasoning.
There's an issue with submitting it to the leaderboard, see here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/1055
Personal merge tests with this model showed very high BBH and MMLU-PRO benchmarks, so I'd expect Skywork has hidden math performance.
@grimjim The issue should be fixed on their end! Now we just need to upvote the model: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/vote
If some votes could also be thrown at my models too I'd appreciate it! π΄
Although the result was diluted, the o1 merge above was able to uplift most benchmarks for another L3.1 8B when merged in. Every bench went up outside of IFEval.
https://huggingface.co/grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B
I have to wonder how much strength is hidden due to lack of compliance with benchmark formatting requirements, with unremarkable IFEval as a potential sign of untapped benchmark potential with HuaTouGPT-o1 8B.