Model trained on benchmarks
I have concerns about the Rhea-72b-v0.5 model's performance on the Open LLM Leaderboard. It appears that the model was trained on datasets that overlap with the evaluation benchmarks used by the leaderboard, such as MMLU, GSM8K, HellaSwag, and others. Training on these datasets can lead to data leakage, where the model is effectively tested on data it has seen during training. This can artificially inflate performance metrics and may not accurately reflect the model's true capabilities or its ability to generalize to new, unseen data.
This may explain why the model performed well on the original (now archived) Leaderboard but performs much less well on the new Leaderboard. I'd propose that this model ought to be flagged on the archived Leaderboard.
Perhaps this is all within the intention of the model developer, I don't know. But it seems to me that it is rather unfair that this model be evaluated on the same leaderboard as models that have not seen the evaluation test questions.
Apologies, it seems that the model was trained on e.g. GSM8K training set, not test set. Still, I am confused why the performance of this model is so much lower on the new LLM Leaderboard -- something other commenters have noted.