Spaces:
Running
on
CPU Upgrade
Black Box Benchmarks over Contamination Scanning
From the recent conversations that have been going on, it sounds like there is a good chance Yi is contaminated as well as loads of other models and then all the merges and fine-tunes using those models.
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
This leaderboard on blind user preference shows that despite being potentially contaminated, Yi-34b is still a really good model. So if you are planning on implementing a way to get rid of the benchmark cheating going on, I feel that having mandatory contamination checks would be a very flawed solution. It would also be very regressive, potentially invalidating hundreds of models.
I feel like having black box benchmarks (or another solution you can think of) where people can't train on or game the test data would be much better.
Most of the benchmarks on the leaderboard right now are susceptible to excessive fine-tuning. MMLU is the most trusted, but even it has its own problems, with 1-2% of its answers being wrong or something. So please consider making your own benchmarks, for example: Problem-Solving, Multi-Language Comprehension, Breadth of Knowledge, Math, and Safety (willingness to answer questions).
This is a often discussed research topic on the validity of validation dataset https://github.com/lm-sys/llm-decontaminator/issues/1, and it requires understaking rather than simply reusing well-known set of "trusted" benchmarks
In theory I like the idea of black box benchmarking. However, I'm certain the HF team is fully aware of the benchmarks, their limitations and the alternatives. And since they're the ones who would need to invest the time and resources to implement such a drastic change it's up to them to decide what's best.
I'm not going to list all the downsides of black box benchmarking since you can just ask GPT4, but there's a lot of them. And I personally don't like the lack of independent score verification.
Lastly, it's been my experience that in most cases scores have roughly matched performance. And if you don't want to waste time testing potentially contaminated LLMs you can stick to legit LLMs, such as Intel's Neural, Berkeley's Starling, Microsoft's Orca 2, HuggingFaceH4's Zephyr and so on. Perhaps they can add a filter that only shows LLMs from trusted sources like companies and academic institutions.
And if you don't want to waste time testing potentially contaminated LLMs you can stick to legit LLMs, such as Intel's Neural, Berkeley's Starling, Microsoft's Orca 2, HuggingFaceH4's Zephyr and so on. Perhaps they can add a filter that only shows LLMs from trusted sources like companies and academic institutions.
That's what I was petitioning against. I was saying that if HF is going to go the route of flagging loads of models for contamination or untrustworthiness, then they might inadvertently stop people from using the potentially best models.
@TNTOutburst This is a perfect example of a no win situation. All options have major downsides. And I'm confident that black box benchmarking, despite its compelling advantages, is not the lesser of all evils.
@Phil337 Maybe. Nonetheless, HF definitely needs to make a large change soon. People were already untrusting of the leaderboard, and in the past month it has gotten so much worse - with the top of the leaderboard being filled with 7b and 10b models. It feels like every benchmark on the leaderboard is completely useless except for MMLU and Winogrande.
Hi! Interesting discussion!
@TNTOutburst Black box benchmarks is not a direction we'll take for the OpenLLMLeaderboard (I explained a bit about the vision we have here), but if you know high quality black box benchmarks and would like to host a leaderboard for this, I'll be happy to give you a hand.
@Phil337 I like the idea of adding an "institutional" category on the leaderboard, I'll add it to the todos and check how feasible it would be