Add Qwen 2.5 7B & Tulu 3 8B results to OLLM benchmarks

#1
by Fizzarolli - opened

Sourced from Open LLM Leaderboard

Command R 7B remains competitive here, no need to worry about excluding them ;)

Fizzarolli changed pull request title from Add Qwen 2.5 7B results to Add Qwen 2.5 7B & Tulu 3 8B results
Fizzarolli changed pull request title from Add Qwen 2.5 7B & Tulu 3 8B results to Add Qwen 2.5 7B & Tulu 3 8B results to OLLM benchmarks

Isn't Tulu 3 a finetune?

So is Command R7B (albeit on its own base), post-training is pretty impactful on a lot of these benchmarks so the current SOTA open data model is pretty noteworthy to have

I see your point, but if we're going to include finetunes then what decides which ones should be listed?
Right now there are quite a few that outscore Tulu 3 in certain rows.

Usually "made by a legitimate company or foundation and got some popular attention, not some random huggingface user who may or may not have trained on test" (ie, AI2 is a legitimate research org and tulu got at least some attention)

Why include Qwen2.5 7b?

I don't want to attack them, but they're artificially boosting English test scores like the English MMLU.

That is, Qwen is actively removing extremely popular English knowledge from their corpus in order to focus training on data that maximizes test scores, such Virology because it's one of the handful of domains covered by the MMLU.

They're even doing this with Chinese knowledge, but nowhere near to the same degree. For example, the Chinese SimpleQA scores of Qwen's models (general Chinese knowledge) is lower, but is in the ballpark of, their Chinese MMLU scores. However, they're still actively favoring Chinese information that maximizes the Chinese MMLU score compared to other Chinese models like DeepSeek.

But when it comes to English what they're doing is cheating, plain and simple. They're achieving higher MMLU scores than the best and longest trained English-only models (e.g. Llama 3.1 8b & Gemma 2 9b), yet have FAR FAR FAR less general English knowledge (e.g. rock bottom SimpleQA scores and score far lower on my popular general knowledge test).

In short, comparing the best English models like Llama 3.1 & Gemma2 to bi-lingual foreign models like Qwen2.5 or EXAONE 3.5 which are selectively training English data that maximizes scores on tests like the MMLU despite have a tiny fraction of the total English knowledge is at minimum grossly misleading (Meta & Google could have easily done the same and gotten far higher MMLU scores than Qwen2.5 or EXAONE3.5 by giving up core English knowledge).

This is entirely unprovable with non-circumstantial evidence (ie, "the model doesnt know the thingymaflorp from the beepityboopity! i know that, it must be bad!" does not count here) unless you have access to the raw pretraining and annealing datasets of these large corporations and can actually pull out examples of how exactly they benchmaxx. All forms of criticism like this that don't, in the same breath also accuse other nationality's models of benchmark optimizing (there are especially obvious examples here you even mentioned, like L3.1's 'burnt' posttrains that a lot of people called inferior to even older models when trying to actually use in multi-turn conversation), come off as nationalistic irrespective of the technology at best and xenophobic at worst... Not saying this is the point you're trying to make at the moment! Just.. that's what all these similar arguments boil down to, from my point of view

Furthermore, even if it's cheated, it's still a very popular series of model that seems disingenous to exclude. Other people exist who do like the models for general-purpose English uses, regardless of what they did to thingymabobber knowledge!

I stand by what I wrote, and didn't just test a handful of random questions.

Qwen2.5 has far more Chinese training data in its corpus, not to mention more coding & math data, putting them at a huge disadvantage competing with English-based LLMs like Llama 3.1 & Gemma 2 at a given parameter count. The ABSOLUTELY only possible way to compete with their MMLU scores at any given parameter count, let alone beat them, is to aggressively favor English data covered by tests like the MMLU.

Sure enough, I ran an extensive and diverse popular knowledge test, and the Qwen2.5 series not only scored MUCH lower per parameter count than other models Like Llama 3 and Gemma 2, it also scored much lower than its predecessor Qwen2 (e.g. Qwen2.5 72b scored 68.4, versus 85.9 for Qwen2 72b).

Additionally, as I previously stated, there's plenty of other tests, including Chinese & English SimpleQA & MMLU tests. And sure enough the disparity between general knowledge and the small subset of knowledge covered by the MMLU is overwhelming, especially when it comes to English, but it's still more pronounced when it comes to Chinese than other models like DeepSeek, especially at 7b parameters and under.

Additionally, ever conceivable indicator clearly and redundantly shows that Qwen played strong favorites when it comes to training on test boosting data. For example, Qwen2.5 72b's MMLU is 51.40, which is lower than Qween2.5 32b's 51.9, plus Qwen2.5 32b scored notably lower than Qwen2.5 72b on general knowledge tests, including my own. This isn't possible unless they selectively crammed proportionately more MMLU relative to general knowledge when creating Qwen2.5 32b vs 72b. Plus the ratio between MMLU scores and general knowledge keeps increasing at progressively smaller sizes (e.g. Qwen2.5 7b). So not only are they selectively training more on data that overlaps tests like the MMLU, but they do so more aggressively the smaller the model is.

Lastly, not that it's the least bit relevant, but Qwen2.5 is only widely popular in the OS community because they're pandering to first adopters. That is, the vast majority of the OS community are primarily/obsessively concerned with coding, so Qwen is overwhelming favoring coding, math and test data (e.g. MMLU), while aggressively suppressing wildly popular knowledge. If you think that Qwen2.5 somehow found a way to perform extremely well in Chinese (e.g. a high Chinese MMLU), while also having a higher English MMLU than Gemma2 and the overtrained Llama3.1 series, while retaining just as much general English knowledge as Gemma 2 and Llama3.1 then you have a rather pronounced misunderstanding of how all this works.

Cohere For AI org

Hey @Fizzarolli , thanks for your contribution!

alexrs changed pull request status to merged

yeah it doesn't make sense to add these models. strange how all the criticism was completely ignored and this got merged.
are you also going to add exaone by lg? interlm? granite by ibm? glm-4? etc...? at least be consistent then and add them all.

well, i can't speak for cohere, but i just stopped engaging with what came off to me as bad faith complaints not based in any actual published or reproducable statistics

those make much less sense to add as there's not really any context where they're "SOTA," or considered by the community of open research to be "SOTA," therefore it's weird to add
something like qwen 2.5 is a lot different in this regard, as it's

  1. pretty widely considered to be in the top tier of OSS models by a lot of people
  2. results in pretty wide ranging criticism when people don't include qwen 2.5's benchmarks in comparison, because of the impression they don't want to compare to the actual-best model

@Fizzarolli

"2. results in pretty wide ranging criticism when people don't include qwen 2.5's benchmarks in comparison, because of the impression they don't want to compare to the actual-best model"

This is precisely why I hate Qwen. You, like so many in the OS community, are confidently acting on the assumption that Qwen2.5 7b is the "actual-best model" in English, and even forcing their scores onto the model cards of other models.

Do you honestly think that despite being smaller than the top English models Llama 3.1 8b and Gemma2 9b, and primarily focusing on the Chinese language, that they also somehow bested the best English models in English?

Anyways, I don't want to rehash everything, but since you're going out of your way to force Qwen's scores onto model cards please do your due diligence and take the time to at least look at the facts I presented. For example, take a look at Qwen2.5 32b & Qwen2.5 14b.

They both scored higher on the MMLU than Llama 3.1 70b, the top English 70b or smaller model. Are you honestly going to argue that a primarily Chinese language LLM got an genuine English MMLU score at 14b that was higher than the best primarily English language LLM at 70b? And if you think I'm lying about the results of my general knowledge English test then either run a public English general knowledge test like SimpleQA on all them, or ask basic questions about the top 100 English movies, books, sports, TV shows, music... Qwen2.5 14b and 32b have surprisingly little English world knowledge, even compared to tiny little Llama 3.2 3b, and have orders of magnitude less than Llama 3.1 70b, and notably less than Qwen2.5 72b despite Qwen2.5 32b having a slightly higher MMLU.

The Qwen team not only without any reasonable doubt artificially boosted MMLU scores when making the Qwen2.5 series, they did it to such an absurd degree (Qwen2.5 14b & 32b beating Llama 70b despite having orders of magnitude less broad English knowledge) that I am honestly embarrassed for them, and for you for not picking up on something so overwhelmingly obvious. Qwen2.5's scores do not belong on any model comparison chart. Qwen2.5 was the last straw. They're dead to me and anyone with at least half a brain.

Sign up or log in to comment