SimpleQA?

#29

by phil111 - opened 17 days ago

17 days ago

Please consider posting this model's English SimpleQA score.

In the past notable gains in STEM (MMLU), coding, and math scores were reliably accompanied by notable regressions in general knowledge and abilities.

For example, Qwen2.5 72b regressed to only 10.1 on the English SimpleQA, vs >20 for Llama 3.3 70b, as its MMLU, math, and coding scores notably increased relative to Qwen2 72b.

SimpleQA isn't just about general knowledge. Its balanced set of mostly fringed knowledge questions across all major domains of knowledge, coupled with its non-multiple choice design, shines a light on the damage caused by training too long on a disproportionately large amount of coding, math... data.

This damage is largely overlooked by multiple choice tests because, just like with humans, even when the model can no longer fully recall the requested information it can usually still identify the answer when it's shown to them within a list of options (a.k.a. a multiple choice test). But since users almost never provide the information they're requesting a model's multiple choice test taking skills isn't relevant. And no, RAG isn't a major fix for an LLM's broad ignorance because in common use cases (e.g. story writing) said knowledge needs to be organically recalled.

Again, I didn't test this model so perhaps the general knowledge and abilities didn't regress from the last version of DeepSeek v3. However, all the red flags are there, including a boost in STEM (MMLU), coding, and math scores, plus the exclusive focus on said domains on the model card and other release notes, which in the past always correlated with a notable drop in general knowledge and abilities (e.g. on my personal test and SimpleQA). So if the same didn't happen here it would be a surprising anomaly.

phil111

17 days ago

I've been playing around with this model a bit on LMsys and its weights do appear to be scrambled relative to the previous version of DS3. I added an example at the end of this post, followed by GPT4o's error free response.

Additionally, it has a tendency to needlessly add irrelevant details, even when contraindicated by the prompt, increasing the hallucination rate even more. Such as adding Dr. to the names of people who aren't doctors, or needlessly mentioning who they're married to, and getting it wrong.

It's pretty clear that all the extra training needed to selectively boost STEM (MMLU/GPQA), math, and coding scores resulted in a general scrambling of the weights. And as a reminder to all you information elitists out there, this model has 685 BILLION parameters.

Example: "What are the 6 main characters, and the actors who portrayed them, in the 1995 movie Four Rooms? Don't add details, just list them."

DeepSeek: Here are the 6 main characters and the actors who portrayed them in Four Rooms (1995):

Ted – Tim Roth
Angela – Sammi Davis
Sigfried – Madonna
Chester Rush – Quentin Tarantino
Norman – David Proval
Betty – Jennifer Beals

Correction: Angela - Jennifer Beals / Jezebel - Sammi Davis
Correction: Sigried - David Proval / Elspeth - Madonna
Correction: Norman - Paul Calderon
Correction: Betty - Kathy Griffin

GPT4o:

Ted the Bellhop - Tim Roth
Betty - Madonna
Margaret - Marisa Tomei
Angela - Jennifer Beals
Siegfried - David Proval
Chester - Quentin Tarantino

PS - If broad knowledge isn't as important as STEM knowledge, then why did DeepSeek retain broad Chinese knowledge (e.g. a Chinese SimpleQA that's higher than all others, including GPT4o's). But when it comes to English its broad English knowledge if far lower than GPT4o's (e.g. an English SimpleQA of only ~20), and way out of proportion with its English MMLU score, which is now even higher than GPT4o's? It's clear that DeepSeek genuinely cares about general Chinese performance, and not just Chinese test scores like the Chinese MMLU, which is admirable. But I find their selective overtraining of English tokens that boost test scores both dishonest and disrespectful (e.g. a higher English MMLU pro than GPT4o, but a far lower English SimpleQA), especially since they didn't do the same with Chinese data.

Shinku

17 days ago

The current meta is overtraining on synthetic data for benchmaxxing.

phil111

17 days ago

@Shinku I'm afraid you're right. Mistral Small recently dropped from a SimpleQA of ~14 to 10.7 after v2409, accompanied by a notable boost in their code, math, and STEM scores.

And as previously mentioned, Qwen did this when they switched from v2 to 2.5, dropping to a SimpleQA score of just 10.1 despite having 72b parameters. And so on, such as Phi4 14b dropping to only 3, versus 7 for Phi3. And now DeepSeek is doing the same.

And the sad thing is they aren't even a little more intelligent than before, even at coding and math, tripping over the simplest trick questions that wouldn't fool a 70 IQ human. They're simply training on so much coding and math data that they're statistically more likely to regurgitate a reasonable pre-packaged response that correlates with the prompt. They aren't thinking through over 100 lines of code to produce something that works on the first try. Even humans can't do that. They're just regurgitating nearest matches.

I fear the days of open source general purpose AI models are coming to an end. We're going to be left with nothing but coding and math agents masquerading as AI models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment