This Mistral Small has FAR less knowledge than the last.

#5
by phil111 - opened

Mistral Small 2409 scored relatively high on broad English knowledge tests like English SimpleQA, and scored a respectable 82.6/100 on my broad English knowledge test.

However, v2501 is scoring notably worse.

This would be understandable if you decided to focus less on English, and more on the performance of multiple languages. However, your English MMLU scores went way up, so with v2501 you selectively trained on the tiny subset of popular English knowledge that overlaps the MMLU.

Qwen2.5 did the same thing earlier. That is, Qwen2.5 72b (and especially Qwen2.5 34b) has a high English MMLU score, yet a very low for its size English SimpleQA score of ~8, and its score on my broad English test dropped from 85.9 to 68.4 (Q2 72b to Q2.5 72b). Mistral Small v2501 is seeing a similar drop from v2409.

Mistral AI_ org

Thanks for the feedback, can you post your knowledge test here or upload it somewhere so that we can test it?

@patrickvonplaten Generally speaking pop culture information (TV shows, movies, music, games, sports...) is more scrambled and weakly held in v2501 vs v2409, including at temp 0.

For example, "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?"

2501 "Melanie Lynskey played Rose, the first ex-wife of Alan Harper.
Jenny McCarthy played Judy, the second ex-wife of Alan Harper."

2409 "The two actresses who played the ex-wives of Alan Harper in "Two and a Half Men" are:
Judith Harper (Alan's first wife) - Played by Marin Hinkle
Kandi (Alan's second wife) - Played by April Bowlby"

And when told to just list the main cast of the same show it made a basic error for a main character ("Judith Harper - Marcia Cross"). 2409 never made such errors.

And with progressively less popular shows the rate of errors (relative to v2409) progressively increased.

For example, when asked about the cast of the popular Canadian show Corner Gas it said the main character's name was "Brent Loney" versus Leroy, reliably mismatched the character and actor names ("Karen Loney - Gabrielle Miller"), and so on ("Hank Yule - Jim Cuddy" vs Hank Yarbo - Fred Ewanuick).

But one very notable improvement with the new Mistral Small is in instruction following. For example, when asked to end 8 sentences with the same word v2501 ended all 8 in said word, while v2409 only end 1 of 8 with the given word.

Lastly, unlike pop culture information, the MMLU STEM information is strongly held and can be fully and accurately retrieved better in v2501 than v2409.

An example , "In astronomy, what's the name of the hypothetical object that forms when a neutron star merges with a red supergiant star, potentially forming a black hole without a supernova explosion?"

Response (2501): "The hypothetical object you're referring to is called a Thorne–Żytkow object (TZO).", while 2409 simply identified it as a "collapsar"

I mean, those are very obscure questions about fictional characters in a TV show where all the characters have ex-wives and with actors sharing some roles and names. You can't expect small(ish) local models to be that accurate for something this incredibly specific.

Out of curiosity, I asked at Q5KL quantization, Tekken7 instruct, and temp 0 with a very generic assistant system prompt (you're an helpful assistant, don't hallucinate, don't guess information you're not sure... that kind of stuff) I got that instead:

In the television series "Two and a Half Men," the character Alan Harper has two ex-wives who are portrayed by different actresses. The first ex-wife, Judy, is played by Melanie Lynskey. The second ex-wife, Kandi, is played by April Bowlby. Both actresses contributed significantly to the character dynamics and comedic elements of the show.

Mistral team, please never prioritize knowledge of American sitcoms in a small model. I can not imagine a more useless application of resources. Great release. Thank you.

@phil111 thanks for providing this example - that's very interesting!

I think I can roughly reproduce your answer using vLLM. One thing that's important to understand with Mistral-Small-3 is that the chat completion template was changed to include a system prompt (and it was trained on following system prompts) => hence it's more sensitive to the system prompt (or the lack thereof).

To reproduce

You can spin up vLLM as follows:

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice

and then ping it.

General system prompt

Generally make sure to always include a system prompt to get the best answers. I would recommend something like the following system prompt. Also in general we recommend temperature=0.15. So for any general purpose task you can use something like the following:

import requests
import json
import sys

url = "http://<your-node>/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

system_prompt = """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""

prompt = "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?"

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {
        "role": "user",
        "content": prompt,
    },
]

data = {"model": model, "messages": messages, "temperature": 0.15}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])

Here the model gives me:

In the television series "Two and a Half Men," the character Alan Harper has two ex-wives. The actresses who played these roles are:

1. **Judith Harper** (Alan's first wife) was played by **Patricia Heaton**.
2. **Kandi Harper** (Alan's second wife) was played by **April Bowlby**.

These actresses portrayed the ex-wives in different seasons of the show.

which is still wrong (April Bowlby is not correct), but it's much closer I'd argue.

Tailored system prompt

Now, if you have a specific case, such as "movie" knowledge you can tailor the model much more for this, by:

  • a) changing the system prompt
  • b) setting temp = 0.0

E.g. the following:

system_prompt = """You are Movie-GPT and are very knowledgeable about any movie and TV Show.

When you're sure about some information, first think step-by-step and give relevant context before answering.
When you're not sure about some information, you say that you don't have the information and don't make up anything."""

with temperature=0.0 gives the correct answer:

To determine which two actresses played the two ex-wives of Alan Harper in the TV show "Two and a Half Men," let's break it down step-by-step:

1. **Identify Alan Harper's Ex-Wives**: In "Two and a Half Men," Alan Harper has two ex-wives. The first is Judy, and the second is Kandi.

2. **Find the Actresses**: We need to identify the actresses who portrayed these characters.

   - **Judy Harper**: Judy is Alan's first ex-wife and the mother of his son, Jake. She is portrayed by **Marin Hinkle**.

   - **Kandi Harper**: Kandi is Alan's second ex-wife. She is portrayed by **April Bowlby**.

Therefore, the two actresses who played the two ex-wives of Alan Harper in "Two and a Half Men" are **Marin Hinkle** (Judy Harper) and **April Bowlby** (Kandi Harper).

=> Obviously we can't adapt the system prompt to a task until the model gives the correct solution, but I'm just showing it here to show how the system prompt can influence the model (in some sense the knowledge is still "stored" somewhere)

Will give it a try now with mistral-small-2409 (would honesty be surprised if mistral-small-2409 confidently gives the correct answer here).

Update: Interesting, you're right @phil111 - mistral-small-2409 always confidently gives the correct answer here! That's interesting!
Does your test set only contain "Movie" knowledge or does it also include other topics / areas of "knowledge"? Would be very interesting to see if this also happens for other "niche" knowledge topics.

I am curious if this holds as true for the base model, though we lack an open point of comparison for 2409. (But to try to test if this is due to forgetting in the instruction pipeline specifically.)

@SerialKicked But they aren't obscure questions. I only ask about the most popular of pop culture that countless millions know the answers to, and L3.1 70 scored nearly ~90/100, and Sonnet 3.5 & GPT4o got a near perfect score. It's a vastly easier test than the English SimpleQA.

More importantly, small models like L3.1 8b and Gemma 2 9b scored around 70/100, and the last Mistral Small scored 82.6, so models of this size already proved that they can know this information.

In the very least, far more people know and care about my pop culture questions than the esoteric and theoretical STEM questions this Mistral Small got right (e.g. Thorne–Żytkow object TZO).

There has only been dozens of shows as popular as Two and a Half Men, which ran for 12 season, so there's no excuse for an AI model, especially a 22b one, to hallucinate like mad at temp 0 when asked basic questions about humanity's most popular information (pop culture). And they didn't used to. Qwen2.5 72b scored 85.9/100, Mistral Small 2409 82.6, and Mixtral 8x7b 86.7, and so on. It's only after boosting select test and task scores (e.g. MMLU and coding) that the general knowledge of LLMs started to tank relative to their previous releases. Please stop overfitting.

Please stop overfitting

I think this shows that some information about popular culture is difficult to express, that is, this information is often not presented in a way that is acceptable in the corpus screening stage, and is therefore filtered out in the more sophisticated training corpus design stage.
This filtering reflects the staff's view of "good corpus". While improving model capabilities, it inevitably leads to bias and negative effects. This is a phenomenon that always exists and is exacerbated as training corpus design is valued.
This is completely different from overfitting.

To clarify, I just take issue just with this particular question. I don't know the rest of your test (I just hope it's more diverse than that). It's for a show where many characters are divorced, remarried, and so on. It has actor and character names being similar and easy to mix up too. Even the wikipedia page (where it took its information from during training, most likely) itself is a barely legible mess of unsorted information. It's a miracle a human with no prior knowledge of the show could retrieve the information accurately, let alone a personal-use language model.

It's a lot harder question to answer than you seem to think it is. Hence why, even if it can respond correctly as demonstrated by patrick, any change in inference setting or prompting method change the LLM's response dramatically in this case (<hile when you ask what's the capital of france, you can change the prompt and inference settings to your heart's content and still get the correct answer).

Point is, for this kind of specialized knowledge (it is, independently of how popular a show is, knowing who played who and when is niche), using or making a dedicated knowledge-specific fine-tune sounds like the proper way to go about it.

In the very least, far more people know and care about my pop culture questions than the esoteric and theoretical STEM questions this Mistral Small got right (e.g. Thorne–Żytkow object TZO)

The average person is not running and finetuning open-source models locally, they use ChatGPT instead.

@jth01 Sure, when space is limited prioritizing high value information is reasonable. But above 7 billion parameters there's room for both academic (MMLU) and popular knowledge (pop culture), as proven by Llama 3.1 8b & Gemma 2 9b.

I believe the broad adoption of open source AI models won't happen if the community de-prioritizes, even excludes, humanity's most popular information in favor of preferred areas of knowledge. Believe it or not, knowing the main cast of a very popular TV show that ran for 12 years is far more broadly valuable, regardless of the AI model's size, than knowing esoteric theoretical astronomical object like Thorne–Żytkow objects (TZOs).

Also, once you start picking and choosing what to include very bad things reliably start to happen. For example, Chinese models start excluded very popular facts that the CCP opposes, and Western models like Phi refuse to include salty language, impolite jokes at the expense of others... which nearly every adult regularly shares.

All of humanity's most popular knowledge simply needs to be included and trained equally, even if it means an MMLU score that's 10 points lower. The elitist attitude that basic information about extremely popular shows that ran for 12 years shouldn't be included in a 22 billion parameters model so that it can do things like math a little better, but still far too unreliably to be trusted, is insane to me.

phil111 changed discussion status to closed

@phil111 I'd consider reopening the discussion. You've got the attention of the mistral team now. And it's maybe to late to fix this regression in the model, but not too late to fully understand the issue and let others report more findings. I think here it actually matters and could help inform future pretraining runs. Besides, despite custom system instructions the Issue still isn't resolved, your point still stands.

@nlpguy The issue seems to be the scrambling of general knowledge as models are continuously trained to perform better at tasks like coding and math. The final score of Mistral Small v2501 is 75.4, compared to 82.6 for v2409.

While a notable drop, that still leaves it tied for first (in general English knowledge) with the latest Command R 34b (75.1) in the 34b- size range, followed by Gemma 2 27b at 71.0. And Chinese models like Yi 1.5 34b & Qwen2.5 34b, despite their high English MMLU scores, far lower at 55.5- (Yi 1.5).

Also, the drop from Qwen2 to Qwen2.5 (85.9 to 68.4), and a SimpleQA score down to only ~8/100 despite its ample 72b parameters, was a much larger, and unforgivable, regression in general English knowledge relative to English MMLU score.

Lastly, since Mistral asked, the regression applies across the board with pop culture, but kicks in later with movies vs TV shows. For example, the 5th Element cast ("Luke Perry as Butcher" vs previously correct Billy). And the least popular movie tested (Four Rooms) makes far more cast errors, including listing random actors like Johnny Depp (scoring 1.32/4 vs 2.68/4 on said question). And in regards to music, song to artist linking is less accurate, as is recalling the main lyrics to very popular songs like Madonna's Like a Virgin, which makes identifying a song you heard, or is stuck in your head, far less reliable with the latest Mistral Small.

What bugs me with those tests (not yours specifically, it's more a general thing in the community), is that they're always done for a specific seed, at a specific inference setting, with a specific prompt (and sometimes expect a specific formatting of the response too). It feels more like cherry-picking than comprehensive testing.

@SerialKicked The open source AI community, in their rush to catch up with the large proprietary models, started to overwhelmingly prioritize the small subset of humanity's popular knowledge covered by the MMLU, then bragged about having comparable MMLU scores to models like Sonnet 3.5 & GPT4o.

However, most of said open source models are empty shells by comparison, having orders of magnitude less broad knowledge (e.g. far lower English SimpleQA scores than models like GPT4o despite comparable English MMLU scores). And this isn't just about having much greater ignorance. The use of a highly filtered corpus means there's far fewer paths to the desired information, so prompt phrasing, spelling/grammar errors, system prompts, and so on much more drastically reduces the quality of the responses compared to models like GPT4o.

Anyways, that's the primary reason I run my test (to identify LLMs that overfit the English MMLU, but are otherwise profoundly ignorant of popular English knowledge). I couldn't care a rats ass about pop culture. I never watch sports and only saw one movie in the last decade, and I was dragged to it. The test uses a generic system prompt, temp 0, simple unambiguous questions about only the top 10% most popular knowledge, requires full recollection (versus just needing to pick the answer out of a lineup like with multiple choice), and partially correct answers are given fractional points to more accurately determine an LLM's horizon of knowledge (e.g. points for a first or last name, less if misspelled).

Thankfully someone finally released a non-multiple choice general knowledge test (SimpleQA) to help keep MMLU overfitting in check. It was getting ridiculous, especially with the Chinese models like InternLM, Yi and Qwen.

@phil111
I have already had a discussion with you about SimpleQA. Unfortunately, you still regard SimpleQA scores and the ability to recite the affinity between movie characters as a criterion of ignorance, which means that you have completely missed the problem I stated about SimpleQA (https://huggingface.co/microsoft/phi-4/discussions/9#678028c513df6a1ad5f6c336).
Since you don't understand, I might as well be more direct. SimpleQA scores are just as stupid and biased ability tests as your movie character affinity. Your "overfitting criticism" of phi or mistral is nothing more than a shift from "phi team bias" and "mistral bias" to "phil111 bias", and a shift to "phil111 overfitting" or "wikipedia overfitting".

I understand your point of view. As I said I was taking issue with the question itself, and maybe the slightly agressive tone toward scientific knowledge (at least I perceived it as such), not the overall idea.

The thing with temp 0, is that you're still gonna have different response depending on seed / sys.prompt (and even hardware) minor variations for things the LLM is supposed to "know" better. And if you use a quantized version, a bunch of paths (possibly toward the right answer) get squashed too.

Out of curiosity, did you occasionally try running your Q&A with slightly different inference settings or prompt to see if the results are roughly consistent? Not necessarily in terms of individual responses, but regarding the overall success rate. I do get your point about the models having "less paths toward" the correct answer, but double checking your settings aren't just "unlucky" with a particular model may lead to a different conclusion. I'd be kinda curious about the result.

And, yes, I do agree that MMLU results (and so many other tests) are, at this point, just marketing, and possibly detrimental. tbh, nowadays, I mostly check IFEval to sort out compliant models from the rest before i test them for my needs. I don't really use models as knowledge bases, so it matters less to me than the ability for a model to do what my code asks it to.

@noneUsername Again, factual hallucinations (not mathematical or coding) remain the number one complaint the general population has with AI, and examples of Apple AI or Google AI producing results saying patently false things keep going viral on social media. It's very important that AI strives for more factual accuracy when there's no ambiguity, such as who sang this song, or played this character in a movie, and especially when it comes to things like not recharging phones in the microwave.

And I'll reiterate, it's not my arbitrary Phil'sTest. It primarily covers pop culture because none of the MMLU categories cover it, yet the large majority of people care more about it than most of the categories covered by the MMLU (e.g. Virology), which is why it's called pop(ular) culture. In short, it's the perfect way to determine whether or not an LLM overfit the English MMLU by de-prioritizating far more (pop)ular English knowledge.

@SerialKicked Yes, I obsessively played around with settings like temp, min p... and found 3 standards with the highest factual accuracy on my test (temp 0, Temp 0.3 and Temp 0.5), with temp 0 reliably having greater factual accuracy, fewer story contradictions, and so on. Plus when using LLMs online (e.g. LMsys) the results are nearly identical. Also, I tried varying system prompts, and they basically just trade subtle hallucinations and always end up with about the same overall score, and usually the longer the system prompt, the lower the score. My theory is the more information you feed the LLM, the more you distract it from what you're requesting, resulting not only in reduced factual accuracy, but more errors like contradictions when doing complex tasks like writing stories.

I also take the time to randomly ask questions about other comparably popular movies, shows, music, games... that aren't on the test, and the rate of hallucinations is never surprising. It really isn't as arbitrary as you might think. By the end of the test an LLM's horizon of pop culture ignorance, and how fast it drops off, becomes perfectly clear.

Update: Sorry about posting again, but I confirmed on LMsys that the new Mistral Small hallucinates notably more than the old one, and it isn't a subtle difference.

In fact, despite being the full float version it hallucinates more on LMsys than on my local machine using Q4_K_M, which isn't surprising since LMsys uses a high temperature.

Anyways, this is just friendly constructive criticism. I'm a big fan of Mistral. But something you did with this Mistral Small, but not the last one, drastically scrambled the weights as you approach Mistral's horizon of knowledge, resulting in both an unusually sharp and odd increase in hallucinations.

For example, there's a pattern of getting last names wrong (e.g. Brent Lyle & Wanda Dingle vs Leroy & Dollard) despite tons of contextual information in the prompt, including the show's name, and having just outputting the correct first name, so the next last name token should only rarely have close statistical tokens, but even a temperature of 0 doesn't drastically reduce said errors.

LMSys Response: When asked for the main cast of Corner Gas

Brent Lyle - Brendan McMurtry-Hayes
Wanda Dingle - Lori Marchand
Oscar Lyle - Eric Peterson
Karen Pelly - Janette MacKay
Davis Quinton - Jim Bryson
Emma Lyle - Megan Follows
No description provided.

@jth01 Yeah, I admit Corner Gas is a little fringe, but it ran for 6 seasons, and during every season it was the #1 primetime sitcom in Canada. Plus it reached well above average popularity in other countries like the US.

But the point is it's popular enough to warrant having smaller LLMs know the basics about it (the plot and main cast), and some ~8b models do. More importantly, the last Mistral Small did nearly perfect (pasted bellow). And this isn't a cherry picked example. As you approach the limits of the new Mistral Small's horizon of knowledge the hallucinations spike drastically compared to the previous version.

Brent Leroy - Brent Butt
Lacey Burrows - Gabrielle Miller
Hank Yarbo - Fred Ewanuick
Davis Quinton - Eric Peterson
Emma Leroy - Janet Wright
Oscar Leroy - Lorne Cardinal

Sign up or log in to comment