Next Leaderboard Update Plans. Open to suggestions.

#71
by DontPlanToEnd - opened

Hoping to have time to work on the next leaderboard revamp in late December and January.

New Features

  • Have the models take the Political Compass test (not a part of the UGI score, would be a separate measurement)
  • Test model willingness with no system prompt vs with system prompt telling model to be uncensored
  • Add more questions that are not censorship-focused. This will help make I/10 a much more accurate ranking of raw intelligence.

Changes

  • Change from me prompting models manually to an automated testing system (long overdue). Should help me test models faster. Automated on my side, creating the infrastructure for automated submissions will have to come later.
  • Remove all questions that are multiple choice to make it so models can’t get lucky.
  • Change system prompt to make it so refusals never happen because of the system prompt
  • Switch from testing Q4_K_M to Q8_0 quants
  • Largely expand Rating Prediction with many more prediction questions to remove model outliers
  • Replace Writing Style with system that better accounts for differences between story generations. Maybe just average the writing style score of 10 different story generations and test different model parameters. Not sure yet, I’m open for suggestions on how to measure writing quality. Definitely won't be manually scoring stories in the future.
  • Add filter for foundation vs fine-tuned vs merged models

What other changes would you like to see, or what new rankings should I add? I'm hoping the automated system will increase testing speed a good amount, which will allow me to add additional measurements.

DontPlanToEnd pinned discussion

The jump from Q4 to Q8 seems quite substantial - have you considered Q6_K as an alternative? It performs very close to Q8 with minimal performance loss, particularly in larger models. Regarding your testing methodology, I'm curious whether you include free-response answers, or if it's currently limited to multiple choice and fill-in-the-blank formats? Also, I've noticed the W/10 ratings don't align with my testing experience in several cases. For instance, Tiger Gemma v2 only receives an 8.5 despite showing uncensored behavior in my tests, while UnslopNemo v2 gets a 10 despite displaying more censorship than Tiger Gemma v2.

I originally went with Q4 because it was what I used and people usually recommended to the average consumer. Though I feel in order to have a fair comparison between models, I should show them at their bests, especially since I'm comparing open source models to closed source. Though I'd definitely prefer the faster download times of a Q6_K quant.

I currently have a mix of multiple choice, fill-in-the-blank, and short answer questions. I realize now I didn't word that part in the post correctly (I'll just be removing multiple choice questions). For the auto-testing program, the test questions will all need to be structured so their answers can be easily parsed by the program. So I guess I'll tell the models to state their answer in a specific format or search their response to see if they state the correct answer.

It's hard to diagnose the W/10 part. One thing that will definitely help W/10 is increasing the number of questions used to calculate it. System prompt may also play a role in the models acting differently. Edit: also prompt template and model settings

Maybe you could test both Q4_K_M and Q6_K, to allow them to be compared.

Maybe you could test both Q4_K_M and Q6_K, to allow them to be compared.

I like the idea of testing a large quant like Q8_0 and having a comparison with a small/medium quant, but that would mean I'd have to run all the test question prompts twice for each model. I'd probably rather use the compute time for testing more questions, than the same on a different quant.

I want to posit a few suggestions that could help your endeavours:

  1. Include a measure of uncertainty such as the standard deviation in your test results, or show a score range (P5-P95). Doesn't take higher costs, but makes it more clear how much statistical power your test has. Nobody bothers to do it in this space, but people absolutely should so it's easier to tell how reliable a result actually is. For example, your W/10 score is based on only 11 questions (I can tell because the values are all rounded multiples of 10/11). It's hard to tell a 9.1 and a 10 apart, could just be a fluke.
  2. Once you compute (1) you'll likely want to ask more questions in each subject. Read this: https://possiblywrong.wordpress.com/2011/06/05/reddits-comment-ranking-algorithm/ --> Conclusion, show the wilson score interval, not just the average value. https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval ;; once scores are based on different number of answers for different models, you'll want to sort by its lower bound, not the average result. (a 9.0 from 10 questions is less impressive than an 8.9 from 100 questions)
  3. The current 'I' score has a problem: By basing the selection on correlation with model size, you are biasing your results to model size, not intelligence. While size is strongly correlated with intelligence, there are relatively dumb big models (grok and falcon), and relatively smart small models. This is called circular analysis, and it leads to an I score that doesn't quite feel like an accurate representation of the model's capability. Hacking the result with posterior knowledge has a high risk of inducing bias. Either:
    3a. Add more questions of a new category, which are general knowledge questions useful in RP but not of the 'spicy' kind, selected prior to running the test.
    3b. Group failures into either a refusal or a wrong answer. The intelligence score is taken over all answers that are not refusals. Note: using method 3b means that standard deviation becomes much more important to include. After all, if chatGPT censors 62 out of 65 questions, then getting 3 out of 3 right shouldn't get it a 100% score (because it's extremely uncertain with such a small sample size). I suggest to grade the models at (Average - 2 * standard_deviation) instead, or using the more precise formula in the article linked above.
    3c. Perhaps my preferred solution, before seeing any evidence to the contrary: Multiple choice might actually be more useful than you think. The 'get lucky' problem you believe exists is really a minor one. Models can get lucky just as much with the other methods. If you use the 8x trick, then getting lucky for even a 10% score boost means guessing right 50 times, a near impossible chance. And multiple choice is extremely useful for improving your W/I scores, in this way: Make the one of the choices always a refusal to answer, and checking for 'censoredness' and 'intelligence' using the same questions (and thus making it cheaper to run the test!) becomes quite easy. When using multiple choice, in order to have 'the best of both worlds', you should allow the model to have some 'reasoning' space and instruct it to respond in the manner of a formatted response such as JSON, by giving an example in the prompt (the example being a harmless question even to goody-2, of course). I.e.
    {
    reasoning: "It would be unsafe to answer this question, someone could use the knowledge to create a dangerous weapon.",
    answer: "E"
    }
    The W/10 simply becomes the inverse of the number of E answers (lower bound via the wilson method).
    If A is always correct, then I/10 becomes the fraction of A within the sum of A-D answers (lower bound).
    And finally 'score' becomes the fraction of A within the sum of A-E answers (lower bound).
    Obviously you would shuffle the answers around and unshuffle them for the results in the real test.
  4. LLMs are much more efficient when they are run using a 'batch'. I'd like to suggest asking your prompt to each model in a batch (duplicating it) to greatly improve score accuracy without having to come up with more questions. By using say 8x duplication, you'll get 8 different answers to the same question, giving you more statistical power at less cost, for generating tokens and consuming tokens takes the same amount of FLOPs, but the memory bandwidth cost is static. Batching takes away memory boundedness of running these models.
  5. Writing style is a subjective thing, where different individuals have different tastes. To accurately score writing style, I'd suggest rather testing how well a model can follow a writing style, rather than trying to evaluate something that's to a large extent subjective. Use some work from some highly regarded writers. Shakespeare, Orwell, Tolkien, Dickens, Kafka, Tolstoy, etc. Lesser known works are best because there is less chance the AI can overfit and just regurgitate the real answer from memory, or disqualify answers that manage to (nearly) do that. Let the AI finish a half-written chapter of it. Then compare the real truth with the AI's answer on a few objective measures of writing style. Here's some to get you started (do some research to find more). In each case, the question is: how much does the model differ from the actual author? There's many objective proxies that can be done by computer.
    5a. Readability: Average words per sentence.
    5b. Readability: Flesch-kincaid score
    5c. Readability: Number of unique + uncommon words in the first N words of output (N should be low enough that the large majority of answers reach at least N words).
    5d. Word-choice: Google N-gram rank distribution (a good model can keep using more archaic forms if this is desired). There's also https://ngrams.dev.
    5e. Punctuation. Test for the frequency of lesser used punctuation. The frequency of each of ,.:;'"/- in percentages (how many commas per 10K characters of output) tells something about writing style, and is easy to compute/compare.
    5f. Use of poetic constructions. (Detecting if AI maintains rhyme, alliteration, haiku, limerick, sonnet, and so on, and doesn't introduce it when the original work did not have it).
    5g. 'slop' tests. Does the sloppiness of the output increase compared to the input? (The use of common AI phrasings like 'shivers down the spine'. They're common in real books too for some authors, so it might be fine in some cases but not in others).
    5h. (takes more work to set up, only applicable to some texts, still easy to grade though) Character-specific differences. Test if the model can roleplay a character with a speech pattern that varies (wildly) from that of the narrator, by using the other ratings but only for text one character says vs. that of the rest of the text; and see if the model can maintain the difference as it generates. Pretty much all models still fail spectacularly at this, gradually collapsing a colourful cast of characters into a boring grey soup of sameness at varying speeds, so it's a great test statistic.
    I'm sure you can create many more metrics yourself with this kind of construction.
  6. Perhaps my most important tip of all though: Make your questions like a proficiency test. Include a wide spectrum of 'degree' of nsfcness*, from the rather mundane like speeding to the most depraved, for example role-playing the Wannseekonferenz. Sort the questions by depravedness and if you think there's a big jump insert something in between. In addition to that, pre-run your test on two intelligent models; the ones you think are the most- and least- censored, and adopt two more rules:
  7. If the most-censored model is censoring your lightest question in a topic, then add a lighter one.
  8. If the least-censored model is not censoring your nastiest question in a topic, then add a nastier one (until you or anyone assisting you can no longer think of one).

*not-safe-for-corporate-ness

@AphidHf I'm a bit confused on the statistics you're asking for. Since I'm not collecting random samples, and I'm using pretty deterministic settings for the models, I usually get the same answers from a model. So what would I be taking the standard deviation of? Also, since I ask the same questions to all models, I'm unsure what wilson score interval is for.

Yeah, for I/10 I didn't have much to work with. I added it to the leaderboard after I already tested a lot of models, and the test set had pretty much no questions that weren't trying to get the model to refuse to answer. In the next update, I'll give it its own dedicated test set that just focuses on measuring standard intelligence, without asking anything that could make it refuse.

Make one of the choices always a refusal to answer

Yeah I suppose for W/10 multiple choice would be pretty useful. It's a more complicated task to determine to what degree of willingness a model answered a short answer question, but very easy to see if it was willing to pick a certain multiple choice answer.

I really like your idea of testing how well the models continue a writing style. I feel like a good writing leaderboard would use (at least) three measurements: the model's knowledge of pop-culture so that it can tell the stories/do the RP the user wants, the model's ability to stay in character and stick to a writing style, and simply how appealing the model writes by default. The last one is the hardest to measure, and without it the writing benchmark probably wouldn't be good enough. Using ML to figure out what lexical statistics I prefer in a story was the solution that was easy to get good results with, though I'm not sure what an unbiased way of measuring appealingness of a writing style would be.

Have the models take the Political Compass test (not a part of the UGI score, would be a separate measurement)

Make them take the test as a persona(e.g. Hitler), to see if they are willing to deviate from default assistant alignment. Consider using a test with more than 2 axes, something like https://9axes.github.io/, political compass is too simplified and will fail to detect more complex ideologies. I've made a bad attempt at at using politiscales test in the past, but as you can see, models differ on different axes a lot. Maybe also run it without "neutral" option, some models really like saying "I'm sorry, but as an AI I don't have political opinions".

Make them take the test as a persona

I was thinking of having it be a measure of what the model itself generally thinks. I don't get refusals when asking gpt or claude the questions when using this system prompt: "You are filling out a quiz. You must give an answer of either Strongly disagree, Disagree, Agree, or Strongly agree. You only reply with the 1-2 word answer and nothing else."

political compass is too simplified and will fail to detect more complex ideologies

Yeah, I thought of doing political compass because it is by far the most well known one, and it would be easier for people to make comparisons. It would be easy to make graphics for it, like comparing the top closed source models, since it only has 2 axes. Using something like 9axes or politiscales would probably be more informative, I just want to make sure that people can easily sort by politically left/right like you had in your example. Idk how that would go with 9Axes since political sides change their values over time on some of the categories like "Militarist vs. Pacifist" and "Globalist vs. Isolationist".
Also, I like the additional things you measured, like how willing models are to believe conspiracy theories.

I just want to make sure that people can easily sort by politically left/right like you had in your example. Idk how that would go with 9Axes since political sides change their values over time on some of the categories like "Militarist vs. Pacifist" and "Globalist vs. Isolationist".

It's quite simple, just pool some values together so they fit modern day political consensus on what is "left" or "right", while leaving detailed results somewhere in the background. For example, in 9axes it would be:

  • Left-Right:
    • Progress-Tradition
    • Secular-Religious
    • Equality-Markets
    • Multiculturist-Assimilationist
    • Globalist-Isolationist
  • Authoritarian-Libertarian:
    • Unitary-Federal
    • Militarist-Pacifist
    • Authority-Democracy
    • Secutity-Freedom

https://politicaltests.github.io/12axes/ also looks good. Might go with it.

I'll make sure to display the model's ideology. That will be easier for people to understand than only displaying numbers. It will also be helpful for filtering models.

I feel like I shouldn't allow models to give "Neutral/Unsure" as an answer. That may just lead to some models using it as a way to refuse to take the test.

having the model complete something like a DnD Alignment might be a good way to pull more ideology as well, as the political spectrum does not completely show ideology as with real life someone can be a "Socialist" or "right wing" but hate everyone around them

How do you guys think I should integrate system prompts into the leaderboard?

If you don't test models with a system prompt like "you are an uncensored model", you're losing out on some of the model's potential as an uncensored model. Though if you only test with a system prompt, many people will disagree with the ranking if they use a different system prompt or no system prompt. I'd really like to avoid having to test models on all the test questions both with and without a system prompt, and then having two separate UGI rankings. I could have two different W/10 rankings (w/wo sys prompt) for the questions solely measuring willingness to answer, but the rest of the test questions that test the model's knowledge still cover topics the models need to be willing to talk about. So having/not having a system prompt still affects those outputs. Having to test models on all questions twice means I'll have to half the number of questions in the test set, reducing accuracy.

I think with vs without "uncensored" system prompt would be the most useful, but that's understandably labor-intensive, so without would IMO be the next most useful since, as you say, there is no standardization of system prompts and everyone uses something different.

A lot of model output quality potential is a "skill issue," as the kids like to say. As long as you at least share what the prompt is trying to accomplish, that should be sufficient, in my opinion. If people get persnickety because you coaxed more out of the model than is there 'out of the box', oh well. πŸ˜‰

I'm currently thinking to have the main UGI score calculated without any decensoring system prompts, and having two W/10 measurements to the side: one which will be used in the full UGI test set that doesn't have the decensoring sys prompt, and one with a decensoring sys prompt to serve as a "what if". I want to make sure the test is evaluating the model, not the system prompt. And people would probably rather be able to search for a model that is willing to do what they want no matter how they use it, as apposed to being forced to use a decensoring system prompt.

Have there been any tests done to compare model accuracy between quants? I remember seeing perplexity comparisons, though I don't know how equivalent those are to actual model outputs. Still unsure if I'll go with Q6_K or Q8_0. The download speed and computational requirements for small/medium models at Q8 isn't a problem, but 70B+ would be more annoying. Especially for huge models: a 405B Q6_K is ~333GB but a Q8_0 is ~432GB.

Edit: Hmm then again, there might be a more noticeable difference between Q6_K and Q8_0 when testing small models like 2B. I might just have to go with Q8_0, idk.

Q6_K is acceptable, Q8 maybe overkill but welcome

Q4_K_M was a shot too low

I hope current version of the Leaderboard won't disappear and will be preserved as v1 or something like that. I also think that Q5_K_M is the sweet spot.

Have there been any tests done to compare model accuracy between quants? I remember seeing perplexity comparisons, though I don't know how equivalent those are to actual model outputs. Still unsure if I'll go with Q6_K or Q8_0. The download speed and computational requirements for small/medium models at Q8 isn't a problem, but 70B+ would be more annoying. Especially for huge models: a 405B Q6_K is ~333GB but a Q8_0 is ~432GB.

Edit: Hmm then again, there might be a more noticeable difference between Q6_K and Q8_0 when testing small models like 2B. I might just have to go with Q8_0, idk.

there has been several tests that show the perceivable and accuracy difference between quants, ill link them.

##Perceivable
391053580-5e668270-d647-495c-a9c0-cceefbf1655b.png
This test uses human testers (around ~3600 votes) to show that I-quants are practically equal to Q6, at least in the since that people really don't perceive any real difference until around Q3_xs and smaller models are much more effected by quants than larger models.

looking at the chart there is only a 3% difference in user preference between f16 and IQ4_XS

##Benchmark
Here is a git hub that uses benchmarks to show the difference on a more analytical level
https://github.com/matt-c1/llama-3-quant-comparison

https://github.com/matt-c1/llama-3-quant-comparison
8B:
Model size [GB] | MMLU | Quant
7.43 | 65.23 | Q8_0
5.73 | 65.06 | Q6_K
5.00 | 64.90 | Q5_K_M
4.30 | 64.64 | Q4_K_M
3.53 | 62.89 | Q3_K_M

Thank you! Yeah seems Q6_K is the best balance between model size and performance.

If anyone has any ideas on things they wished models were tested more on, please comment them!

For example, it's annoying when you ask a model for website links, but they give you fake urls or sites that don't exist anymore.

You could test its ability to adhere to system prompts. You can assign it a specific persona and evaluate whether it maintains that character when answering questions. The questions don't need to have factually correct or incorrect responses - instead, they help us assess if the LLM consistently respects the assigned persona. For example:

System prompt: "You are a grumpy old professor who has been teaching for 40 years and is tired of students' excuses. While you know your subject matter perfectly, you have little patience for basic questions."

Q: A student emails asking for an extension because their computer crashed before they could save their essay. How do you respond?
A) "Of course, these things happen! Take all the time you need"
B) "Back in my day, we wrote everything by hand. Extension denied." (most consistent with system prompt)
C) "Please submit proper documentation from IT services"
D) "I understand completely, you can have an extra week"

System prompt: "You are a conspiracy theorist who believes that lizard people control all world governments and that the Earth is not only flat but also a simulation run by interdimensional beings."

Q: Why do birds migrate south for the winter?
A) "Due to seasonal food availability and temperature changes"
B) "They're actually surveillance drones returning to government facilities for maintenance and data upload. The 'migration patterns' match the locations of known lizard people underground bases!" (most consistent with persona)
C) "To find warmer climates for breeding"
D) "Following ancient instinctual patterns"

A model properly maintaining its assigned character would select option B, even though it might not be the most scientifically supported answer. This demonstrates the LLM's ability to prioritize persona consistency over factual optimization.

If you're using deterministic settings, then I suppose asking the same question multiple times is not relevant. But I'd like to turn the question around: Why would you be using deterministic settings? Normally it would be to make your test repeatable by others, but it's deliberately private so it's harder to game. Using multiple fixed seeds would work too and allow you to run things semi-deterministically still while also obtaining multiple answers. If you want to tell models apart better, you need more data, as you can show with stats.

Here's also my answer to this question: "what would I be taking the standard deviation of?" Imagine a thought experiment where a different DontPlanToEnd in a different reality made up the same test, but with new questions (in the same topic). What's the chance that a model that scored a 9.1 before scores a 10 now? How reliable are these numbers? If you rephrased (but not changed the meaning of) the questions, how similar are the results?

This is just another way of phrasing: What's the 'statistical power' of your measure?

That's where standard deviation or score intervals come in: Your test is a 'sample' of all the roleplaying output your model can give (which is basically infinite for a sizeable output). You're measuring a proportion over that sample (censored/uncensored). A typical confidence interval used for this test is the Wilson interval, because it's fairly accurate for most values (meaning, within 50-200% of the chosen alpha, and averaging 100% over all values in the 0-1 range).

Another thought experiment: You could ask 10000x more questions and get a 100x more precise result (way too expensive to do though), due to the way the law of large numbers works. All those 9.1s would very likely become different numbers from each other. By how much would these change? --> This is where score intervals come in handy. Here's an example for a test with 11 questions (weighted from 0-10):

image.png

And here's a table for all 65 (weighted from 0-100, with the typical score range for top models):

image.png

So as you can see, there's no way to tell a 10 , 9.1 or 8.2 apart. You can say that a model with a 7.3 is statistically likely to be worse at willingness than one with a 10 with 95% confidence, but no more.

This drives home the importance of doing the statistics: that was, I imagine, much worse than your intuition about how reliable that willingness score is! A model that's about 80% likely to pass your questions is pretty likely to get 11/11 right by pure chance, after all. These models are in the end statistical machines that estimate the probability of each next token. Even if you set the RNG seed to a particular value, choosing a different value will get you new test results. And, kind of obviously your seed shouldn't be hugely impacting your rankings, or they'd be pretty arbitrary. You're showing two decimal places, where it's hard to say you can reliably (95% certain) even show the first. You need about 400 questions (!) in a test to reliably show the first decimal. 40,000 for the second. 4,000,000 for the third. It's a lot worse than your intuition might suggest.

This is why I'm suggesting to use batches, multiple answers, etc. To use statistics to check how good your test is. To use multiple choice with JSON formatting and optionally 'thoughts' or as it's known the AI space as 'CoT' so a simple program can check the answer. So you can, with the use of less time/hardware, have a model answer more questions. Because the real problem isn't 'having multiple choice', it's 'you don't have enough data'.

Here's how you would compute this interval in a program: https://www.econometrics.blog/post/the-wilson-confidence-interval-for-a-proportion/

A simpler way of doing the wilson score interval for z=2 is to add 2 successes and 2 failures to your results, then add or subtract 2 standard deviations.

DontPlanToEnd changed discussion status to closed
DontPlanToEnd unpinned discussion

Sign up or log in to comment