Link mainstream benchmarks to UGI scores
The problem with uncensored models that it often "lobotomizes" their ability to come up with factual knowledge, which was researched in regards abliteration. The problem was that abliteration caused the models to lose scores in benches like GPQA/IFEval etc. This adds to the problem of factory lobotomy of models that generate infinite slop, useless gptisms and other undesirable output.
So my question is, does linking to the Open LLM leaderboard average or making a separate test for the general knowledge of LLMs makes sense? I think we could use average scores for models that do have the open leaderboard eval complete without counting this score into UGI(?). This would give the users more idea how good a model in the open field.
Example:
Currently the best sub 10B model on UGI is MT-Gen6fix-gemma-2-9B, this model is at rank 2356 on the Open LLM leaderboard with an avg score of ~20%. The best sub 10B model on the open leaderboard is phi-4-unsloth-bnb-4bit at rank 250 with an avg score of ~40%. Probably its garbage for anything creative tho but still puts the two models into context.
I could find other models leading in UGI, but not all, the Qwen2.5-32B-Instruct-abliterated-v2 model is one of the best in both leaderboards(UGI: 38.69 vs 46.89 %, rank 13). But another model that is really close to it in UGI, Cydonia-22B-v1.2 reaches significantly lower scores on the open leaderboard (UGI: 37.46 vs 28.79 % rank 997)
I'm hoping that NatInt can be used as a ranking of intelligence not focused on uncensoredness. It still needs some work, but it does kinda show what you're talking about with lobotomizing, like how pretty much all llama-3 8b and 70b finetunes and merges perform worse than their Instructs.
I think that makes sense, all models above NatInt 50 are closed source or Llama 405B or Deepseek-v3. Im not sure what it is benchmarking against, but if a model scores low a GPQA and high on UGI it's probably safe to assume the lobotomy was complete in favor of RP. Same goes for NatInt. So I'm proposing closing the discussion then.