You know Mixtral, Llama 2 70b, GPT3.5... Are All Much Better
I used Solar instruct for hours across days, and while it scored slightly higher than Mistral 7bs in my testing, it didn't score near as high as Mixtrals, Llama 2 70bs or GPT3.5.
Usually the results of my testing roughly align with HF average scores, but since they were way off I looked into it.
It appears the discrepancy is primary due to 2 things.
(1) Solar instruct obsessively denies things are true, including countless millions of things which are in fact true, resulting in an absurdly high 71.5 TruthfulQA score (much higher than even GPT4). When I removed TruthfulQA from the HF average score it was a much better representation of Solar Instruct's true performance.
(2) Solar instruct gives unusually brief responses, even when contraindicated by the circumstances or the user's instruction. And because of automated eval limitations longer and more complex answers result in lower scores on numerous tests (more true answers falsely identified as false).
All things considered, the true HF score of Solar Instruct is no higher than 68, and certainly nowhere near 74. 74 would put it above GPT3.5, yet it isn't near as good (not my opinion). It's not even near as good as Llama 2 70b or Mixtral. It has far less knowledge and gets tripped up by much simpler questions than all three, yet has a higher score.
Thank you for the much needed analysis. Especially the point about TruthfulQA score was illuminating