Spaces:
Running
on
CPU Upgrade
Small models on MUSR benchmark
Nope, but since we provide all the details you could do a small analysis and see what's happening :)
@clefourrier how is a correct response determined by the eval?
It's the one with the best logprob among the possible choices. :)
As expected gpt 2 is far less confident in its answers due to its lower logprob when compared to a much bigger model such as yi-34b, this would imply that small models are pretty much just guessing and it just so happens that the correct answer ends up being number 1 on the logprob? Maybe evals that use logprob are unreliable when benchmarking smaller models due to this.
Calibration is indeed an interesting complement we would benefit from!
However, if this was truly random chance models should be at 0 (since we normalise evals), there could be something else at play here
Agreed, a dirty fix I was thinking of would be to scale the scores with the average logprob of all answers across the entire eval. This would in theory make the scores much more practical.