Did you change the way you run evals today?

#178
by bleysg - opened

Hello, we saw our new model got evaluated yesterday:
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/results_2023-08-09T10%3A54%3A28.159442.json

Now, it appears it has been re-evaluated in the last few hours:
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/results_2023-08-09T19%3A53%3A44.921082.json

Everything is exactly the same in the config_general section of the output, but the results are dramatically different (and the overall score much lower on the new eval which the leaderboard appears to have been updated to use).

We noted that the original eval had "total_evaluation_time_secondes": "6305.85427236557" whereas the new eval appears to have taken over 5 times longer to run.

The original eval was much closer to our internal evals which were run based on your published methods. Can you please explain what has changed today?

Open LLM Leaderboard org

Hi, I just checked the requests dataset, and your model has actually been submitted 3 times, one in float16, one in bfloat16, and one in 4bits (here). We ran the three evaluations, and I guess the last one (4 bit, which is way slower because of the quantization operations) overrode the other two in the results file.

However, it should be saying "4bit" there, so I'll check why it doesn't. Thank you very much for paying attention and pointing this out!

Thanks for that! We had a suspicion that might have been the case. None of these requests were submitted by our team. For reference, the model was trained in bfloat16, but the float16 results are similar at least.

Might I suggest something like this to avoid having evals with the wrong precision parameters pollute the results...

  1. Look in config.json for the torch_dtype entry and use that (e.g. https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/blob/main/config.json#L19 )
  2. Look for a specific named file with nothing but the intended precision for evals (e.g. filename: eval_dtype.cfg contents: bfloat16)

This would save evals time running with the wrong parameters and prevent spurious results being posted to the leaderboard (whether by ignorance, accident, or malice).

More weird stuff with average

image.png
@clefourrier

Open LLM Leaderboard org

Hi @felixz ,
You'll notice if you display the model sha that this model appears 3 times because it's been submitted with three different model shas.
I would be grateful if you could create a dedicated issue next time.

Open LLM Leaderboard org

@bleysg We actually are OK with people submitting bfloat16 models for evaluation in 4bit, for example, especially for bigger models: not everybody has the consumer hardware to run a 70B model, and it's very interesting for a lot of people to know what is the performance loss they get when quantizing models. That's why we added the option.

However, I updated the leaderboard, so that models of different precisions are now each on their own row, to avoid the problems you had earlier with your model, which is now back at the (almost) top of its category :)

Thanks! That works too :) I noticed that our model appears to be the only one aside from GPT2 with a 4bit result on the board currently. Is this perhaps due to a longstanding issue with 4bit evals getting miscategorized?

Open LLM Leaderboard org

It is highly possible, yes!
I'll have to do a full pass on matching info in the requests file with info in the results file.

clefourrier changed discussion status to closed

Sign up or log in to comment