gaia-benchmark/leaderboard · De-duplicating submissions and defaulting to test

Jan 24, 2024

[Adding this as a new thread, since the thread from yesterday was closed as I was posting this as a reply:]

Hi @clefourrier @gregmialz Thanks for your awesome work on this dataset! I think it's really going to drive forward development in this space. Just wanted to say +1 for removing duplicates by default and returning the top-performing result for each model name (though all results could still be retained in the public results file). Also, it would be nice for the leaderboard to default to the test results instead of the validation results if possible. The validation dataset is at the top of many of our agent's Google searches, along with articles on HackerNews etc. that discuss questions, so there's a significant risk of contamination (which as you know would show up as a large validation-test performance disparity).

Also @hccngu Great work! Looking forward to hearing more about it!

clefourrier

GAIA org Jan 25, 2024

Hi @chad-ml56 ,
Thanks a lot for your message!

We now have a filter to check if the model name and organisation are already submitted there - we don't want to have duplicates at all, if an org want to submit several versions of the same model, they should specify the version in the model name. Good idea to default to the test result!

clefourrier changed discussion status to closed Feb 5, 2024