Called FRIDAY, it's a mysterious new autonomous agent, which got quite good performances on both the public validation set *and* the private test set. It notably passed 10 points for the val and 5 points for the test set on our hardest questions (level 3): they require to take arbitrarily long sequences of actions, use any number of tools, and access the world in genera! ✨
The GAIA benchmark evaluates next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc) and was co authored by @gregmialz@ThomasNLG@ylecun@thomwolf and myself: gaia-benchmark/leaderboard