MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 35
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
RULER: What's the Real Context Size of Your Long-Context Language Models? Paper • 2404.06654 • Published Apr 9 • 34
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues Paper • 2404.03820 • Published Apr 4 • 24
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings Paper • 2404.16820 • Published Apr 25 • 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
On the Planning Abilities of Large Language Models -- A Critical Investigation Paper • 2305.15771 • Published May 25, 2023 • 1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31 • 20
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Paper • 2406.09170 • Published Jun 13 • 24
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery Paper • 2406.08587 • Published Jun 12 • 15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17 • 61
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published Jun 14 • 54
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack Paper • 2406.10149 • Published Jun 14 • 48
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains Paper • 2407.18961 • Published Jul 18 • 39
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents Paper • 2407.18901 • Published Jul 26 • 32
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 23
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Paper • 2408.03361 • Published Aug 6 • 85
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java Paper • 2408.14354 • Published Aug 26 • 40
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments Paper • 2405.07960 • Published May 13 • 1
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning Paper • 2310.16049 • Published Oct 24, 2023 • 4
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published Sep 12 • 66
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models Paper • 2409.16191 • Published Sep 24 • 41
OmniBench: Towards The Future of Universal Omni-Language Models Paper • 2409.15272 • Published Sep 23 • 26
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models Paper • 2410.07985 • Published Oct 10 • 28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 46