PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving Paper • 2503.21821 • Published 19 days ago • 17
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search Paper • 2503.20757 • Published 18 days ago • 9
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Paper • 2503.07459 • Published Mar 10 • 15
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval Paper • 2503.04644 • Published Mar 6 • 20
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding Paper • 2502.08946 • Published Feb 13 • 193
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21 • 86
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published Jan 22 • 380
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published Jan 22 • 91
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Paper • 2412.21199 • Published Dec 30, 2024 • 14
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper • 2410.23266 • Published Oct 30, 2024 • 20