MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Abstract
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Community
MedXpertQA is a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. It features:
- Two subsets, Text for text evaluation and MM for multimodal evaluation, totaling 4,460 questions.
- Questions that effectively challenge state-of-the-art models, collected from expert-level sources and processed through filtering, question & option augmentation, and expert review. We present results on 16 leading models.
- High clinical relevance. MM introduces questions with diverse images and rich clinical information to multimodal medical benchmarking; Text incorporates specialty board questions for increased comprehensiveness.
- A reasoning-oriented subset enabling assessment of model reasoning abilities beyond mathematics and code.
Data files will be released shortly at: https://github.com/TsinghuaC3I/MedXpertQA
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark (2025)
- ReasVQA: Advancing VideoQA with Imperfect Reasoning Process (2025)
- BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities (2024)
- MedG-KRP: Medical Graph Knowledge Representation Probing (2024)
- MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes (2024)
- FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training (2025)
- LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper