Spaces:

MMIE
/

README

Sleeping

App Files Files Community

README / README.md

Lillianwei

Update README.md

beb48c6 verified about 1 month ago

preview code

raw

history blame

3.05 kB

	---
	title: README
	emoji: 🏢
	colorFrom: blue
	colorTo: gray
	sdk: gradio
	pinned: false
	---
	<div align="center">
	<!-- 📚 MMIE: The Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models 📚 -->
	<img src='https://cdn-uploads.huggingface.co/production/uploads/65941852f0152a21fc860f79/dg_Ng0cCX2MRX-YQLgHYN.png' width=300px>

	<p align="center" style="font-size:28px"><b>MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models</b></p>
	<p align="center">
	<a href="https://mmie-bench.github.io">[📖 Project]</a>
	<a href="https://arxiv.org/abs/2410.10139">[📄 Paper]</a>
	<a href="https://github.com/Lillianwei-h/MMIE">[💻 Code]</a>
	<a href="https://huggingface.co/datasets/MMIE/MMIE">[📝 Dataset]</a>
	<a href="https://huggingface.co/MMIE/MMIE-Score">[🤖 Evaluation Model]</a>
	<a href="https://huggingface.co/spaces/MMIE/Leaderboard">[🏆 Leaderboard]</a>
	</p>
	</div>

	We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.

	<div align="center">
	<img src='https://cdn-uploads.huggingface.co/production/uploads/65941852f0152a21fc860f79/Ks9yJtJh7fcyNJSUcKg-0.jpeg'>
	</div>

	### 🔑 Key Features:
	- 🗂 Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
	- 🔍 Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
	- ⚙ Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
	- 🔎 Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.

	---

	### 🔍 Key Insights:
	1. 🧠 In-depth Evaluation: Covering 12 major fields (mathematics, coding, literature, and more) with 102 subfields for a comprehensive test across competencies.
	2. 📈 Challenging the Best: Even top models like GPT-4o + SDXL peak at 65.47%, highlighting room for growth in LVLMs.
	3. 🌐 Designed for Interleaved Tasks: The benchmark supports evaluation across both text and image comprehension with both multiple-choice and open-ended formats.

	---

	### 🔧 Dataset Details
	<div align="center">
	<img src='https://cdn-uploads.huggingface.co/production/uploads/65941852f0152a21fc860f79/vrsgjTBcBYfZTdMQiJ1uC.png' width=500px>
	</div>

	MMIE is curated to evaluate models' comprehensive abilities in interleaved multimodal comprehension and generation. The dataset features diverse examples, categorized and distributed across different fields as illustrated above. This ensures balanced coverage across various domains of interleaved input/output tasks, supporting accurate and detailed model evaluations.