MMIE/MMIE-Score · Hugging Face

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard] [🌟 Overview] [🔧 Metric Details] [🚩 Citation]

🌟 Overview

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed specifically for Large Vision-Language Models (LVLMs). MMIE offers a robust, automated evaluation metric, powered by Intern-VL2, to assess interleaved comprehension and generation capabilities across diverse fields.

This automated evaluation metric provides a reliable, streamlined approach to scoring LVLMs based on their performance in multimodal reasoning tasks. It is tailored to handle interleaved inputs and outputs, ensuring unbiased and consistent evaluation results.

🎯 Key Features of the MMIE Evaluation Metric:

Automated Scoring System: Fine-tuned InternVL-2-4B is employed as the foundation of the scoring system, offering high performance and support for multi-image input.
Bias Mitigation: The model is fine-tuned to minimize biases and provide fair, objective scoring across all models tested.
Multimodal Focus: Tailored to handle interleaved multimodal inputs and outputs, ensuring models are judged on their ability to integrate and reason with both text and images.
Human-like Evaluation: Our metric shows high correlation with human annotations, surpassing alternative automated metrics like GPT-4o, especially in nuanced multimodal tasks.
Scalable and Consistent: The evaluation metric is built to handle large-scale datasets, offering consistent and reproducible scoring results, making it perfect for model benchmarking and comparison.

🔧 Metric Details

Pipeline

To ensure a comprehensive and unbiased evaluation of various LVLMs, we propose an automated evaluation metric powered by InternVL-2-4B. This model was selected for its strong performance in multimodal reasoning tasks and its ability to support multi-image inputs. Furthermore, we fine-tuned the model to mitigate potential biases and provide accurate, consistent scoring.

The evaluation pipeline leverages the internally fine-tuned LVLM to assess models based on key dimensions such as text quality, image quality, text-image coherence, and stylistic consistency. This ensures models are rigorously tested on their multimodal reasoning capabilities.

Results

Note: In the image, higher values indicate better performance for Pearson and Cosine Similarity, while lower values are better for MSE and MAE.

The MMIE evaluation metric demonstrates superior performance in scoring, achieving the highest correlation with human annotations in all aspects of multimodal comprehension and generation. It consistently outperforms GPT-4o and other standard evaluation metrics, proving its reliability for large-scale model benchmarking.

Installation

To use our benchmark and evaluation metric, please refer to our Github repo.

🚩 Citation

If you find our benchmark useful in your research, please kindly consider citing us:

@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}

MMIE
/

MMIE-Score

You need to agree to share your contact information to access this model

🌟 Overview

🔧 Metric Details

Pipeline

Results

Installation

🚩 Citation

Model tree for MMIE/MMIE-Score

Dataset used to train MMIE/MMIE-Score

Space using MMIE/MMIE-Score 1