---
title: README
emoji: 🌍
colorFrom: indigo
colorTo: yellow
sdk: static
pinned: false
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/668e4eb2446c8736208e227a/ncWIn6EQLzMMh78vzYXe4.jpeg
---

# Abstract
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses.
# Evaluation Metric
In this section, we will provide a detailed introduction to evaluation metrics.
## Retrieve Evaluation
To evaluate the retrieval performance, we consider the following metrics:
- **Context Recall** uses LLMs to evaluate whether the retrieved documents contain all the relevant textual information required for answer generation.
- **Visual Recall** measures the percentage of retrieved images relative to the total number of images in the ground truth.
It is computed as:

where "Retrieved Relevant Images" refers to the number of images retrieved that are present in the ground truth, and "Total Relevant Images in Ground Truth" refers to the total number of relevant images that should have been retrieved.
## Generation Evaluation
To evaluate the performance of multimodal answers, we consider the following metrics, which can be divided into two categories: statistical-based metrics (first six metrics) and LLM-based metrics (last four metrics).
We use the following _statistical-based metrics_:
- **Image Precision** measures the percentage of correct images in the multimodal answer relative to the total number of inserted images, assessing whether irrelevant images were introduced.
It is computed as:
![\[
\text{Image Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
\]](https://cdn-uploads.huggingface.co/production/uploads/67571051d39ac252085797ca/Z27SuLVNOMqU_Br8-h69P.png)
where True Positives are the correctly inserted images, and False Positives are irrelevant images that were included.
- **Image Recall** measures the percentage of correct images in the multimodal answer relative to the total number of images in the ground truth, evaluating whether the answer effectively includes useful image information.
It is computed as:
![\[
\text{Image Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
\]](https://cdn-uploads.huggingface.co/production/uploads/67571051d39ac252085797ca/dJCpcB_oeZDmjn0AAGbex.png)
where False Negatives are the images in the ground truth that were omitted in the generated multimodal answer.
- **Image F1 Score** is the harmonic mean of Precision and Recall, providing an overall evaluation of the image quality in the multimodal answer.
It is calculated as:
![\[
\text{Image F1 Score} = 2 \times \frac{\text{Image Precision} \times \text{Image Recall}}{\text{Image Precision} + \text{Image Recall}}
\]](https://cdn-uploads.huggingface.co/production/uploads/67571051d39ac252085797ca/9Y1NtxXcyUoLay9mln2GR.png)
- **Image Ordering Score** evaluates whether the order of images inserted into the multimodal answer matches the order of images in the ground truth.
Specifically, we compute the weighted edit distance between the two image sequences to reflect the difference in their order.
- **Data Format** (For Lifestyle Data):
- **Ground-truth**: A = a_1 -> a_2 -> ... -> a_n, where a_i represents the image at the i-th position in the order.
- **Answer**: B = b_1 -> b_2 -> ... -> b_m, where b_j is not necessarily in A, and m is not necessarily equal to n.
- **Scoring Formula**:
![\[
\text{Score} = \frac{|A \cap B|}{n} \times \left( 1 - \frac{1}{p} \times \min\left(\frac{\text{dist}(A, B)}{\operatorname{max}(n, m)}, p \right)\right)
\]](https://cdn-uploads.huggingface.co/production/uploads/67571051d39ac252085797ca/KF53yea-wyQe9Av2Zn0-q.png)
- Ensures a score of 0 when no correct images are present.
- : Normalization factor.
- **Details**:
- Here, dist(A, B) represents the weighted edit distance between string A and string B , i.e., the minimum total cost to transform string B into string A through the following three operations:
- **String Insertion**: If B is missing certain images, insert an image from A into a specific position in B. The operation cost is p_1.
- **String Deletion**: If B contains extra irrelevant images, delete them. The operation cost is p_2.
- **String Substitution**: If the positions of images in B do not match A, substitute the image in B with the corresponding image from A. The operation cost is p_3.
- The weights generally satisfy p_1 > p_2 > p_3, and p >= p_1 ensures the final score falls within the range \[0, 1\].
- Weighted edit distance can be computed using dynamic programming, with a time complexity of O(mn).
- **Rouge-L** is a text generation evaluation metric based on the longest common subsequence, measuring the structural similarity between the answer and the ground truth.
- **BERTScore** is a text generation evaluation metric based on the pre-trained language model BERT, used to assess the semantic similarity between the text in the generated multimodal answer and the ground truth.
We use the following _LLM-based metrics_:
- **Image Relevance** measures evaluate the relevance of the inserted image to the query-answer pair, specifically assessing whether the content described by the image is meaningfully related to the content of the QA. This metric assigns a score to each image appearing in the answer, with scores ranging from 1 to 5.
- **Image Effectiveness** measures evaluate the effectiveness of the images inserted into the multimodal answer, assessing whether the images align with the QA content and contribute to the understanding of the answer.
This metric also assigns a score to each image, with scores ranging from 1 to 5.
- **Image Position Score** is used to assess the appropriateness of the image placement in the multimodal answer.
It assigns a score of either 0 or 1 to each image, based on whether its position is deemed correct and suitable.
- **Comprehensive Score** reflects overall quality of the multimodal answer, evaluating whether the answer appropriately addresses the query and maintains overall coherence. It particularly considers whether the insertion of images enhances the answer, making it visually engaging and more expressive.
This metric assigns a score to the complete answer, with scores ranging from 1 to 5.
# Prompts
## Generation Prompts
### Answer Generation Prompt for LLM-Based Method
#Input
Query: {}
Context: {}
Image Caption: {}
# Task
Imagine you are an expert in handling multimodal input queries and producing coherent text-image responses. You will receive:
1. Query: The user query to be answered.
2. Contexts containing multiple images represented as placeholders
.
- The input context follows the format:
[context_1] , [context_2] , ...
- Each [text_context_x] represents a pure text passage, while each
serves as a placeholder for an image.
3. A set of image captions.
- Each caption is sequentially aligned in a one-to-one correspondence with its respective image placeholder
.
Your task is to answer the query based solely on the content of the context and input image information. Firstly, you should select appropriate images from the provided context (if none are suitable, you may choose not to include any). And then generate a mixed media response to the query, combining text and the selected images.
# Requirements
Ensure that your answer does not include any additional information outside the context.
Image Insert: When inserting image placeholders, place them at the most appropriate point within the answer. Image placeholders should be embedded naturally in the answer to support and enhance understanding, such as when describing specific locations, historical events, or notable buildings.
# Output Format
Please output your answer in an interwoven text-image format, where you select images from the context and include them in the corresponding placeholder format.
# Output Example
Doing household chores is a daily task that helps maintain a clean home. At the kitchen, dishes are neatly washed and placed in the drying rack, ready to be put away once they dry. Similarly, in the living room, the sofa cushions are fluffed and arranged properly, creating a comfortable space for relaxation.
### Answer Generation Prompt for MLLM-Based Methods
# Input
Query: {}
Context: {}
Image Caption: {}
# Task
Imagine you are an expert in handling multimodal input queries and producing coherent text-image responses.
You will receive:
1. Query: The user query to be answered.
2. Contexts.
3. A set of images.
4. A set of image captions.
- Each caption is sequentially aligned in a one-to-one correspondence with its respective input image.
Your task is to answer the query based solely on the content of the context and input image information. Firstly, you should visually and textually understand the images based on the given images and image captions to select appropriate images from the input images (if none are suitable, you may choose not to include any). Next, based on the provided contexts and query, generate a multi-modal answer combining text and the selected images.
# Requirements
Ensure that your answer does not include any additional information outside the context. Please note, your answer should be presented in an interwoven text-image format, where you select images from the context and output them in the corresponding placeholder format. Please provide only the answer, without including any analysis.
Image Insert: When inserting image placeholders, place them at the most appropriate point within the answer. Image placeholders should be embedded naturally in the answer to support and enhance understanding, such as when describing specific locations, historical events, or notable buildings.
# Output Format
Please output the answer in an interwoven text-image format, where you select images from the context provided and output them in the corresponding placeholder format.
# Output Example
Doing household chores is a daily task that helps maintain a clean home. At the kitchen, dishes are neatly washed and placed in the drying rack, ready to be put away once they dry. Similarly, in the living room, the sofa cushions are fluffed and arranged properly, creating a comfortable space for relaxation.
### Answer Generation Prompt for Rule-Based Methods
# Task
Imagine you are a text QA expert, skilled in delivering contextually relevant answers. You will receive:
1. Query.
2. Contexts.
Your task is to answer the query based solely on the content of the context.
# Requirements
Ensure that your answer does not include any additional information outside the context. Please note that your answer should be in pure text format.
# Output Format
Provide the answer in pure text format. Do not include any information beyond what is contained in the context.
## Evaluation Prompts
### Answer Evaluation Prompt for Image Relevance
# Input
Query: {}
Answer: {}
Image Context: {}
Image Caption: {}
# Task
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the relevance of selected images within an answer to the given query. Specifically, the answer contains both text and images. You need to assess whether the selected images are relevant to the QA pair in terms of content. The evaluation results should be output in the form of reasons and scores.
# Answer Input Format
[text_1] [text_2] ...
Explanation:
Each [text_x]” is a piece of pure text context, and each
represents an image. The images will be provided in the same order as the placeholders
.
# Image Context Input Format
[context_above]
[context_bottom]
Explanation:
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.
# Scoring Criteria of Relevance (Each Image)
When scoring, strictly adhere to the following standards, with a range of 1 to 5:
- 1 point: Completely unrelated: The image has no connection to the main content of the query and answer, and is irrelevant.
- 2 points: Weakly related: The image has a very tenuous connection to the main content of the query and answer.
- 3 points: Partially related: The image is somewhat connected to part of the content of the query and answer.
- 4 points: Mostly related: The image has a fairly clear connection to the main content of the query and answer.
- 5 points: Highly related: The image is highly relevant to the content of the query and answer.
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.
# Output Format
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.
# Output Example
Partially related, the image depicts the general structure of the gate but does not clearly show the number of pillars, making it only somewhat relevant to the QA.
3
### Answer Evaluation Prompt For Image Effectiveness
# Input
Query: {}
Answer: {}
Image Context: {}
Image Caption: {}
# Task
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the effectiveness of selected images within an answer to the given query. Specifically, the answer contains both text and images. You need to assess whether the selected images are effective to the QA pair in terms of content. The evaluation results should be output in the form of reasons and scores.
Answer Input Format
[text_1] [text_2] ...
Explanation:
Each [text_x] is a piece of pure text context, and each
represents an image. The images will be provided in the same order as the placeholders
.
# Image Context Input Format
[context_above]
[context_bottom]
Explanation:
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.
# Scoring Criteria of Effectiveness (Each Image)
When scoring, strictly adhere to the following standards, with a range of 1 to 5:
- 1 point, Harmful: The images in the answer are harmful to answering the query, such as causing serious misunderstanding for the reader.
- 2 points, Irrelevant: The images in the answer are mostly unrelated to the query and the answer, with little to no connection overall.
- 3 points, Partially Effective: The images in the answer are somewhat effective in helping the reader understand the answer to the query.
- 4 points, Mostly Effective: The images in the answer are largely consistent with the answer to the query and effectively help the reader better understand the answer.
- 5 points, Highly Effective: The images in the answer provide crucial details for answering the query. They not only align with the answer but also offer highly effective supplementary information that aids in understanding the query-answer pair from a multimodal perspective.
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.
# Output Format
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.
# Output Example
Highly effective: The images in the answer, depicting the front entrance with three pillars, are highly effective in helping readers understand the query about how many pillars there are. They strongly support the response that states there are three pillars. All images provide crucial details that aid in the reader's comprehension.
5
### Answer Evaluation Prompt For Comprehensive Answer Quality Evaluation
# Input
Query: {}
Answer: {}
Image Context: {}
Image Caption: {}
# Task
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the overall quality of the answer. Specifically, the answer contains both text and images. The evaluation results should be output in the form of reasons and scores.
# Answer Input Format
[text_1] [tex_2] ...
Explanation:
Each [text_x] is a piece of pure text context, and each
represents an image. The images will be provided in the same order as the placeholders
.
# Image Context Input Format
[context_above]
[context_bottom]
Explanation:
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.
# Evaluation Criteria of Overall Quality
Strictly follow the scoring criteria below to assign a score between 1 and 5:
- 1 point, Poor Quality: The answer fails to address the question, the structure is confusing or missing, and the images are irrelevant or not helpful.
- 2 points, Fair Quality: The answer partially addresses the question but lacks completeness. The structure is weak, and the text-image integration is weak or only partially helpful.
- 3 points, Average Quality: The answer addresses the question but lacks depth. The structure is clear but could be improved. The images are somewhat helpful but don’t fully enhance understanding.
- 4 points, Good Quality: The answer is clear and fairly comprehensive. The structure is logical and well-organized, and the images enhance the understanding of the text.
- 5 points, Excellent Quality: The answer is detailed and insightful. The structure is strong and cohesive, and the images complement the text perfectly, significantly enhancing comprehension.
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.
# Output Format
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.
# Output Example
The answer provides a complete and coherent description of the Irish bouzouki, and the images in the answer help reinforce the explanation of its appearance. The structure is logical and easy to follow, with all images appropriately enhancing the reader's understanding of the instrument.
5
### Answer Evaluation Prompt For Image Position
# Input
Query: {}
Answer: {}
Image Context: {}
Image Caption: {}
# Task
Imagine you are a multimodal problem-solving expert tasked with evaluating whether the position of each selected image within an answer to the given query is appropriate.
# Answer Input Format
[text_1] [text_2] ...
Explanation:
Each [text_x] is a segment of pure text context, and each
represents an image. The images will be presented in the same order as the placeholders
.
# Image Context Input Format
[context_above]
[context_bottom]
Explanation:
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.
# Revised Evaluation Criteria:
Strictly follow the criteria below to assign a score of 0 or 1:
- 0 point, Inappropriate Position: The image is irrelevant to both the preceding and following context, or the position of the image does not enhance content understanding or visual appeal. The insertion of the image does not align with the logical progression of the text and fails to improve the reading experience or information transmission.
- 1 point, Appropriate Position: The image is contextually relevant to at least one of the surrounding contexts (preceding or following), and it enhances content understanding or visual effect. The position of the image aligns with the logical flow of the text and is inserted appropriately, improving the overall information delivery. If the description of the image is detailed, it further clarifies the connection between the image and the text, enhancing the overall expressive effect.
# Output Format
Provide a brief justification for the evaluation and a score of either 0 or 1. Ensure no evaluation criteria beyond the provided query and answer are used.
Please output two lines for each image: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.
# Output Example
displays a distant aerial view of the site, but the surrounding context focuses on intricate design details of the main entrance. The image placement does not align with the described content and does not improve comprehension.
0
shows a close-up of one of the pillars, which is directly referenced in the following context about the structure's details. The image placement aligns with the description, enhancing understanding.
1
# Results
In this section, we give the full experiment results, wherein the metrics of **Prec.**, **Rec.**, **F1.**, **R.L.**, **B.S.**, **Rel.**, **Eff.**, **Comp.**, **Pos.**, and **Avg.** represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord.
** represents image ordering score.
## Comprehensive performance results on MRAMG-Wit(Web Dataset).
| Framework | Model | MRAMG-Wit | | | | | | | | | |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 49.50 | 49.67 | 49.56 | 56.23 | 92.27 | 43.67 | 39.50 | 77.00 | 50.08 | 56.39 |
| | GPT-4o-mini | 42.83 | 42.83 | 42.83 | 48.55 | 89.52 | 38.30 | 34.83 | 76.90 | 43.33 | 51.10 |
| | Claude-3.5-Sonnet | 50.08 | 50.50 | 50.22 | 53.37 | 92.53 | 44.03 | 39.93 | 79.20 | 50.58 | 56.72 |
| | Gemini-1.5-Pro | 28.83 | 29.00 | 28.89 | 39.47 | 84.96 | 25.20 | 22.83 | 75.50 | 29.08 | 40.42 |
| | DeepSeek-V3 | 57.67 | 58.00 | 57.78 | 58.71 | 93.65 | 51.00 | 46.13 | 79.37 | 58.17 | 62.28 |
| | Qwen2-VL-7B-Instruct | 51.67 | 51.83 | 51.72 | 53.23 | 91.14 | 45.97 | 41.53 | 74.97 | 52.25 | 57.15 |
| | Qwen2-VL-72B-Instruct | 40.83 | 41.00 | 40.89 | 46.80 | 88.20 | 36.17 | 32.73 | 73.73 | 41.58 | 49.10 |
| | InternVL-2.5-8B | 37.25 | 37.33 | 37.28 | 42.09 | 86.57 | 32.43 | 29.20 | 72.10 | 37.42 | 45.74 |
| | InternVL-2.5-78B | 43.25 | 43.50 | 43.33 | 47.52 | 88.58 | 37.53 | 34.20 | 76.20 | 43.42 | 50.84 |
| | Llama-3.1-8B-Instruct | 24.07 | 25.50 | 24.46 | 26.50 | 80.51 | 21.97 | 20.47 | 59.40 | 25.92 | 34.31 |
| | Llama-3.3-70B-Instruct | 53.58 | 53.83 | 53.67 | 56.50 | 92.42 | 46.97 | 42.43 | 78.47 | 54.25 | 59.12 |
| MLLM-Based | GPT-4o | 83.50 | 84.00 | 83.67 | 54.84 | 93.32 | 74.67 | 68.13 | 81.50 | 84.33 | 78.66 |
| | GPT-4o-mini | 64.61 | 86.83 | 71.27 | 47.62 | 92.48 | 74.60 | 69.60 | 74.27 | 67.75 | 72.11 |
| | Claude-3.5-Sonnet | 93.83 | 96.17 | 94.61 | 40.00 | 91.73 | 86.07 | 79.03 | 82.20 | 95.67 | 84.37 |
| | Gemini-1.5-Pro | 94.11 | 96.17 | 94.78 | 50.84 | 91.56 | 83.67 | 75.40 | 78.80 | 95.14 | 84.50 |
| | Qwen2-VL-7B-Instruct | 22.92 | 34.67 | 25.90 | 35.14 | 83.90 | 29.07 | 26.90 | 57.40 | 27.36 | 38.14 |
| | Qwen2-VL-72B-Instruct | 60.92 | 65.17 | 62.19 | 49.95 | 92.34 | 57.53 | 53.20 | 78.37 | 62.62 | 64.70 |
| | InternVL-2.5-8B | 44.71 | 68.17 | 51.33 | 41.24 | 89.07 | 59.07 | 55.53 | 67.10 | 56.34 | 59.17 |
| | InternVL-2.5-78B | 77.15 | 82.17 | 78.75 | 44.01 | 91.63 | 72.87 | 66.67 | 80.13 | 80.71 | 74.90 |
| LLM-Based | GPT-4o | 73.75 | 73.83 | 73.78 | 52.80 | 93.02 | 66.13 | 60.03 | 82.70 | 74.42 | 72.27 |
| | GPT-4o-mini | 61.39 | 91.33 | 70.54 | 42.85 | 91.80 | 78.90 | 72.63 | 76.80 | 63.03 | 72.14 |
| | Claude-3.5-Sonnet | 91.53 | 94.83 | 92.61 | 44.24 | 92.58 | 84.60 | 77.57 | 82.37 | 92.11 | 83.60 |
| | Gemini-1.5-Pro | 96.08 | 96.67 | 96.28 | 53.93 | 92.45 | 84.73 | 77.40 | 80.20 | 96.42 | 86.02 |
| | DeepSeek-V3 | 93.81 | 96.83 | 94.78 | 43.64 | 92.48 | 86.43 | 79.23 | 82.10 | 94.75 | 84.89 |
| | Llama-3.1-8B-Instruct | 32.75 | 40.50 | 34.87 | 32.51 | 82.06 | 37.87 | 35.70 | 54.77 | 36.14 | 43.02 |
| | Llama-3.3-70B-Instruct | 86.58 | 96.00 | 89.09 | 44.83 | 92.87 | 81.93 | 75.33 | 78.90 | 88.15 | 81.52 |
## Comprehensive performance results on MRAMG-Wiki(Web Dataset).
| Framework | Model | MRAMG-Wiki | | | | | | | | | |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 53.00 | 53.00 | 53.00 | 54.62 | 95.15 | 46.60 | 42.56 | 82.24 | 53.00 | 59.24 |
| | GPT-4o-mini | 49.60 | 49.60 | 49.60 | 53.39 | 94.87 | 42.52 | 39.12 | 82.04 | 49.60 | 56.70 |
| | Claude-3.5-Sonnet | 37.80 | 37.80 | 37.80 | 49.32 | 94.06 | 32.60 | 30.00 | 82.88 | 37.80 | 48.90 |
| | Gemini-1.5-Pro | 41.20 | 41.20 | 41.20 | 47.18 | 92.46 | 35.76 | 32.64 | 80.44 | 41.20 | 50.36 |
| | DeepSeek-V3 | 56.20 | 56.40 | 56.27 | 53.33 | 95.28 | 49.36 | 44.80 | 83.00 | 56.20 | 61.20 |
| | Qwen2-VL-7B-Instruct | 53.50 | 53.60 | 53.53 | 48.15 | 93.12 | 46.04 | 41.60 | 76.08 | 53.50 | 57.68 |
| | Qwen2-VL-72B-Instruct | 51.50 | 51.60 | 51.53 | 48.08 | 92.81 | 44.76 | 40.76 | 77.72 | 51.50 | 56.70 |
| | InternVL-2.5-8B | 50.00 | 50.20 | 50.07 | 48.06 | 93.32 | 43.64 | 40.08 | 78.20 | 50.20 | 55.97 |
| | InternVL-2.5-78B | 54.00 | 54.20 | 54.07 | 51.42 | 94.61 | 46.40 | 42.60 | 81.44 | 54.10 | 59.20 |
| | Llama-3.1-8B-Instruct | 21.60 | 21.80 | 21.67 | 27.74 | 84.65 | 18.76 | 17.28 | 59.96 | 22.20 | 32.85 |
| | Llama-3.3-70B-Instruct | 53.70 | 53.80 | 53.73 | 53.02 | 94.91 | 46.76 | 43.00 | 80.80 | 53.70 | 59.27 |
| MLLM-Based | GPT-4o | 71.30 | 71.60 | 71.40 | 53.34 | 95.70 | 63.32 | 58.28 | 83.32 | 71.40 | 71.07 |
| | GPT-4o-mini | 49.83 | 81.40 | 58.56 | 49.99 | 95.51 | 70.36 | 64.60 | 74.00 | 51.32 | 66.17 |
| | Claude-3.5-Sonnet | 91.90 | 94.20 | 92.67 | 44.42 | 94.41 | 83.68 | 76.00 | 82.36 | 92.50 | 83.57 |
| | Gemini-1.5-Pro | 92.10 | 93.80 | 92.67 | 50.05 | 94.34 | 82.08 | 74.60 | 79.76 | 92.20 | 83.51 |
| | Qwen2-VL-7B-Instruct | 24.22 | 31.60 | 26.28 | 32.02 | 87.45 | 26.76 | 25.20 | 56.24 | 26.63 | 37.38 |
| | Qwen2-VL-72B-Instruct | 53.64 | 59.60 | 55.29 | 47.93 | 94.63 | 54.28 | 49.32 | 79.92 | 54.19 | 60.98 |
| | InternVL-2.5-8B | 46.92 | 72.40 | 53.97 | 44.69 | 93.12 | 61.76 | 57.64 | 71.08 | 53.79 | 61.71 |
| | InternVL-2.5-78B | 67.12 | 72.40 | 68.66 | 45.43 | 94.85 | 64.32 | 58.96 | 81.24 | 69.33 | 69.15 |
| LLM-Based | GPT-4o | 81.40 | 81.60 | 81.47 | 51.53 | 95.66 | 72.28 | 65.76 | 83.72 | 81.40 | 77.20 |
| | GPT-4o-mini | 44.47 | 86.80 | 56.05 | 47.98 | 95.20 | 72.40 | 67.04 | 73.68 | 45.02 | 65.40 |
| | Claude-3.5-Sonnet | 93.70 | 94.80 | 94.07 | 45.78 | 94.63 | 83.68 | 76.60 | 82.48 | 93.80 | 84.39 |
| | Gemini-1.5-Pro | 95.90 | 96.00 | 95.93 | 50.92 | 94.84 | 81.76 | 75.20 | 79.72 | 96.10 | 85.15 |
| | DeepSeek-V3 | 90.03 | 95.80 | 91.90 | 45.71 | 95.18 | 84.40 | 77.00 | 82.16 | 90.13 | 83.59 |
| | Llama-3.1-8B-Instruct | 23.50 | 28.00 | 24.79 | 35.66 | 85.16 | 23.04 | 21.68 | 51.16 | 23.90 | 35.21 |
| | Llama-3.3-70B-Instruct | 70.61 | 94.40 | 76.35 | 47.86 | 95.47 | 78.16 | 71.84 | 76.96 | 71.46 | 75.90 |
## Comprehensive performance results on MRAMG-Web(Web Dataset).
| Framework | Model | MRAMG-Web | | | | | | | | | |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 32.47 | 16.93 | 22.11 | 39.17 | 90.56 | 29.47 | 27.81 | 73.87 | 32.80 | 40.58 |
| | GPT-4o-mini | 26.89 | 14.27 | 18.46 | 34.88 | 89.84 | 24.53 | 23.44 | 72.72 | 27.40 | 36.94 |
| | Claude-3.5-Sonnet | 52.27 | 29.20 | 36.89 | 49.74 | 93.69 | 47.12 | 44.75 | 80.27 | 53.07 | 54.11 |
| | Gemini-1.5-Pro | 26.60 | 15.00 | 18.87 | 28.75 | 85.91 | 24.80 | 23.55 | 70.56 | 27.60 | 35.74 |
| | DeepSeek-V3 | 53.27 | 31.00 | 38.40 | 50.29 | 93.71 | 48.83 | 46.13 | 78.96 | 54.53 | 55.01 |
| | Qwen2-VL-7B-Instruct | 16.69 | 8.67 | 11.33 | 33.36 | 90.12 | 15.04 | 14.13 | 64.43 | 16.96 | 30.08 |
| | Qwen2-VL-72B-Instruct | 18.87 | 10.47 | 13.27 | 29.15 | 86.36 | 17.20 | 16.56 | 66.53 | 19.07 | 30.83 |
| | InternVL-2.5-8B | 12.80 | 6.67 | 8.71 | 23.42 | 84.23 | 12.03 | 11.47 | 62.56 | 13.20 | 26.12 |
| | InternVL-2.5-78B | 25.09 | 14.13 | 17.77 | 36.30 | 90.46 | 22.99 | 21.49 | 69.31 | 25.56 | 35.90 |
| | Llama-3.1-8B-Instruct | 25.20 | 15.73 | 18.61 | 24.90 | 83.01 | 25.41 | 23.95 | 56.56 | 28.97 | 33.59 |
| | Llama-3.3-70B-Instruct | 41.80 | 24.00 | 29.93 | 44.60 | 91.86 | 38.13 | 36.11 | 74.77 | 43.13 | 47.15 |
| MLLM-Based | GPT-4o | 89.78 | 83.80 | 85.47 | 52.09 | 95.14 | 94.27 | 90.08 | 91.25 | 93.74 | 86.18 |
| | GPT-4o-mini | 87.71 | 88.60 | 87.82 | 53.13 | 95.66 | 93.49 | 89.44 | 90.03 | 91.49 | 86.37 |
| | Claude-3.5-Sonnet | 88.50 | 91.33 | 89.45 | 50.48 | 94.89 | 95.68 | 92.88 | 93.20 | 92.96 | 87.71 |
| | Gemini-1.5-Pro | 83.51 | 83.73 | 82.91 | 37.06 | 91.10 | 94.05 | 90.05 | 90.43 | 87.01 | 82.21 |
| | Qwen2-VL-7B-Instruct | 30.85 | 31.73 | 29.53 | 37.55 | 90.45 | 36.83 | 34.56 | 67.01 | 34.95 | 43.72 |
| | Qwen2-VL-72B-Instruct | 62.64 | 57.60 | 58.82 | 42.56 | 91.67 | 67.25 | 64.11 | 82.59 | 65.44 | 65.85 |
| | InternVL-2.5-8B | 62.98 | 59.67 | 59.98 | 46.92 | 93.31 | 70.59 | 67.12 | 78.45 | 69.95 | 67.66 |
| | InternVL-2.5-78B | 79.78 | 74.13 | 75.77 | 52.47 | 95.29 | 81.65 | 78.48 | 89.20 | 83.28 | 78.89 |
| LLM-Based | GPT-4o | 86.18 | 78.73 | 81.15 | 54.87 | 95.96 | 86.21 | 82.37 | 89.52 | 87.02 | 82.45 |
| | GPT-4o-mini | 92.86 | 93.40 | 92.95 | 53.50 | 95.82 | 93.20 | 89.28 | 89.95 | 94.59 | 88.39 |
| | Claude-3.5-Sonnet | 92.40 | 92.47 | 92.16 | 54.27 | 95.51 | 94.48 | 91.07 | 91.68 | 94.23 | 88.70 |
| | Gemini-1.5-Pro | 90.16 | 90.13 | 89.82 | 45.64 | 93.38 | 94.13 | 90.16 | 90.75 | 91.38 | 86.17 |
| | DeepSeek-V3 | 94.52 | 94.27 | 94.20 | 56.25 | 96.10 | 94.27 | 90.11 | 90.80 | 95.93 | 89.61 |
| | Llama-3.1-8B-Instruct | 29.34 | 26.27 | 26.31 | 33.70 | 81.16 | 32.08 | 30.48 | 51.81 | 32.38 | 38.17 |
| | Llama-3.3-70B-Instruct | 66.83 | 95.80 | 75.47 | 47.98 | 94.79 | 92.03 | 88.03 | 88.93 | 69.34 | 79.91 |
## Comprehensive performance results on MRAMG-Arxiv(Academic Dataset).
| Framework | Model | MRAMG-Arxiv | | | | | | | | | |
|------------|------------------------|-------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 |
| | GPT-4o-mini | 51.71 | 59.29 | 53.80 | 44.21 | 94.36 | 67.50 | 64.80 | 85.20 | 73.75 | 66.07 |
| | Claude-3.5-Sonnet | 55.17 | 62.79 | 57.37 | 42.78 | 94.09 | 69.10 | 66.10 | 84.20 | 75.75 | 67.48 |
| | Gemini-1.5-Pro | 52.43 | 56.29 | 53.10 | 42.18 | 93.85 | 64.20 | 61.80 | 83.20 | 70.28 | 64.15 |
| | DeepSeek-V3 | 56.12 | 67.29 | 59.34 | 45.74 | 94.90 | 74.00 | 70.30 | 84.30 | 78.46 | 70.05 |
| | Qwen2-VL-7B-Instruct | 49.17 | 52.17 | 49.57 | 39.32 | 92.09 | 60.70 | 58.90 | 78.90 | 67.08 | 60.88 |
| | Qwen2-VL-72B-Instruct | 45.42 | 48.71 | 45.68 | 39.86 | 92.39 | 60.30 | 58.10 | 79.60 | 65.42 | 59.50 |
| | InternVL-2.5-8B | 39.20 | 48.29 | 41.51 | 40.36 | 91.42 | 61.60 | 59.20 | 76.70 | 65.17 | 58.16 |
| | InternVL-2.5-78B | 52.21 | 62.00 | 55.28 | 43.66 | 94.51 | 71.00 | 68.70 | 85.40 | 75.38 | 67.57 |
| | Llama-3.1-8B-Instruct | 21.50 | 23.08 | 21.90 | 26.61 | 85.92 | 26.20 | 25.00 | 58.70 | 29.00 | 35.32 |
| | Llama-3.3-70B-Instruct | 53.00 | 58.17 | 53.97 | 44.14 | 94.39 | 65.80 | 63.50 | 83.60 | 73.42 | 65.55 |
| MLLM-Based | GPT-4o | 60.39 | 74.29 | 64.23 | 44.25 | 95.15 | 89.40 | 86.20 | 87.50 | 90.39 | 76.87 |
| | GPT-4o-mini | 36.17 | 74.79 | 46.78 | 42.48 | 95.08 | 83.60 | 80.80 | 83.20 | 74.66 | 68.62 |
| | Claude-3.5-Sonnet | 47.12 | 83.50 | 57.68 | 40.60 | 94.65 | 89.30 | 86.70 | 87.60 | 86.38 | 74.84 |
| | Gemini-1.5-Pro | 58.13 | 80.25 | 64.74 | 41.84 | 94.30 | 85.10 | 82.40 | 85.90 | 83.61 | 75.14 |
| | Qwen2-VL-7B-Instruct | 1.63 | 4.00 | 2.18 | 33.01 | 84.62 | 5.20 | 5.10 | 49.80 | 4.46 | 21.11 |
| | Qwen2-VL-72B-Instruct | 31.99 | 44.87 | 35.22 | 40.54 | 93.53 | 57.90 | 56.60 | 84.20 | 55.16 | 55.56 |
| | InternVL-2.5-8B | 12.22 | 27.87 | 15.78 | 31.99 | 83.72 | 30.40 | 29.50 | 58.10 | 28.49 | 35.34 |
| | InternVL-2.5-78B | 36.62 | 55.00 | 41.77 | 37.99 | 94.47 | 68.10 | 66.20 | 84.80 | 64.11 | 61.01 |
| LLM-Based | GPT-4o | 65.28 | 76.54 | 68.54 | 44.13 | 95.23 | 86.00 | 82.70 | 88.90 | 84.84 | 76.91 |
| | GPT-4o-mini | 37.69 | 83.33 | 49.90 | 41.23 | 95.01 | 85.90 | 82.60 | 84.50 | 69.07 | 69.91 |
| | Claude-3.5-Sonnet | 62.17 | 88.00 | 70.16 | 41.04 | 94.37 | 90.90 | 88.60 | 89.60 | 88.17 | 79.22 |
| | Gemini-1.5-Pro | 59.85 | 78.63 | 65.22 | 42.41 | 94.32 | 84.60 | 82.20 | 87.60 | 80.15 | 75.00 |
| | DeepSeek-V3 | 46.57 | 81.13 | 56.69 | 39.48 | 94.70 | 90.30 | 86.40 | 87.50 | 70.01 | 72.53 |
| | Llama-3.1-8B-Instruct | 1.50 | 2.00 | 1.67 | 25.78 | 80.61 | 3.30 | 3.00 | 43.40 | 4.00 | 18.36 |
| | Llama-3.3-70B-Instruct | 38.78 | 84.88 | 48.56 | 37.83 | 95.01 | 85.50 | 81.80 | 83.40 | 64.59 | 68.93 |
## Comprehensive performance results on MRAMG-Recipe(Lifestyle Dataset).
| Framework | Model | MRAMG-Recipe | | | | | | | | | | |
|------------|------------------------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 48.79 | 66.11 | 52.76 | 51.80 | 92.10 | 45.30 | 77.80 | 74.64 | 79.19 | 78.04 | 66.65 |
| | GPT-4o-mini | 51.10 | 63.88 | 53.43 | 49.49 | 91.14 | 45.58 | 75.42 | 72.42 | 79.40 | 76.91 | 65.88 |
| | Claude-3.5-Sonnet | 52.15 | 62.95 | 53.48 | 47.13 | 92.08 | 44.94 | 75.36 | 72.53 | 79.84 | 77.21 | 65.77 |
| | Gemini-1.5-Pro | 50.61 | 51.46 | 47.23 | 40.71 | 87.97 | 39.61 | 71.09 | 68.31 | 78.40 | 73.08 | 60.85 |
| | DeepSeek-V3 | 26.13 | 59.00 | 33.36 | 50.51 | 92.48 | 22.96 | 74.58 | 71.92 | 73.36 | 64.49 | 56.88 |
| | Qwen2-VL-7B-Instruct | 45.55 | 63.81 | 48.46 | 50.79 | 91.85 | 41.36 | 77.99 | 74.92 | 78.06 | 78.36 | 65.11 |
| | Qwen2-VL-72B-Instruct | 31.20 | 50.10 | 34.40 | 46.33 | 89.91 | 24.99 | 73.41 | 70.50 | 72.61 | 71.18 | 56.46 |
| | InternVL-2.5-8B | 29.37 | 52.39 | 32.92 | 42.87 | 90.19 | 23.14 | 73.58 | 70.92 | 72.53 | 71.38 | 55.93 |
| | InternVL-2.5-78B | 20.90 | 70.86 | 29.26 | 51.20 | 92.37 | 17.82 | 75.20 | 72.77 | 74.43 | 54.05 | 55.89 |
| | Llama-3.1-8B-Instruct | 27.59 | 37.70 | 25.17 | 25.89 | 81.02 | 18.83 | 64.42 | 61.52 | 65.64 | 61.73 | 46.95 |
| | Llama-3.3-70B-Instruct | 29.56 | 51.38 | 34.55 | 51.56 | 93.19 | 24.57 | 74.31 | 71.50 | 72.64 | 69.57 | 57.28 |
| MLLM-Based | GPT-4o | 45.20 | 46.49 | 42.25 | 45.74 | 92.72 | 33.70 | 77.31 | 74.64 | 81.65 | 78.01 | 61.77 |
| | GPT-4o-mini | 30.31 | 50.26 | 33.86 | 40.16 | 91.81 | 22.67 | 77.97 | 75.49 | 77.52 | 71.23 | 57.13 |
| | Claude-3.5-Sonnet | 30.04 | 54.21 | 35.01 | 34.54 | 90.90 | 22.18 | 80.56 | 78.18 | 79.75 | 74.75 | 58.01 |
| | Gemini-1.5-Pro | 39.01 | 59.50 | 43.50 | 43.43 | 89.89 | 32.49 | 81.94 | 79.22 | 81.64 | 70.42 | 62.10 |
| | Qwen2-VL-7B-Instruct | 9.06 | 15.17 | 9.48 | 34.47 | 84.65 | 4.44 | 18.81 | 18.08 | 55.62 | 17.17 | 26.69 |
| | Qwen2-VL-72B-Instruct | 19.19 | 26.47 | 19.70 | 43.26 | 91.35 | 12.27 | 43.25 | 41.57 | 74.52 | 39.73 | 41.13 |
| | InternVL-2.5-8B | 23.01 | 39.81 | 23.89 | 33.22 | 89.42 | 15.34 | 67.19 | 64.96 | 74.45 | 63.44 | 49.47 |
| | InternVL-2.5-78B | 21.72 | 30.07 | 21.22 | 36.60 | 91.13 | 13.87 | 56.60 | 54.66 | 75.79 | 52.99 | 45.46 |
| LLM-Based | GPT-4o | 49.70 | 65.03 | 51.91 | 44.75 | 92.42 | 43.59 | 82.58 | 79.38 | 81.02 | 81.88 | 67.23 |
| | GPT-4o-mini | 45.59 | 39.32 | 39.61 | 47.56 | 92.91 | 32.04 | 51.78 | 49.82 | 83.47 | 54.86 | 53.70 |
| | Claude-3.5-Sonnet | 62.24 | 67.73 | 61.48 | 38.65 | 91.49 | 53.23 | 81.15 | 78.30 | 84.96 | 83.87 | 70.31 |
| | Gemini-1.5-Pro | 64.87 | 71.43 | 64.43 | 47.01 | 90.70 | 56.89 | 82.39 | 79.16 | 83.55 | 80.69 | 72.11 |
| | DeepSeek-V3 | 47.53 | 70.82 | 51.92 | 39.83 | 91.84 | 40.90 | 84.38 | 81.46 | 82.97 | 77.92 | 66.96 |
| | Llama-3.1-8B-Instruct | 11.56 | 12.69 | 10.89 | 24.61 | 75.21 | 6.70 | 17.71 | 17.04 | 41.86 | 18.32 | 23.66 |
| | Llama-3.3-70B-Instruct | 36.87 | 72.52 | 44.31 | 38.38 | 91.99 | 31.00 | 81.84 | 79.19 | 80.84 | 71.99 | 62.89 |
## Comprehensive performance results on MRAMG-Manual(Lifestyle Dataset).
| Framework | Model | MRAMG-Manual | | | | | | | | | | |
|------------|------------------------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 36.45 | 47.97 | 38.32 | 50.82 | 91.51 | 32.10 | 75.79 | 73.44 | 79.08 | 71.66 | 59.71 |
| | GPT-4o-mini | 37.29 | 47.18 | 38.22 | 50.40 | 91.05 | 32.83 | 73.28 | 71.79 | 78.87 | 69.70 | 59.06 |
| | Claude-3.5-Sonnet | 39.27 | 50.38 | 41.13 | 48.17 | 91.69 | 32.82 | 73.18 | 71.23 | 77.44 | 71.43 | 59.67 |
| | Gemini-1.5-Pro | 40.54 | 45.17 | 39.84 | 46.69 | 90.40 | 33.01 | 73.13 | 70.67 | 76.56 | 73.51 | 58.95 |
| | DeepSeek-V3 | 32.75 | 48.83 | 36.28 | 51.57 | 92.05 | 31.29 | 76.92 | 75.08 | 79.23 | 69.42 | 59.34 |
| | Qwen2-VL-7B-Instruct | 33.18 | 43.48 | 34.32 | 46.68 | 89.32 | 27.61 | 71.79 | 69.90 | 75.54 | 71.14 | 56.30 |
| | Qwen2-VL-72B-Instruct | 35.58 | 44.42 | 35.38 | 46.30 | 89.73 | 28.86 | 70.72 | 68.31 | 74.82 | 69.46 | 56.36 |
| | InternVL-2.5-8B | 29.53 | 45.93 | 32.06 | 42.30 | 89.64 | 24.17 | 72.10 | 69.23 | 74.15 | 70.72 | 54.98 |
| | InternVL-2.5-78B | 32.96 | 48.63 | 36.00 | 48.26 | 91.10 | 29.72 | 75.74 | 73.44 | 78.10 | 71.66 | 58.56 |
| | Llama-3.1-8B-Instruct | 32.07 | 27.50 | 26.58 | 30.90 | 82.93 | 15.10 | 50.87 | 49.44 | 62.00 | 55.84 | 43.32 |
| | Llama-3.3-70B-Instruct | 34.53 | 44.35 | 35.60 | 49.50 | 91.22 | 30.26 | 73.13 | 71.03 | 75.74 | 69.26 | 57.46 |
| MLLM-Based | GPT-4o | 35.07 | 33.78 | 32.44 | 44.68 | 91.16 | 24.50 | 75.49 | 73.28 | 79.59 | 73.38 | 56.34 |
| | GPT-4o-mini | 23.43 | 32.24 | 25.16 | 43.60 | 91.05 | 17.33 | 72.92 | 71.13 | 75.23 | 62.22 | 51.43 |
| | Claude-3.5-Sonnet | 25.17 | 39.24 | 28.47 | 40.32 | 91.02 | 19.94 | 80.51 | 78.10 | 80.41 | 75.12 | 55.83 |
| | Gemini-1.5-Pro | 36.01 | 44.68 | 37.14 | 48.87 | 90.99 | 28.76 | 76.62 | 74.62 | 79.79 | 66.32 | 58.38 |
| | Qwen2-VL-7B-Instruct | 13.32 | 15.05 | 13.48 | 41.07 | 86.02 | 3.09 | 13.38 | 12.82 | 57.74 | 10.46 | 26.65 |
| | Qwen2-VL-72B-Instruct | 22.13 | 24.92 | 21.62 | 44.36 | 90.34 | 12.95 | 49.08 | 47.13 | 73.44 | 41.23 | 42.72 |
| | InternVL-2.5-8B | 17.23 | 26.63 | 18.65 | 39.71 | 89.33 | 9.34 | 47.38 | 46.26 | 71.23 | 39.90 | 40.57 |
| | InternVL-2.5-78B | 19.70 | 23.19 | 19.37 | 42.90 | 91.01 | 11.36 | 55.95 | 55.18 | 73.28 | 45.90 | 43.78 |
| LLM-Based | GPT-4o | 34.02 | 46.48 | 36.78 | 45.99 | 91.46 | 35.80 | 77.59 | 75.64 | 78.05 | 71.65 | 59.35 |
| | GPT-4o-mini | 36.94 | 31.87 | 32.64 | 45.77 | 91.35 | 25.46 | 55.33 | 54.05 | 81.79 | 55.56 | 51.08 |
| | Claude-3.5-Sonnet | 45.21 | 44.59 | 43.20 | 42.68 | 91.64 | 40.39 | 75.08 | 72.67 | 82.62 | 74.73 | 61.28 |
| | Gemini-1.5-Pro | 46.23 | 49.69 | 45.43 | 50.21 | 91.58 | 39.87 | 76.62 | 74.36 | 80.36 | 73.40 | 62.77 |
| | DeepSeek-V3 | 34.71 | 47.89 | 37.82 | 43.80 | 91.38 | 36.81 | 81.08 | 78.77 | 80.67 | 71.65 | 60.46 |
| | Llama-3.1-8B-Instruct | 12.65 | 13.12 | 12.38 | 22.27 | 76.31 | 3.03 | 10.56 | 10.46 | 35.59 | 10.06 | 20.64 |
| | Llama-3.3-70B-Instruct | 25.74 | 50.15 | 31.26 | 39.80 | 91.31 | 28.03 | 76.72 | 74.36 | 75.95 | 62.56 | 55.59 |
## Comprehensive performance results on MRAMG-Bench.
| Framework | Model | Web Data | | | | | | | | | | Academic Data | | | | | | | | | | Lifestyle Data | | | | | | | | | | |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
| Rule-Based | GPT-4o | 43.54 | 37.30 | 39.36 | 48.88 | 92.35 | 38.70 | 35.59 | 77.15 | 43.86 | 50.75 | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 | 47.04 | 63.54 | 50.71 | 51.66 | 92.01 | 43.54 | 77.51 | 74.47 | 79.17 | 77.13 | 65.68 |
| | GPT-4o-mini | 38.20 | 33.08 | 34.78 | 44.32 | 91.10 | 33.86 | 31.37 | 76.59 | 38.57 | 46.87 | 51.71 | 59.29 | 53.80 | 44.21 | 94.36 | 67.50 | 64.80 | 85.20 | 73.75 | 66.07 | 49.14 | 61.52 | 51.27 | 49.62 | 91.13 | 43.87 | 75.11 | 72.33 | 79.32 | 75.89 | 64.92 |
| | Claude-3.5-Sonnet | 47.65 | 38.43 | 41.46 | 50.81 | 93.42 | 42.19 | 39.20 | 80.63 | 48.14 | 53.55 | 55.17 | 62.79 | 57.37 | 42.78 | 94.09 | 69.10 | 66.10 | 84.20 | 75.75 | 67.48 | 50.32 | 61.17 | 51.73 | 47.27 | 92.02 | 43.32 | 75.05 | 72.34 | 79.50 | 76.39 | 64.91 |
| | Gemini-1.5-Pro | 31.27 | 26.62 | 28.15 | 37.21 | 87.37 | 27.89 | 25.77 | 74.83 | 31.76 | 41.21 | 52.43 | 56.29 | 53.10 | 42.18 | 93.85 | 64.20 | 61.80 | 83.20 | 70.28 | 64.15 | 49.18 | 50.57 | 46.18 | 41.56 | 88.32 | 38.73 | 71.38 | 68.64 | 78.14 | 73.14 | 60.58 |
| | DeepSeek-V3 | 55.49 | 46.62 | 49.51 | 53.84 | 94.12 | 49.68 | 45.77 | 80.18 | 56.16 | 59.04 | 56.12 | 67.29 | 59.34 | 45.74 | 94.90 | 74.00 | 70.30 | 84.30 | 78.46 | 70.05 | 27.07 | 57.56 | 33.77 | 50.66 | 92.42 | 24.07 | 74.92 | 72.36 | 74.19 | 65.19 | 57.22 |
| | Qwen2-VL-7B-Instruct | 37.98 | 34.81 | 35.84 | 43.80 | 91.26 | 33.45 | 30.44 | 70.99 | 38.28 | 46.32 | 49.17 | 52.17 | 49.57 | 39.32 | 92.09 | 60.70 | 58.90 | 78.90 | 67.08 | 60.88 | 43.79 | 60.93 | 46.46 | 50.21 | 91.49 | 39.52 | 77.11 | 74.20 | 77.70 | 77.33 | 63.87 |
| | Qwen2-VL-72B-Instruct | 34.81 | 31.49 | 32.57 | 39.99 | 88.70 | 30.80 | 28.35 | 71.89 | 35.14 | 43.75 | 45.42 | 48.71 | 45.68 | 39.86 | 92.39 | 60.30 | 58.10 | 79.60 | 65.42 | 59.50 | 31.82 | 49.30 | 34.54 | 46.33 | 89.88 | 25.51 | 73.03 | 70.19 | 72.92 | 70.94 | 56.45 |
| | InternVL-2.5-8B | 30.78 | 28.38 | 29.15 | 36.13 | 87.45 | 27.19 | 24.95 | 69.88 | 31.05 | 40.55 | 39.20 | 48.29 | 41.51 | 40.36 | 91.42 | 61.60 | 59.20 | 76.70 | 65.17 | 58.16 | 29.39 | 51.47 | 32.80 | 42.79 | 90.11 | 23.28 | 73.37 | 70.68 | 72.76 | 71.29 | 55.79 |
| | InternVL-2.5-78B | 38.79 | 34.49 | 35.87 | 44.03 | 90.97 | 34.03 | 31.32 | 74.82 | 39.06 | 47.04 | 52.21 | 62.00 | 55.28 | 43.66 | 94.51 | 71.00 | 68.70 | 85.40 | 75.38 | 67.57 | 22.61 | 67.71 | 30.21 | 50.79 | 92.19 | 19.41 | 75.28 | 72.87 | 74.95 | 56.55 | 56.26 |
| | Llama-3.1-8B-Instruct | 23.86 | 20.54 | 21.34 | 26.19 | 82.64 | 22.50 | 21.02 | 58.40 | 26.15 | 33.63 | 21.50 | 23.08 | 21.90 | 26.61 | 85.92 | 26.20 | 25.00 | 58.70 | 29.00 | 35.32 | 28.23 | 36.25 | 25.37 | 26.60 | 81.29 | 18.33 | 62.49 | 59.80 | 65.12 | 60.90 | 46.44 |
| | Llama-3.3-70B-Instruct | 48.84 | 41.73 | 44.06 | 50.73 | 92.87 | 43.33 | 40.02 | 77.60 | 49.59 | 54.31 | 53.00 | 58.17 | 53.97 | 44.14 | 94.39 | 65.80 | 63.50 | 83.60 | 73.42 | 65.55 | 30.27 | 50.38 | 34.70 | 51.27 | 92.91 | 25.33 | 74.14 | 71.43 | 73.08 | 69.53 | 57.30 |
| MLLM-Based | GPT-4o | 82.75 | 80.57 | 81.08 | 53.32 | 94.70 | 79.55 | 74.37 | 85.95 | 84.65 | 79.66 | 60.39 | 74.29 | 64.23 | 44.25 | 95.15 | 89.40 | 86.20 | 87.50 | 90.39 | 76.87 | 43.77 | 44.68 | 40.86 | 45.59 | 92.50 | 32.47 | 77.05 | 74.45 | 81.36 | 77.35 | 61.01 |
| | GPT-4o-mini | 69.98 | 86.08 | 74.55 | 50.49 | 94.59 | 81.11 | 76.29 | 80.58 | 72.93 | 76.29 | 36.17 | 74.79 | 46.78 | 42.48 | 95.08 | 83.60 | 80.80 | 83.20 | 74.66 | 68.62 | 29.34 | 47.71 | 32.62 | 40.65 | 91.71 | 21.95 | 77.26 | 74.87 | 77.19 | 69.95 | 56.32 |
| | Claude-3.5-Sonnet | 91.15 | 93.68 | 91.99 | 45.44 | 93.73 | 89.32 | 83.83 | 86.70 | 93.71 | 85.51 | 47.12 | 83.50 | 57.68 | 40.60 | 94.65 | 89.30 | 86.70 | 87.60 | 86.38 | 74.84 | 29.35 | 52.09 | 34.08 | 35.36 | 90.92 | 21.88 | 80.55 | 78.17 | 79.85 | 74.80 | 57.71 |
| | Gemini-1.5-Pro | 89.27 | 90.49 | 89.39 | 45.04 | 92.13 | 87.45 | 81.12 | 83.77 | 91.05 | 83.30 | 58.13 | 80.25 | 64.74 | 41.84 | 94.30 | 85.10 | 82.40 | 85.90 | 83.61 | 75.14 | 38.59 | 57.40 | 42.60 | 44.20 | 90.05 | 31.99 | 81.19 | 78.57 | 81.37 | 69.84 | 61.58 |
| | Qwen2-VL-7B-Instruct | 26.48 | 32.65 | 27.47 | 35.27 | 87.51 | 31.59 | 29.55 | 60.98 | 30.24 | 40.19 | 1.63 | 4.00 | 2.18 | 33.01 | 84.62 | 5.20 | 5.10 | 49.80 | 4.46 | 21.11 | 9.66 | 15.15 | 10.05 | 35.41 | 84.84 | 4.26 | 18.04 | 17.34 | 55.92 | 16.22 | 26.69 |
| | Qwen2-VL-72B-Instruct | 59.65 | 60.59 | 58.96 | 46.41 | 92.69 | 60.59 | 56.57 | 80.50 | 61.49 | 64.16 | 31.99 | 44.87 | 35.22 | 40.54 | 93.53 | 57.90 | 56.60 | 84.20 | 55.16 | 55.56 | 19.61 | 26.25 | 19.98 | 43.42 | 91.21 | 12.36 | 44.07 | 42.36 | 74.36 | 39.94 | 41.36 |
| | InternVL-2.5-8B | 52.71 | 65.86 | 55.55 | 44.48 | 91.89 | 64.46 | 60.80 | 72.78 | 61.17 | 63.30 | 12.22 | 27.87 | 15.78 | 31.99 | 83.72 | 30.40 | 29.50 | 58.10 | 28.49 | 35.34 | 22.19 | 37.94 | 23.14 | 34.14 | 89.41 | 14.54 | 64.39 | 62.31 | 73.99 | 60.11 | 48.22 |
| | InternVL-2.5-78B | 75.50 | 76.27 | 74.82 | 47.82 | 93.98 | 74.12 | 69.37 | 84.11 | 78.68 | 74.96 | 36.62 | 55.00 | 41.77 | 37.99 | 94.47 | 68.10 | 66.20 | 84.80 | 64.11 | 61.01 | 21.43 | 29.09 | 20.96 | 37.49 | 91.11 | 13.53 | 56.51 | 54.73 | 75.43 | 51.99 | 45.23 |
| LLM-Based | GPT-4o | 80.86 | 77.92 | 78.85 | 53.30 | 94.92 | 75.94 | 70.64 | 85.74 | 81.41 | 77.73 | 65.28 | 76.54 | 68.54 | 44.13 | 95.23 | 86.00 | 82.70 | 88.90 | 84.84 | 76.91 | 47.48 | 62.40 | 49.76 | 44.92 | 92.29 | 42.55 | 81.88 | 78.85 | 80.60 | 80.43 | 66.11 |
| | GPT-4o-mini | 69.58 | 90.95 | 75.71 | 48.56 | 94.35 | 82.94 | 77.87 | 81.29 | 70.96 | 76.91 | 37.69 | 83.33 | 49.90 | 41.23 | 95.01 | 85.90 | 82.60 | 84.50 | 69.07 | 69.91 | 44.36 | 38.27 | 38.62 | 47.30 | 92.69 | 31.16 | 52.28 | 50.42 | 83.24 | 54.96 | 53.33 |
| | Claude-3.5-Sonnet | 92.47 | 93.86 | 92.82 | 48.72 | 94.32 | 88.36 | 82.78 | 86.17 | 93.43 | 85.88 | 62.17 | 88.00 | 70.16 | 41.04 | 94.37 | 90.90 | 88.60 | 89.60 | 88.17 | 79.22 | 59.83 | 64.45 | 58.89 | 39.22 | 91.51 | 51.51 | 80.29 | 77.50 | 84.63 | 82.58 | 69.04 |
| | Gemini-1.5-Pro | 93.63 | 93.84 | 93.57 | 49.75 | 93.48 | 87.74 | 81.98 | 84.35 | 94.29 | 85.85 | 59.85 | 78.63 | 65.22 | 42.41 | 94.32 | 84.60 | 82.20 | 87.60 | 80.15 | 75.00 | 62.23 | 68.34 | 61.74 | 47.47 | 90.83 | 54.62 | 81.57 | 78.48 | 83.10 | 79.66 | 70.80 |
| | DeepSeek-V3 | 93.08 | 95.51 | 93.76 | 49.31 | 94.68 | 89.06 | 83.04 | 85.64 | 93.98 | 86.45 | 46.57 | 81.13 | 56.69 | 39.48 | 94.70 | 90.30 | 86.40 | 87.50 | 70.01 | 72.53 | 45.71 | 67.56 | 49.92 | 40.39 | 91.78 | 40.35 | 83.91 | 81.08 | 82.64 | 77.03 | 66.04 |
| | Llama-3.1-8B-Instruct | 28.87 | 31.35 | 28.68 | 33.84 | 82.53 | 31.51 | 29.79 | 52.59 | 31.31 | 38.94 | 1.50 | 2.00 | 1.67 | 25.78 | 80.61 | 3.30 | 3.00 | 43.40 | 4.00 | 18.36 | 11.71 | 12.75 | 11.10 | 24.28 | 75.36 | 6.21 | 16.70 | 16.11 | 40.97 | 17.15 | 23.23 |
| | Llama-3.3-70B-Instruct | 74.26 | 95.49 | 80.12 | 46.93 | 94.35 | 85.01 | 79.54 | 82.44 | 76.01 | 79.35 | 38.78 | 84.88 | 48.56 | 37.83 | 95.01 | 85.50 | 81.80 | 83.40 | 64.59 | 68.93 | 35.29 | 69.34 | 42.46 | 38.58 | 91.89 | 30.60 | 81.11 | 78.51 | 80.15 | 70.66 | 61.86 |