metadata

license: mit
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
  - OpenGVLab/InternVL2_5-26B-MPO
base_model_relation: quantized
datasets:
  - OpenGVLab/MMPR-v1.1
language:
  - multilingual
tags:
  - internvl
  - custom_code

InternVL2_5-26B-MPO-AWQ

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

Introduction

We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.

InternVL 2.5 Family

In the following table, we provide an overview of the InternVL2.5-MPO series.

Model Name	Vision Part	Language Part	HF Link
InternVL2_5-1B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-0.5B-Instruct	🤗 link
InternVL2_5-2B-MPO	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 link
InternVL2_5-4B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 link
InternVL2_5-8B-MPO	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 link
InternVL2_5-26B-MPO	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 link
InternVL2_5-38B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 link
InternVL2_5-78B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 link

Model Architecture

As shown in the following figure, InternVL2.5-MPO retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.

Key Designs

Multi-Modal Preference Dataset

MMPR is a large-scale and high-quality multimodal reasoning preference dataset. This dataset includes about 3 million samples.

To construct this dataset, we propose an efficient data construction pipeline. Specifically, we categorize the multimodal data into samples with clear ground truths and samples without clear ground truths.

For samples with clear ground truths: the model is prompted to first provide the reasoning process and then give the final answer in the format like Final Answer: ***. Responses matching the ground truth answer constitute the positive set $\mathcal{Y}_p$ , while those that do not match make up the negative set $\mathcal{Y}_n$ . Additionally, responses that fail to provide a clear final answer are also merged into $\mathcal{Y}_n$ . Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response $y_{c}$ from $\mathcal{Y}_p$ and a negative response $y_{r}$ from $\mathcal{Y}_n$ .
For samples without clear ground truths: we propose a simple yet effective method: Dropout Next-Token Prediction (Dropout NTP). Specifically, we use the responses generated by InternVL2-8B as chosen answers. Given the chosen answer, we truncate it by half and then prompt InternVL2-8B to complete the remaining portion of the truncated answer without access to the image input. This generated completion serves as the rejected answer for the paired sample. It is worth noting that while the responses generated by InternVL2-8B may not be perfect, the completions generated without the image input will introduce more hallucinations than those generated with the image input. Therefore, the partial order relationship between the chosen and rejected responses holds true.

The data construction pipeline is open-sourced, see more details in our document.

Mixed Preference Optimization

The key insight behind MPO is that an effective PO process should enable the model to learn the relative preference between pairs of responses, the absolute quality of individual responses, and the process for generating preferred responses. We define the training objective as a combination of preference loss $\mathcal{L}_{\text{p}}$ , quality loss $\mathcal{L}_{\text{q}}$ , and generation loss $\mathcal{L}_{\text{g}}$ , referred to as Mixed Preference Optimization:

$\mathcal{L}=w_{p}\cdot\mathcal{L}_{\text{p}} + w_{q}\cdot\mathcal{L}_{\text{q}} + w_{g}\cdot\mathcal{L}_{\text{g}},$

where $w_{*}$ represents the weight assigned to each loss component. In this work, we empirically compare different variants of preference loss. Based on the experimental results, we use DPO as our preference loss and BCO as our quality loss.

Specifically, the DPO serves as the preference loss to enable the model to learn the relative preference between chosen and rejected responses. This algorithm optimizes the following loss function:

$\mathcal{L}_{\text{p}}=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)}\right),$

where $\beta$ is the KL penalty coefficient, and $x$ , $y_{c}$ , and $y_{r}$ are user query, chosen response, and rejected response, respectively. The policy model $\pi_\theta$ is initialized from model $\pi_0$ .

Additionally, the BCO loss is employed as the quality loss, which helps the model to understand the absolute quality of individual responses. The loss function is defined as:

$\mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,$

where $\mathcal{L}_{\text{q}}^{+}$ and $\mathcal{L}_{\text{q}}^{+}$ represent the loss for chosen and rejected responses, respectively. Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:

$\mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),$

$\mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),$

where $\delta$ represents the reward shift, calculated as the moving average of previous rewards to stabilize training.

Finally, the SFT loss is used as the generation loss to help the model learn the generation process of preferred responses. The loss function is defined as:

$\mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.$

Evaluation on Multimodal Capability

To comprehensively compare InternVL's performance before and after MPO, we employ the benchmarks from OpenCompass Learderboard, including both well-established classic datasets and newly introduced ones. These benchmarks span a wide range of categories, aiming to provide a thorough and balanced assessment of InternVL’s capabilities across various multimodal tasks. We provide the evaluation results in the tables behind.

Model	Avg.	MMBench v1.1	MMStar	MMMU	MathVista	HallusionBench	AI2D	OCRBench	MMVet
InternVL2-5-1B	54.9	66.5	51.3	41.2	47.1	39.4	69.0	77.4	47.2
InternVL2-5-1B-MPO	56.4	67.2	49.7	40.8	53.0	40.0	69.4	83.6	47.2
InternVL2-5-2B	59.9	70.9	54.3	43.2	51.1	42.3	74.9	80.2	62.6
InternVL2-5-2B-MPO	62.0	71.6	55.0	45.0	56.4	43.0	75.3	84.2	65.4
InternVL2-5-4B	65.1	78.2	58.7	51.8	60.8	46.6	81.4	82.0	61.5
InternVL2-5-4B-MPO	67.6	78.6	60.2	51.6	65.3	47.8	82.0	88.0	67.1
InternVL2-5-8B	68.9	82.5	63.2	56.2	64.5	49.0	84.6	82.1	62.8
InternVL2-5-8B-MPO	70.4	82.4	65.7	54.9	68.9	51.4	84.5	88.3	66.9
InternVL2-5-26B	71.6	84.6	66.5	60.7	68.0	55.8	86.2	85.4	65.4
InternVL2-5-26B-MPO	72.7	84.2	67.2	57.7	72.8	55.3	86.2	91.2	67.1
InternVL2-5-38B	73.5	85.4	68.5	64.6	72.4	57.9	87.6	84.1	67.2
InternVL2-5-38B-MPO	75.5	85.6	69.8	64.1	73.8	61.5	88.1	88.5	72.5
InternVL2-5-78B	75.2	87.5	69.5	70.0	70.6	57.4	89.1	85.3	71.8
InternVL2-5-78B-MPO	76.6	87.3	73.1	68.3	73.8	58.7	89.3	91.2	71.4

Deployment

LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>=0.6.4

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.

A 'Hello, world' Example

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-26B-MPO-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required dependency packages as prompted.

Multi-images Inference

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-26B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

Batch Prompts Inference

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-26B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

Multi-turn Conversation

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-26B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

Service

LMDeploy's api_server enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-26B-MPO-AWQ --server-port 23333

To use the OpenAI-style interface, you need to install OpenAI:

pip install openai

Then, use the code below to make the API call:

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

License

This project is released under the MIT License. This project uses the pre-trained Qwen2.5-3B-Instruct as a component, which is licensed under the Apache License 2.0.

Citation

If you find this project useful in your research, please consider citing:

@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}