czczup's picture
Update README.md
d35db0f verified
metadata
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
  - OpenGVLab/InternVL2_5-38B-MPO
base_model_relation: quantized
datasets:
  - OpenGVLab/MMPR-v1.1
language:
  - multilingual
tags:
  - internvl
  - custom_code

InternVL2_5-38B-MPO-AWQ

[πŸ“‚ GitHub] [πŸ“œ InternVL 1.0] [πŸ“œ InternVL 1.5] [πŸ“œ InternVL 2.5] [πŸ“œ InternVL2.5-MPO]

[πŸ†• Blog] [πŸ—¨οΈ Chat Demo] [πŸ€— HF Demo] [πŸš€ Quick Start] [πŸ“– Documents]

image

Introduction

We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.

image/png

InternVL 2.5 Family

In the following table, we provide an overview of the InternVL2.5-MPO series.

Model Architecture

As shown in the following figure, InternVL2.5-MPO retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.

image/png

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.

Key Designs

Multi-Modal Preference Dataset

MMPR is a large-scale and high-quality multimodal reasoning preference dataset. This dataset includes about 3 million samples.

image/jpeg image/jpeg

To construct this dataset, we propose an efficient data construction pipeline. Specifically, we categorize the multimodal data into samples with clear ground truths and samples without clear ground truths.

  • For samples with clear ground truths: the model is prompted to first provide the reasoning process and then give the final answer in the format like Final Answer: ***. Responses matching the ground truth answer constitute the positive set Yp\mathcal{Y}_p, while those that do not match make up the negative set Yn\mathcal{Y}_n. Additionally, responses that fail to provide a clear final answer are also merged into Yn\mathcal{Y}_n. Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response ycy_c from Yp\mathcal{Y}_p and a negative response yry_r from Yn\mathcal{Y}_n.

  • For samples without clear ground truths: we propose a simple yet effective method: Dropout Next-Token Prediction (Dropout NTP). Specifically, we use the responses generated by InternVL2-8B as chosen answers. Given the chosen answer, we truncate it by half and then prompt InternVL2-8B to complete the remaining portion of the truncated answer without access to the image input. This generated completion serves as the rejected answer for the paired sample. It is worth noting that while the responses generated by InternVL2-8B may not be perfect, the completions generated without the image input will introduce more hallucinations than those generated with the image input. Therefore, the partial order relationship between the chosen and rejected responses holds true.

The data construction pipeline is open-sourced, see more details in our document.

Mixed Preference Optimization

The key insight behind MPO is that an effective PO process should enable the model to learn the relative preference between pairs of responses, the absolute quality of individual responses, and the process for generating preferred responses. We define the training objective as a combination of preference loss Lp\mathcal{L}_{\text{p}}, quality loss Lq\mathcal{L}_{\text{q}}, and generation loss Lg\mathcal{L}_{\text{g}}, referred to as Mixed Preference Optimization:

L=wpβ‹…Lp+wqβ‹…Lq+wgβ‹…Lg, \mathcal{L}=w_{p}\cdot\mathcal{L}_{\text{p}} + w_{q}\cdot\mathcal{L}_{\text{q}} + w_{g}\cdot\mathcal{L}_{\text{g}},

where wβˆ—w_{*} represents the weight assigned to each loss component. In this work, we empirically compare different variants of preference loss. Based on the experimental results, we use DPO as our preference loss and BCO as our quality loss.

Specifically, the DPO serves as the preference loss to enable the model to learn the relative preference between chosen and rejected responses. This algorithm optimizes the following loss function:

Lp=βˆ’log⁑σ(Ξ²log⁑πθ(yc∣x)Ο€0(yc∣x)βˆ’Ξ²log⁑πθ(yr∣x)Ο€0(yr∣x)), \mathcal{L}_{\text{p}}=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)}\right),

where Ξ²\beta is the KL penalty coefficient, and xx, ycy_c, and yry_r are user query, chosen response, and rejected response, respectively. The policy model πθ\pi_\theta is initialized from model Ο€0\pi_0.

Additionally, the BCO loss is employed as the quality loss, which helps the model to understand the absolute quality of individual responses. The loss function is defined as:

Lq=Lq++Lqβˆ’, \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,

where Lq+\mathcal{L}_{\text{q}}^{+} and Lq+\mathcal{L}_{\text{q}}^{+} represent the loss for chosen and rejected responses, respectively. Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:

Lq+=βˆ’log⁑σ(Ξ²log⁑πθ(yc∣x)Ο€0(yc∣x)βˆ’Ξ΄), \mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),

Lqβˆ’=βˆ’log⁑σ(βˆ’(Ξ²log⁑πθ(yr∣x)Ο€0(yr∣x)βˆ’Ξ΄)), \mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),

where Ξ΄\delta represents the reward shift, calculated as the moving average of previous rewards to stabilize training.

Finally, the SFT loss is used as the generation loss to help the model learn the generation process of preferred responses. The loss function is defined as:

Lgen=βˆ’log⁑πθ(yc∣x)∣yc∣. \mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.

Evaluation on Multimodal Capability

To comprehensively compare InternVL's performance before and after MPO, we employ the benchmarks from OpenCompass Learderboard, including both well-established classic datasets and newly introduced ones. These benchmarks span a wide range of categories, aiming to provide a thorough and balanced assessment of InternVL’s capabilities across various multimodal tasks. We provide the evaluation results in the tables behind.

Model Avg. MMBench v1.1 MMStar MMMU MathVista HallusionBench AI2D OCRBench MMVet
InternVL2-5-1B 54.9 66.5 51.3 41.2 47.1 39.4 69.0 77.4 47.2
InternVL2-5-1B-MPO 56.4 67.2 49.7 40.8 53.0 40.0 69.4 83.6 47.2
InternVL2-5-2B 59.9 70.9 54.3 43.2 51.1 42.3 74.9 80.2 62.6
InternVL2-5-2B-MPO 62.0 71.6 55.0 45.0 56.4 43.0 75.3 84.2 65.4
InternVL2-5-4B 65.1 78.2 58.7 51.8 60.8 46.6 81.4 82.0 61.5
InternVL2-5-4B-MPO 67.6 78.6 60.2 51.6 65.3 47.8 82.0 88.0 67.1
InternVL2-5-8B 68.9 82.5 63.2 56.2 64.5 49.0 84.6 82.1 62.8
InternVL2-5-8B-MPO 70.4 82.4 65.7 54.9 68.9 51.4 84.5 88.3 66.9
InternVL2-5-26B 71.6 84.6 66.5 60.7 68.0 55.8 86.2 85.4 65.4
InternVL2-5-26B-MPO 72.7 84.2 67.2 57.7 72.8 55.3 86.2 91.2 67.1
InternVL2-5-38B 73.5 85.4 68.5 64.6 72.4 57.9 87.6 84.1 67.2
InternVL2-5-38B-MPO 75.5 85.6 69.8 64.1 73.8 61.5 88.1 88.5 72.5
InternVL2-5-78B 75.2 87.5 69.5 70.0 70.6 57.4 89.1 85.3 71.8
InternVL2-5-78B-MPO 76.6 87.3 73.1 68.3 73.8 58.7 89.3 91.2 71.4

Deployment

LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>=0.6.4

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.

A 'Hello, world' Example

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-38B-MPO-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
response = pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required dependency packages as prompted.

Multi-images Inference

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-38B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

Batch Prompts Inference

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-38B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

Multi-turn Conversation

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-38B-MPO-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

Service

LMDeploy's api_server enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-38B-MPO-AWQ --server-port 23333 --tp 2

To use the OpenAI-style interface, you need to install OpenAI:

pip install openai

Then, use the code below to make the API call:

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

License

This project is released under the MIT License. This project uses the pre-trained Qwen2.5-32B-Instruct as a component, which is licensed under the Apache License 2.0.

Citation

If you find this project useful in your research, please consider citing:

@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}