kingsley01 commited on
Commit
0abc5a5
·
verified ·
1 Parent(s): 8048d30

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +661 -0
README.md ADDED
@@ -0,0 +1,661 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternVL2_5-1B
7
+ base_model_relation: finetune
8
+ datasets:
9
+ - OpenGVLab/MMPR-v1.1
10
+ language:
11
+ - multilingual
12
+ tags:
13
+ - internvl
14
+ - custom_code
15
+ ---
16
+
17
+ # InternVL2_5-1B-MPO
18
+
19
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)
20
+
21
+ [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
22
+
23
+ <div align="center">
24
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
25
+ </div>
26
+
27
+ ## Introduction
28
+
29
+ We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.
30
+
31
+ ![image/png](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/images/overview_performance.png)
32
+
33
+ ## InternVL 2.5 Family
34
+
35
+ In the following table, we provide an overview of the InternVL2.5-MPO series.
36
+
37
+ | Model Name | Vision Part | Language Part | HF Link |
38
+ | :-----------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :------------------------------------------------------------: |
39
+ | InternVL2_5-1B-MPO | [InternViT-300M-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-1B-MPO) |
40
+ | InternVL2_5-2B-MPO | [InternViT-300M-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) | [internlm2_5-1_8b-chat](https://huggingface.co/internlm/internlm2_5-1_8b-chat) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-2B-MPO) |
41
+ | InternVL2_5-4B-MPO | [InternViT-300M-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) | [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO) |
42
+ | InternVL2_5-8B-MPO | [InternViT-300M-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-8B-MPO) |
43
+ | InternVL2_5-26B-MPO | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [internlm2_5-20b-chat](https://huggingface.co/internlm/internlm2_5-20b-chat) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-26B-MPO) |
44
+ | InternVL2_5-38B-MPO | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-38B-MPO) |
45
+ | InternVL2_5-78B-MPO | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2_5-78B-MPO) |
46
+
47
+ ## Model Architecture
48
+
49
+ As shown in the following figure, [InternVL2.5-MPO](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/) retains the same model architecture as [InternVL 2.5](https://internvl.github.io/blog/2024-12-05-InternVL-2.5/) and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
50
+
51
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/BiiyXN6NOk0p-3rl3ueyL.png)
52
+
53
+ As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.
54
+
55
+ ## Key Designs
56
+
57
+ ### Multi-Modal Preference Dataset
58
+
59
+ MMPR is a large-scale and high-quality multimodal reasoning preference dataset. This dataset includes about 3 million samples.
60
+
61
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/619507e7b74b6c591f794340/mmXL47UPDFwYOWdn9Z6j5.jpeg)
62
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/619507e7b74b6c591f794340/6fnvI_wCd9JXAs6vYthaG.jpeg)
63
+
64
+ To construct this dataset, we propose an efficient data construction pipeline. Specifically, we categorize the multimodal data into **samples with clear ground truths** and **samples without clear ground truths**.
65
+
66
+ - **For samples with clear ground truths:**
67
+ the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
68
+ Responses matching the ground truth answer constitute the positive set \\(\mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
69
+ Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
70
+
71
+ - **For samples without clear ground truths:**
72
+ we propose a simple yet effective method: Dropout Next-Token Prediction (Dropout NTP).
73
+ Specifically, we use the responses generated by InternVL2-8B as chosen answers.
74
+ Given the chosen answer, we truncate it by half and then prompt InternVL2-8B to complete the remaining
75
+ portion of the truncated answer without access to the image input.
76
+ This generated completion serves as the rejected answer for the paired sample.
77
+ It is worth noting that while the responses generated by InternVL2-8B may not be perfect,
78
+ the completions generated without the image input will introduce more hallucinations than those
79
+ generated with the image input.
80
+ Therefore, the partial order relationship between the chosen and rejected responses holds true.
81
+
82
+ The data construction pipeline is open-sourced, see more details in our [document](https://internvl.readthedocs.io/en/latest/internvl2.0/preference_optimization.html#generate-additional-preference-data).
83
+
84
+
85
+ ### Mixed Preference Optimization
86
+
87
+ The key insight behind MPO is that *an effective PO process should enable the model to learn the relative preference between pairs of responses, the absolute quality of individual responses, and the process for generating preferred responses.* We define the training objective as a combination of
88
+ preference loss \\(\mathcal{L}_{\text{p}}\\),
89
+ quality loss \\(\mathcal{L}_{\text{q}}\\),
90
+ and generation loss \\(\mathcal{L}_{\text{g}}\\),
91
+ referred to as Mixed Preference Optimization:
92
+
93
+ $$
94
+ \mathcal{L}=w_{p}\cdot\mathcal{L}_{\text{p}} + w_{q}\cdot\mathcal{L}_{\text{q}} + w_{g}\cdot\mathcal{L}_{\text{g}},
95
+ $$
96
+
97
+ where \\(w_{*}\\) represents the weight assigned to each loss component.
98
+ In this work, we empirically compare different variants of preference loss.
99
+ Based on the experimental results, we use DPO as our preference loss and BCO as our quality loss.
100
+
101
+ Specifically, the DPO serves as the preference loss to enable the model to learn the
102
+ relative preference between chosen and rejected responses.
103
+ This algorithm optimizes the following loss function:
104
+
105
+ $$
106
+ \mathcal{L}_{\text{p}}=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)}\right),
107
+ $$
108
+
109
+ where \\(\beta\\) is the KL penalty coefficient, and \\(x\\), \\(y_c\\), and \\(y_r\\) are user query, chosen response, and rejected response, respectively.
110
+ The policy model \\(\pi_\theta\\) is initialized from model \\(\pi_0\\).
111
+
112
+ Additionally, the BCO loss is employed as the quality loss, which helps the model to understand the absolute quality of individual responses.
113
+ The loss function is defined as:
114
+
115
+ $$
116
+ \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,
117
+ $$
118
+
119
+ where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
120
+ Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:
121
+
122
+ $$
123
+ \mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),
124
+ $$
125
+
126
+ $$
127
+ \mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),
128
+ $$
129
+
130
+ where \\(\delta\\) represents the reward shift, calculated as the moving average of previous rewards to stabilize training.
131
+
132
+ Finally, the SFT loss is used as the generation loss to help the model learn the generation process of preferred responses.
133
+ The loss function is defined as:
134
+
135
+ $$
136
+ \mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.
137
+ $$
138
+
139
+ ## Evaluation on Multimodal Capability
140
+
141
+ To comprehensively compare InternVL's performance before and after MPO, we employ the benchmarks from OpenCompass Learderboard, including both well-established classic datasets and newly introduced ones. These benchmarks span a wide range of categories, aiming to provide a thorough and balanced assessment of InternVL’s capabilities across various multimodal tasks. We provide the evaluation results in the tables behind.
142
+
143
+ | Model | Avg. | MMBench v1.1 | MMStar | MMMU | MathVista | HallusionBench | AI2D | OCRBench | MMVet |
144
+ | ------------------- | ---- | ------------ | ------ | ---- | --------- | -------------- | ---- | -------- | ----- |
145
+ | InternVL2-5-1B | 54.9 | 66.5 | 51.3 | 41.2 | 47.1 | 39.4 | 69.0 | 77.4 | 47.2 |
146
+ | InternVL2-5-1B-MPO | 56.4 | 67.2 | 49.7 | 40.8 | 53.0 | 40.0 | 69.4 | 83.6 | 47.2 |
147
+ | InternVL2-5-2B | 59.9 | 70.9 | 54.3 | 43.2 | 51.1 | 42.3 | 74.9 | 80.2 | 62.6 |
148
+ | InternVL2-5-2B-MPO | 62.0 | 71.6 | 55.0 | 45.0 | 56.4 | 43.0 | 75.3 | 84.2 | 65.4 |
149
+ | InternVL2-5-4B | 65.1 | 78.2 | 58.7 | 51.8 | 60.8 | 46.6 | 81.4 | 82.0 | 61.5 |
150
+ | InternVL2-5-4B-MPO | 67.6 | 78.6 | 60.2 | 51.6 | 65.3 | 47.8 | 82.0 | 88.0 | 67.1 |
151
+ | InternVL2-5-8B | 68.9 | 82.5 | 63.2 | 56.2 | 64.5 | 49.0 | 84.6 | 82.1 | 62.8 |
152
+ | InternVL2-5-8B-MPO | 70.4 | 82.4 | 65.7 | 54.9 | 68.9 | 51.4 | 84.5 | 88.3 | 66.9 |
153
+ | InternVL2-5-26B | 71.6 | 84.6 | 66.5 | 60.7 | 68.0 | 55.8 | 86.2 | 85.4 | 65.4 |
154
+ | InternVL2-5-26B-MPO | 72.7 | 84.2 | 67.2 | 57.7 | 72.8 | 55.3 | 86.2 | 91.2 | 67.1 |
155
+ | InternVL2-5-38B | 73.5 | 85.4 | 68.5 | 64.6 | 72.4 | 57.9 | 87.6 | 84.1 | 67.2 |
156
+ | InternVL2-5-38B-MPO | 75.5 | 85.6 | 69.8 | 64.1 | 73.8 | 61.5 | 88.1 | 88.5 | 72.5 |
157
+ | InternVL2-5-78B | 75.2 | 87.5 | 69.5 | 70.0 | 70.6 | 57.4 | 89.1 | 85.3 | 71.8 |
158
+ | InternVL2-5-78B-MPO | 76.6 | 87.3 | 73.1 | 68.3 | 73.8 | 58.7 | 89.3 | 91.2 | 71.4 |
159
+
160
+
161
+ ## Quick Start
162
+
163
+ We provide an example code to run `InternVL2_5-1B-MPO` using `transformers`.
164
+
165
+ > Please use transformers>=4.37.2 to ensure the model works normally.
166
+
167
+ ### Model Loading
168
+
169
+ #### 16-bit (bf16 / fp16)
170
+
171
+ ```python
172
+ import torch
173
+ from transformers import AutoTokenizer, AutoModel
174
+ path = "OpenGVLab/InternVL2_5-1B-MPO"
175
+ model = AutoModel.from_pretrained(
176
+ path,
177
+ torch_dtype=torch.bfloat16,
178
+ low_cpu_mem_usage=True,
179
+ use_flash_attn=True,
180
+ trust_remote_code=True).eval().cuda()
181
+ ```
182
+
183
+ #### BNB 8-bit Quantization
184
+
185
+ ```python
186
+ import torch
187
+ from transformers import AutoTokenizer, AutoModel
188
+ path = "OpenGVLab/InternVL2_5-1B-MPO"
189
+ model = AutoModel.from_pretrained(
190
+ path,
191
+ torch_dtype=torch.bfloat16,
192
+ load_in_8bit=True,
193
+ low_cpu_mem_usage=True,
194
+ use_flash_attn=True,
195
+ trust_remote_code=True).eval()
196
+ ```
197
+
198
+ #### Multiple GPUs
199
+
200
+ The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
201
+
202
+ ```python
203
+ import math
204
+ import torch
205
+ from transformers import AutoTokenizer, AutoModel
206
+
207
+ def split_model(model_name):
208
+ device_map = {}
209
+ world_size = torch.cuda.device_count()
210
+ num_layers = {
211
+ 'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
212
+ 'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
213
+ # Since the first GPU will be used for ViT, treat it as half a GPU.
214
+ num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
215
+ num_layers_per_gpu = [num_layers_per_gpu] * world_size
216
+ num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
217
+ layer_cnt = 0
218
+ for i, num_layer in enumerate(num_layers_per_gpu):
219
+ for j in range(num_layer):
220
+ device_map[f'language_model.model.layers.{layer_cnt}'] = i
221
+ layer_cnt += 1
222
+ device_map['vision_model'] = 0
223
+ device_map['mlp1'] = 0
224
+ device_map['language_model.model.tok_embeddings'] = 0
225
+ device_map['language_model.model.embed_tokens'] = 0
226
+ device_map['language_model.output'] = 0
227
+ device_map['language_model.model.norm'] = 0
228
+ device_map['language_model.model.rotary_emb'] = 0
229
+ device_map['language_model.lm_head'] = 0
230
+ device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
231
+
232
+ return device_map
233
+
234
+ path = "OpenGVLab/InternVL2_5-1B-MPO"
235
+ device_map = split_model('InternVL2_5-1B')
236
+ model = AutoModel.from_pretrained(
237
+ path,
238
+ torch_dtype=torch.bfloat16,
239
+ low_cpu_mem_usage=True,
240
+ use_flash_attn=True,
241
+ trust_remote_code=True,
242
+ device_map=device_map).eval()
243
+ ```
244
+
245
+ ### Inference with Transformers
246
+
247
+ ```python
248
+ import numpy as np
249
+ import torch
250
+ import torchvision.transforms as T
251
+ from decord import VideoReader, cpu
252
+ from PIL import Image
253
+ from torchvision.transforms.functional import InterpolationMode
254
+ from transformers import AutoModel, AutoTokenizer
255
+
256
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
257
+ IMAGENET_STD = (0.229, 0.224, 0.225)
258
+
259
+ def build_transform(input_size):
260
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
261
+ transform = T.Compose([
262
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
263
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
264
+ T.ToTensor(),
265
+ T.Normalize(mean=MEAN, std=STD)
266
+ ])
267
+ return transform
268
+
269
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
270
+ best_ratio_diff = float('inf')
271
+ best_ratio = (1, 1)
272
+ area = width * height
273
+ for ratio in target_ratios:
274
+ target_aspect_ratio = ratio[0] / ratio[1]
275
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
276
+ if ratio_diff < best_ratio_diff:
277
+ best_ratio_diff = ratio_diff
278
+ best_ratio = ratio
279
+ elif ratio_diff == best_ratio_diff:
280
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
281
+ best_ratio = ratio
282
+ return best_ratio
283
+
284
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
285
+ orig_width, orig_height = image.size
286
+ aspect_ratio = orig_width / orig_height
287
+
288
+ # calculate the existing image aspect ratio
289
+ target_ratios = set(
290
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
291
+ i * j <= max_num and i * j >= min_num)
292
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
293
+
294
+ # find the closest aspect ratio to the target
295
+ target_aspect_ratio = find_closest_aspect_ratio(
296
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
297
+
298
+ # calculate the target width and height
299
+ target_width = image_size * target_aspect_ratio[0]
300
+ target_height = image_size * target_aspect_ratio[1]
301
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
302
+
303
+ # resize the image
304
+ resized_img = image.resize((target_width, target_height))
305
+ processed_images = []
306
+ for i in range(blocks):
307
+ box = (
308
+ (i % (target_width // image_size)) * image_size,
309
+ (i // (target_width // image_size)) * image_size,
310
+ ((i % (target_width // image_size)) + 1) * image_size,
311
+ ((i // (target_width // image_size)) + 1) * image_size
312
+ )
313
+ # split the image
314
+ split_img = resized_img.crop(box)
315
+ processed_images.append(split_img)
316
+ assert len(processed_images) == blocks
317
+ if use_thumbnail and len(processed_images) != 1:
318
+ thumbnail_img = image.resize((image_size, image_size))
319
+ processed_images.append(thumbnail_img)
320
+ return processed_images
321
+
322
+ def load_image(image_file, input_size=448, max_num=12):
323
+ image = Image.open(image_file).convert('RGB')
324
+ transform = build_transform(input_size=input_size)
325
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
326
+ pixel_values = [transform(image) for image in images]
327
+ pixel_values = torch.stack(pixel_values)
328
+ return pixel_values
329
+
330
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
331
+ path = 'OpenGVLab/InternVL2_5-1B-MPO'
332
+ model = AutoModel.from_pretrained(
333
+ path,
334
+ torch_dtype=torch.bfloat16,
335
+ low_cpu_mem_usage=True,
336
+ use_flash_attn=True,
337
+ trust_remote_code=True).eval().cuda()
338
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
339
+
340
+ # set the max number of tiles in `max_num`
341
+ pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
342
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
343
+
344
+ # pure-text conversation (纯文本对话)
345
+ question = 'Hello, who are you?'
346
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
347
+ print(f'User: {question}\nAssistant: {response}')
348
+
349
+ question = 'Can you tell me a story?'
350
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
351
+ print(f'User: {question}\nAssistant: {response}')
352
+
353
+ # single-image single-round conversation (单图单轮对话)
354
+ question = '<image>\nPlease describe the image shortly.'
355
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
356
+ print(f'User: {question}\nAssistant: {response}')
357
+
358
+ # single-image multi-round conversation (单图多轮对话)
359
+ question = '<image>\nPlease describe the image in detail.'
360
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
361
+ print(f'User: {question}\nAssistant: {response}')
362
+
363
+ question = 'Please write a poem according to the image.'
364
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
365
+ print(f'User: {question}\nAssistant: {response}')
366
+
367
+ # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
368
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
369
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
370
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
371
+
372
+ question = '<image>\nDescribe the two images in detail.'
373
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
374
+ history=None, return_history=True)
375
+ print(f'User: {question}\nAssistant: {response}')
376
+
377
+ question = 'What are the similarities and differences between these two images.'
378
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
379
+ history=history, return_history=True)
380
+ print(f'User: {question}\nAssistant: {response}')
381
+
382
+ # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
383
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
384
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
385
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
386
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
387
+
388
+ question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
389
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
390
+ num_patches_list=num_patches_list,
391
+ history=None, return_history=True)
392
+ print(f'User: {question}\nAssistant: {response}')
393
+
394
+ question = 'What are the similarities and differences between these two images.'
395
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
396
+ num_patches_list=num_patches_list,
397
+ history=history, return_history=True)
398
+ print(f'User: {question}\nAssistant: {response}')
399
+
400
+ # batch inference, single image per sample (单图批处理)
401
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
402
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
403
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
404
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
405
+
406
+ questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
407
+ responses = model.batch_chat(tokenizer, pixel_values,
408
+ num_patches_list=num_patches_list,
409
+ questions=questions,
410
+ generation_config=generation_config)
411
+ for question, response in zip(questions, responses):
412
+ print(f'User: {question}\nAssistant: {response}')
413
+
414
+ # video multi-round conversation (视频多轮对话)
415
+ def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
416
+ if bound:
417
+ start, end = bound[0], bound[1]
418
+ else:
419
+ start, end = -100000, 100000
420
+ start_idx = max(first_idx, round(start * fps))
421
+ end_idx = min(round(end * fps), max_frame)
422
+ seg_size = float(end_idx - start_idx) / num_segments
423
+ frame_indices = np.array([
424
+ int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
425
+ for idx in range(num_segments)
426
+ ])
427
+ return frame_indices
428
+
429
+ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
430
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
431
+ max_frame = len(vr) - 1
432
+ fps = float(vr.get_avg_fps())
433
+
434
+ pixel_values_list, num_patches_list = [], []
435
+ transform = build_transform(input_size=input_size)
436
+ frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
437
+ for frame_index in frame_indices:
438
+ img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
439
+ img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
440
+ pixel_values = [transform(tile) for tile in img]
441
+ pixel_values = torch.stack(pixel_values)
442
+ num_patches_list.append(pixel_values.shape[0])
443
+ pixel_values_list.append(pixel_values)
444
+ pixel_values = torch.cat(pixel_values_list)
445
+ return pixel_values, num_patches_list
446
+
447
+ video_path = './examples/red-panda.mp4'
448
+ pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
449
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
450
+ video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
451
+ question = video_prefix + 'What is the red panda doing?'
452
+ # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
453
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
454
+ num_patches_list=num_patches_list, history=None, return_history=True)
455
+ print(f'User: {question}\nAssistant: {response}')
456
+
457
+ question = 'Describe this video in detail.'
458
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
459
+ num_patches_list=num_patches_list, history=history, return_history=True)
460
+ print(f'User: {question}\nAssistant: {response}')
461
+ ```
462
+
463
+ #### Streaming Output
464
+
465
+ Besides this method, you can also use the following code to get streamed output.
466
+
467
+ ```python
468
+ from transformers import TextIteratorStreamer
469
+ from threading import Thread
470
+
471
+ # Initialize the streamer
472
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
473
+ # Define the generation configuration
474
+ generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
475
+ # Start the model chat in a separate thread
476
+ thread = Thread(target=model.chat, kwargs=dict(
477
+ tokenizer=tokenizer, pixel_values=pixel_values, question=question,
478
+ history=None, return_history=False, generation_config=generation_config,
479
+ ))
480
+ thread.start()
481
+
482
+ # Initialize an empty string to store the generated text
483
+ generated_text = ''
484
+ # Loop through the streamer to get the new text as it is generated
485
+ for new_text in streamer:
486
+ if new_text == model.conv_template.sep:
487
+ break
488
+ generated_text += new_text
489
+ print(new_text, end='', flush=True) # Print each new chunk of generated text on the same line
490
+ ```
491
+
492
+ ## Finetune
493
+
494
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
495
+
496
+ ## Deployment
497
+
498
+ ### LMDeploy
499
+
500
+ LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs.
501
+
502
+ ```sh
503
+ pip install lmdeploy>=0.6.4
504
+ ```
505
+
506
+ LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
507
+
508
+ #### A 'Hello, world' Example
509
+
510
+ ```python
511
+ from lmdeploy import pipeline, TurbomindEngineConfig
512
+ from lmdeploy.vl import load_image
513
+
514
+ model = 'OpenGVLab/InternVL2_5-1B-MPO'
515
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
516
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
517
+ response = pipe(('describe this image', image))
518
+ print(response.text)
519
+ ```
520
+
521
+ If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
522
+
523
+ #### Multi-images Inference
524
+
525
+ When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
526
+
527
+ ```python
528
+ from lmdeploy import pipeline, TurbomindEngineConfig
529
+ from lmdeploy.vl import load_image
530
+ from lmdeploy.vl.constants import IMAGE_TOKEN
531
+
532
+ model = 'OpenGVLab/InternVL2_5-1B-MPO'
533
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
534
+
535
+ image_urls=[
536
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
537
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
538
+ ]
539
+
540
+ images = [load_image(img_url) for img_url in image_urls]
541
+ # Numbering images improves multi-image conversations
542
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
543
+ print(response.text)
544
+ ```
545
+
546
+ #### Batch Prompts Inference
547
+
548
+ Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
549
+
550
+ ```python
551
+ from lmdeploy import pipeline, TurbomindEngineConfig
552
+ from lmdeploy.vl import load_image
553
+
554
+ model = 'OpenGVLab/InternVL2_5-1B-MPO'
555
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
556
+
557
+ image_urls=[
558
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
559
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
560
+ ]
561
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
562
+ response = pipe(prompts)
563
+ print(response)
564
+ ```
565
+
566
+ #### Multi-turn Conversation
567
+
568
+ There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
569
+
570
+ ```python
571
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
572
+ from lmdeploy.vl import load_image
573
+
574
+ model = 'OpenGVLab/InternVL2_5-1B-MPO'
575
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
576
+
577
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
578
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
579
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
580
+ print(sess.response.text)
581
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
582
+ print(sess.response.text)
583
+ ```
584
+
585
+ #### Service
586
+
587
+ LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
588
+
589
+ ```shell
590
+ lmdeploy serve api_server OpenGVLab/InternVL2_5-1B-MPO --server-port 23333
591
+ ```
592
+
593
+ To use the OpenAI-style interface, you need to install OpenAI:
594
+
595
+ ```shell
596
+ pip install openai
597
+ ```
598
+
599
+ Then, use the code below to make the API call:
600
+
601
+ ```python
602
+ from openai import OpenAI
603
+
604
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
605
+ model_name = client.models.list().data[0].id
606
+ response = client.chat.completions.create(
607
+ model=model_name,
608
+ messages=[{
609
+ 'role':
610
+ 'user',
611
+ 'content': [{
612
+ 'type': 'text',
613
+ 'text': 'describe this image',
614
+ }, {
615
+ 'type': 'image_url',
616
+ 'image_url': {
617
+ 'url':
618
+ 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
619
+ },
620
+ }],
621
+ }],
622
+ temperature=0.8,
623
+ top_p=0.8)
624
+ print(response)
625
+ ```
626
+
627
+ ## License
628
+
629
+ This project is released under the MIT License. This project uses the pre-trained Qwen2.5-0.5B-Instruct as a component, which is licensed under the Apache License 2.0.
630
+
631
+ ## Citation
632
+
633
+ If you find this project useful in your research, please consider citing:
634
+
635
+ ```BibTeX
636
+ @article{wang2024mpo,
637
+ title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
638
+ author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
639
+ journal={arXiv preprint arXiv:2411.10442},
640
+ year={2024}
641
+ }
642
+ @article{chen2024expanding,
643
+ title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
644
+ author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
645
+ journal={arXiv preprint arXiv:2412.05271},
646
+ year={2024}
647
+ }
648
+ @article{chen2024far,
649
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
650
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
651
+ journal={arXiv preprint arXiv:2404.16821},
652
+ year={2024}
653
+ }
654
+ @inproceedings{chen2024internvl,
655
+ title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
656
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
657
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
658
+ pages={24185--24198},
659
+ year={2024}
660
+ }
661
+ ```