czczup commited on
Commit
e1f89bc
1 Parent(s): cf79149

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +23 -39
  2. modeling_intern_vit.py +6 -12
README.md CHANGED
@@ -62,6 +62,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
62
  | MathVista<sub>testmini</sub> | 63.8 | 67.7 | 63.7 | 65.5 |
63
  | OpenCompass<sub>avg</sub> | 69.9 | 67.9 | 69.7 | 71.0 |
64
 
 
 
65
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
66
 
67
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
@@ -321,7 +323,7 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
321
 
322
  # set the max number of tiles in `max_num`
323
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
324
- generation_config = dict(max_new_tokens=1024, do_sample=False)
325
 
326
  # pure-text conversation (纯文本对话)
327
  question = 'Hello, who are you?'
@@ -473,30 +475,28 @@ for new_text in streamer:
473
 
474
  ## Finetune
475
 
476
- SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
477
 
478
  ## Deployment
479
 
480
  ### LMDeploy
481
 
482
- #### Service
483
-
484
- To deploy InternVL2 as an API, please configure the chat template config first. Create the following JSON file `chat_template.json`.
485
 
486
- ```json
487
- {
488
- "model_name":"internvl-internlm2",
489
- "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
490
- "stop_words":["<|im_start|>", "<|im_end|>"]
491
- }
492
  ```
493
 
 
 
 
 
494
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
495
 
496
  > **⚠️ Warning**: Please make sure to install Flash Attention; otherwise, using `--tp` will cause errors.
497
 
498
  ```shell
499
- CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --chat-template chat_template.json --tp 4
500
  ```
501
 
502
  To use the OpenAI-style interface, you need to install OpenAI:
@@ -533,14 +533,6 @@ response = client.chat.completions.create(
533
  print(response)
534
  ```
535
 
536
- ### vLLM
537
-
538
- TODO
539
-
540
- ### Ollama
541
-
542
- TODO
543
-
544
  ## License
545
 
546
  This project is released under the MIT license, while Llama3 is licensed under the Llama 3 Community License.
@@ -613,6 +605,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
613
  | MathVista<sub>testmini</sub> | 63.8 | 67.7 | 63.7 | 65.5 |
614
  | OpenCompass<sub>avg</sub> | 69.9 | 67.9 | 69.7 | 71.0 |
615
 
 
 
616
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
617
 
618
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
@@ -671,30 +665,28 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
671
 
672
  ## 微调
673
 
674
- 来自ModelScope社区的SWIFT已经���持对InternVL进行微调(图像/视频),详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)
675
 
676
  ## 部署
677
 
678
  ### LMDeploy
679
 
680
- #### API部署
681
-
682
- 为了将InternVL2部署成API,请先配置聊天模板配置文件。创建如下的 JSON 文件 `chat_template.json`。
683
 
684
- ```json
685
- {
686
- "model_name":"internvl-internlm2",
687
- "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
688
- "stop_words":["<|im_start|>", "<|im_end|>"]
689
- }
690
  ```
691
 
 
 
 
 
692
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
693
 
694
  > **⚠️ 注意**: 请务必安装Flash Attention; 否则,使用`——tp`将存在异常。
695
 
696
  ```shell
697
- CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --chat-template chat_template.json --tp 4
698
  ```
699
 
700
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
@@ -731,14 +723,6 @@ response = client.chat.completions.create(
731
  print(response)
732
  ```
733
 
734
- ### vLLM
735
-
736
- TODO
737
-
738
- ### Ollama
739
-
740
- TODO
741
-
742
  ## 开源许可证
743
 
744
  该项目采用 MIT 许可证发布,而 LLama3 则采用 Llama 3 Community License 许可证。
 
62
  | MathVista<sub>testmini</sub> | 63.8 | 67.7 | 63.7 | 65.5 |
63
  | OpenCompass<sub>avg</sub> | 69.9 | 67.9 | 69.7 | 71.0 |
64
 
65
+ - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
66
+
67
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
68
 
69
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
 
323
 
324
  # set the max number of tiles in `max_num`
325
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
326
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
327
 
328
  # pure-text conversation (纯文本对话)
329
  question = 'Hello, who are you?'
 
475
 
476
  ## Finetune
477
 
478
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
479
 
480
  ## Deployment
481
 
482
  ### LMDeploy
483
 
484
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 
 
485
 
486
+ ```sh
487
+ pip install lmdeploy==0.5.3
 
 
 
 
488
  ```
489
 
490
+ LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
491
+
492
+ #### Service
493
+
494
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
495
 
496
  > **⚠️ Warning**: Please make sure to install Flash Attention; otherwise, using `--tp` will cause errors.
497
 
498
  ```shell
499
+ CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --tp 4
500
  ```
501
 
502
  To use the OpenAI-style interface, you need to install OpenAI:
 
533
  print(response)
534
  ```
535
 
 
 
 
 
 
 
 
 
536
  ## License
537
 
538
  This project is released under the MIT license, while Llama3 is licensed under the Llama 3 Community License.
 
605
  | MathVista<sub>testmini</sub> | 63.8 | 67.7 | 63.7 | 65.5 |
606
  | OpenCompass<sub>avg</sub> | 69.9 | 67.9 | 69.7 | 71.0 |
607
 
608
+ - 关于更多的细节以及评测复现,请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
609
+
610
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
611
 
612
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
 
665
 
666
  ## 微调
667
 
668
+ 许多仓库现在都支持 InternVL 系列模型的微调,包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
669
 
670
  ## 部署
671
 
672
  ### LMDeploy
673
 
674
+ LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
 
 
675
 
676
+ ```sh
677
+ pip install lmdeploy==0.5.3
 
 
 
 
678
  ```
679
 
680
+ LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM)的推理管道。
681
+
682
+ #### API部署
683
+
684
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
685
 
686
  > **⚠️ 注意**: 请务必安装Flash Attention; 否则,使用`——tp`将存在异常。
687
 
688
  ```shell
689
+ CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --backend turbomind --server-port 23333 --tp 4
690
  ```
691
 
692
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
 
723
  print(response)
724
  ```
725
 
 
 
 
 
 
 
 
 
726
  ## 开源许可证
727
 
728
  该项目采用 MIT 许可证发布,而 LLama3 则采用 Llama 3 Community License 许可证。
modeling_intern_vit.py CHANGED
@@ -20,18 +20,12 @@ from transformers.utils import logging
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
- try: # v1
24
- from flash_attn.flash_attn_interface import \
25
- flash_attn_unpadded_qkvpacked_func
26
- except: # v2
27
- from flash_attn.flash_attn_interface import \
28
- flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
-
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
-
 
32
  has_flash_attn = True
33
  except:
34
- print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
@@ -74,7 +68,7 @@ class FlashAttention(nn.Module):
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
- output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
@@ -84,7 +78,7 @@ class FlashAttention(nn.Module):
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
- output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
@@ -93,7 +87,7 @@ class FlashAttention(nn.Module):
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
- output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
 
 
 
 
 
 
 
23
  from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_varlen_qkvpacked_func
26
  has_flash_attn = True
27
  except:
28
+ print('FlashAttention2 is not installed.')
29
  has_flash_attn = False
30
 
31
  logger = logging.get_logger(__name__)
 
68
  max_s = seqlen
69
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
  device=qkv.device)
71
+ output = flash_attn_varlen_qkvpacked_func(
72
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
  softmax_scale=self.softmax_scale, causal=causal
74
  )
 
78
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
+ output_unpad = flash_attn_varlen_qkvpacked_func(
82
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
  softmax_scale=self.softmax_scale, causal=causal
84
  )
 
87
  'b s (h d) -> b s h d', h=nheads)
88
  else:
89
  assert max_s is not None
90
+ output = flash_attn_varlen_qkvpacked_func(
91
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
  softmax_scale=self.softmax_scale, causal=causal
93
  )