OpenGVLab
/

InternVL-Chat-V1-5

@@ -5,6 +5,7 @@ library_name: transformers
 base_model:
   - OpenGVLab/InternViT-6B-448px-V1-5
   - internlm/internlm2-chat-20b
 base_model_relation: merge
 language:
   - multilingual
@@ -19,7 +20,7 @@ tags:
 # InternVL-Chat-V1-5
-[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238)  [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
 [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/706547971)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -86,7 +87,7 @@ We provide an example code to run InternVL-Chat-V1-5 using `transformers`.
 We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
-> Please use transformers==4.37.2 to ensure the model works normally.
 ### Model Loading
@@ -386,7 +387,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
 print(f'User: {question}\nAssistant: {response}')
 ```
-#### Streaming output
 Besides this method, you can also use the following code to get streamed output.
@@ -426,12 +427,12 @@ Many repositories now support fine-tuning of the InternVL series models, includi
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```sh
-pip install lmdeploy==0.5.3
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
-#### A 'Hello, world' example
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
@@ -446,7 +447,7 @@ print(response.text)
 If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
-#### Multi-images inference
 When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
@@ -471,7 +472,7 @@ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe thes
 print(response.text)
 ```
-#### Batch prompts inference
 Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
@@ -491,7 +492,7 @@ response = pipe(prompts)
 print(response)
 ```
-#### Multi-turn conversation
 There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
@@ -561,6 +562,12 @@ This project is released under the MIT license, while InternLM2 is licensed unde
 If you find this project useful in your research, please consider citing:
 ```BibTeX
 @article{chen2023internvl,
   title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
   author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},

 base_model:
   - OpenGVLab/InternViT-6B-448px-V1-5
   - internlm/internlm2-chat-20b
+new_version: OpenGVLab/InternVL2_5-26B
 base_model_relation: merge
 language:
   - multilingual
 # InternVL-Chat-V1-5
+[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[📜 InternVL 1.0\]](https://arxiv.org/abs/2312.14238)  [\[📜 InternVL 1.5\]](https://arxiv.org/abs/2404.16821) [\[📜 Mini-InternVL\]](https://arxiv.org/abs/2410.16261)
 [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/706547971)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
+> Please use transformers>=4.37.2 to ensure the model works normally.
 ### Model Loading
 print(f'User: {question}\nAssistant: {response}')
 ```
+#### Streaming Output
 Besides this method, you can also use the following code to get streamed output.
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```sh
+pip install lmdeploy>=0.5.3
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
+#### A 'Hello, world' Example
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
 If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
+#### Multi-images Inference
 When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
 print(response.text)
 ```
+#### Batch Prompts Inference
 Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
 print(response)
 ```
+#### Multi-turn Conversation
 There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
 If you find this project useful in your research, please consider citing:
 ```BibTeX
+@article{gao2024mini,
+  title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
+  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
+  journal={arXiv preprint arXiv:2410.16261},
+  year={2024}
+}
 @article{chen2023internvl,
   title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
   author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},

configuration_internvl_chat.py CHANGED Viewed

@@ -39,11 +39,11 @@ class InternVLChatConfig(PretrainedConfig):
         super().__init__(**kwargs)
         if vision_config is None:
-            vision_config = {}
             logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
         if llm_config is None:
-            llm_config = {}
             logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
         self.vision_config = InternVisionConfig(**vision_config)

         super().__init__(**kwargs)
         if vision_config is None:
+            vision_config = {'architectures': ['InternVisionModel']}
             logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
         if llm_config is None:
+            llm_config = {'architectures': ['InternLM2ForCausalLM']}
             logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
         self.vision_config = InternVisionConfig(**vision_config)

modeling_intern_vit.py CHANGED Viewed

@@ -3,6 +3,7 @@
 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 from typing import Optional, Tuple, Union
 import torch

 # Copyright (c) 2024 OpenGVLab
 # Licensed under The MIT License [see LICENSE for details]
 # --------------------------------------------------------
 from typing import Optional, Tuple, Union
 import torch