OpenGVLab
/

InternVL-Chat-V1-2

@@ -10,10 +10,7 @@ datasets:
 pipeline_tag: visual-question-answering
 ---
-# Model Card for InternVL-Chat-V1-2
-<p align="center">
-<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
-</p>
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238)  [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
@@ -45,18 +42,6 @@ For better training reproducibility, we follow the minimalist design and data ef
     - Learnable Component: ViT + MLP + LLM
     - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
-## Released Models
-| Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
-| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
-| InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))            | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
-| InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )           |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
-| InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))            |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
 ## Performance
 \* Proprietary Model
@@ -75,7 +60,6 @@ For better training reproducibility, we follow the minimalist design and data ef
 - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
 ## Training Details
 ### Data Preparation
@@ -84,7 +68,6 @@ Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train Intern
 For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
 ### Training (Supervised Finetuning)
 We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
@@ -97,9 +80,6 @@ The hyperparameters used for finetuning are listed in the following table.
 | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
 | InternVL−Chat−V1-2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
 ## Model Usage
 We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
@@ -178,12 +158,3 @@ If you find this project useful in your research, please consider citing:
 ## License
 This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
-Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
-## Acknowledgement
-InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
-## Contributors
-Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai

 pipeline_tag: visual-question-answering
 ---
+# InternVL-Chat-V1-2
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238)  [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
     - Learnable Component: ViT + MLP + LLM
     - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
 ## Performance
 \* Proprietary Model
 - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
 ## Training Details
 ### Data Preparation
 For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
 ### Training (Supervised Finetuning)
 We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
 | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
 | InternVL−Chat−V1-2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
 ## Model Usage
 We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
 ## License
 This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.