Update README.md
Browse files
README.md
CHANGED
|
@@ -10,10 +10,7 @@ datasets:
|
|
| 10 |
pipeline_tag: visual-question-answering
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
-
<p align="center">
|
| 15 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
|
| 16 |
-
</p>
|
| 17 |
|
| 18 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
|
| 19 |
|
|
@@ -45,18 +42,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
| 45 |
- Learnable Component: ViT + MLP + LLM
|
| 46 |
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
| 47 |
|
| 48 |
-
|
| 49 |
-
## Released Models
|
| 50 |
-
|
| 51 |
-
| Model | Vision Foundation Model | Release Date |Note |
|
| 52 |
-
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
| 53 |
-
| InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
| 54 |
-
| InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
| 55 |
-
| InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
| 56 |
-
| InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
## Performance
|
| 61 |
|
| 62 |
\* Proprietary Model
|
|
@@ -75,7 +60,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
| 75 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
| 76 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
| 77 |
|
| 78 |
-
|
| 79 |
## Training Details
|
| 80 |
|
| 81 |
### Data Preparation
|
|
@@ -84,7 +68,6 @@ Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train Intern
|
|
| 84 |
|
| 85 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
| 86 |
|
| 87 |
-
|
| 88 |
### Training (Supervised Finetuning)
|
| 89 |
|
| 90 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
|
@@ -97,9 +80,6 @@ The hyperparameters used for finetuning are listed in the following table.
|
|
| 97 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
| 98 |
| InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
## Model Usage
|
| 104 |
|
| 105 |
We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
|
|
@@ -178,12 +158,3 @@ If you find this project useful in your research, please consider citing:
|
|
| 178 |
## License
|
| 179 |
|
| 180 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
| 181 |
-
|
| 182 |
-
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
| 183 |
-
|
| 184 |
-
## Acknowledgement
|
| 185 |
-
|
| 186 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
| 187 |
-
|
| 188 |
-
## Contributors
|
| 189 |
-
Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
|
|
|
|
| 10 |
pipeline_tag: visual-question-answering
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# InternVL-Chat-V1-2
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
|
| 16 |
|
|
|
|
| 42 |
- Learnable Component: ViT + MLP + LLM
|
| 43 |
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
## Performance
|
| 46 |
|
| 47 |
\* Proprietary Model
|
|
|
|
| 60 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
| 61 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
| 62 |
|
|
|
|
| 63 |
## Training Details
|
| 64 |
|
| 65 |
### Data Preparation
|
|
|
|
| 68 |
|
| 69 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
| 70 |
|
|
|
|
| 71 |
### Training (Supervised Finetuning)
|
| 72 |
|
| 73 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
|
|
|
| 80 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
| 81 |
| InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
| 82 |
|
|
|
|
|
|
|
|
|
|
| 83 |
## Model Usage
|
| 84 |
|
| 85 |
We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
|
|
|
|
| 158 |
## License
|
| 159 |
|
| 160 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|