File size: 6,648 Bytes
c30d23e 43d626b c30d23e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
datasets:
- FreedomIntelligence/ALLaVA-4V
language:
- en
pipeline_tag: text-generation
---
# ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
# this is a fork from the official repo
<p align="center">
β‘ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.β‘
</p>
<!-- <p align="center">
![Python 3.10](https://img.shields.io/badge/Python-3.10-lightblue) ![Pytorch 1.13.0](https://img.shields.io/badge/PyTorch-2.1.1-lightblue) ![transformers](https://img.shields.io/badge/transformers-4.37.0-lightblue)
</p> -->
<p align="center">
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π¨π»βπ» <a href="https://github.com/FreedomIntelligence/ALLaVA" target="_blank">Github</a>
</p>
<p align="center">
π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a>
</p>
<p align="center">
π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer" target="_blank">ALLaVA-3B-Longer</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B" target="_blank">ALLaVA-3B</a>
</p>
<!-- <p align="center">
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer" target="_blank">ALLaVA-3B-Longer</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B" target="_blank">ALLaVA-3B</a>
<br> <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README_zh.md"> δΈζ</a> | <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README.md"> English
</p> -->
## Benchmark Result
Our model [**ALLaVA-3B-Longer**](https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer) and [**ALLaVA-3B**](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) achieve competitive results on 12 benchmarks. Bold numbers denote the SOTA performance among 3B-scale models.
| Model | Backbone | Vicuna-80 | MMB | SEEDBench-v1 (img) | MM-Vet | MMMU (val) | MME | TextVQA | GQA | EMT (CIFAR10) | MLLM-Bench | TouchStone | LLaVA (In-the-Wild) |
|-------|----------|-----------|-----|-------------|--------|----------|-----|------|-----|---------|----|----|--------|
| Qwen-VL-Chat | Qwen-7B | - | 60.6 | 65.4 | - | 35.9 | 1487.5 | 61.5 | 57.5 | - | 6.2 | 711.6 | - |
| LLaVA-v1.5-7B | Vicuna-7B | - | 64.3 | - | 31.1 | - | 1510.7 | 58.2 | 62.0 | - | - | | 65.4 |
| LLaVA-v1.5-13B | Vicuna-13B | 22.50 | 67.7 | 68.2 | 35.4 | 36.4 | 1531.3 | 61.3 | 63.3 | 85.0 | 7.4 | 637.7 | 70.7 |
| ShareGPT4V-7B | Vicuna-7B | - | 68.8 | 69.7 | 37.6 | - | 1943.8 | 60.4 | 63.3 | - | - | - | 72.6 |
| TinyGPT-V | Phi2-2.7B | - | - | - | - | - | - | - | 33.6 | - | - | - | - |
| MobileVLM | MobileLLaMA-2.7B | - | 59.6 | - | - | - | 1288.9 | 47.5 | - | - | - | - | - |
| LLaVA-Phi | Phi2-2.7B | - | 59.8 | - | 28.9 | - | 1335.1 | 48.6 | - | - | - | - | - |
| **ALLaVA-3B** | Phi2-2.7B | 48.8 | 64.0 | 65.2 | 32.2 | **35.3** | **1623.2** | 49.5 | 48.8 | **90.2** | 6.7 | 632.0 | 69.4 |
| **ALLaVA-3B-Longer** | Phi2-2.7B | **52.5** | **64.6** | **65.6** | **35.5** | 33.2 | 1564.6 | **50.3** | **50.0** | 85.9 | **8.8** | **636.5** | **71.7** |
The detailed information of each benchmark is shown in Table 4 of our [technical report](https://arxiv.org/pdf/2402.11684.pdf).
## π Inference
### Load from π€ (Recommended)
See the [example script](https://github.com/FreedomIntelligence/ALLaVA/blob/main/allava/serve/huggingface_inference.py).
### CLI
See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#cli) for CLI code snippet.
## ποΈββοΈ Training
### Data
<div align=center>
<img src="training_datasets_by_stage.jpg" width = "640" alt="training_datasets" align=center />
</div>
As shown in the table, ALLaVA-3B uses 1M and 1.5M data for PT. and FT., respectively.
ALLaVA-3B-Longer trains one more epoch (i.e. 3M in total) for the FT. stage.
### Code
The training code is largely based on [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA).
We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.
### Cost
We train our models on 8*A800 GPUs.
[ALLaVA-3B-Longer](https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer) takes 8.3h for PT and 21.3h for FT.
[ALLaVA-3B](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) takes 8.3h for PT and 10.6h for FT.
These two models share the same PT procedure.
### Hyperparameters
| Global Batch Size| ZeRO Stage| Optimizer | Max LR| Min LR | Scheduler | Max length | Weight decay |
| ---: | ---: |--:| ---: | ---: | ---: | ---: | ---: |
| 256 (PT) / 128 (FT) | 1| AdamW | 2e-5 | 2e-6 | CosineAnnealingWarmRestarts | 2048 | 0 |
The LM backbone, projector are trainable, while the vision encoder is kept frozen.
**The trainabilities of each module are the same for both stages.**
## π ALLaVA-4V Data
The majority part of training data is [ALLaVA-4V](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#data-preparation) to prepare it for training.
## π Contributors
- Project Leader: [Guiming Hardy Chen](https://g-h-chen.github.io/)
- Data: Shunian Chen, [Junying Chen](https://jymchen.github.io/), Xiangbo Wu
- Evaluation: [Ruifei Zhang](https://scholar.google.com/citations?user=W4zOhmEAAAAJ&hl=zh-CN)
- Deployment: Xiangbo Wu, Zhiyi Zhang
- Advising: [Zhihong Chen](https://zhjohnchan.github.io/), [Benyou Wang](https://wabyking.github.io/old.html)
- Others: Jianquan Li, [Xiang Wan](https://scholar.google.com/citations?user=e3_kWigAAAAJ&hl=zh-CN)
## π Citation
If you find our data useful, please consider citing our work! We are FreedomIntelligence from [Shenzhen Research Institute of Big Data](http://sribd.cn/en) and [The Chinese University of Hong Kong, Shenzhen](https://sds.cuhk.edu.cn/en)
```
@article{chen2024allava,
title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model},
author={Chen, Guiming Hardy and Chen, Shunian and Zhang, Ruifei and Chen, Junying and Wu, Xiangbo and Zhang, Zhiyi and Chen, Zhihong and Li, Jianquan and Wan, Xiang and Wang, Benyou},
journal={arXiv preprint arXiv:2402.11684},
year={2024}
}
```
|