Image-Text-to-Text
Transformers
Safetensors
English
MLLM
Inference Endpoints
File size: 3,349 Bytes
0069ff2
 
 
 
 
 
 
 
 
 
 
 
f1f977d
 
 
 
 
 
0069ff2
 
f1f977d
 
0069ff2
f1f977d
 
 
 
0069ff2
f1f977d
 
 
 
 
 
 
 
 
 
 
 
0069ff2
f1f977d
 
 
 
 
 
 
 
 
 
0069ff2
 
 
 
 
 
 
 
 
 
 
f1f977d
0069ff2
f1f977d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: apache-2.0
datasets:
- AIDC-AI/Parrot-dataset
library_name: transformers
tags:
- MLLM
pipeline_tag: image-text-to-text
language:
- en
---

# Parrot-7B

## Introduction
Welcome to Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. 
Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. 
Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB.
For a comprehensive introduction, please refer to [Parrot Paper](https://arxiv.org/abs/2406.02539) and [Parrot GitHub](https://github.com/AIDC-AI/Parrot).

## Model
Parrot is a multilingual multimodal large language model. We provide our fully finetuned models below:

| Model | Base LLM | Vision Encoder | Stage | Download |
| --- | --- | :---: | :---: | :---: |
| Parrot-7B | Qwen-1.5-7B-Chat | CLIP-ViT-Large-patch14-336 | SFT | [ckpt](https://huggingface.co/AIDC-AI/Parrot-7B) |
| Parrot-14B | Qwen-1.5-14B-Chat | CLIP-ViT-Large-patch14-336 | SFT | [ckpt](https://huggingface.co/AIDC-AI/Parrot-14B) |

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6076587d310e510df1db14bc/FAfbL6IqE7ZJdcx4_qQlF.png" width="600px" />
</div>


## Performance
<div align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/6076587d310e510df1db14bc/ZmTOkUZk5_UC1t0ExSmjM.png" width="400px" />
</div>
<div align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/6076587d310e510df1db14bc/Njdnvzcx7BsH7HkK-ylRo.png" width="100%" />
</div>


## Quick Start
We provide a quick start demo in [Parrot GitHub](https://github.com/AIDC-AI/Parrot), which can be used as a template to run Parrot for inference.

1. Before running the demo, please make sure you download the [Parrot checkpoint](https://huggingface.co/AIDC-AI/Parrot-7B) and the [Clip checkpoint](https://huggingface.co/openai/clip-vit-large-patch14-336).
2. Second, you should replace the paths in the `runner.py`.
3. Finally, run the python file in your system.


## Citation
If you find Parrot useful, please cite the paper

```markdown
@article{sun2024parrot,
  title={Parrot: Multilingual Visual Instruction Tuning},
  author={Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and others},
  journal={arXiv preprint arXiv:2406.02539},
  year={2024}
}
```

## License
The project is licensed under Apache License Version 2.0 and is restricted to uses that comply with the license agreements of Qwen and Clip.

## Disclaimer
We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.