Image-Text-to-Text
Transformers
Safetensors
English
MLLM
Inference Endpoints
Parrot-7B / README.md
liyang's picture
update readme
f1f977d verified
metadata
license: apache-2.0
datasets:
  - AIDC-AI/Parrot-dataset
library_name: transformers
tags:
  - MLLM
pipeline_tag: image-text-to-text
language:
  - en

Parrot-7B

Introduction

Welcome to Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. For a comprehensive introduction, please refer to Parrot Paper and Parrot GitHub.

Model

Parrot is a multilingual multimodal large language model. We provide our fully finetuned models below:

Model Base LLM Vision Encoder Stage Download
Parrot-7B Qwen-1.5-7B-Chat CLIP-ViT-Large-patch14-336 SFT ckpt
Parrot-14B Qwen-1.5-14B-Chat CLIP-ViT-Large-patch14-336 SFT ckpt

Performance

Quick Start

We provide a quick start demo in Parrot GitHub, which can be used as a template to run Parrot for inference.

  1. Before running the demo, please make sure you download the Parrot checkpoint and the Clip checkpoint.
  2. Second, you should replace the paths in the runner.py.
  3. Finally, run the python file in your system.

Citation

If you find Parrot useful, please cite the paper

@article{sun2024parrot,
  title={Parrot: Multilingual Visual Instruction Tuning},
  author={Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and others},
  journal={arXiv preprint arXiv:2406.02539},
  year={2024}
}

License

The project is licensed under Apache License Version 2.0 and is restricted to uses that comply with the license agreements of Qwen and Clip.

Disclaimer

We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.