|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
- google/siglip-so400m-patch14-384 |
|
datasets: |
|
- weizhiwang/Open-Qwen2VL-Data |
|
- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M |
|
language: |
|
- en |
|
license: cc |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# Model Card for Open-Qwen2VL |
|
|
|
Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL). |
|
|
|
## Updates |
|
- [4/1/2025] The codebase, model, data, and paper are released. |
|
|
|
<!-- ## Model Details --> |
|
|
|
## How to Use |
|
|
|
Please firstly install Open-Qwen2VL via |
|
``` |
|
pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms |
|
``` |
|
|
|
You can load the model and perform inference as follows: |
|
```python |
|
import requests |
|
import torch |
|
from PIL import Image |
|
from prismatic import load |
|
|
|
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
|
|
|
# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub) |
|
vlm = load("Open-Qwen2VL") |
|
vlm.to(device, dtype=torch.bfloat16) |
|
|
|
# Download an image and specify a prompt |
|
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png" |
|
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") |
|
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)] |
|
user_prompt = '<image>' + ' |
|
' + "Describe the image." |
|
|
|
# Generate! |
|
generated_text = vlm.generate_batch( |
|
image, |
|
[user_prompt], |
|
do_sample=False, |
|
max_new_tokens=512, |
|
min_length=1, |
|
) |
|
print(generated_text[0]) |
|
``` |
|
The image caption results look like: |
|
``` |
|
The image depicts a blue and orange bus parked on the side of a street. ... |
|
``` |
|
|
|
|
|
## Citation |
|
```bibtex |
|
@article{Open-Qwen2VL, |
|
title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources}, |
|
author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng}, |
|
journal={arXiv preprint arXiv:2504.00595}, |
|
year={2025} |
|
} |
|
... |