Improve model card and add metadata
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
3 |
datasets:
|
4 |
- weizhiwang/Open-Qwen2VL-Data
|
5 |
- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
|
6 |
language:
|
7 |
- en
|
8 |
-
|
9 |
-
- Qwen/Qwen2.5-1.5B-Instruct
|
10 |
-
- google/siglip-so400m-patch14-384
|
11 |
pipeline_tag: image-text-to-text
|
|
|
12 |
---
|
13 |
|
14 |
# Model Card for Open-Qwen2VL
|
15 |
|
16 |
-
|
17 |
-
|
18 |
|
19 |
<!-- Please follow my reproduced implementation [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. -->
|
20 |
|
@@ -48,7 +48,8 @@ vlm.to(device, dtype=torch.bfloat16)
|
|
48 |
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
|
49 |
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
50 |
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
|
51 |
-
user_prompt = '<image>' + '
|
|
|
52 |
|
53 |
# Generate!
|
54 |
generated_text = vlm.generate_batch(
|
@@ -62,36 +63,14 @@ print(generated_text[0])
|
|
62 |
```
|
63 |
The image caption results look like:
|
64 |
```
|
65 |
-
The image depicts a blue and orange bus parked on the side of a street.
|
66 |
-
|
67 |
-
The bus is parked on a grassy area adjacent to a sidewalk. The grass is well-maintained, and there are a few trees planted along the sidewalk, providing some shade. The sidewalk itself is made of concrete and appears to be clean and well-kept.
|
68 |
-
|
69 |
-
In the background, there are residential buildings, which are typical of a suburban area. These buildings have pitched roofs and are constructed with a mix of brick and plaster. The architecture suggests a typical British suburban neighborhood. The sky above is partly cloudy, with patches of blue sky visible, indicating fair weather.
|
70 |
-
|
71 |
-
The bus is parked in a designated bus stop area, as indicated by the presence of a bus stop sign and a bus stop shelter, although the shelter is not visible in the image. The bus's license plate is visible and reads "Y600 HJX." The overall scene suggests a typical day in a suburban area where public transportation is readily available for residents.
|
72 |
-
|
73 |
-
The image does not contain any people, vehicles other than the bus, or any other notable objects. The focus is primarily on the bus and its surroundings. The bus appears to be in good condition, indicating that it is well-maintained and likely in regular use. The presence of the bus stop and the well-kept environment suggest that the area is well-organized and that public transportation is an integral part of the community's daily life.
|
74 |
```
|
75 |
|
76 |
-
<!-- # Fine-Tune LLaVA-Llama-3 on Your Visual Instruction Data
|
77 |
-
Please refer to our [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
|
78 |
-
|
79 |
-
## Benchmark Results
|
80 |
-
|
81 |
-
|
82 |
-
| Model | MMMU Val |
|
83 |
-
| :-------------------- | :---------------: |
|
84 |
-
| LLaVA-v1.5-7B | 35.3 |
|
85 |
-
| LLaVA-Llama-3-8B | 36.7 |
|
86 |
-
|
87 |
-
Please refer to `eval_outputs/LLaVA-Llama-3-8B_mmmu_val.json` for reproduce the benchmark performance on MMMU validation set. -->
|
88 |
|
89 |
## Citation
|
90 |
<!--
|
91 |
```bibtex
|
92 |
@misc{wang2024llavallama3,
|
93 |
-
|
94 |
-
|
95 |
-
year={2024}
|
96 |
-
}
|
97 |
-
``` -->
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2.5-1.5B-Instruct
|
4 |
+
- google/siglip-so400m-patch14-384
|
5 |
datasets:
|
6 |
- weizhiwang/Open-Qwen2VL-Data
|
7 |
- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
|
8 |
language:
|
9 |
- en
|
10 |
+
license: cc
|
|
|
|
|
11 |
pipeline_tag: image-text-to-text
|
12 |
+
library_name: transformers
|
13 |
---
|
14 |
|
15 |
# Model Card for Open-Qwen2VL
|
16 |
|
17 |
+
Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL).
|
|
|
18 |
|
19 |
<!-- Please follow my reproduced implementation [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. -->
|
20 |
|
|
|
48 |
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
|
49 |
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
50 |
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
|
51 |
+
user_prompt = '<image>' + '
|
52 |
+
' + "Describe the image."
|
53 |
|
54 |
# Generate!
|
55 |
generated_text = vlm.generate_batch(
|
|
|
63 |
```
|
64 |
The image caption results look like:
|
65 |
```
|
66 |
+
The image depicts a blue and orange bus parked on the side of a street. ...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
```
|
68 |
|
69 |
+
<!-- # Fine-Tune LLaVA-Llama-3 on Your Visual Instruction Data ... -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
## Citation
|
72 |
<!--
|
73 |
```bibtex
|
74 |
@misc{wang2024llavallama3,
|
75 |
+
...
|
76 |
+
``` -->
|
|
|
|
|
|