Commit
β’
aba3cde
1
Parent(s):
9620ec6
Update README.md
Browse files
README.md
CHANGED
@@ -9,19 +9,21 @@ language:
|
|
9 |
|
10 |
Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
|
11 |
|
12 |
-
Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/
|
13 |
|
14 |
-
|
15 |
|
16 |
-
|
17 |
-
## Model details
|
18 |
|
19 |
**Model type:**
|
20 |
<br>
|
21 |
-
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
|
22 |
<br>
|
23 |
Base LLM: lmsys/vicuna-7b-v1.5
|
24 |
|
|
|
|
|
|
|
25 |
**Model date:**
|
26 |
<br>
|
27 |
LLaVA-Next-Video-7B was trained in April 2024.
|
@@ -31,7 +33,24 @@ LLaVA-Next-Video-7B was trained in April 2024.
|
|
31 |
https://github.com/LLaVA-VL/LLaVA-NeXT
|
32 |
|
33 |
|
34 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
First, make sure to have `transformers >= 4.42.0`.
|
37 |
The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
|
@@ -39,17 +58,12 @@ The model supports multi-visual and multi-prompt generation. Meaning that you ca
|
|
39 |
Below is an example script to run generation in `float16` precision on a GPU device:
|
40 |
|
41 |
```python
|
42 |
-
import requests
|
43 |
-
from PIL import Image
|
44 |
import av
|
45 |
import torch
|
46 |
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
|
47 |
|
48 |
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
|
49 |
|
50 |
-
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
|
51 |
-
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
52 |
-
|
53 |
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
54 |
model_id,
|
55 |
torch_dtype=torch.float16,
|
@@ -82,7 +96,7 @@ prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
|
|
82 |
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
|
83 |
container = av.open(video_path)
|
84 |
|
85 |
-
# sample uniformly 8 frames from the video
|
86 |
total_frames = container.streams.video[0].frames
|
87 |
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
|
88 |
clip = read_video_pyav(container, indices)
|
@@ -97,6 +111,12 @@ print(processor.decode(output[0][2:], skip_special_tokens=True))
|
|
97 |
To generate from images use the below code after loading the model as shown above:
|
98 |
|
99 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
101 |
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
|
102 |
|
@@ -149,11 +169,12 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
|
149 |
).to(0)
|
150 |
```
|
151 |
|
152 |
-
|
|
|
153 |
Llama 2 is licensed under the LLAMA 2 Community License,
|
154 |
Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
155 |
|
156 |
-
## Intended use
|
157 |
**Primary intended uses:**
|
158 |
<br>
|
159 |
The primary use of LLaVA is research on large multimodal models and chatbots.
|
@@ -162,20 +183,27 @@ The primary use of LLaVA is research on large multimodal models and chatbots.
|
|
162 |
<br>
|
163 |
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
164 |
|
165 |
-
## Training dataset
|
166 |
|
167 |
-
|
168 |
-
|
169 |
-
- 158K GPT-generated multimodal instruction-following data.
|
170 |
-
- 500K academic-task-oriented VQA data mixture.
|
171 |
-
- 50K GPT-4V data mixture.
|
172 |
-
- 40K ShareGPT data.
|
173 |
-
|
174 |
-
### Video
|
175 |
-
- 100K VideoChatGPT-Instruct.
|
176 |
-
|
177 |
-
## Evaluation dataset
|
178 |
-
A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
|
179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
181 |
|
|
|
9 |
|
10 |
Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
|
11 |
|
12 |
+
Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing)
|
13 |
|
14 |
+
Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
|
15 |
|
16 |
+
## π Model details
|
|
|
17 |
|
18 |
**Model type:**
|
19 |
<br>
|
20 |
+
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data. The videos were sampled uniformly to be 32 frames per clip.
|
21 |
<br>
|
22 |
Base LLM: lmsys/vicuna-7b-v1.5
|
23 |
|
24 |
+
<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">
|
25 |
+
|
26 |
+
|
27 |
**Model date:**
|
28 |
<br>
|
29 |
LLaVA-Next-Video-7B was trained in April 2024.
|
|
|
33 |
https://github.com/LLaVA-VL/LLaVA-NeXT
|
34 |
|
35 |
|
36 |
+
## π Training dataset
|
37 |
+
|
38 |
+
### Image
|
39 |
+
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
|
40 |
+
- 158K GPT-generated multimodal instruction-following data.
|
41 |
+
- 500K academic-task-oriented VQA data mixture.
|
42 |
+
- 50K GPT-4V data mixture.
|
43 |
+
- 40K ShareGPT data.
|
44 |
+
|
45 |
+
### Video
|
46 |
+
- 100K VideoChatGPT-Instruct.
|
47 |
+
|
48 |
+
## π Evaluation dataset
|
49 |
+
A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
|
50 |
+
|
51 |
+
|
52 |
+
|
53 |
+
## π How to use the model
|
54 |
|
55 |
First, make sure to have `transformers >= 4.42.0`.
|
56 |
The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
|
|
|
58 |
Below is an example script to run generation in `float16` precision on a GPU device:
|
59 |
|
60 |
```python
|
|
|
|
|
61 |
import av
|
62 |
import torch
|
63 |
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
|
64 |
|
65 |
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
|
66 |
|
|
|
|
|
|
|
67 |
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
68 |
model_id,
|
69 |
torch_dtype=torch.float16,
|
|
|
96 |
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
|
97 |
container = av.open(video_path)
|
98 |
|
99 |
+
# sample uniformly 8 frames from the video, can sample more for longer videos
|
100 |
total_frames = container.streams.video[0].frames
|
101 |
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
|
102 |
clip = read_video_pyav(container, indices)
|
|
|
111 |
To generate from images use the below code after loading the model as shown above:
|
112 |
|
113 |
```python
|
114 |
+
import requests
|
115 |
+
from PIL import Image
|
116 |
+
|
117 |
+
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
|
118 |
+
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
119 |
+
|
120 |
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
121 |
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
|
122 |
|
|
|
169 |
).to(0)
|
170 |
```
|
171 |
|
172 |
+
|
173 |
+
## π License
|
174 |
Llama 2 is licensed under the LLAMA 2 Community License,
|
175 |
Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
176 |
|
177 |
+
## π― Intended use
|
178 |
**Primary intended uses:**
|
179 |
<br>
|
180 |
The primary use of LLaVA is research on large multimodal models and chatbots.
|
|
|
183 |
<br>
|
184 |
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
185 |
|
|
|
186 |
|
187 |
+
## βοΈ Citation
|
188 |
+
If you find our paper and code useful in your research:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
189 |
|
190 |
+
```BibTeX
|
191 |
+
@misc{zhang2024llavanextvideo,
|
192 |
+
title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
|
193 |
+
url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
|
194 |
+
author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
|
195 |
+
month={April},
|
196 |
+
year={2024}
|
197 |
+
}
|
198 |
+
```
|
199 |
|
200 |
+
```BibTeX
|
201 |
+
@misc{liu2024llavanext,
|
202 |
+
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
|
203 |
+
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
|
204 |
+
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
|
205 |
+
month={January},
|
206 |
+
year={2024}
|
207 |
+
}
|
208 |
+
```
|
209 |
|