zR
commited on
Commit
•
45515e3
1
Parent(s):
1443e84
.mdl
DELETED
Binary file (60 Bytes)
|
|
.msc
DELETED
Binary file (1.53 kB)
|
|
.mv
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
Revision:master,CreatedAt:1719926951
|
|
|
|
README.md
ADDED
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: cogvlm2
|
4 |
+
license_link: https://huggingface.co/THUDM/cogvlm2-video-llama3-chat/blob/main/LICENSE
|
5 |
+
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
tags:
|
10 |
+
- chat
|
11 |
+
- cogvlm2
|
12 |
+
- cogvlm--video
|
13 |
+
|
14 |
+
inference: false
|
15 |
+
---
|
16 |
+
# CogVLM2-Video
|
17 |
+
|
18 |
+
[中文版本README](README_zh.md)
|
19 |
+
|
20 |
+
CogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. The following diagram
|
21 |
+
shows the performance of CogVLM2-Video on
|
22 |
+
the [MVBench](https://github.com/OpenGVLab/Ask-Anything), [VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
|
23 |
+
and Zero-shot VideoQA datasets (MSVD-QA, MSRVTT-QA, ActivityNet-QA). Where VCG-* refers to the VideoChatGPTBench, ZS-*
|
24 |
+
refers to Zero-Shot VideoQA datasets and MV-* refers to main categories in the MVBench.
|
25 |
+
|
26 |
+
![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
|
27 |
+
|
28 |
+
## Detailed performance
|
29 |
+
|
30 |
+
Performance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:
|
31 |
+
|
32 |
+
| Models | VCG-AVG | VCG-CI | VCG-DO | VCG-CU | VCG-TU | VCG-CO | ZS-AVG |
|
33 |
+
|-----------------------|----------|----------|----------|----------|----------|----------|-----------|
|
34 |
+
| IG-VLM GPT4V | 3.17 | 3.40 | 2.80 | 3.61 | 2.89 | 3.13 | 65.70 |
|
35 |
+
| ST-LLM | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 62.90 |
|
36 |
+
| ShareGPT4Video | N/A | N/A | N/A | N/A | N/A | N/A | 46.50 |
|
37 |
+
| VideoGPT+ | 3.28 | 3.27 | 3.18 | 3.74 | 2.83 | **3.39** | 61.20 |
|
38 |
+
| VideoChat2_HD_mistral | 3.10 | 3.40 | 2.91 | 3.72 | 2.65 | 2.84 | 57.70 |
|
39 |
+
| PLLaVA-34B | 3.32 | **3.60** | 3.20 | **3.90** | 2.67 | 3.25 | **68.10** |
|
40 |
+
| CogVLM2-Video | **3.41** | 3.49 | **3.46** | 3.87 | **2.98** | 3.23 | 66.60 |
|
41 |
+
|
42 |
+
Performance on MVBench dataset:
|
43 |
+
|
44 |
+
| Model | AVG | AA | AC | AL | AP | AS | CO | CI | EN | ER | FA | FP | MA | MC | MD | OE | OI | OS | ST | SC | UA |
|
45 |
+
|-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|-------|----------|----------|----------|----------|----------|----------|----------|----------|------|----------|------|----------|
|
46 |
+
| IG-VLM GPT4V | 43.7 | 72.0 | 39.0 | 40.5 | **63.5** | 55.5 | 52.0 | 11.0 | 31.0 | 59.0 | 46.5 | 47.5 | 22.5 | 12.0 | 12.0 | 18.5 | 59.0 | 29.5 | 83.5 | 45.0 | 73.5 |
|
47 |
+
| ST-LLM | 54.9 | 84.0 | 36.5 | 31.0 | 53.5 | 66.0 | 46.5 | 58.5 | 34.5 | 41.5 | 44.0 | 44.5 | 78.5 | 56.5 | 42.5 | 80.5 | 73.5 | 38.5 | 86.5 | 43.0 | 58.5 |
|
48 |
+
| ShareGPT4Video | 51.2 | 79.5 | 35.5 | 41.5 | 39.5 | 49.5 | 46.5 | 51.5 | 28.5 | 39.0 | 40.0 | 25.5 | 75.0 | 62.5 | 50.5 | 82.5 | 54.5 | 32.5 | 84.5 | 51.0 | 54.5 |
|
49 |
+
| VideoGPT+ | 58.7 | 83.0 | 39.5 | 34.0 | 60.0 | **69.0** | 50.0 | 60.0 | 29.5 | 44.0 | 48.5 | 53.0 | 90.5 | 71.0 | 44.0 | **85.5** | 75.5 | 36.0 | 89.5 | 45.0 | 66.5 |
|
50 |
+
| VideoChat2_HD_mistral | 62.3 | 79.5 | **60.0** | **87.5** | 50.0 | 68.5 | **93.5** | 71.5 | 36.5 | 45.0 | 49.5 | **87.0** | 40.0 | **76.0** | **92.0** | 53.0 | 62.0 | 45.5 | 36.0 | 44.0 | 69.5 |
|
51 |
+
| PLLaVA-34B | 58.1 | 82.0 | 40.5 | 49.5 | 53.0 | 67.5 | 66.5 | 59.0 | l39.5 | **63.5** | 47.0 | 50.0 | 70.0 | 43.0 | 37.5 | 68.5 | 67.5 | 36.5 | **91.0** | 51.5 | **79.0** |
|
52 |
+
| CogVLM2-Video | **62.3** | **85.5** | 41.5 | 31.5 | 65.5 | 79.5 | 58.5 | **77.0** | 28.5 | 42.5 | **54.0** | 57.0 | **91.5** | 73.0 | 48.0 | **91.0** | **78.0** | 36.0 | **91.5** | 47.0 | 68.5 |
|
53 |
+
|
54 |
+
## Evaluation details
|
55 |
+
|
56 |
+
We follow the previous works to evaluate the performance of our model. In different benchmarks, we craft task-specific
|
57 |
+
prompts for each benchmark:
|
58 |
+
|
59 |
+
``` python
|
60 |
+
# For MVBench
|
61 |
+
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Short Answer:"
|
62 |
+
# For VideoChatGPT-Bench
|
63 |
+
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, comprehensively answer the following question. Your answer should be long and cover all the related aspects\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
|
64 |
+
# For Zero-shot VideoQA
|
65 |
+
prompt = f"The input consists of a sequence of key frames from a video. Answer the question comprehensively including all the possible verbs and nouns that can discribe the events, followed by significant events, characters, or objects that appear throughout the frames.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
|
66 |
+
```
|
67 |
+
|
68 |
+
For evaluation codes, please refer to
|
69 |
+
the [evaluation script](https://github.com/magic-research/PLLaVA/blob/main/README.md) in PLLaVA.
|
70 |
+
|
71 |
+
## Using This Model
|
72 |
+
|
73 |
+
This repository is a `chat` version model and it support single-round chat.
|
74 |
+
|
75 |
+
You can quickly install the Python package dependencies and run model inference in
|
76 |
+
our [github](https://github.com/THUDM/CogVLM2/tree/main/video_demo).
|
77 |
+
|
78 |
+
## License
|
79 |
+
|
80 |
+
This model is released under the CogVLM2 [LICENSE](LICENSE). For models built with Meta Llama 3, please also adhere to
|
81 |
+
the [LLAMA3_LICENSE](LLAMA3_LICENSE).
|
82 |
+
|
83 |
+
## Training details
|
84 |
+
|
85 |
+
Pleaser refer to our technical report for training formula and hyperparameters.
|
86 |
+
|
README_zh.md
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CogVLM2-Video
|
2 |
+
|
3 |
+
CogVLM2-Video 在多个视频问答任务上实现了最先进的性能。下图显示了 CogVLM2-Video
|
4 |
+
在 [MVBench](https://github.com/OpenGVLab/Ask-Anything)、[VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
|
5 |
+
和 Zero-shot VideoQA 数据集 (MSVD-QA、MSRVTT-QA、ActivityNet-QA) 上的性能。
|
6 |
+
|
7 |
+
![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
|
8 |
+
|
9 |
+
其中 VCG 指的是 VideoChatGPTBench,ZS 指的是零样本 VideoQA 数据集,MV-* 指的是 MVBench 中的主要类别。
|
10 |
+
|
11 |
+
## 评估结论
|
12 |
+
|
13 |
+
具体榜单测试数据如下:
|
14 |
+
|
15 |
+
| Models | VCG-AVG | VCG-CI | VCG-DO | VCG-CU | VCG-TU | VCG-CO | ZS-AVG |
|
16 |
+
|-----------------------|----------|----------|----------|----------|----------|----------|-----------|
|
17 |
+
| IG-VLM GPT4V | 3.17 | 3.40 | 2.80 | 3.61 | 2.89 | 3.13 | 65.70 |
|
18 |
+
| ST-LLM | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 62.90 |
|
19 |
+
| ShareGPT4Video | N/A | N/A | N/A | N/A | N/A | N/A | 46.50 |
|
20 |
+
| VideoGPT+ | 3.28 | 3.27 | 3.18 | 3.74 | 2.83 | **3.39** | 61.20 |
|
21 |
+
| VideoChat2_HD_mistral | 3.10 | 3.40 | 2.91 | 3.72 | 2.65 | 2.84 | 57.70 |
|
22 |
+
| PLLaVA-34B | 3.32 | **3.60** | 3.20 | **3.90** | 2.67 | 3.25 | **68.10** |
|
23 |
+
| CogVLM2-Video | **3.41** | 3.49 | **3.46** | 3.87 | **2.98** | 3.23 | 66.60 |
|
24 |
+
|
25 |
+
CogVLM2-Video 在 MVBench 数据集上的表现
|
26 |
+
|
27 |
+
| Model | AVG | AA | AC | AL | AP | AS | CO | CI | EN | ER | FA | FP | MA | MC | MD | OE | OI | OS | ST | SC | UA |
|
28 |
+
|-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|-------|----------|----------|----------|----------|----------|----------|----------|----------|------|----------|------|----------|
|
29 |
+
| IG-VLM GPT4V | 43.7 | 72.0 | 39.0 | 40.5 | **63.5** | 55.5 | 52.0 | 11.0 | 31.0 | 59.0 | 46.5 | 47.5 | 22.5 | 12.0 | 12.0 | 18.5 | 59.0 | 29.5 | 83.5 | 45.0 | 73.5 |
|
30 |
+
| ST-LLM | 54.9 | 84.0 | 36.5 | 31.0 | 53.5 | 66.0 | 46.5 | 58.5 | 34.5 | 41.5 | 44.0 | 44.5 | 78.5 | 56.5 | 42.5 | 80.5 | 73.5 | 38.5 | 86.5 | 43.0 | 58.5 |
|
31 |
+
| ShareGPT4Video | 51.2 | 79.5 | 35.5 | 41.5 | 39.5 | 49.5 | 46.5 | 51.5 | 28.5 | 39.0 | 40.0 | 25.5 | 75.0 | 62.5 | 50.5 | 82.5 | 54.5 | 32.5 | 84.5 | 51.0 | 54.5 |
|
32 |
+
| VideoGPT+ | 58.7 | 83.0 | 39.5 | 34.0 | 60.0 | **69.0** | 50.0 | 60.0 | 29.5 | 44.0 | 48.5 | 53.0 | 90.5 | 71.0 | 44.0 | **85.5** | 75.5 | 36.0 | 89.5 | 45.0 | 66.5 |
|
33 |
+
| VideoChat2_HD_mistral | 62.3 | 79.5 | **60.0** | **87.5** | 50.0 | 68.5 | **93.5** | 71.5 | 36.5 | 45.0 | 49.5 | **87.0** | 40.0 | **76.0** | **92.0** | 53.0 | 62.0 | 45.5 | 36.0 | 44.0 | 69.5 |
|
34 |
+
| PLLaVA-34B | 58.1 | 82.0 | 40.5 | 49.5 | 53.0 | 67.5 | 66.5 | 59.0 | l39.5 | **63.5** | 47.0 | 50.0 | 70.0 | 43.0 | 37.5 | 68.5 | 67.5 | 36.5 | **91.0** | 51.5 | **79.0** |
|
35 |
+
| CogVLM2-Video | **62.3** | **85.5** | 41.5 | 31.5 | 65.5 | 79.5 | 58.5 | **77.0** | 28.5 | 42.5 | **54.0** | 57.0 | **91.5** | 73.0 | 48.0 | **91.0** | **78.0** | 36.0 | **91.5** | 47.0 | 68.5 |
|
36 |
+
|
37 |
+
## 评估和复现
|
38 |
+
|
39 |
+
我们遵循以前的研究来评估我们模型的性能。在不同的基准测试中,我们为每个基准测试制作特定于任务的提示:
|
40 |
+
|
41 |
+
``` python
|
42 |
+
# For MVBench
|
43 |
+
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Short Answer:"
|
44 |
+
# For VideoChatGPT-Bench
|
45 |
+
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, comprehensively answer the following question. Your answer should be long and cover all the related aspects\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
|
46 |
+
# For Zero-shot VideoQA
|
47 |
+
prompt = f"The input consists of a sequence of key frames from a video. Answer the question comprehensively including all the possible verbs and nouns that can discribe the events, followed by significant events, characters, or objects that appear throughout the frames.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
|
48 |
+
```
|
49 |
+
|
50 |
+
有关评估代码,请参阅 PLLaVA 中的 [评估脚本](https://github.com/magic-research/PLLaVA/blob/main/README.md)。
|
51 |
+
|
52 |
+
## 快速调用
|
53 |
+
|
54 |
+
本仓库为 `chat` 版本模型,支持单轮对话。
|
55 |
+
|
56 |
+
您可以在我们的 [github](https://github.com/THUDM/CogVLM2/tree/main/video_demo) 中快速安装对应的 Python包 依赖和运行模型推理。
|
57 |
+
|
58 |
+
## 模型协议
|
59 |
+
|
60 |
+
此模型根据 CogVLM2 [LICENSE](LICENSE) 发布。对于使用 Meta Llama 3 构建的模型,还请遵守
|
61 |
+
[LLAMA3_LICENSE](LLAMA3_LICENSE)。
|
62 |
+
|
63 |
+
## 引用
|
64 |
+
|
65 |
+
我们即将发布技术报告,尽情期待。
|