File size: 6,079 Bytes
c11cba3
6cc79fe
4ff8fc3
6cc79fe
4ff8fc3
c11cba3
6cc79fe
c11cba3
 
6cc79fe
c11cba3
 
6cc79fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ff8fc3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: Ask-Anything:ChatGPT with Video Understanding
emoji: 👀
colorFrom: green
colorTo: blue
sdk: gradio
python_version: 3.8.16
app_file: app.py
pinned: false
license: mit
---

# 🦜 VideoChat [[paper](https://arxiv.org/abs/2305.06355)]

![images](assert/framework.png)
In this study, we initiate an exploration into video understanding by introducing VideoChat, an **end-to-end chat-centric video understanding system**. It integrates video foundation models and large language models via a learnable neural interface, excelling in **spatiotemporal reasoning, event localization, and causal relationship inference**. To instructively tune this system, we propose a **video-centric instruction dataset**, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes **spatiotemporal reasoning and causal relationships**, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.


# :fire: Updates
- **2023/05/11**: Release the 🦜**VideoChat V1**, which can **handle both image and video understanding!**
    - [Model](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) and [Data](https://github.com/OpenGVLab/InternVideo/blob/main/Data/instruction_data.md).
    - 🧑‍💻 *Online demo is Preparing*.
    - 🧑‍🔧 *Tuning scripts are cleaning*.

# :hourglass_flowing_sand: Schedule

- [x] Small-scale video instuction data and tuning
- [x] Instruction tuning on BLIP+UniFormerV2+Vicuna
- [ ] Large-scale and complex video instuction data
- [ ] Instruction tuning on strong video foundation model
- [ ] User-friendly interactions with longer videos
- [ ] ...

# :speech_balloon: Example

<div align="center">
<b>
  <font size="4">Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl. </font>
  <br>
  <font size="4" color="red">Our VideoChat can handle both image and video understanding well!</font>
</b>
</div>
<div align="center">
<img src="assert/comparison.png" width="90%">
</div>

<div align="center">
  <font size="4">
	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jesse_dance.mp4">[Video]</a> <b>Why the video is funny?</b>
  </font>
</div>
<div align="center">
<img src="assert/humor.png" width="50%">
</div>

<div align="center">
  <font size="4">
	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jp_dance.mp4">[Video]</a> <b>Spatial perception</b>
  </font>
</div>
<div align="center">
<img src="assert/spatial.png" width="50%">
</div>

<div align="center">
  <font size="4">
	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/car_accident.mp4">[Video]</a> <b>Temporal perception</b>
  </font>
</div>
<div align="center">
<img src="assert/temporal.png" width="50%">
</div>

<div align="center">
  <font size="4">
	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/idol_dancing.mp4">[Video]</a> <b>Multi-turn conversation</b>
  </font>
</div>
<div align="center">
<img src="assert/multi_turn.png" width="50%">
</div>

<div align="center">
  <font size="4">
	<b>Image understanding</b>
  </font>
</div>
<div align="center">
<img src="assert/image.png" width="100%">
</div>

# :running: Usage

- Prepare the envirment.
    ```shell
    pip install -r requirements.txt
    ```
    
- Download [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) model:
    - ViT: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth`
    - QFormer: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth`
    - Change the `vit_model_path` and `q_former_model_path` in [config.json](./configs/config.json).
    
- Download [StabelVicuna](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) model:
    - LLAMA: Download it from the [original repo](https://github.com/facebookresearch/llama) or [hugging face](https://huggingface.co/decapoda-research/llama-13b-hf).
    - If you download LLAMA from the original repo, please process it via the following command:
    ```shell
    # convert_llama_weights_to_hf is copied from transformers
    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
      --input_dir /path/to/downloaded/llama/weights \
      --model_size 7B --output_dir /output/path
    ```
    - Download [StableVicuna-13b-deelta](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) and process it:
    ```shell
    # fastchat v0.1.10
    python3 apply_delta.py \
      --base /path/to/model_weights/llama-13b \
      --target stable-vicuna-13b \
      --delta CarperAI/stable-vicuna-13b-delta
    ```
    - Change the `llama_model_path` in [config.json](./configs/config.json).
    
- Download [VideoChat](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) model:
  
    - Change the `videochat_model_path` in [config.json](./configs/config.json).
    
- Running demo with Gradio:
    ```shell
    python demo.py
    ```
    
- Another demo on Jupyter Notebook can found in [demo.ipynb](demo.ipynb)


# :page_facing_up: Citation

If you find this project useful in your research, please consider cite:
```BibTeX
@article{2023videochat,
  title={VideoChat: Chat-Centric Video Understanding},
  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}
```

# :thumbsup: Acknowledgement

Thanks to the open source of the following projects:

[InternVideo](https://github.com/OpenGVLab/InternVideo), [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA), [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2), [StableLM](https://github.com/Stability-AI/StableLM).