File size: 1,416 Bytes
b6413e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2366917
b6413e7
2366917
b6413e7
 
 
 
 
 
2366917
 
 
 
 
 
 
 
 
b6413e7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
license: mit
library_name: peft
base_model: lmms-lab/llava-onevision-qwen2-7b-ov
datasets:
- wangyueqian/MMDuetIT
language:
- en
tags:
- llava-onevision
- llava
- multimodal
- online video understanding
- video understanding
pipeline_tag: video-text-to-text
---

# MMDuet

This is the model checkpoint of **MMDuet**, a VideoLLM you can interact with in a real-time manner while the video plays.

## Model Details


## Related Resources
- **Paper:** [VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format](https://arxiv.org/abs/2411.17991)
- **Github:** [MMDuet](https://github.com/yellow-binary-tree/MMDuet)
- **Video Demo:** [On Youtube](https://www.youtube.com/watch?v=n1OybwhQvtk) and [On Bilibili](https://www.bilibili.com/video/BV1nwzGYBEPE)
- **Data:** [MMDuetIT](https://huggingface.co/datasets/wangyueqian/MMDuetIT)


## Citation
If you use this work in your research, please consider cite:
```bibtex
@misc{wang2024mmduet,
      title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format}, 
      author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao},
      year={2024},
      eprint={2411.17991},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17991}, 
}
```