license: mit | |
library_name: peft | |
base_model: lmms-lab/llava-onevision-qwen2-7b-ov | |
datasets: | |
- wangyueqian/MMDuetIT | |
language: | |
- en | |
tags: | |
- llava-onevision | |
- llava | |
- multimodal | |
- online video understanding | |
- video understanding | |
pipeline_tag: video-text-to-text | |
# MMDuet | |
This is the model checkpoint of **MMDuet**, a VideoLLM you can interact with in a real-time manner while the video plays. | |
## Model Details | |
## Related Resources | |
- **Paper:** [VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format](https://arxiv.org/abs/2411.17991) | |
- **Github:** [MMDuet](https://github.com/yellow-binary-tree/MMDuet) | |
- **Video Demo:** [On Youtube](https://www.youtube.com/watch?v=n1OybwhQvtk) and [On Bilibili](https://www.bilibili.com/video/BV1nwzGYBEPE) | |
- **Data:** [MMDuetIT](https://huggingface.co/datasets/wangyueqian/MMDuetIT) | |
## Citation | |
If you use this work in your research, please consider cite: | |
```bibtex | |
@misc{wang2024mmduet, | |
title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format}, | |
author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao}, | |
year={2024}, | |
eprint={2411.17991}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CV}, | |
url={https://arxiv.org/abs/2411.17991}, | |
} | |
``` |