metadata
license: mit
library_name: peft
base_model: lmms-lab/llava-onevision-qwen2-7b-ov
datasets:
- wangyueqian/MMDuetIT
language:
- en
tags:
- llava-onevision
- llava
- multimodal
- online video understanding
- video understanding
pipeline_tag: video-text-to-text
MMDuet
This is the model checkpoint of MMDuet, a VideoLLM you can interact with in a real-time manner while the video plays.
Model Details
Related Resources
- Paper: VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
- Github: MMDuet
- Video Demo: On Youtube and On Bilibili
- Data: MMDuetIT
Citation
If you use this work in your research, please consider cite:
@misc{wang2024mmduet,
title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format},
author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao},
year={2024},
eprint={2411.17991},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17991},
}