--- license: mit library_name: peft base_model: lmms-lab/llava-onevision-qwen2-7b-ov datasets: - wangyueqian/MMDuetIT language: - en tags: - llava-onevision - llava - multimodal - online video understanding - video understanding pipeline_tag: video-text-to-text --- # MMDuet This is the model checkpoint of **MMDuet**, a VideoLLM you can interact with in a real-time manner while the video plays. ## Model Details ## Related Resources - **Paper:** [VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format](https://arxiv.org/abs/2411.17991) - **Github:** [MMDuet](https://github.com/yellow-binary-tree/MMDuet) - **Video Demo:** [On Youtube](https://www.youtube.com/watch?v=n1OybwhQvtk) and [On Bilibili](https://www.bilibili.com/video/BV1nwzGYBEPE) - **Data:** [MMDuetIT](https://huggingface.co/datasets/wangyueqian/MMDuetIT) ## Citation If you use this work in your research, please consider cite: ```bibtex @misc{wang2024mmduet, title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format}, author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao}, year={2024}, eprint={2411.17991}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.17991}, } ```