wangyueqian
/

MMDuet

Video-Text-to-Text

llava-onevision

online video understanding

video understanding

Model card Files Files and versions Community

MMDuet / README.md

wangyueqian's picture

add paper and video demo to REAME.md

2366917 verified 29 days ago

|

history blame contribute delete

1.42 kB

	---
	license: mit
	library_name: peft
	base_model: lmms-lab/llava-onevision-qwen2-7b-ov
	datasets:
	- wangyueqian/MMDuetIT
	language:
	- en
	tags:
	- llava-onevision
	- llava
	- multimodal
	- online video understanding
	- video understanding
	pipeline_tag: video-text-to-text
	---

	# MMDuet

	This is the model checkpoint of MMDuet, a VideoLLM you can interact with in a real-time manner while the video plays.

	## Model Details


	## Related Resources
	- Paper: [VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format](https://arxiv.org/abs/2411.17991)
	- Github: [MMDuet](https://github.com/yellow-binary-tree/MMDuet)
	- Video Demo: [On Youtube](https://www.youtube.com/watch?v=n1OybwhQvtk) and [On Bilibili](https://www.bilibili.com/video/BV1nwzGYBEPE)
	- Data: [MMDuetIT](https://huggingface.co/datasets/wangyueqian/MMDuetIT)


	## Citation
	If you use this work in your research, please consider cite:
	```bibtex
	@misc{wang2024mmduet,
	title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format},
	author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao},
	year={2024},
	eprint={2411.17991},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2411.17991},
	}
	```