OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts Paper • 2503.22952 • Published 27 days ago • 18
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens Paper • 2502.18890 • Published Feb 26 • 30
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions Paper • 2305.18756 • Published May 30, 2023
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Paper • 2210.12460 • Published Oct 22, 2022
LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding Paper • 2402.16050 • Published Feb 25, 2024 • 1
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training Paper • 2305.18760 • Published May 30, 2023
LongViTU: Instruction Tuning for Long-Form Video Understanding Paper • 2501.05037 • Published Jan 9 • 1
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners Paper • 2305.14825 • Published May 24, 2023 • 1
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation Paper • 2306.15253 • Published Jun 27, 2023
LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding Paper • 2402.16050 • Published Feb 25, 2024 • 1
HawkEye: Training Video-Text LLMs for Grounding Text in Videos Paper • 2403.10228 • Published Mar 15, 2024
RAM: Towards an Ever-Improving Memory System by Learning from Communications Paper • 2404.12045 • Published Apr 18, 2024 • 2