Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Abstract
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models (2025)
- M-LLM Based Video Frame Selection for Efficient Video Understanding (2025)
- BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding (2025)
- FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs (2025)
- Large-scale Pre-training for Grounded Video Caption Generation (2025)
- Fine-Grained Video Captioning through Scene Graph Consolidation (2025)
- Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend