Update README.md

7337262 verified 3 days ago

3.16 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- video temporal grounding
	- dense video caption
	- video highlight detection
	---

	<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Causal Event Modeling</a></h2>
	<h5 align="center"> If our project helps you, please give us a star ⭐ on <a href="https://github.com/gyxxyg/TRACE">GitHub</a> and cite our paper!</h2>
	<h5 align="center">

	## 📰 News

	- [2024.10.19] 🔥 We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
	- [2024.10.10] 🔥 Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released!
	- [2024.10.10] 🔥 Our checkpoints are available now!

	## Overview

	In this work
	- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
	- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

	## Model Zoo

	\| Checkpoints \| Description \| URL \|
	\| ----------- \| ----------- \| ----------- \|
	\| Initialization \| Weights initialized from VideoLLaMA2 \| [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) \|
	\| Stage-1 \| Model checkpoints trained after stage-1 \| [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) \|
	\| Stage-2 \| Model checkpoints trained after stage-2 \| [trace](https://huggingface.co/Yongxin-Guo/trace) \|
	\| FT-Charades \| Fine-tuned on Charades-STA dataset \| [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) \|
	\| FT-Youcook2 \| Fine-tuned on Youcook2 dataset \| [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) \|
	\| FT-QVHighlights \| Fine-tuned on QVHighlights dataset \| [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) \|
	\| TRACE-retrieval \| Forcing the predicted timestamps to be align with input timestamps \| [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) \|

	#### Results

	\| Youcook2 (Zero-Shot) \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 8.1 \| 2.8 \| 2.2 \| 22.4 \|
	\| TRACE-retrieval \| 8.3 \| 2.9 \| 2.3 \| 24.1 \|

	\| Charades-STA (Zero-Shot) \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 58.6 \| 40.3 \| 19.4 \| 38.7 \|
	\| TRACE-retrieval \| 57.9 \| 37.4 \| 17.3 \| 37.4 \|

	\| QVHighlights (Zero-Shot) \| mAP \| Hit@1 \|
	\| --- \| --- \| --- \|
	\| TRACE \| 26.8 \| 42.7 \|
	\| TRACE-retrieval \| 27.9 \| 44.3 \|


	\| ActivityNet-DVC \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 25.9 \| 6.0 \| 6.4 \| 39.3 \|
	\| TRACE-retrieval \| 25.7 \| 5.9 \| 6.5 \| 40.1 \|

	\| ActivityNet-MR \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 54.0 \| 37.7 \| 24.0 \| 39.0 \|
	\| TRACE-retrieval \| 54.4 \| 39.8 \| 24.9 \| 40.2 \|

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- video temporal grounding
	- dense video caption
	- video highlight detection
	---

	<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Causal Event Modeling</a></h2>
	<h5 align="center"> If our project helps you, please give us a star ⭐ on <a href="https://github.com/gyxxyg/TRACE">GitHub</a> and cite our paper!</h2>
	<h5 align="center">

	## 📰 News

	- [2024.10.19] 🔥 We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
	- [2024.10.10] 🔥 Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released!
	- [2024.10.10] 🔥 Our checkpoints are available now!

	## Overview

	In this work
	- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
	- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

	## Model Zoo

	\| Checkpoints \| Description \| URL \|
	\| ----------- \| ----------- \| ----------- \|
	\| Initialization \| Weights initialized from VideoLLaMA2 \| [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) \|
	\| Stage-1 \| Model checkpoints trained after stage-1 \| [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) \|
	\| Stage-2 \| Model checkpoints trained after stage-2 \| [trace](https://huggingface.co/Yongxin-Guo/trace) \|
	\| FT-Charades \| Fine-tuned on Charades-STA dataset \| [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) \|
	\| FT-Youcook2 \| Fine-tuned on Youcook2 dataset \| [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) \|
	\| FT-QVHighlights \| Fine-tuned on QVHighlights dataset \| [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) \|
	\| TRACE-retrieval \| Forcing the predicted timestamps to be align with input timestamps \| [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) \|

	#### Results

	\| Youcook2 (Zero-Shot) \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 8.1 \| 2.8 \| 2.2 \| 22.4 \|
	\| TRACE-retrieval \| 8.3 \| 2.9 \| 2.3 \| 24.1 \|

	\| Charades-STA (Zero-Shot) \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 58.6 \| 40.3 \| 19.4 \| 38.7 \|
	\| TRACE-retrieval \| 57.9 \| 37.4 \| 17.3 \| 37.4 \|

	\| QVHighlights (Zero-Shot) \| mAP \| Hit@1 \|
	\| --- \| --- \| --- \|
	\| TRACE \| 26.8 \| 42.7 \|
	\| TRACE-retrieval \| 27.9 \| 44.3 \|


	\| ActivityNet-DVC \| CIDER \| METEOR \| SODA_c \| F1 \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 25.9 \| 6.0 \| 6.4 \| 39.3 \|
	\| TRACE-retrieval \| 25.7 \| 5.9 \| 6.5 \| 40.1 \|

	\| ActivityNet-MR \| 0.3 \| 0.5 \| 0.7 \| mIOU \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| TRACE \| 54.0 \| 37.7 \| 24.0 \| 39.0 \|
	\| TRACE-retrieval \| 54.4 \| 39.8 \| 24.9 \| 40.2 \|