Yongxin-Guo/trace · Hugging Face

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!

📰 News

[2024.11.01] 🔥 We are excited to announce the release of trace-uni, which has been enhanced by incorporating additional general video understanding data from a subset of LLaVA-Video-178k. Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
[2024.10.19] 🔥 We release trace-retrieval by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
[2024.10.10] 🔥 Our code and paper are released!
[2024.10.10] 🔥 Our checkpoints are available now!

Overview

In this work

We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

Model Zoo

Checkpoints	Description	URL
Initialization	Weights initialized from VideoLLaMA2	trace-init
Stage-1	Model checkpoints trained after stage-1	trace-stage1
Stage-2	Model checkpoints trained after stage-2	trace
FT-Charades	Fine-tuned on Charades-STA dataset	trace-ft-charades
FT-Youcook2	Fine-tuned on Youcook2 dataset	trace-ft-youcook2
FT-QVHighlights	Fine-tuned on QVHighlights dataset	trace-ft-qvhighlights
TRACE-retrieval	Forcing the predicted timestamps to be align with input timestamps	trace-retrieval
TRACE-uni	Incorporating additional general video understanding data from a subset of LLaVA-Video-178k.	trace-uni

Results

Youcook2 (Zero-Shot)	CIDER	METEOR	SODA_c	F1
TRACE	8.1	2.8	2.2	22.4
TRACE-retrieal	8.3	2.9	2.3	24.1
TRACE-uni	8.6	2.9	2.3	22.4

Charades-STA (Zero-Shot)	0.3	0.5	0.7	mIOU
TRACE	58.6	40.3	19.4	38.7
TRACE-retrieval	57.9	37.4	17.3	37.4
TRACE-uni	63.7	43.7	21.0	41.5

QVHighlights (Zero-Shot)	mAP	Hit@1
TRACE	26.8	42.7
TRACE-retrieval	27.9	44.3
TRACE-uni	27.5	43.9

ActivityNet-DVC	CIDER	METEOR	SODA_c	F1
TRACE	25.9	6.0	6.4	39.3
TRACE-retrieval	25.7	5.9	6.5	40.1
TRACE-uni	29.2	6.9	6.4	40.4

ActivityNet-MR	0.3	0.5	0.7	mIOU
TRACE	54.0	37.7	24.0	39.0
TRACE-retrieval	54.4	39.8	24.9	40.2
TRACE-uni	53.2	38.2	24.7	39.4

MVBench	Avg	AS	AP	AA	FA	UA	OE	OI	OS	MD	AL	ST	AC	MC	MA	SC	FP	CO	EN	ER	CI
TRACE	48.1	61.2	56.5	72.5	46.5	61.0	48.0	69.5	40.0	22.0	31.0	86.5	37.5	37.0	51.0	45.0	40.5	39.0	31.0	43.5	44.5
TRACE-uni	53.8	68.1	58.5	72.5	41.5	73.5	55.1	71.5	40.5	25.0	53.0	88.5	63.5	38.5	51.0	52.5	49.0	59.5	33.5	49.5	32.5

VideoMME (w/o subtitle)	Short	Midium	Long	Avg
TRACE	49.5	42.5	39.3	43.8
TRACE-uni	58.2	48.1	42.3	49.6

Bibliography

If you find this repository helpful for your project, please consider citing:

@misc{guo2024tracetemporalgroundingvideo,
      title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling}, 
      author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
      year={2024},
      eprint={2410.05643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.05643}, 
}

Yongxin-Guo
/

trace

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!

📰 News

Overview

Model Zoo

Results

Bibliography

Model tree for Yongxin-Guo/trace

Collection including Yongxin-Guo/trace

TRACE