jylins
/

vtsum_blip

cross-modal-video-summarization

video-summarization

video-captioning

Model card Files Files and versions Community

vtsum_blip / README.md

jylins

update readme

f7b4925 6 months ago

|

history blame contribute delete

1.43 kB

	---
	license: apache-2.0
	task_categories:
	- summarization
	language:
	- en
	tags:
	- cross-modal-video-summarization
	- video-summarization
	- video-captioning
	pretty_name: VideoXum
	size_categories:
	- 10K<n<100K
	---

	# VTSUM-BLIP Model Card

	## Model details

	Model type:
	VTSUM-BLIP is an end-to-end cross-modal video summarization model.

	Model description:
	- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
	- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
	- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
	- BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth

	The file structure of Model zoo looks like:
	```
	outputs
	├── blip
	│ └── model_base_capfilt_large.pth
	├── vt_clipscore
	│ └── vt_clip.pth
	├── vtsum_tt
	│ └── vtsum_tt.pth
	└── vtsum_tt_ca
	└── vtsum_tt_ca.pth
	```

	Paper or resources for more information:
	https://videoxum.github.io/



	## Training dataset
	- VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.

	## Evaluation dataset
	- VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
	- VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.