|
--- |
|
license: apache-2.0 |
|
task_categories: |
|
- summarization |
|
language: |
|
- en |
|
tags: |
|
- cross-modal-video-summarization |
|
- video-summarization |
|
- video-captioning |
|
pretty_name: VideoXum |
|
size_categories: |
|
- 10K<n<100K |
|
--- |
|
|
|
# VTSUM-BLIP Model Card |
|
|
|
## Model details |
|
|
|
**Model type:** |
|
VTSUM-BLIP is an end-to-end cross-modal video summarization model. |
|
|
|
**Model description:** |
|
- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth |
|
- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth |
|
- VT-CLIP for VT-CLIPScore metric: vt_clip.pth |
|
- BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth |
|
|
|
**The file structure of Model zoo looks like:** |
|
``` |
|
outputs |
|
βββ blip |
|
β βββ model_base_capfilt_large.pth |
|
βββ vt_clipscore |
|
β βββ vt_clip.pth |
|
βββ vtsum_tt |
|
β βββ vtsum_tt.pth |
|
βββ vtsum_tt_ca |
|
βββ vtsum_tt_ca.pth |
|
``` |
|
|
|
**Paper or resources for more information:** |
|
https://videoxum.github.io/ |
|
|
|
|
|
|
|
## Training dataset |
|
- VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries. |
|
|
|
## Evaluation dataset |
|
- VideoXum *val* set: 2K long videos long videos with 80K pairs of aligned video and text summaries. |
|
- VideoXum *test* set: 4K long videos long videos with 80K pairs of aligned video and text summaries. |