File size: 1,426 Bytes
c5deef6 048fdcf c5deef6 048fdcf 363a181 f7b4925 048fdcf f7b4925 048fdcf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
---
license: apache-2.0
task_categories:
- summarization
language:
- en
tags:
- cross-modal-video-summarization
- video-summarization
- video-captioning
pretty_name: VideoXum
size_categories:
- 10K<n<100K
---
# VTSUM-BLIP Model Card
## Model details
**Model type:**
VTSUM-BLIP is an end-to-end cross-modal video summarization model.
**Model description:**
- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
- BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth
**The file structure of Model zoo looks like:**
```
outputs
βββ blip
β βββ model_base_capfilt_large.pth
βββ vt_clipscore
β βββ vt_clip.pth
βββ vtsum_tt
β βββ vtsum_tt.pth
βββ vtsum_tt_ca
βββ vtsum_tt_ca.pth
```
**Paper or resources for more information:**
https://videoxum.github.io/
## Training dataset
- VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries.
## Evaluation dataset
- VideoXum *val* set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
- VideoXum *test* set: 4K long videos long videos with 80K pairs of aligned video and text summaries. |