|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- en |
|
--- |
|
# The LVDR Benchmark (Long Video Description Ranking) |
|
|
|
This benchmark is proposed from [VideoCLIP-XL](https://arxiv.org/abs/2410.00741). |
|
Given each video and its corresponding ground-truth description, we perform a synthesis process that iterates p − 1 times and alters q words as hallucination during each iteration, resulting in totally p descriptions with gradually increasing degrees of hallucination. We denote such a subset as p × q and construct five subsets as {4 × 1, 4 × 2, 4 × 3, 4 × 4, 4 × 5}. The video CLIP models need to be able to correctly rank these descriptions in descending order of similarity given the video. |
|
|
|
# Format |
|
```json |
|
{ |
|
"long_captions": [ |
|
"...", |
|
], |
|
"video_id": "..." |
|
} |
|
{ |
|
..... |
|
}, |
|
..... |
|
``` |
|
|
|
|
|
# Source |
|
~~~ |
|
@misc{wang2024videoclipxladvancinglongdescription, |
|
title={VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models}, |
|
author={Jiapeng Wang and Chengyu Wang and Kunzhe Huang and Jun Huang and Lianwen Jin}, |
|
year={2024}, |
|
eprint={2410.00741}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2410.00741}, |
|
} |
|
~~~ |