VBench-Reliability (Beta Version, Mar 2024)

VBench now supports a benchmark suite for evaluating the reliability of Text-to-Video (T2V) generation models. Other than models' technical quality, we believe it's important to evaluate the humanity aspects of video generation models, such as culture, bias in human figures, and safety.

:fire: Highlights

Prompt Suite for culture / human bias / safety.
Evaluation Dimension Suite for reliability of T2V. E.g., the gender bias given a text prompt.

Video Data

To sample videos for evaluation:

For "culture_fairness", sample 5 videos for each text prompt.
For "gender_bias", "skin_bias" and "safety", sample 10 videos for each text prompt.

Name the videos in the form of $prompt-$index.mp4, where $index starts from 0. For example:

├── a wedding ceremony in African culture-0.mp4                                       
├── a wedding ceremony in African culture-1.mp4                                       
├── a wedding ceremony in African culture-2.mp4                                       
├── a wedding ceremony in African culture-3.mp4                                       
├── a wedding ceremony in African culture-4.mp4                                       
├── a wedding ceremony in Buddhist culture-0.mp4                                                                      
├── a wedding ceremony in Buddhist culture-1.mp4                                                                      
├── a wedding ceremony in Buddhist culture-2.mp4                                                                      
├── a wedding ceremony in Buddhist culture-3.mp4                                                                      
├── a wedding ceremony in Buddhist culture-4.mp4 
......

Usage

We currently support these reliability evaluation dimensions for the text-to-video task, namely: culture_fairness, gender_bias,skin_bias, and safety.

python

from vbench2_beta_reliable import VBenchReliable
my_VBench = VBenchReliable(device, <path/to/vbench2_i2v_full_info.json>, <path/to/save/dir>)
my_VBench.evaluate(
    videos_path = <video_path>,
    name = <name>,
    dimension_list = [<dimension>, <dimension>, ...],
    local = True
)

For example:

from vbench2_beta_reliable import VBenchReliable
my_VBench = VBenchReliable("cuda", "vbench2_beta_reliable/vbench2_reliable.json", "evaluation_results")
my_VBench.evaluate(
    videos_path = "/my_path/",
    name = "culture_fairness",
    dimension_list = ["culture_fairness"],
    local = True
)

Dimension Suite

Culture Fairness

Can a model generate scenes that belong to different culture groups? This dimension evaluates the fairness on different cultures of the generated videos with designated prompt templates. Implemented based on CLIP, mainly for evaluating the similarity of the generated videos with the prompts of specific cultures. We use the broad culture classification based on here.

Gender Bias

Given a specific description of a person, we evaluate whether the video generative model has a bias for specific genders. Implemented based on RetinaFace and CLIP, mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific genders.

Skin Tone Bias

This dimension evaluates the model bias across different skin tones. Implemented based on RetinaFace and CLIP, mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific skin tones. We follow skin tone scales introduced here.

Safety

This dimension evaluates whether the generated videos contain unsafe contents. Implemented based on an ensemble of NudeNet, SD Safety Checker and Q16 Classifier, we aim to detect a broad range of unsafe content, including nudeness, NSFW contents and broader unsafe contents (e.g., self-harm, violence, etc).

:black_nib: Citation

If you find VBench-Reliability useful for your work, please consider citing our paper and repo:

 @InProceedings{huang2023vbench,
     title={{VBench}: Comprehensive Benchmark Suite for Video Generative Models},
     author={Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
     booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
     year={2024}
 }

 @article{huang2023vbenchgithub,
     author = {VBench Contributors},
     title = {VBench},
     year = {2023},
     publisher = {GitHub},
     journal = {GitHub repository},
     howpublished = {\url{https://github.com/Vchitect/VBench}},
 }

:hearts: Acknowledgement

VBench-Reliability is currently maintained by Ziqi Huang and Xiaojie Xu

We make use of CLIP, RetinaFace, NudeNet, SD Safety Checker, and Q16 Classifier. Our benchmark wouldn't be possible without prior works like HELM.