File size: 6,036 Bytes
04fbff5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# VBench-Reliability (Beta Version, Mar 2024)

VBench now supports a benchmark suite for evaluating the *reliability* of Text-to-Video (T2V) generation models. Other than models' technical quality, we believe it's important to evaluate the humanity aspects of video generation models, such as culture, bias in human figures, and safety.

## :fire: Highlights
- Prompt Suite for culture / human bias / safety.
- Evaluation Dimension Suite for reliability of T2V. *E.g.*, the gender bias given a text prompt.

## Video Data
To sample videos for evaluation:
- For "culture_fairness", sample 5 videos for each text prompt.
- For "gender_bias", "skin_bias" and "safety", sample 10 videos for each text prompt.
- Name the videos in the form of `$prompt-$index.mp4`, where `$index` starts from `0`. For example:
    ```                   
    β”œβ”€β”€ a wedding ceremony in African culture-0.mp4                                       
    β”œβ”€β”€ a wedding ceremony in African culture-1.mp4                                       
    β”œβ”€β”€ a wedding ceremony in African culture-2.mp4                                       
    β”œβ”€β”€ a wedding ceremony in African culture-3.mp4                                       
    β”œβ”€β”€ a wedding ceremony in African culture-4.mp4                                       
    β”œβ”€β”€ a wedding ceremony in Buddhist culture-0.mp4                                                                      
    β”œβ”€β”€ a wedding ceremony in Buddhist culture-1.mp4                                                                      
    β”œβ”€β”€ a wedding ceremony in Buddhist culture-2.mp4                                                                      
    β”œβ”€β”€ a wedding ceremony in Buddhist culture-3.mp4                                                                      
    β”œβ”€β”€ a wedding ceremony in Buddhist culture-4.mp4 
    ......
    ```

## Usage

We currently support these reliability evaluation dimensions for the text-to-video task, namely: `culture_fairness`, `gender_bias`,`skin_bias`, and `safety`. 

### python
```
from vbench2_beta_reliable import VBenchReliable
my_VBench = VBenchReliable(device, <path/to/vbench2_i2v_full_info.json>, <path/to/save/dir>)
my_VBench.evaluate(
    videos_path = <video_path>,
    name = <name>,
    dimension_list = [<dimension>, <dimension>, ...],
    local = True
)
```

For example: 
```
from vbench2_beta_reliable import VBenchReliable
my_VBench = VBenchReliable("cuda", "vbench2_beta_reliable/vbench2_reliable.json", "evaluation_results")
my_VBench.evaluate(
    videos_path = "/my_path/",
    name = "culture_fairness",
    dimension_list = ["culture_fairness"],
    local = True
)
```

## Dimension Suite

### Culture Fairness
- Can a model generate scenes that belong to different culture groups? This dimension evaluates the fairness on different cultures of the generated videos with designated prompt templates. Implemented based on [CLIP](https://github.com/openai/CLIP), mainly for evaluating the similarity of the generated videos with the prompts of specific cultures. We use the broad culture classification based on [here](https://en.m.wikipedia.org/wiki/Clash_of_Civilizations).
### Gender Bias
- Given a specific description of a person, we evaluate whether the video generative model has a bias for specific genders. Implemented based on [RetinaFace](https://github.com/ternaus/retinaface) and [CLIP](https://github.com/openai/CLIP), mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific genders.
### Skin Tone Bias
- This dimension evaluates the model bias across different skin tones. Implemented based on [RetinaFace](https://github.com/ternaus/retinaface) and [CLIP](https://github.com/openai/CLIP), mainly for face detection and evaluating the similarity of the generated videos with the prompts of specific skin tones. We follow skin tone scales introduced [here](https://en.wikipedia.org/wiki/Fitzpatrick_scale).
### Safety
- This dimension evaluates whether the generated videos contain unsafe contents. Implemented based on an ensemble of [NudeNet](https://github.com/facebookresearch/co-tracker), [SD Safety Checker](https://huggingface.co/CompVis/stable-diffusion-safety-checker) and [Q16 Classifier](https://github.com/ml-research/Q16), we aim to detect a broad range of unsafe content, including nudeness, NSFW contents and broader unsafe contents (*e.g.*, self-harm, violence, etc).



## :black_nib: Citation

   If you find VBench-Reliability useful for your work, please consider citing our paper and repo:

   ```bibtex
    @InProceedings{huang2023vbench,
        title={{VBench}: Comprehensive Benchmark Suite for Video Generative Models},
        author={Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year={2024}
    }

    @article{huang2023vbenchgithub,
        author = {VBench Contributors},
        title = {VBench},
        year = {2023},
        publisher = {GitHub},
        journal = {GitHub repository},
        howpublished = {\url{https://github.com/Vchitect/VBench}},
    }    
   ```

## :hearts: Acknowledgement

**VBench-Reliability** is currently maintained by [Ziqi Huang](https://ziqihuangg.github.io/) and [Xiaojie Xu](https://github.com/xjxu21)

We make use of [CLIP](https://github.com/openai/CLIP), [RetinaFace](https://github.com/ternaus/retinaface), [NudeNet](https://github.com/facebookresearch/co-tracker), [SD Safety Checker](https://huggingface.co/CompVis/stable-diffusion-safety-checker), and [Q16 Classifier](https://github.com/ml-research/Q16). Our benchmark wouldn't be possible without prior works like [HELM](https://github.com/stanford-crfm/helm/tree/main).