|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- TIGER-Lab/VideoEval |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
|
|
[Paper] | [Website](https://tiger-ai-lab.github.io/MantisScore/) | [Github](https://github.com/TIGER-AI-Lab/MantisScore) | [Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoEval) | [Model](https://huggingface.co/TIGER-Lab/MantisScore) | [Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore) |
|
|
|
|
|
![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png) |
|
|
|
## Introduction |
|
- MantisScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model |
|
and trained on [VideoEval](https://huggingface.co/datasets/TIGER-Lab/VideoEval), |
|
a large video evaluation dataset with multi-aspect human scores. |
|
|
|
- MantisScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics. |
|
|
|
- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations. |
|
|
|
## Performance |
|
### Evaluation Results |
|
|
|
We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench. |
|
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings |
|
averaged among all the evaluation aspects as indicator. |
|
For GenAI-Bench and VBench, which include human preference data among two or more videos, |
|
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator. |
|
| metric | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench | |
|
|-------------------|:---------------:|:--------------:|:-----------:|:-----------:|:------:| |
|
| MantisScore (reg) | 278.3 | 75.7 | 51.1 | 78.5 | 73.0 | |
|
| MantisScore (gen) | 222.4 | 77.1 | 27.6 | 59.0 | 58.7 | |
|
| Gemini-1.5-Pro | 158.8 | 22.1 | 22.9 | 60.9 | 52.9 | |
|
| Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | 67.1 | 52.3 | |
|
| GPT-4o | 155.4 | 23.1 | 28.7 | 52.0 | 51.7 | |
|
| CLIP-sim | 126.8 | 8.9 | 36.2 | 34.2 | 47.4 | |
|
| DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 | |
|
| SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 | |
|
| CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 | |
|
| LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 | |
|
| LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 | |
|
| X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 | |
|
| PIQE | 78.3 | -10.1 | -1.2 | 34.5 | 55.1 | |
|
| BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 | |
|
| Idefics2 | 73.0 | 6.5 | 0.3 | 34.6 | 31.7 | |
|
| SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 | |
|
| MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 | |
|
|
|
|
|
## Usage |
|
### Installation |
|
```bash |
|
pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git |
|
``` |
|
|
|
### Inference |
|
```python |
|
import av |
|
import numpy as np |
|
def _read_video_pyav( |
|
frame_paths:List[str], |
|
max_frames:int, |
|
): |
|
frames = [] |
|
container.seek(0) |
|
start_index = indices[0] |
|
end_index = indices[-1] |
|
for i, frame in enumerate(container.decode(video=0)): |
|
if i > end_index: |
|
break |
|
if i >= start_index and i in indices: |
|
frames.append(frame) |
|
return np.stack([x.to_ndarray(format="rgb24") for x in frames]) |
|
|
|
MAX_NUM_FRAMES=16 |
|
REGRESSION_QUERY_PROMPT = """ |
|
Suppose you are an expert in judging and evaluating the quality of AI-generated videos, |
|
please watch the following frames of a given video and see the text prompt for generating the video, |
|
then give scores from 5 different dimensions: |
|
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color |
|
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements |
|
(3) dynamic degree, the degree of dynamic changes |
|
(4) text-to-video alignment, the alignment between the text prompt and the video content |
|
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge |
|
|
|
for each dimension, output a float number from 1.0 to 4.0, |
|
the higher the number is, the better the video performs in that sub-score, |
|
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video) |
|
Here is an output example: |
|
visual quality: 3.2 |
|
temporal consistency: 2.7 |
|
dynamic degree: 4.0 |
|
text-to-video alignment: 2.3 |
|
factual consistency: 1.8 |
|
|
|
For this video, the text prompt is "{text_prompt}", |
|
all the frames of video are as follows: |
|
""" |
|
|
|
video_path="examples/video1.mp4" |
|
|
|
# sample uniformly 8 frames from the video |
|
container = av.open(video_path) |
|
total_frames = container.streams.video[0].frames |
|
if total_frames > MAX_NUM_FRAMES: |
|
indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int) |
|
else: |
|
indices = np.arange(total_frames) |
|
|
|
frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)] |
|
eval_prompt = REGRESSION_QUERY_TEMPLATE.format(text_prompt=video_prompt) |
|
num_image_token = eval_prompt.count("<image>") |
|
if num_image_token < len(frames): |
|
eval_prompt += "<image> " * (len(frames) - num_image_token) |
|
|
|
flatten_images = [] |
|
for x in [frames]: |
|
if isinstance(x, list): |
|
flatten_images.extend(x) |
|
else: |
|
flatten_images.append(x) |
|
flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images] |
|
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt") |
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
logits = outputs.logits |
|
num_aspects = logits.shape[-1] |
|
|
|
aspect_scores = [] |
|
for i in range(num_aspects): |
|
aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT)) |
|
print(aspect_scores) |
|
|
|
""" |
|
# model output on visual quality, temporal consistency, dynamic degree, text-to-video alignment, factual consistency, respectively |
|
[2.2969, 2.4375, 2.8281, 2.5, 2.4688] |
|
""" |
|
|
|
``` |
|
|
|
### Training |
|
see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details |
|
|
|
### Evaluation |
|
see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details |
|
|
|
## Citation |
|
|