|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- TIGER-Lab/VideoFeedback |
|
language: |
|
- en |
|
metrics: |
|
- accuracy/spcc |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
|
|
[📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore) |
|
|
|
|
|
![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png) |
|
|
|
## Introduction |
|
- MantisScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model |
|
and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback), |
|
a large video evaluation dataset with multi-aspect human scores. |
|
|
|
- MantisScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics. |
|
|
|
- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations. |
|
|
|
## Evaluation Results |
|
|
|
We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench. |
|
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings |
|
averaged among all the evaluation aspects as indicator. |
|
For GenAI-Bench and VBench, which include human preference data among two or more videos, |
|
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator. |
|
|
|
Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset |
|
for VideoFeedback-test set, while for other three benchmarks, we use |
|
[MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset |
|
with real videos excluded. |
|
|
|
The evaluation results are shown below: |
|
|
|
|
|
| metric | Final Avg Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench | |
|
|:-----------------:|:--------------:|:--------------:|:-----------:|:-----------:|:----------:| |
|
| MantisScore (reg) | **69.6** | 75.7 | **51.1** | **78.5** | **73.0** | |
|
| MantisScore (gen) | 55.6 | **77.1** | 27.6 | 59.0 | 58.7 | |
|
| Gemini-1.5-Pro | <u>39.7</u> | 22.1 | 22.9 | 60.9 | 52.9 | |
|
| Gemini-1.5-Flash | 39.4 | 20.8 | 17.3 | <u>67.1</u> | 52.3 | |
|
| GPT-4o | 38.9 | <u>23.1</u> | 28.7 | 52.0 | 51.7 | |
|
| CLIP-sim | 31.7 | 8.9 | <u>36.2</u> | 34.2 | 47.4 | |
|
| DINO-sim | 30.3 | 7.5 | 32.1 | 38.5 | 43.3 | |
|
| SSIM-sim | 29.5 | 13.4 | 26.9 | 34.1 | 43.5 | |
|
| CLIP-Score | 28.6 | -7.2 | 21.7 | 45.0 | 54.9 | |
|
| LLaVA-1.5-7B | 27.1 | 8.5 | 10.5 | 49.9 | 39.4 | |
|
| LLaVA-1.6-7B | 23.3 | -3.1 | 13.2 | 44.5 | 38.7 | |
|
| X-CLIP-Score | 23.2 | -1.9 | 13.3 | 41.4 | 40.1 | |
|
| PIQE | 19.6 | -10.1 | -1.2 | 34.5 |<u> 55.1</u>| |
|
| BRISQUE | 19.0 | -20.3 | 3.9 | 38.5 | 53.7 | |
|
| Idefics2 | 18.3 | 6.5 | 0.3 | 34.6 | 31.7 | |
|
| MSE-dyn | 10.6 | -5.5 | -17.0 | 28.4 | 36.5 | |
|
| SSIM-dyn | 9.2 | -12.9 | -26.4 | 31.4 | 44.5 | |
|
<!-- | Fuyu | - | - | - | - | - | |
|
| Kosmos-2 | - | - | - | - | - | |
|
| CogVLM | - | - | - | - | - | |
|
| OpenFlamingo | - | - | - | - | - | --> |
|
|
|
The best in MantisScore series is in bold and the best in baselines is underlined. |
|
<!-- "-" means the answer of MLLM is meaningless or in wrong format. --> |
|
|
|
## Usage |
|
### Installation |
|
``` |
|
git clone https://github.com/TIGER-AI-Lab/MantisScore.git |
|
``` |
|
|
|
### Inference |
|
``` |
|
cd MantisScore/examples |
|
``` |
|
|
|
```python |
|
import av |
|
import numpy as np |
|
from typing import List |
|
from PIL import Image |
|
import torch |
|
from transformers import AutoProcessor |
|
from models.idefics2 import Idefics2ForSequenceClassification |
|
|
|
def _read_video_pyav( |
|
frame_paths:List[str], |
|
max_frames:int, |
|
): |
|
frames = [] |
|
container.seek(0) |
|
start_index = indices[0] |
|
end_index = indices[-1] |
|
for i, frame in enumerate(container.decode(video=0)): |
|
if i > end_index: |
|
break |
|
if i >= start_index and i in indices: |
|
frames.append(frame) |
|
return np.stack([x.to_ndarray(format="rgb24") for x in frames]) |
|
|
|
MAX_NUM_FRAMES=16 |
|
ROUND_DIGIT=4 |
|
REGRESSION_QUERY_PROMPT = """ |
|
Suppose you are an expert in judging and evaluating the quality of AI-generated videos, |
|
please watch the following frames of a given video and see the text prompt for generating the video, |
|
then give scores from 5 different dimensions: |
|
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color |
|
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements |
|
(3) dynamic degree, the degree of dynamic changes |
|
(4) text-to-video alignment, the alignment between the text prompt and the video content |
|
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge |
|
|
|
for each dimension, output a float number from 1.0 to 4.0, |
|
the higher the number is, the better the video performs in that sub-score, |
|
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video) |
|
Here is an output example: |
|
visual quality: 3.2 |
|
temporal consistency: 2.7 |
|
dynamic degree: 4.0 |
|
text-to-video alignment: 2.3 |
|
factual consistency: 1.8 |
|
|
|
For this video, the text prompt is "{text_prompt}", |
|
all the frames of video are as follows: |
|
""" |
|
|
|
model_name="TIGER-Lab/MantisScore" |
|
video_path="video1.mp4" |
|
video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense." |
|
|
|
processor = AutoProcessor.from_pretrained(model_name,torch_dtype=torch.bfloat16) |
|
model = Idefics2ForSequenceClassification.from_pretrained(model_name,torch_dtype=torch.bfloat16).eval() |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
# sample uniformly 8 frames from the video |
|
container = av.open(video_path) |
|
total_frames = container.streams.video[0].frames |
|
if total_frames > MAX_NUM_FRAMES: |
|
indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int) |
|
else: |
|
indices = np.arange(total_frames) |
|
|
|
frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)] |
|
eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt) |
|
num_image_token = eval_prompt.count("<image>") |
|
if num_image_token < len(frames): |
|
eval_prompt += "<image> " * (len(frames) - num_image_token) |
|
|
|
flatten_images = [] |
|
for x in [frames]: |
|
if isinstance(x, list): |
|
flatten_images.extend(x) |
|
else: |
|
flatten_images.append(x) |
|
flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images] |
|
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt") |
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
logits = outputs.logits |
|
num_aspects = logits.shape[-1] |
|
|
|
aspect_scores = [] |
|
for i in range(num_aspects): |
|
aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT)) |
|
print(aspect_scores) |
|
|
|
""" |
|
# model output on visual quality, temporal consistency, dynamic degree, |
|
# text-to-video alignment, factual consistency, respectively |
|
[2.2969, 2.4375, 2.8281, 2.5, 2.4688] |
|
""" |
|
|
|
``` |
|
|
|
### Training |
|
see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/tree/main/training) for details |
|
|
|
### Evaluation |
|
see [MantisScore/benchmark](https://github.com/TIGER-AI-Lab/MantisScore/tree/main/benchmark) for details |
|
|
|
## Citation |
|
|