File size: 1,933 Bytes
17a3da3
 
 
 
 
 
 
 
 
f0c9ee0
3ba8654
f0c9ee0
7947d8b
ab92717
f0c9ee0
ab92717
f0c9ee0
ab92717
bbb3390
 
f0c9ee0
ab92717
f0c9ee0
 
 
bbb3390
7947d8b
bbb3390
ab92717
7947d8b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
title: README
emoji: πŸŒ–
colorFrom: green
colorTo: pink
sdk: static
pinned: false
---

# πŸ‘©β€βš–οΈ [**MJ-Bench**: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?](https://mj-bench.github.io/)

Project page: https://mj-bench.github.io/
Code repository: https://github.com/MJ-Bench/MJ-Bench

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. 

To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: **alignment**, **safety**, **image quality**, and **bias**. 

![Dataset Overview](https://raw.githubusercontent.com/MJ-Bench/MJ-Bench.github.io/main/static/images/dataset_overview.png)

Specifically, we evaluate a large variety of multimodal judges including 

- 6 smaller-sized CLIP-based scoring models
- 11 open-source VLMs (e.g. LLaVA family)
- 4 and close-source VLMs (e.g. GPT-4o, Claude 3)
- 
![Radar Plot](https://raw.githubusercontent.com/MJ-Bench/MJ-Bench.github.io/main/static/images/radar_plot.png)


πŸ”₯πŸ”₯We are actively updating the [leaderboard](https://mj-bench.github.io/) and you are welcome to submit the evaluation result of your multimodal judge on [our dataset](https://huggingface.co/datasets/MJ-Bench/MJ-Bench) to [huggingface leaderboard](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard).