Spaces:
Running
Running
# this is .py for store constants | |
MODEL_INFO = ["Model Type", "Model", "Language Model", "Question Type"] | |
MODEL_SIZE = ["<10B", ">=10B", "-"] | |
QUESTION_TYPE = ["Base", "Option", "Instruction"] | |
LEADERBOARD_VERSION = ["Version1"] | |
TASK_AAD_INFO = ["Overall Dual Acc.", "Overall Standard Acc.", "Overall UPD Acc.", "action_recognition", "attribute_comparison", "attribute_recognition", "celebrity_recognition", "function_reasoning", "future_prediction", "identity_reasoning", "image_emotion", "image_scene", "image_style", "image_topic", "nature_relation", "object_localization", "ocr", "physical_property_reasoning", "physical_relation", "social_relation", "structuralized_imagetext_understanding"] | |
TASK_IASD_INFO = ["Overall Dual Acc.", "Overall Standard Acc.", "Overall UPD Acc.", "action_recognition", "attribute_comparison", "attribute_recognition", "celebrity_recognition", "function_reasoning", "future_prediction", "identity_reasoning", "image_emotion", "image_scene", "image_style", "image_topic", "nature_relation", "object_localization", "ocr", "physical_property_reasoning", "physical_relation", "social_relation", "structuralized_imagetext_understanding"] | |
TASK_IVQD_INFO = ["Overall Dual Acc.", "Overall Standard Acc.", "Overall UPD Acc.", "action_recognition", "attribute_comparison", "attribute_recognition", "celebrity_recognition", "function_reasoning", "image_scene", "nature_relation", "object_localization", "ocr", "physical_property_reasoning", "physical_relation", "social_relation"] | |
AVG_INFO = ["Overall Dual Acc."] | |
DATA_AAD_TITILE_TYPE = ["markdown", "markdown", "markdown", "markdown", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number"] | |
DATA_IASD_TITILE_TYPE = ["markdown", "markdown", "markdown", "markdown", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number"] | |
DATA_IVQD_TITILE_TYPE = ["markdown", "markdown", "markdown", "markdown", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number", "number"] | |
CSV_AAD_RESULT_PATH = "./download_from_dataset/results/result_aad.csv" | |
CSV_IASD_RESULT_PATH = "./download_from_dataset/results/result_iasd.csv" | |
CSV_IVQD_RESULT_PATH = "./download_from_dataset/results/result_ivqd.csv" | |
CSV_QUEUE_DIR = "./download_from_dataset/queue/" | |
COLUMN_AAD_NAMES = MODEL_INFO + TASK_AAD_INFO | |
COLUMN_IASD_NAMES = MODEL_INFO + TASK_IASD_INFO | |
COLUMN_IVQD_NAMES = MODEL_INFO + TASK_IVQD_INFO | |
LEADERBORAD_VERSION = ["MM-AAD", "MM-IASD", "MM-IVQD"] | |
LEADERBORAD_INTRODUCTION = """ | |
# UPD Leaderboard | |
### *"Which VLM is reliable?"* 🏆 Welcome to the leaderboard of the **UPD**! *Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models* (**arXiv 2024**) [![Code](https://img.shields.io/github/stars/AtsuMiyai/UPD.svg?style=social&label=Official)](https://github.com/AtsuMiyai/UPD) | |
<div style="display: flex; flex-wrap: wrap; align-items: center; gap: 10px;"> | |
<a href='https://arxiv.org/abs/2403.20331'><img src='https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv&logoColor=red'></a> | |
</div> | |
MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD) | |
Our MM-UPD Bench encompasses three benchmarks: MM-AAD, MM-IASD, and MM-IVQD. | |
Through these benchmarks, we aim to provide a comprehensive evaluation of VLMs across multiple senarios. | |
As for more detail information, we can refer to `About` section. | |
Please follow the instructions in [UPD](https://github.com/AtsuMiyai/UPD) to upload the generated JSON file here. After clicking the `Submit Eval` button, click the `Refresh` button. | |
Besides, we are open to any PRs into [UPD](https://github.com/AtsuMiyai/UPD) for adding your VLMs. | |
""" | |
SUBMIT_INTRODUCTION = """# Submit on MM-UPD Benchmark Introduction | |
1. Obtain Dual Result JSON File from our [github repository](https://github.com/AtsuMiyai/UPD/tree/main/scripts/inference). | |
2. If you want to update model performance by uploading new results, please ensure 'Model Name Revision' is the same as what's shown in the leaderboard. For example, if you want to modify LLaVA-1.5-13B's performance, you need to fill in 'LLaVA-1.5-13B' in 'Revision Model Name'. | |
3. Please provide the correct link of your model's repository for each submission. | |
4. After clicking 'Submit Eval', you can click 'Refresh' to obtain the latest result in the leaderboard. | |
Note: The example of the submitted JSON file is this url: [llava1.5_13b_result_dual_detail_submission.json](https://drive.google.com/file/d/1ILYlxcKC_a5Jrm7kyyqeHo0vo3WjkA1V/view?usp=sharing). | |
You need to care about whether (i) the JSON file has the prediction for all data, (ii) the data on all options, "hit_upd", "hit_standard", and "hit" exist. | |
## Submit Example | |
If you want to upload LLaVA-1.5-13B's result in the leaderboard, you need to: | |
1. Select VLM in 'Model Type'. | |
2. Fill in 'LLaVA-1.5-13B' in 'Model Name' if it is your first time to submit your result (You can leave 'Revision Model Name' blank). | |
3. Fill in 'LLaVA-1.5-13B' in 'Revision Model Name' if you want to update your result (You can leave 'Model Name' blank). | |
4. Fill in 'https://github.com/haotian-liu/LLaVA' in 'Model Link'. | |
5. Fill in '13B' in 'Model size'. | |
6. Select 'Vicuna-1.5-13B' in 'LLM Type'. | |
7. Fill in 'LLM model' if you select Others for 'LLM Type'. | |
8. Select 'AAD', 'IASD', or 'IVQD' in 'UPD_Type'. | |
9. Select 'Base', 'Option', or 'Instruction' in 'Question Type'. | |
10. Upload results.json. | |
11. Click the 'Submit Eval' button. | |
12. Click 'Refresh' to obtain the uploaded leaderboard. | |
### If you have any questions or deletion requests, please contact [[email protected]]([email protected]). | |
### ⚠️ Please do not submit any malicious file (e.g, files you manually edited). | |
""" | |
LEADERBORAD_INFO = """ | |
## What is MM-UPD Bench? | |
MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD) | |
Our MM-UPD Bench encompasses three benchmarks: MM-AAD, MM-IASD, and MM-IVQD. | |
1\. **MM-AAD:** Benchmark for Absent Answer Detection (AAD). | |
MM-AAD Bench is a dataset where the correct answer option for each question is removed. | |
MM-AAD tests the model's capability to recognize when the correct answer is absent from the provided choices. | |
2\. **MM-IASD:** Benchmark for Incompatible Answer Set Detection (IASD). | |
MM-IASD Bench is a dataset where the answer set is completely incompatible with the context specified by the question and the image. | |
MM-IASD tests the model's capability to recognize when the answer set is incompatible with the context. | |
3\. **MM-IVQD:** Benchmark for Incompatible Visual Question Detection (IVQD). | |
MM-IVQD Bench is a dataset where the question is incompatible with the image. | |
MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate. | |
We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses. | |
## Evaluation Scenario | |
We evaluate the performance of VLMs on MM-UPD Bench using the following settings: | |
1. **Base:** In the Base setting, we do not provide any instruction to withold answers. | |
2. **Option:** In the Option setting, we provide an additional option (e.g., None of the above) to withold answers. | |
3. **Instruction:** In the Instruction setting, we provide an additional instruction (e.g., If all the options are incorrect, answer F. None of the above.) to withold answers. | |
## Evaluation Metrics | |
We evaluate the performance of VLMs on MM-UPD Bench using the following metrics: | |
1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count | |
success only if the model is correct on both the standard and UPD questions. | |
2. **Standard accuracy:** The accuracy on standard questions. | |
3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions. | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r"""@article{miyai2024unsolvable, | |
title={{Unsolvable Problem Detection}: Evaluating Trustworthiness of Vision Language Models}, | |
author={Miyai, Atsuyuki and Yang, Jingkang and Zhang, Jingyang and Ming, Yifei and Yu, Qing and Irie, Go and Li, Yixuan and Li, Hai and Liu, Ziwei and Aizawa, Kiyoharu}, | |
journal={arXiv preprint arXiv:2403.20331}, | |
year={2024} | |
}""" | |