Spaces:
Running
Running
AtsuMiyai
commited on
Commit
•
65d061e
1
Parent(s):
de6b903
update explanations on MM-UPD Bench
Browse files- constants.py +10 -8
constants.py
CHANGED
@@ -77,7 +77,6 @@ SUBMIT_INTRODUCTION = """# Submit on MM-UPD Benchmark Introduction
|
|
77 |
"""
|
78 |
|
79 |
|
80 |
-
|
81 |
LEADERBORAD_INFO = """
|
82 |
## What is MM-UPD Bench?
|
83 |
MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
|
@@ -96,24 +95,27 @@ MM-IASD tests the model's capability to recognize when the answer set is incompa
|
|
96 |
MM-IVQD Bench is a dataset where the question is incompatible with the image.
|
97 |
MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
|
98 |
|
|
|
|
|
99 |
|
100 |
-
##
|
101 |
-
We
|
|
|
102 |
|
103 |
-
|
104 |
|
105 |
-
|
106 |
|
107 |
|
108 |
-
|
|
|
109 |
We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
|
110 |
1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
|
111 |
success only if the model is correct on both the standard and UPD questions.
|
112 |
|
113 |
2. **Standard accuracy:** The accuracy on standard questions.
|
114 |
|
115 |
-
3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
|
116 |
-
|
117 |
|
118 |
"""
|
119 |
|
|
|
77 |
"""
|
78 |
|
79 |
|
|
|
80 |
LEADERBORAD_INFO = """
|
81 |
## What is MM-UPD Bench?
|
82 |
MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
|
|
|
95 |
MM-IVQD Bench is a dataset where the question is incompatible with the image.
|
96 |
MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
|
97 |
|
98 |
+
We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses.
|
99 |
+
|
100 |
|
101 |
+
## Evaluation Scenario
|
102 |
+
We evaluate the performance of VLMs on MM-UPD Bench using the following settings:
|
103 |
+
1. **Base:** In the Base setting, we do not provide any instruction to withold answers.
|
104 |
|
105 |
+
2. **Option:** In the Option setting, we provide an additional option (e.g., None of the above) to withold answers.
|
106 |
|
107 |
+
3. **Instruction:** In the Instruction setting, we provide an additional instruction (e.g., If all the options are incorrect, answer F. None of the above.) to withold answers.
|
108 |
|
109 |
|
110 |
+
|
111 |
+
## Evaluation Metrics
|
112 |
We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
|
113 |
1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
|
114 |
success only if the model is correct on both the standard and UPD questions.
|
115 |
|
116 |
2. **Standard accuracy:** The accuracy on standard questions.
|
117 |
|
118 |
+
3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
|
|
|
119 |
|
120 |
"""
|
121 |
|