Spaces:

MM-UPD
/

MM-UPD_Leaderboard

Running

AtsuMiyai commited on Jun 4

Commit

65d061e

•

1 Parent(s): de6b903

update explanations on MM-UPD Bench

Files changed (1) hide show

constants.py CHANGED Viewed

@@ -77,7 +77,6 @@ SUBMIT_INTRODUCTION = """# Submit on MM-UPD Benchmark Introduction
 """
 LEADERBORAD_INFO = """
 ## What is MM-UPD Bench?
 MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
@@ -96,24 +95,27 @@ MM-IASD tests the model's capability to recognize when the answer set is incompa
 MM-IVQD Bench is a dataset where the question is incompatible with the image.
 MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
-## Characteristics of MM-UPD Bench
-We design MM-UPD Bench to provide a comprehensive evaluation of VLMs across multiple senarios.
-1\. **Multiple Senario Evaluation:** We carefully design prompts choices and examine the three senario: (i) Base (w/o instruction), (ii) Option (w/ additional option), (iii) Instruction (w/ additional instruction).
-2\. **Ability-Wise Evaluation:** We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses.
-## About Evaluation Metrics
 We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
 1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
 success only if the model is correct on both the standard and UPD questions.
 2. **Standard accuracy:** The accuracy on standard questions.
-3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
 """

 """
 LEADERBORAD_INFO = """
 ## What is MM-UPD Bench?
 MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
 MM-IVQD Bench is a dataset where the question is incompatible with the image.
 MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
+We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses.
+## Evaluation Scenario
+We evaluate the performance of VLMs on MM-UPD Bench using the following settings:
+1. **Base:** In the Base setting, we do not provide any instruction to withold answers.
+2. **Option:** In the Option setting, we provide an additional option (e.g., None of the above) to withold answers.
+3. **Instruction:** In the Instruction setting, we provide an additional instruction (e.g., If all the options are incorrect, answer F. None of the above.) to withold answers.
+## Evaluation Metrics
 We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
 1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
 success only if the model is correct on both the standard and UPD questions.
 2. **Standard accuracy:** The accuracy on standard questions.
+3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
 """