AtsuMiyai commited on
Commit
65d061e
1 Parent(s): de6b903

update explanations on MM-UPD Bench

Browse files
Files changed (1) hide show
  1. constants.py +10 -8
constants.py CHANGED
@@ -77,7 +77,6 @@ SUBMIT_INTRODUCTION = """# Submit on MM-UPD Benchmark Introduction
77
  """
78
 
79
 
80
-
81
  LEADERBORAD_INFO = """
82
  ## What is MM-UPD Bench?
83
  MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
@@ -96,24 +95,27 @@ MM-IASD tests the model's capability to recognize when the answer set is incompa
96
  MM-IVQD Bench is a dataset where the question is incompatible with the image.
97
  MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
98
 
 
 
99
 
100
- ## Characteristics of MM-UPD Bench
101
- We design MM-UPD Bench to provide a comprehensive evaluation of VLMs across multiple senarios.
 
102
 
103
- 1\. **Multiple Senario Evaluation:** We carefully design prompts choices and examine the three senario: (i) Base (w/o instruction), (ii) Option (w/ additional option), (iii) Instruction (w/ additional instruction).
104
 
105
- 2\. **Ability-Wise Evaluation:** We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses.
106
 
107
 
108
- ## About Evaluation Metrics
 
109
  We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
110
  1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
111
  success only if the model is correct on both the standard and UPD questions.
112
 
113
  2. **Standard accuracy:** The accuracy on standard questions.
114
 
115
- 3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
116
-
117
 
118
  """
119
 
 
77
  """
78
 
79
 
 
80
  LEADERBORAD_INFO = """
81
  ## What is MM-UPD Bench?
82
  MM-UPD Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Vision Language Models (VLMs) in the Context of Unsolvable Problem Detection (UPD)
 
95
  MM-IVQD Bench is a dataset where the question is incompatible with the image.
96
  MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or inappropriate.
97
 
98
+ We carefully decompose each benchmark into various abilities to reveal individual model's strengths and weaknesses.
99
+
100
 
101
+ ## Evaluation Scenario
102
+ We evaluate the performance of VLMs on MM-UPD Bench using the following settings:
103
+ 1. **Base:** In the Base setting, we do not provide any instruction to withold answers.
104
 
105
+ 2. **Option:** In the Option setting, we provide an additional option (e.g., None of the above) to withold answers.
106
 
107
+ 3. **Instruction:** In the Instruction setting, we provide an additional instruction (e.g., If all the options are incorrect, answer F. None of the above.) to withold answers.
108
 
109
 
110
+
111
+ ## Evaluation Metrics
112
  We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
113
  1. **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
114
  success only if the model is correct on both the standard and UPD questions.
115
 
116
  2. **Standard accuracy:** The accuracy on standard questions.
117
 
118
+ 3. **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
 
119
 
120
  """
121