Spaces:

MM-UPD
/

MM-UPD_Leaderboard

Running

App Files Files Community

AtsuMiyai commited on Jun 4, 2024

Commit

e8e50f1

1 Parent(s): 03f13cc

update explanations on MM-UPD Bench

Browse files

Files changed (2) hide show

app.py +11 -15
constants.py +29 -3

app.py CHANGED Viewed

@@ -316,15 +316,6 @@ with block:
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
         # table mmupd bench
         with gr.TabItem("🏅 MM-AAD Benchmark", elem_id="mmaad-benchmark-tab-table", id=1):
-            with gr.Row():
-                with gr.Accordion("Citation", open=False):
-                    citation_button = gr.Textbox(
-                        value=CITATION_BUTTON_TEXT,
-                        label=CITATION_BUTTON_LABEL,
-                        elem_id="citation-button",
-                        show_copy_button=True,
-                    )
             # selection for column part:
             checkbox_aad_group = gr.CheckboxGroup(
                 choices=TASK_AAD_INFO,
@@ -411,8 +402,6 @@ with block:
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_aad_group], outputs=data_component_aad)
             checkbox_aad_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_aad_group], outputs=data_component_aad)
-        with gr.TabItem("🏅 MM-IASD Benchmark", elem_id="mmiasd-benchmark-tab-table", id=2):
             with gr.Row():
                 with gr.Accordion("Citation", open=False):
                     citation_button = gr.Textbox(
@@ -422,6 +411,7 @@ with block:
                         show_copy_button=True,
                     )
             checkbox_iasd_group = gr.CheckboxGroup(
                 choices=TASK_IASD_INFO,
                 value=AVG_INFO,
@@ -505,8 +495,6 @@ with block:
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_iasd_group], outputs=data_component_iasd)
             checkbox_iasd_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_iasd_group], outputs=data_component_iasd)
-        # Table 3
-        with gr.TabItem("🏅 MM-IVQD Benchmark", elem_id="mmiasd-benchmark-tab-table", id=3):
             with gr.Row():
                 with gr.Accordion("Citation", open=False):
                     citation_button = gr.Textbox(
@@ -516,6 +504,9 @@ with block:
                         show_copy_button=True,
                     )
             # selection for column part:
             checkbox_ivqd_group = gr.CheckboxGroup(
                 choices=TASK_IVQD_INFO,
@@ -599,6 +590,13 @@ with block:
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_ivqd_group], outputs=data_component_ivqd)
             checkbox_ivqd_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_ivqd_group], outputs=data_component_ivqd)
         # table 4
         with gr.TabItem("📝 About", elem_id="mmupd-benchmark-tab-table", id=4):
@@ -606,8 +604,6 @@ with block:
         # table 5
         with gr.TabItem("🚀 Submit here! ", elem_id="mmupd-benchmark-tab-table", id=5):
-            gr.Markdown(LEADERBORAD_INTRODUCTION, elem_classes="markdown-text")
             with gr.Row():
                 gr.Markdown(SUBMIT_INTRODUCTION, elem_classes="markdown-text")

     with gr.Tabs(elem_classes="tab-buttons") as tabs:
         # table mmupd bench
         with gr.TabItem("🏅 MM-AAD Benchmark", elem_id="mmaad-benchmark-tab-table", id=1):
             # selection for column part:
             checkbox_aad_group = gr.CheckboxGroup(
                 choices=TASK_AAD_INFO,
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_aad_group], outputs=data_component_aad)
             checkbox_aad_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_aad_group], outputs=data_component_aad)
             with gr.Row():
                 with gr.Accordion("Citation", open=False):
                     citation_button = gr.Textbox(
                         show_copy_button=True,
                     )
+        with gr.TabItem("🏅 MM-IASD Benchmark", elem_id="mmiasd-benchmark-tab-table", id=2):
             checkbox_iasd_group = gr.CheckboxGroup(
                 choices=TASK_IASD_INFO,
                 value=AVG_INFO,
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_iasd_group], outputs=data_component_iasd)
             checkbox_iasd_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_iasd_group], outputs=data_component_iasd)
             with gr.Row():
                 with gr.Accordion("Citation", open=False):
                     citation_button = gr.Textbox(
                         show_copy_button=True,
                     )
+        # Table 3
+        with gr.TabItem("🏅 MM-IVQD Benchmark", elem_id="mmiasd-benchmark-tab-table", id=3):
+            with gr.Row():
             # selection for column part:
             checkbox_ivqd_group = gr.CheckboxGroup(
                 choices=TASK_IVQD_INFO,
             question_type.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_ivqd_group], outputs=data_component_ivqd)
             checkbox_ivqd_group.change(fn=on_filter_model_size_method_change, inputs=[model_size, question_type, checkbox_ivqd_group], outputs=data_component_ivqd)
+            with gr.Accordion("Citation", open=False):
+                    citation_button = gr.Textbox(
+                        value=CITATION_BUTTON_TEXT,
+                        label=CITATION_BUTTON_LABEL,
+                        elem_id="citation-button",
+                        show_copy_button=True,
+                    )
         # table 4
         with gr.TabItem("📝 About", elem_id="mmupd-benchmark-tab-table", id=4):
         # table 5
         with gr.TabItem("🚀 Submit here! ", elem_id="mmupd-benchmark-tab-table", id=5):
             with gr.Row():
                 gr.Markdown(SUBMIT_INTRODUCTION, elem_classes="markdown-text")

constants.py CHANGED Viewed

@@ -35,11 +35,37 @@ LEADERBORAD_INTRODUCTION = """
 <a href='https://arxiv.org/abs/2403.20331'><img src='https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv&logoColor=red'></a>
 </div>
 - **Multiple Senario Evaluation:** We carefully design prompts choices and examine the three senario: (i) base (no instruction), (ii) option (add an additional option), (iii) instruction (add an instruction).
-- **Ability-wise Evaluation:** We carefully decompose each benchmark into more than 10 abilities to reveal individual model's strengths and weaknesses.
-- **Valuable Insights:** MM-UPD Bench provides multi-perspective insights on trustworthiness and reliablitity for the community.
-Please follow the instructions in [UPD](https://github.com/AtsuMiyai/UPD) to upload the generated `result_dual.json` file here. After clicking the `Submit Eval` button, click the `Refresh` button.
 """

 <a href='https://arxiv.org/abs/2403.20331'><img src='https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv&logoColor=red'></a>
 </div>
+## About MM-UPD Bench
+### What is MM-UPD Bench?
+MM-UPD Bench is a comprehensive benchmark for evaluating the trustworthiness of Vision Language Models (VLMs) in the context of Unsolvable Problem Detection (UPD).
+Our MM-UPD Bench encompasses three benchmarks: MM-AAD, MM-IASD, and MM-IVQD.
+- **MM-AAD:** Benchmark for Absent Answer Detection (AAD). MM-AAD Bench is a dataset where the correct answer
+option for each question is removed. MM-AAD tests the model's capability
+to recognize when the correct answer is absent from the provided choices.
+- **MM-IASD:** Benchmark for Incompatible Answer Set Detection (IASD). MM-IASD Bench is a dataset where the answer set
+is completely incompatible with the context specified by the question and the image.
+MM-IASD tests the model's capability to recognize when the answer set is incompatible with the context.
+- **MM-IVQD:** Benchmark for Incompatible Visual Question Detection (IVQD). MM-IVQD Bench is a dataset where the question is incompatible with the image.
+MM-IVQD evaluates the VLMs' capability to discern when a question and image are irrelevant or
+inappropriate.
+### Characteristics of MM-UPD Bench
+We design MM-UPD Bench to provide a comprehensive evaluation of VLMs across multiple senarios.
 - **Multiple Senario Evaluation:** We carefully design prompts choices and examine the three senario: (i) base (no instruction), (ii) option (add an additional option), (iii) instruction (add an instruction).
+- **Ability-Wise Evaluation:** We carefully decompose each benchmark into more than 10 abilities to reveal individual model's strengths and weaknesses.
+- **Valuable Insights:** MM-UPD Bench provides multi-perspective insights on trustworthiness and reliablitity for the community.
+## About Evaluation Metrics
+We evaluate the performance of VLMs on MM-UPD Bench using the following metrics:
+- **Dual accuracy:** The accuracy on standard-UPD pairs, where we count
+success only if the model is correct on both the standard and UPD questions.
+- **Standard accuracy:** The accuracy on standard questions.
+- **UPD (AAD/IASD/IVQD) accuracy:** The accuracy of AAD/IASD/IVQD questions.
+Please follow the instructions in [UPD](https://github.com/AtsuMiyai/UPD) to upload the generated JSON file here. After clicking the `Submit Eval` button, click the `Refresh` button.
 """