Spaces:

BAAI
/

open_flageval_vlm_leaderboard

Running

App Files Files Community

lixuejing commited on 18 days ago

Commit

f1caf55

1 Parent(s): 5c4d8e2

update

Browse files

Files changed (1) hide show

src/about.py +10 -10

src/about.py CHANGED Viewed

@@ -61,14 +61,14 @@ FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to
 We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
 - <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
-- Blink- a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
-- CMMU-  a benchmark for Chinese multi-modal multi-type question understanding and reasoning
-- CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
-- MMMU -a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
-- MMMU_Pro（standard & vision）  - a more robust multi-discipline multimodal understanding benchmark.
-- OCRBench- a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
-- MathVision- a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
-- CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
 For all these evaluations, a higher score is a better score.
 Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
@@ -81,13 +81,13 @@ You can find:
 ## Reproducibility
 An example of llava with vllm as backend:
-flagevalmm --tasks tasks/mmmu/mmmu_val.py \
         --exec model_zoo/vlm/api_model/model_adapter.py \
         --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
         --num-workers 8 \
         --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
         --backend vllm \
-        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
 ## Icons
 - 🟢 : pretrained model: new, base models, trained on a given corpora

 We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
 - <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
+- <a href="https://huggingface.co/datasets/BLINK-Benchmark/BLINK"> Blink </a> - a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
+- <a href="https://github.com/flageval-baai/CMMU"> CMMU </a> -  a benchmark for Chinese multi-modal multi-type question understanding and reasoning
+- <a href="https://cmmmu-benchmark.github.io/"> CMMMU </a> - a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
+- <a href="https://mmmu-benchmark.github.io/"> MMMU </a> - a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
+- <a href="https://huggingface.co/datasets/MMMU/MMMU_Pro"> MMMU_Pro(standard & vision) </a> - a more robust multi-discipline multimodal understanding benchmark.
+- <a href="https://github.com/Yuliang-Liu/MultimodalOCR"> OCRBench </a> - a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
+- <a href="https://mathvision-cuhk.github.io/"> MathVision </a> - a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
+- <a href="https://cii-bench.github.io/"> CII-Bench </a> -a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
 For all these evaluations, a higher score is a better score.
 Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
 ## Reproducibility
 An example of llava with vllm as backend:
+`flagevalmm --tasks tasks/mmmu/mmmu_val.py \
         --exec model_zoo/vlm/api_model/model_adapter.py \
         --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
         --num-workers 8 \
         --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
         --backend vllm \
+        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"`
 ## Icons
 - 🟢 : pretrained model: new, base models, trained on a given corpora