lixuejing
commited on
Commit
·
f1caf55
1
Parent(s):
5c4d8e2
update
Browse files- src/about.py +10 -10
src/about.py
CHANGED
@@ -61,14 +61,14 @@ FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to
|
|
61 |
We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
|
62 |
|
63 |
- <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
|
64 |
-
- Blink- a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
|
65 |
-
- CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
|
66 |
-
- CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
|
67 |
-
- MMMU -a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
|
68 |
-
- MMMU_Pro
|
69 |
-
- OCRBench- a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
|
70 |
-
- MathVision- a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
|
71 |
-
- CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
|
72 |
|
73 |
For all these evaluations, a higher score is a better score.
|
74 |
Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
|
@@ -81,13 +81,13 @@ You can find:
|
|
81 |
## Reproducibility
|
82 |
|
83 |
An example of llava with vllm as backend:
|
84 |
-
flagevalmm --tasks tasks/mmmu/mmmu_val.py \
|
85 |
--exec model_zoo/vlm/api_model/model_adapter.py \
|
86 |
--model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
|
87 |
--num-workers 8 \
|
88 |
--output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
|
89 |
--backend vllm \
|
90 |
-
--extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
|
91 |
|
92 |
## Icons
|
93 |
- 🟢 : pretrained model: new, base models, trained on a given corpora
|
|
|
61 |
We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
|
62 |
|
63 |
- <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
|
64 |
+
- <a href="https://huggingface.co/datasets/BLINK-Benchmark/BLINK"> Blink </a> - a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
|
65 |
+
- <a href="https://github.com/flageval-baai/CMMU"> CMMU </a> - a benchmark for Chinese multi-modal multi-type question understanding and reasoning
|
66 |
+
- <a href="https://cmmmu-benchmark.github.io/"> CMMMU </a> - a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
|
67 |
+
- <a href="https://mmmu-benchmark.github.io/"> MMMU </a> - a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
|
68 |
+
- <a href="https://huggingface.co/datasets/MMMU/MMMU_Pro"> MMMU_Pro(standard & vision) </a> - a more robust multi-discipline multimodal understanding benchmark.
|
69 |
+
- <a href="https://github.com/Yuliang-Liu/MultimodalOCR"> OCRBench </a> - a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
|
70 |
+
- <a href="https://mathvision-cuhk.github.io/"> MathVision </a> - a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
|
71 |
+
- <a href="https://cii-bench.github.io/"> CII-Bench </a> -a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
|
72 |
|
73 |
For all these evaluations, a higher score is a better score.
|
74 |
Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
|
|
|
81 |
## Reproducibility
|
82 |
|
83 |
An example of llava with vllm as backend:
|
84 |
+
`flagevalmm --tasks tasks/mmmu/mmmu_val.py \
|
85 |
--exec model_zoo/vlm/api_model/model_adapter.py \
|
86 |
--model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
|
87 |
--num-workers 8 \
|
88 |
--output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
|
89 |
--backend vllm \
|
90 |
+
--extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"`
|
91 |
|
92 |
## Icons
|
93 |
- 🟢 : pretrained model: new, base models, trained on a given corpora
|