lixuejing commited on
Commit
f1caf55
·
1 Parent(s): 5c4d8e2
Files changed (1) hide show
  1. src/about.py +10 -10
src/about.py CHANGED
@@ -61,14 +61,14 @@ FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to
61
  We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
62
 
63
  - <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
64
- - Blink- a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
65
- - CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
66
- - CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
67
- - MMMU -a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
68
- - MMMU_Prostandard & vision- a more robust multi-discipline multimodal understanding benchmark.
69
- - OCRBench- a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
70
- - MathVision- a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
71
- - CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
72
 
73
  For all these evaluations, a higher score is a better score.
74
  Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
@@ -81,13 +81,13 @@ You can find:
81
  ## Reproducibility
82
 
83
  An example of llava with vllm as backend:
84
- flagevalmm --tasks tasks/mmmu/mmmu_val.py \
85
  --exec model_zoo/vlm/api_model/model_adapter.py \
86
  --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
87
  --num-workers 8 \
88
  --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
89
  --backend vllm \
90
- --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
91
 
92
  ## Icons
93
  - 🟢 : pretrained model: new, base models, trained on a given corpora
 
61
  We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
62
 
63
  - <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
64
+ - <a href="https://huggingface.co/datasets/BLINK-Benchmark/BLINK"> Blink </a> - a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
65
+ - <a href="https://github.com/flageval-baai/CMMU"> CMMU </a> - a benchmark for Chinese multi-modal multi-type question understanding and reasoning
66
+ - <a href="https://cmmmu-benchmark.github.io/"> CMMMU </a> - a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
67
+ - <a href="https://mmmu-benchmark.github.io/"> MMMU </a> - a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
68
+ - <a href="https://huggingface.co/datasets/MMMU/MMMU_Pro"> MMMU_Pro(standard & vision) </a> - a more robust multi-discipline multimodal understanding benchmark.
69
+ - <a href="https://github.com/Yuliang-Liu/MultimodalOCR"> OCRBench </a> - a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
70
+ - <a href="https://mathvision-cuhk.github.io/"> MathVision </a> - a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
71
+ - <a href="https://cii-bench.github.io/"> CII-Bench </a> -a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
72
 
73
  For all these evaluations, a higher score is a better score.
74
  Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
 
81
  ## Reproducibility
82
 
83
  An example of llava with vllm as backend:
84
+ `flagevalmm --tasks tasks/mmmu/mmmu_val.py \
85
  --exec model_zoo/vlm/api_model/model_adapter.py \
86
  --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
87
  --num-workers 8 \
88
  --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
89
  --backend vllm \
90
+ --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"`
91
 
92
  ## Icons
93
  - 🟢 : pretrained model: new, base models, trained on a given corpora