Safetensors
qwen2
xiangan's picture
Update README.md
54329c5 verified
metadata
license: apache-2.0
datasets:
  - liuhaotian/LLaVA-Pretrain
  - lmms-lab/LLaVA-NeXT-Data
base_model:
  - Qwen/Qwen2.5-7B-Instruct

[Paper] [GitHub]

Model

We used MLCD as the Vision Encoder in LLaVA-Next. image/png

Data

Our model was trained on publicly available data from the LLaVA-Pretrain and LLaVA-NeXT-Data datasets.

How to eval

pip install lmms-eval==0.2.0

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m accelerate.commands.launch \
  --main_process_port=12581 \
  --num_processes=8 \
  -m lmms_eval \
  --model llava \
  --model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
  --tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix mlcd_llava_qwen2_7b \
  --output_path ./log

Performance and Limitations

In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower MLCD (ViT_L_14_336px) CLIP (ViT_L_14_336px)
LLM Qwen2.5-7B Qwen2.5-7B
AI2D 76.98 73.15
ScienceQA_img 78.09 76.35
GQA 64.17 63.31
InfoVQA_val 43.48 38.88
MMBench_cn_dev 74.83 72.51
MMBench_en_dev 76.37 74.57
MME(cognition) 432 384
MME(perception) 1598 1512
SeedBench 68.20 66.80
SeedBench_img 73.75 72.72
MMStar 50.98 48.98
MMMU 44.30 44.20
OCRBench 531.00 525.00
ChartQA 67.84 66.52
DocVQA_val 76.46 75.21
POPE 88.69 88.83
TextVQA_val 61.69 62.47

C. Limitations

Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.

Acknowledgments

We would like to express our gratitude to Yumeng Wang for his significant contributions to the experimental validation in MLLMs.