Update README.md

4c07b27 verified about 11 hours ago

No virus

4.89 kB

	---
	license: apache-2.0
	datasets:
	- laion/laion400m
	- kakaobrain/coyo-700m
	pipeline_tag: feature-extraction
	tags:
	- Vision
	- LLaVA
	---




	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
	## Model
	We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


	## Data
	Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.

	## Performance and Limitations

	### A. MLLMs Evaluation Results
	In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

	\| Vision Tower \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:----------------\|:-------------\|:-------------\|
	\| LLM \| Qwen2.5-7B \| Qwen2.5-7B \|
	\| AI2D \| 76.98 \| 73.15 \|
	\| ScienceQA_img \| 78.09 \| 76.35 \|
	\| GQA \| 64.17 \| 63.31 \|
	\| InfoVQA_val \| 43.48 \| 38.88 \|
	\| MMBench_cn_dev \| 74.83 \| 72.51 \|
	\| MMBench_en_dev \| 76.37 \| 74.57 \|
	\| MME(cognition) \| 432 \| 384 \|
	\| MME(perception) \| 1598 \| 1512 \|
	\| SeedBench \| 68.20 \| 66.80 \|
	\| SeedBench_img \| 73.75 \| 72.72 \|
	\| MMStar \| 50.98 \| 48.98 \|
	\| MMMU \| 44.30 \| 44.20 \|
	\| OCRBench \| 531.00 \| 525.00 \|
	\| ChartQA \| 67.84 \| 66.52 \|
	\| DocVQA_val \| 76.46 \| 75.21 \|
	\| POPE \| 88.69 \| 88.83 \|
	\| TextVQA_val \| 61.69 \| 62.47 \|




	### B. Linear Probe Evaluation Results
	This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

	\| Dataset \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:---------------\|:----------------------\|:----------------------\|
	\| AVG \| 87.15 \| 85.35 \|
	\| Food101 \| 96.21 \| 95.90 \|
	\| CIFAR-10 \| 99.36 \| 97.90 \|
	\| CIFAR-100 \| 93.69 \| 87.40 \|
	\| Birdsnap \| 88.18 \| 79.90 \|
	\| SUN397 \| 87.96 \| 82.20 \|
	\| Stanford Cars \| 95.16 \| 91.50 \|
	\| FGVC Aircraft \| 86.38 \| 71.60 \|
	\| Describable Textures Dataset \| 86.70 \| 83.00 \|
	\| Oxford-IIIT Pets \| 96.27 \| 95.10 \|
	\| Caltech-101 \| 97.92 \| 96.00 \|
	\| Flowers102 \| 99.58 \| 99.20 \|
	\| MNIST \| 98.67 \| 99.20 \|
	\| STL-10 \| 99.28 \| 99.70 \|
	\| EuroSAT \| 99.06 \| 98.10 \|
	\| RESISC45 \| 95.48 \| 94.90 \|
	\| GTSRB \| 92.32 \| 92.40 \|
	\| KITTI \| 75.39 \| 69.20 \|
	\| Country211 \| 38.12 \| 46.40 \|
	\| PatchCamelyon \| 88.00 \| 85.60 \|
	\| UCF101 \| 92.86 \| 92.00 \|
	\| Kinetics-700 \| 73.35 \| 73.00 \|
	\| CLEVR \| 64.40 \| 60.30 \|
	\| Hateful Memes \| 72.00 \| 77.30 \|
	\| SST-2 \| 76.33 \| 80.50 \|
	\| ImageNet \| 86.10 \| 85.40 \|


	### C. Limitations

	Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


	## Acknowledgments

	We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.

	---
	license: apache-2.0
	datasets:
	- laion/laion400m
	- kakaobrain/coyo-700m
	pipeline_tag: feature-extraction
	tags:
	- Vision
	- LLaVA
	---




	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
	## Model
	We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


	## Data
	Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.

	## Performance and Limitations

	### A. MLLMs Evaluation Results
	In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

	\| Vision Tower \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:----------------\|:-------------\|:-------------\|
	\| LLM \| Qwen2.5-7B \| Qwen2.5-7B \|
	\| AI2D \| 76.98 \| 73.15 \|
	\| ScienceQA_img \| 78.09 \| 76.35 \|
	\| GQA \| 64.17 \| 63.31 \|
	\| InfoVQA_val \| 43.48 \| 38.88 \|
	\| MMBench_cn_dev \| 74.83 \| 72.51 \|
	\| MMBench_en_dev \| 76.37 \| 74.57 \|
	\| MME(cognition) \| 432 \| 384 \|
	\| MME(perception) \| 1598 \| 1512 \|
	\| SeedBench \| 68.20 \| 66.80 \|
	\| SeedBench_img \| 73.75 \| 72.72 \|
	\| MMStar \| 50.98 \| 48.98 \|
	\| MMMU \| 44.30 \| 44.20 \|
	\| OCRBench \| 531.00 \| 525.00 \|
	\| ChartQA \| 67.84 \| 66.52 \|
	\| DocVQA_val \| 76.46 \| 75.21 \|
	\| POPE \| 88.69 \| 88.83 \|
	\| TextVQA_val \| 61.69 \| 62.47 \|




	### B. Linear Probe Evaluation Results
	This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

	\| Dataset \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:---------------\|:----------------------\|:----------------------\|
	\| AVG \| 87.15 \| 85.35 \|
	\| Food101 \| 96.21 \| 95.90 \|
	\| CIFAR-10 \| 99.36 \| 97.90 \|
	\| CIFAR-100 \| 93.69 \| 87.40 \|
	\| Birdsnap \| 88.18 \| 79.90 \|
	\| SUN397 \| 87.96 \| 82.20 \|
	\| Stanford Cars \| 95.16 \| 91.50 \|
	\| FGVC Aircraft \| 86.38 \| 71.60 \|
	\| Describable Textures Dataset \| 86.70 \| 83.00 \|
	\| Oxford-IIIT Pets \| 96.27 \| 95.10 \|
	\| Caltech-101 \| 97.92 \| 96.00 \|
	\| Flowers102 \| 99.58 \| 99.20 \|
	\| MNIST \| 98.67 \| 99.20 \|
	\| STL-10 \| 99.28 \| 99.70 \|
	\| EuroSAT \| 99.06 \| 98.10 \|
	\| RESISC45 \| 95.48 \| 94.90 \|
	\| GTSRB \| 92.32 \| 92.40 \|
	\| KITTI \| 75.39 \| 69.20 \|
	\| Country211 \| 38.12 \| 46.40 \|
	\| PatchCamelyon \| 88.00 \| 85.60 \|
	\| UCF101 \| 92.86 \| 92.00 \|
	\| Kinetics-700 \| 73.35 \| 73.00 \|
	\| CLEVR \| 64.40 \| 60.30 \|
	\| Hateful Memes \| 72.00 \| 77.30 \|
	\| SST-2 \| 76.33 \| 80.50 \|
	\| ImageNet \| 86.10 \| 85.40 \|


	### C. Limitations

	Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


	## Acknowledgments

	We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.