File size: 4,891 Bytes
8fec223
 
 
 
 
 
 
 
 
 
 
e6656a0
 
 
8fec223
 
 
 
e6656a0
 
 
8fec223
 
 
 
 
 
 
 
 
 
 
 
14fd13e
8fec223
cf984ef
 
8fec223
cf984ef
 
14fd13e
 
 
8fec223
 
14fd13e
 
8fec223
cf984ef
14fd13e
8fec223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c07b27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
datasets:
- laion/laion400m
- kakaobrain/coyo-700m
pipeline_tag: feature-extraction
tags:
- Vision
- LLaVA
---




[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)  
## Model
We used the same Vision Transformer architecture  [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


## Data
Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets. 

## Performance and Limitations

### A. MLLMs Evaluation Results
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

| Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|:----------------|:-------------|:-------------|
| LLM             | Qwen2.5-7B   |   Qwen2.5-7B |
| AI2D            | **76.98**    | 73.15        |
| ScienceQA_img   | **78.09**    | 76.35        |
| GQA             | **64.17**    | 63.31        |
| InfoVQA_val     | **43.48**    | 38.88        |
| MMBench_cn_dev  | **74.83**    | 72.51        |
| MMBench_en_dev  | **76.37**    | 74.57        |
| MME(cognition)  | **432**      | 384          |
| MME(perception) | **1598**     | 1512         |
| SeedBench       | **68.20**    | 66.80        |
| SeedBench_img   | **73.75**    | 72.72        |
| MMStar          | **50.98**    | 48.98        |
| MMMU            | **44.30**    | 44.20        |
| OCRBench        | **531.00**   | 525.00       |
| ChartQA         | **67.84**    | 66.52        |
| DocVQA_val      | **76.46**    | 75.21        |
| POPE            | 88.69        | **88.83**    |
| TextVQA_val     | 61.69        | **62.47**    |




### B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

| Dataset        | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|:---------------|:----------------------|:----------------------|
| **AVG**        | **87.15**             | 85.35                 |
| Food101        | **96.21**             | 95.90                 |
| CIFAR-10       | **99.36**             | 97.90                 |
| CIFAR-100      | **93.69**             | 87.40                 |
| Birdsnap       | **88.18**             | 79.90                 |
| SUN397         | **87.96**             | 82.20                 |
| Stanford Cars  | **95.16**             | 91.50                 |
| FGVC Aircraft  | **86.38**             | 71.60                 |
| Describable Textures Dataset | **86.70** | 83.00                 |
| Oxford-IIIT Pets | **96.27**          | 95.10                 |
| Caltech-101    | **97.92**             | 96.00                 |
| Flowers102     | **99.58**             | 99.20                 |
| MNIST          | 98.67                 | **99.20**             |
| STL-10         | 99.28                 | **99.70**             |
| EuroSAT        | **99.06**             | 98.10                 |
| RESISC45       | **95.48**             | 94.90                 |
| GTSRB          | 92.32                 | **92.40**             |
| KITTI          | **75.39**             | 69.20                 |
| Country211     | 38.12                 | **46.40**             |
| PatchCamelyon  | **88.00**             | 85.60                 |
| UCF101         | **92.86**             | 92.00                 |
| Kinetics-700   | **73.35**             | 73.00                 |
| CLEVR          | **64.40**             | 60.30                 |
| Hateful Memes  | 72.00                 | **77.30**             |
| SST-2          | 76.33                 | **80.50**             |
| ImageNet       | **86.10**             | 85.40                 |


### C. Limitations

Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


## Acknowledgments

We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.