File size: 3,257 Bytes
dc42511
 
 
 
 
 
0aa04ca
 
 
 
dc42511
 
09c6657
c37ff5c
09c6657
6b07ba8
dc42511
0aa04ca
dc42511
 
 
 
 
0aa04ca
dc42511
0aa04ca
dc42511
 
c37ff5c
dc42511
0aa04ca
09c6657
 
dc42511
 
 
5dec1c8
0aa04ca
 
 
 
 
5dec1c8
0aa04ca
dc42511
0aa04ca
dc42511
0aa04ca
 
 
dc42511
0aa04ca
dc42511
0aa04ca
dc42511
0aa04ca
 
 
 
dc42511
0aa04ca
dc42511
0aa04ca
dc42511
0aa04ca
 
28f7e02
0aa04ca
 
dc42511
 
 
0aa04ca
dc42511
0aa04ca
dc42511
0aa04ca
 
 
 
 
 
 
dc42511
 
0aa04ca
dc42511
0aa04ca
 
65e24d1
 
 
 
 
0aa04ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
pipeline_tag: question-answering
metrics:
- accuracy
library_name: transformers
---

[Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions](https://arxiv.org/abs/2412.08737)

# Model Card for Euclid-convnext-xxlarge (Version on 12/05/2024)

A multimodal large language models specifically trained for strong low-level geometric perception. 

## Model Details

### Model Description

Euclid is trained on 1.6M synthetic geometry images with high-fidelity question-answer pairs using a curriculum learning approach.

It combines a ConvNeXt visual encoder with a Qwen-2.5 language model, connected through a 2-layer MLP multimodal connector.


### Model Sources

- **Repository:** https://github.com/euclid-multimodal/Euclid
- **Paper:** https://arxiv.org/abs/2412.08737
- **Demo:** https://euclid-multimodal.github.io/

## Uses

The model is trained for precise low-level geometric perception tasks which is able to perform
- Point-on-line detection
- Point-on-circle detection
- Angle classification
- Length comparison
- Geometric annotation understanding

Please refer to our [repo](https://github.com/euclid-multimodal/Euclid) for full input format.

### Limitations and Applications

Our model is not designed to handle:
- Comprehensive image understanding tasks
- Advanced cognitive reasoning beyond geometric analysis

However, the model demonstrates strength in low-level visual perception. 

This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including:

- Robotic vision and automation systems
- Medical imaging and diagnostic support
- Industrial quality assurance and inspection
- Geometric education and visualization tools

### Example Usage

Clone our Euclid [repo](https://github.com/euclid-multimodal/Euclid) first, set up the environment, then run:

```
pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda
```



## Evaluation Results

Performance on Geoperception benchmark tasks:

| Model | POL | POC | ALC | LHC | PEP | PRA | EQL | Overall |
|-------|-----|-----|-----|-----|-----|-----|-----|----------|
| Random Baseline | 0.43 | 2.63 | 59.92 | 51.36 | 0.25 | 0.00 | 0.02 | 16.37 |
| Pixtral-12B | 22.85 | 53.21 | 47.33 | 51.43 | 22.53 | 37.11 | **58.45** | 41.84 |
| Gemini-1.5-Pro | 24.42 | **69.80** | 57.96 | 79.05 | **39.60** | **77.59** | 52.27 | 57.24 |
| EUCLID-ConvNeXt-Large | 80.54 | 57.76 | 86.37 | 88.24 | 42.23 | 64.94 | 34.45 | 64.93 |
| EUCLID-ConvNeXt-XXLarge | **82.98** | 61.45 | **90.56** | **90.82** | **46.96** | 70.52 | 31.94 | **67.89** |


## Citation

If you find Euclid useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{zhang2024euclid,
  title={Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions},
  author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
  journal={arXiv preprint arXiv:2412.08737},
  year={2024}
}