license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
pipeline_tag: question-answering
metrics:
- accuracy
library_name: transformers
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Model Card for Euclid-convnext-xxlarge (Version on 12/05/2024)
A multimodal large language models specifically trained for strong low-level geometric perception.
Model Details
Model Description
Euclid is trained on 1.6M synthetic geometry images with high-fidelity question-answer pairs using a curriculum learning approach.
It combines a ConvNeXt visual encoder with a Qwen-2.5 language model, connected through a 2-layer MLP multimodal connector.
Model Sources
- Repository: https://github.com/euclid-multimodal/Euclid
- Paper: https://arxiv.org/abs/2412.08737
- Demo: https://euclid-multimodal.github.io/
Uses
The model is trained for precise low-level geometric perception tasks which is able to perform
- Point-on-line detection
- Point-on-circle detection
- Angle classification
- Length comparison
- Geometric annotation understanding
Please refer to our repo for full input format.
Limitations and Applications
Our model is not designed to handle:
- Comprehensive image understanding tasks
- Advanced cognitive reasoning beyond geometric analysis
However, the model demonstrates strength in low-level visual perception.
This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including:
- Robotic vision and automation systems
- Medical imaging and diagnostic support
- Industrial quality assurance and inspection
- Geometric education and visualization tools
Example Usage
Clone our Euclid repo first, set up the environment, then run:
pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda
Evaluation Results
Performance on Geoperception benchmark tasks:
Model | POL | POC | ALC | LHC | PEP | PRA | EQL | Overall |
---|---|---|---|---|---|---|---|---|
Random Baseline | 0.43 | 2.63 | 59.92 | 51.36 | 0.25 | 0.00 | 0.02 | 16.37 |
Pixtral-12B | 22.85 | 53.21 | 47.33 | 51.43 | 22.53 | 37.11 | 58.45 | 41.84 |
Gemini-1.5-Pro | 24.42 | 69.80 | 57.96 | 79.05 | 39.60 | 77.59 | 52.27 | 57.24 |
EUCLID-ConvNeXt-Large | 80.54 | 57.76 | 86.37 | 88.24 | 42.23 | 64.94 | 34.45 | 64.93 |
EUCLID-ConvNeXt-XXLarge | 82.98 | 61.45 | 90.56 | 90.82 | 46.96 | 70.52 | 31.94 | 67.89 |
Citation
If you find Euclid useful for your research and applications, please cite using this BibTeX:
@article{zhang2024euclid,
title={Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions},
author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
journal={arXiv preprint arXiv:2412.08737},
year={2024}
}