SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Abstract
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
Community
It is truly magnificent! Thank you for sharing this work.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing (2025)
- Generate, Transduct, Adapt: Iterative Transduction with VLMs (2025)
- Efficient Few-Shot Continual Learning in Vision-Language Models (2025)
- MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders (2025)
- Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model (2025)
- Unifying Specialized Visual Encoders for Video Language Models (2025)
- SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Not compared against dinov2reg or other SSL backbones for vision tasks...
Models citing this paper 79
Browse 79 models citing this paperDatasets citing this paper 0
No dataset linking this paper