Abstract
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.
Community
We are happy to introduce our tech report for SmolVLM :)
Video collection: https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7
Tiny Image collection: https://huggingface.co/collections/HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
Codebase: https://github.com/huggingface/smollm
ΠΡΠΈΠ²Π΅Ρ
Based on your findings in Section 3.4 that "excessive CoT data harms compact model performance", would you expect the same effect when CoT reasoning is learned through RL in compact VLMs? Or is this relationship not obviously transferable and would require separate testing?
We did not try RL-style post-training, so it is hard to draw a definitive conclusion. However, larger LLMs (e.g., S1) can use SFT on CoT and succeed in 'distilling' such reasoning from CoT data.
My intuition is that these small models do not have that "emergent" property to learn to reason -- which is why CoT distillation was not helpful.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model (2025)
- Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI (2025)
- SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding (2025)
- Small Vision-Language Models: A Survey on Compact Architectures and Techniques (2025)
- FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression (2025)
- Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation (2025)
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper