MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Abstract
Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
Community
MVL-SIB is a multilingual dataset that provides image-sentence pairs spanning 205 languages and 7 topical categories (entertainment, geography, health, politics, science, sports, travel). It was constructed by extending SIB-200. For each topic, a set of 10 permissively licensed images was manually collected to distinctly represent each category. MVL-SIB supports evaluations across text-only and cross-modal tasks. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further shows that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we find that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models (2025)
- Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? (2025)
- LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy (2025)
- Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models (2025)
- When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages (2025)
- LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models (2025)
- Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper