Introducing EuroBERT: A High-Performance Multilingual Encoder Model

A New Chapter for Multilingual NLP
In recent years, large language models have dominated natural language processing (NLP), with many advances focusing on generative models. However, bidirectional encoder models remain essential for tasks such as retrieval, classification, and regression. With this in mind, we introduce EuroBERT, a new family of multilingual encoder models designed to push the boundaries of performance in European and widely spoken global languages.
EuroBERT is optimized for a wide range of applications and introduces several innovations in model architecture, training methodology, and dataset curation. By leveraging insights from modern generative models, it offers state-of-the-art performance while retaining the efficiency and robustness of encoder-based architectures.
What Makes EuroBERT Special?
EuroBERT improves upon traditional multilingual encoder models like XLM-RoBERTa and mGTE in several key ways:
- Extensive Multilingual Training: Trained on a 5 trillion-token dataset spanning 15 languages, ensuring broad linguistic coverage.
- Advanced Architecture: Incorporates grouped query attention, rotary position embeddings, and root mean square normalization for better efficiency and performance.
- Longer Context Support: Natively supports sequences up to 8,192 tokens, making it ideal for document-level tasks.
- Specialized Knowledge: Includes datasets for mathematics and programming languages to enhance retrieval and reasoning capabilities.
Training Methodology
EuroBERT follows a two-phase training pipeline:
- Pretraining: The model learns language structures from a massive corpus using a masked language modeling (MLM) objective, leveraging high-quality multilingual data.
- Annealing Phase: The data mix is adjusted, and training is fine-tuned for optimal downstream performance. Adjustments include lowering the masking ratio and modifying the data distribution.
By applying this approach, EuroBERT ensures high adaptability across multiple NLP tasks while maintaining strong generalization. Additionally, for those interested in the finer details, we conducted extensive ablations in our study to understand the impact of various training choices. These ablations include the effects of data quality filtering, masking ratios, sentence length variations, and multilingual data balance. More details on these experiments and insights can be found in the full paper.
Performance Highlights
EuroBERT achieves state-of-the-art results on a diverse set of multilingual NLP tasks. Key benchmarks include:
- Multilingual Retrieval (MIRACL, Wikipedia, CC-News): Outperforms existing models in ranking and document search tasks.
- Classification (XNLI, PAWS-X, Amazon Reviews): Demonstrates competitive accuracy on natural language inference and sentiment analysis.
- Regression (SeaHorse, WMT, SummEval): Excels in text similarity and evaluation tasks.
- Code and Math Understanding: Shows strong results in code search (CodeSearchNet) and mathematical reasoning (MathShepherd).
EuroBERT for Long-Context NLP
One of the standout features of EuroBERT is its ability to handle long-context tasks effectively. With support for sequences up to 8,192 tokens, it is particularly well-suited for document retrieval, summarization, and question answering over extended text.
Open Access and Availability
To foster research and real-world applications, we are open-sourcing the entire EuroBERT family, including:
- Model checkpoints (210M, 610M, and 2.1B parameters)
- Intermediate training snapshots for reproducibility
- Training framework and dataset composition
📝 The paper: https://arxiv.org/abs/2503.05500
👀 The model: https://huggingface.co/EuroBERT
💻 The training code (AMD + NVIDIA - soon follow the repo 😉): https://github.com/Nicolas-BZRD/EuroBERT
Conclusion and Future Work
EuroBERT represents a major step forward in multilingual encoder models, setting new benchmarks across multiple tasks. As we continue refining multilingual NLP, we invite the community to explore, experiment, and build upon our work.
We look forward to seeing how EuroBERT is used in research and industry applications. If you have questions or feedback, feel free to contact us !
Contributors
This project was made possible through the collaboration between the MICS laboratory at CentraleSupélec, Diabolocom, Artefact, and Unbabel, as well as the technological support of AMD and CINES. We also highlight the support of the French government through the France 2030 program as part of the ArGiMi project and DataIA Institute, whose contributions facilitated the completion of this work.
Finally, we thank the entire EuroBERT team without whom this would not have been possible: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Celine Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno Miguel Guerreiro, Ricardo Rei, Pierre Colombo
Diabolocom, Artefact, MICS, CentraleSupélec, Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), Instituto de Telecomunicações, Unbabel, Université Paris-Saclay, CNRS, LISN, INSA Rennes, IRISA, CINES, IRT Saint Exupéry, Illuin Technology, Université Grenoble Alpes, Grenoble INP, LIG, Equall, ISIA Lab
Citation
@misc{boizard2025eurobertscalingmultilingualencoders,
title={EuroBERT: Scaling Multilingual Encoders for European Languages},
author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
year={2025},
eprint={2503.05500},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.05500},
}