Introducing EuroBERT: A High-Performance Multilingual Encoder Model

Community Article Published March 10, 2025

A New Chapter for Multilingual NLP

In recent years, large language models have dominated natural language processing (NLP), with many advances focusing on generative models. However, bidirectional encoder models remain essential for tasks such as retrieval, classification, and regression. With this in mind, we introduce EuroBERT, a new family of multilingual encoder models designed to push the boundaries of performance in European and widely spoken global languages.

EuroBERT is optimized for a wide range of applications and introduces several innovations in model architecture, training methodology, and dataset curation. By leveraging insights from modern generative models, it offers state-of-the-art performance while retaining the efficiency and robustness of encoder-based architectures.

What Makes EuroBERT Special?

EuroBERT improves upon traditional multilingual encoder models like XLM-RoBERTa and mGTE in several key ways:

  • Extensive Multilingual Training: Trained on a 5 trillion-token dataset spanning 15 languages, ensuring broad linguistic coverage.
  • Advanced Architecture: Incorporates grouped query attention, rotary position embeddings, and root mean square normalization for better efficiency and performance.
  • Longer Context Support: Natively supports sequences up to 8,192 tokens, making it ideal for document-level tasks.
  • Specialized Knowledge: Includes datasets for mathematics and programming languages to enhance retrieval and reasoning capabilities.

Training Methodology

EuroBERT follows a two-phase training pipeline:

  1. Pretraining: The model learns language structures from a massive corpus using a masked language modeling (MLM) objective, leveraging high-quality multilingual data.
  2. Annealing Phase: The data mix is adjusted, and training is fine-tuned for optimal downstream performance. Adjustments include lowering the masking ratio and modifying the data distribution.

By applying this approach, EuroBERT ensures high adaptability across multiple NLP tasks while maintaining strong generalization. Additionally, for those interested in the finer details, we conducted extensive ablations in our study to understand the impact of various training choices. These ablations include the effects of data quality filtering, masking ratios, sentence length variations, and multilingual data balance. More details on these experiments and insights can be found in the full paper.

Performance Highlights

EuroBERT achieves state-of-the-art results on a diverse set of multilingual NLP tasks. Key benchmarks include:

  • Multilingual Retrieval (MIRACL, Wikipedia, CC-News): Outperforms existing models in ranking and document search tasks.
  • Classification (XNLI, PAWS-X, Amazon Reviews): Demonstrates competitive accuracy on natural language inference and sentiment analysis.
  • Regression (SeaHorse, WMT, SummEval): Excels in text similarity and evaluation tasks.
  • Code and Math Understanding: Shows strong results in code search (CodeSearchNet) and mathematical reasoning (MathShepherd).

image/png

image/png

EuroBERT for Long-Context NLP

One of the standout features of EuroBERT is its ability to handle long-context tasks effectively. With support for sequences up to 8,192 tokens, it is particularly well-suited for document retrieval, summarization, and question answering over extended text.

image/png

Open Access and Availability

To foster research and real-world applications, we are open-sourcing the entire EuroBERT family, including:

  • Model checkpoints (210M, 610M, and 2.1B parameters)
  • Intermediate training snapshots for reproducibility
  • Training framework and dataset composition

📝 The paper: https://arxiv.org/abs/2503.05500

👀 The model: https://huggingface.co/EuroBERT

💻 The training code (AMD + NVIDIA - soon follow the repo 😉): https://github.com/Nicolas-BZRD/EuroBERT

Conclusion and Future Work

EuroBERT represents a major step forward in multilingual encoder models, setting new benchmarks across multiple tasks. As we continue refining multilingual NLP, we invite the community to explore, experiment, and build upon our work.

We look forward to seeing how EuroBERT is used in research and industry applications. If you have questions or feedback, feel free to contact us !

Contributors

This project was made possible through the collaboration between the MICS laboratory at CentraleSupélec, Diabolocom, Artefact, and Unbabel, as well as the technological support of AMD and CINES. We also highlight the support of the French government through the France 2030 program as part of the ArGiMi project and DataIA Institute, whose contributions facilitated the completion of this work.

Finally, we thank the entire EuroBERT team without whom this would not have been possible: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Celine Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno Miguel Guerreiro, Ricardo Rei, Pierre Colombo

Diabolocom, Artefact, MICS, CentraleSupélec, Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), Instituto de Telecomunicações, Unbabel, Université Paris-Saclay, CNRS, LISN, INSA Rennes, IRISA, CINES, IRT Saint Exupéry, Illuin Technology, Université Grenoble Alpes, Grenoble INP, LIG, Equall, ISIA Lab

Citation

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

Community

Why are there so few languages involved in the training of these models? You argue that this data mix was selected "to create a corpus of European and most widely spoken languages, representing a broad range of alphabets and cultures."
But what is the relevance in other alphabets when, for example, you do not include any Nordic languages with large and high-quality datasets?

Prefixing it "Euro" seems odd in this context. You have selected a tiny fraction of languages - so name it accordingly :-)
It would also make sense to refer to EuroEval https://euroeval.com/leaderboards/

Sign up or log in to comment