Introducing EuroBERT: A High-Performance Multilingual Encoder Model

Community Article Published March 10, 2025

Upvote

138

Nicolas-BZRD Nicolas-BZRD

EuroBERT

hgissbkh Hippolyte Gisserot-Boukhlef

EuroBERT

DuarteMRAlves Duarte Alves

EuroBERT

manu Manuel Faysse

EuroBERT

A New Chapter for Multilingual NLP

What Makes EuroBERT Special?

Training Methodology

Performance Highlights

EuroBERT for Long-Context NLP

Open Access and Availability

Conclusion and Future Work

Contributors
Citation

A New Chapter for Multilingual NLP

In recent years, large language models have dominated natural language processing (NLP), with many advances focusing on generative models. However, bidirectional encoder models remain essential for tasks such as retrieval, classification, and regression. With this in mind, we introduce EuroBERT, a new family of multilingual encoder models designed to push the boundaries of performance in European and widely spoken global languages.

EuroBERT is optimized for a wide range of applications and introduces several innovations in model architecture, training methodology, and dataset curation. By leveraging insights from modern generative models, it offers state-of-the-art performance while retaining the efficiency and robustness of encoder-based architectures.

What Makes EuroBERT Special?

EuroBERT improves upon traditional multilingual encoder models like XLM-RoBERTa and mGTE in several key ways:

Extensive Multilingual Training: Trained on a 5 trillion-token dataset spanning 15 languages, ensuring broad linguistic coverage.
Advanced Architecture: Incorporates grouped query attention, rotary position embeddings, and root mean square normalization for better efficiency and performance.
Longer Context Support: Natively supports sequences up to 8,192 tokens, making it ideal for document-level tasks.
Specialized Knowledge: Includes datasets for mathematics and programming languages to enhance retrieval and reasoning capabilities.

Training Methodology

EuroBERT follows a two-phase training pipeline:

Pretraining: The model learns language structures from a massive corpus using a masked language modeling (MLM) objective, leveraging high-quality multilingual data.
Annealing Phase: The data mix is adjusted, and training is fine-tuned for optimal downstream performance. Adjustments include lowering the masking ratio and modifying the data distribution.

By applying this approach, EuroBERT ensures high adaptability across multiple NLP tasks while maintaining strong generalization. Additionally, for those interested in the finer details, we conducted extensive ablations in our study to understand the impact of various training choices. These ablations include the effects of data quality filtering, masking ratios, sentence length variations, and multilingual data balance. More details on these experiments and insights can be found in the full paper.

Performance Highlights

EuroBERT achieves state-of-the-art results on a diverse set of multilingual NLP tasks. Key benchmarks include:

Multilingual Retrieval (MIRACL, Wikipedia, CC-News): Outperforms existing models in ranking and document search tasks.
Classification (XNLI, PAWS-X, Amazon Reviews): Demonstrates competitive accuracy on natural language inference and sentiment analysis.
Regression (SeaHorse, WMT, SummEval): Excels in text similarity and evaluation tasks.
Code and Math Understanding: Shows strong results in code search (CodeSearchNet) and mathematical reasoning (MathShepherd).

EuroBERT for Long-Context NLP

One of the standout features of EuroBERT is its ability to handle long-context tasks effectively. With support for sequences up to 8,192 tokens, it is particularly well-suited for document retrieval, summarization, and question answering over extended text.

Open Access and Availability

To foster research and real-world applications, we are open-sourcing the entire EuroBERT family, including:

Model checkpoints (210M, 610M, and 2.1B parameters)
Intermediate training snapshots for reproducibility
Training framework and dataset composition

📝 The paper: https://arxiv.org/abs/2503.05500

👀 The model: https://huggingface.co/EuroBERT

💻 The training code (AMD + NVIDIA - soon follow the repo 😉): https://github.com/Nicolas-BZRD/EuroBERT

Conclusion and Future Work

EuroBERT represents a major step forward in multilingual encoder models, setting new benchmarks across multiple tasks. As we continue refining multilingual NLP, we invite the community to explore, experiment, and build upon our work.

We look forward to seeing how EuroBERT is used in research and industry applications. If you have questions or feedback, feel free to contact us !

Contributors

This project was made possible through the collaboration between the MICS laboratory at CentraleSupélec, Diabolocom, Artefact, and Unbabel, as well as the technological support of AMD and CINES. We also highlight the support of the French government through the France 2030 program as part of the ArGiMi project and DataIA Institute, whose contributions facilitated the completion of this work.

Finally, we thank the entire EuroBERT team without whom this would not have been possible: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Celine Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno Miguel Guerreiro, Ricardo Rei, Pierre Colombo

Diabolocom, Artefact, MICS, CentraleSupélec, Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), Instituto de Telecomunicações, Unbabel, Université Paris-Saclay, CNRS, LISN, INSA Rennes, IRISA, CINES, IRT Saint Exupéry, Illuin Technology, Université Grenoble Alpes, Grenoble INP, LIG, Equall, ISIA Lab

Citation

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

Community

tollefj

Mar 10

Why are there so few languages involved in the training of these models? You argue that this data mix was selected "to create a corpus of European and most widely spoken languages, representing a broad range of alphabets and cultures."
But what is the relevance in other alphabets when, for example, you do not include any Nordic languages with large and high-quality datasets?

Prefixing it "Euro" seems odd in this context. You have selected a tiny fraction of languages - so name it accordingly :-)
It would also make sense to refer to EuroEval https://euroeval.com/leaderboards/

Nicolas-BZRD

Article author about 1 month ago

We are working on the next model, which covers all European languages. Training the previous model with a restricted number of languages helped us better understand the impact of their distribution during training and the curse of multilinguality while maximizing population coverage.

We also released the code base and look forward to see the community adding more languages 🤗

CorentinAmbroise

Mar 10

•

edited Mar 10

Great work ! I have been looking for such a multilingual model since ModernBERT is out. Looking forward to see these models evaluated on MTEB. Is this already planned ?

Nicolas-BZRD

Article author 27 days ago

Hey @CorentinAmbroise , we are currently working on the modeling file to add the different tasks required to execute the MTEB benchmark. We hope to achieve it soon.

musabgultekin

Mar 10

We should also see ModernBERT on the Table1. Including that in the Table 2 but not in Table 1 makes me question the metrics on the table 1

Flocksserver

about 1 month ago

My wild guess is: ModernBERT was not intended to be multilingual? Or at least not published yet. But it is interesting anyway :D The official models are english only and there is no multilingual one published with this architecture yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Introducing EuroBERT: A High-Performance Multilingual Encoder Model

A New Chapter for Multilingual NLP What Makes EuroBERT Special? Training Methodology Performance Highlights EuroBERT for Long-Context NLP Open Access and Availability Conclusion and Future Work Contributors Citation A New Chapter for Multilingual NLP

What Makes EuroBERT Special?

Training Methodology

Performance Highlights

EuroBERT for Long-Context NLP

Open Access and Availability

Conclusion and Future Work

Contributors

Citation

Community

A New Chapter for Multilingual NLP

What Makes EuroBERT Special?

Training Methodology

Performance Highlights

EuroBERT for Long-Context NLP

Open Access and Availability

Conclusion and Future Work

Contributors
Citation

A New Chapter for Multilingual NLP