mRoBERTa Model Card

mRoBERTa is a new multilingual foundational model based on the RoBERTa architecture. It has been pretrained from scratch using 35 European languages and code. The pretraining corpus consists of 12.8TB of high-quality data. This is significantly larger compared to previous state-of-the-art encoder-only foundational models like XLM-RoBERTa-base and XLM-RoBERTa-large, whose training datasets included less multilingual data, amounting to 2.5TB.

Technical Description

Technical details of the mRoBERTa model.

Description	Value
Model Parameters	283M
Tokenizer Type	SPM
Vocabulary size	256,000
Precision	bfloat16
Context length	512

Training Hyperparemeters

Hyperparameter	Value
Pretraining Objective	Masked Language Modeling
Learning Rate	7E-05
Learning Rate Scheduler	Cosine
Warmup	10k
Optimizer	AdamW
Optimizer Hyperparameters	AdamW (β1=0.9,β2=0.98,ε =1e-06 )
Optimizer Decay	1E-02
Global Batch Size	8192
Dropout	1E-01
Attention Dropout	1E-01
Activation Function	GeLU

Data

Pretraining Corpus

The training corpus consists of 35 European languages and 92 programming languages, amounting to a total of 12.8TB of high-quality data.

This highly multilingual corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 66.06% of the total tokens. Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%. The next largest sources are French PD at 3.12% and Proof Pile at 1.98%. Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%. These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model. The remaining 10% comes from smaller sources in various languages.

The final pretraining language distribution split by language can be seen in the following picture: drawing

Further details about the pretraining corpus can be found here, as it is the same as the Salamandra foundational model.

Multilingual Evaluation and Performance

Evaluation is done using multilingual benchmarks in order to assess the multilingual capabilities of the models.

The following multilingual benchmarks have been considered:

Benchmark	Description	Languages	Source
XTREME	Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models	bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk	LINK
CLUB	Human-Annotated Catalan Benchmark	ca	LINK
Basque Custom Benchmark	A set of NER, POS and TC evaluation tasks to assess the performace in Basque language.	eu	LINK
Galician Custom Benchmark	NER and POS evaluation tasks to assess the performace in Galician language.	gl	LINK LINK

The following base foundational models have been considered:

Multilingual Foundational Model	Number of Parameters	Vocab Size	Description
BERTa	126M	52K	BERTa is a Catalan-specific language model pretrained with Catalan-only data.
BERTinho	109M	30K	BERTinho is monolingual BERT model for Galician language.
mBERT	178M	120K	Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia.
mRoBERTa	283M	256K	RoBERTa base model pretrained with 35 European languages and a larger vocabulary size.
roberta-base-bne	125M	50K	RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019.
RoBERTa-ca	125M	50K	RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa.
xlm-roberta-base	279M	250K	Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.
xlm-roberta-large	561M	250K	Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.

RESULTS

This section presents results across various multilingual benchmarks, with the maximum values highlighted in bold and the second-highest values underlined.

XTREME Benchmark

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, availability of training data, and overlap with the languages present during pre-training of the models.

🔵 Sentence Classification

🔵 XNLI

Metric used: Accuracy.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
bg	69.34	78.26	82.10	77.56
de	71.54	76.75	81.62	77.01
el	66.51	76.37	81.46	76.35
en	82.20	84.45	87.98	85.69
es	74.81	78.18	83.65	79.66
fr	74.25	78.24	82.71	79.16
ru	68.56	76.21	79.10	74.73

🔵 PAWS-X

Metric used: Accuracy.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
de	85.65	86.95	85.05	87.35
en	93.50	93.90	91.45	94.75
es	87.75	89.30	87.65	88.60
fr	86.60	88.55	87.30	89.20

🟣 Structured Prediction: POS

🟣 POS (UDPOS)

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
bg	85.14	88.62	89.06	88.19
de	85.71	88.41	88.65	88.58
el	80.92	87.12	86.55	87.03
en	95.43	95.79	96.07	95.85
es	85.85	88.10	89.31	87.45
et	79.68	86.22	87.36	86.25
eu	60.18	68.83	71.85	69.22
fi	79.72	85.90	86.54	84.23
fr	81.20	86.34	88.24	87.00
hu	78.39	83.05	83.84	82.96
it	87.86	88.91	90.01	89.11
lt	78.59	83.86	84.91	81.12
nl	88.59	89.16	89.70	89.31
pl	80.34	84.61	85.77	84.23
pt	85.77	87.53	88.56	87.18
ro	76.51	83.99	86.47	82.74
ru	85.36	88.75	89.83	89.09
uk	80.63	84.79	85.84	85.19

🟣 NER (PANX)

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
bg	78.38	76.52	81.97	78.66
de	78.89	73.92	78.59	78.17
el	74.09	73.07	75.49	74.81
en	84.69	82.70	84.50	83.56
es	72.32	72.83	73.46	78.30
et	77.55	72.56	78.37	73.92
eu	66.52	58.34	60.01	56.74
fi	78.11	74.98	78.46	76.42
fr	79.45	77.00	80.16	76.94
hu	77.39	75.48	80.10	73.31
it	81.33	76.68	80.60	80.04
lt	75.48	73.76	76.41	72.71
nl	82.40	79.80	82.92	81.42
pl	80.57	77.15	80.55	80.26
pt	79.66	76.60	80.97	76.13
ro	74.73	71.79	81.42	66.85
ru	65.42	63.93	70.68	67.53
uk	71.71	66.78	74.12	71.69

⚪️ Sentence Retrieval

⚪️ BUCC2018

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
de	63.26	66.83	75.23	86.09
fr	62.62	65.79	69.29	79.21
ru	54.97	70.12	75.57	82.93

⚪️ Tatoeba

Metric used: Accuracy.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
bg	48.80	66.90	71.60	77.60
ca	59.80	57.30	62.20	80.20
de	75.40	88.40	88.80	95.60
el	29.80	51.60	61.80	72.30
es	64.10	71.00	75.70	89.70
et	28.10	44.20	52.20	61.80
eu	25.50	26.10	35.80	53.40
fi	39.00	63.90	71.60	63.90
fr	64.30	72.50	73.70	81.30
hu	36.90	58.70	65.40	62.40
it	57.30	64.70	68.30	80.30
lt	31.10	54.80	59.60	49.30
nl	63.70	76.80	80.80	86.60
pl	50.10	65.20	75.90	79.00
pt	68.40	76.60	82.20	88.80
ro	51.50	68.80	75.70	69.00
ru	59.40	69.80	74.10	81.60
uk	52.60	57.30	69.10	77.50

⚫ Question Answering

⚫ XQUAD

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
de	73.55	74.81	80.15	73.92
el	63.74	73.34	80.86	73.56
en	84.84	84.22	88.13	82.70
es	75.06	76.44	82.21	77.07
ru	72.02	74.73	80.11	72.85

⚫ MLQA

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
de	57.68	62.20	68.78	63.25
en	80.16	80.27	83.52	79.81
es	64.90	66.97	72.93	68.14

⚫ TyDiQA

Metric used: F1.

langs	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
en	68.26	59.57	71.33	61.50
fi	55.70	51.91	70.62	52.32
ru	53.71	50.75	64.48	50.66

CLUB Benchmark

The Catalan Language Understanding Benchmark consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.

This comparison also includes RoBERTa-ca, a model derived from mRoBERTa by applying vocabulary adaptation and performing continual pre-training on a 95GB Catalan-only corpus. For further details, visit here.

tasks	roberta-base-bne (125M)	berta (126M)	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	roberta-ca (125M)	mRoBERTa (283M)
ner (F1)	87.59	89.47	85.89	87.50	89.47	89.70	88.33
pos (F1)	98.64	98.89	98.78	98.91	99.03	99.00	98.98
sts (Person)	74.27	81.39	77.05	75.11	83.49	82.99	79.52
tc (Acc.)	73.86	73.16	72.00	73.05	74.10	72.81	72.41
te (Acc.)	72.27	80.11	75.86	78.27	86.63	82.14	82.38
viquiquad (F1)	82.56	86.74	87.42	86.81	90.35	87.31	87.86
xquad (F1)	60.56	67.38	67.72	68.56	76.08	70.53	69.40

Galician Benchmark

To evaluate performance in Galician, the models are tested on two tasks highlighted in Bertinho's paper:

NER task: NER task using the SLI NERC dataset.
POS task: POS task using the Universal Dependencies dataset.

tasks	bertinho (109M)	roberta-base-bne (125M)	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
ner-dataset-SLI NERC (F1)	86.27	86.80	86.22	85.99	88.10	87.75
pos-dataset-UD_GL_CTG (F1)	97.58	97.27	97.57	97.77	97.95	97.75

BasqueGLUE Benchmark

To assess the model performance in the Basque language, the BasqueGLUE Benchmark is used as the baseline. BasqueGLUE has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. Some of the tasks has been slightly adapted to easily assess all the models (e.g, FMTODeu_slot is originally described as "Slot filling" task, but it is evaluated as NERC, since it follows the BIO annotation scheme).

tasks	roberta-base-bne (125M)	mBERT (178M)	xlm-roberta-base (279M)	xlm-roberta-large (561M)	mRoBERTa (283M)
NERC - NERCid (F1)	71.53	79.98	81.74	83.96	80.86
NERC - NERCood (F1)	61.47	76.95	76.23	80.25	76.97
NERC - FMTODeu_slot (F1)	72.70	73.65	73.80	77.09	77.32
Sentiment Analysis - BEC2016eu (Acc.)	67.13	67.05	69.89	67.90	69.20
Topic Classification - BHTCv2 (Acc.)	66.72	70.17	72.01	75.78	72.55
Intent Classification - FMTODeu_intent (Acc.)	78.38	78.01	82.15	83.35	83.07
Stance Detection - VaxxStance (Acc.)	58.01	66.67	61.22	66.03	65.71

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to [email protected].

Copyright

Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

License

Apache License, Version 2.0