Model card for RoBERT-small

---
language: 
- ro
---

# RoBERT-small


## Pretrained BERT model for Romanian 

Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. 
It was introduced in this [paper](https://www.aclweb.org/anthology/2020.coling-main.581/). Three BERT models were released: **RoBERT-small**, RoBERT-base and RoBERT-large, all versions uncased.

| Model          | Weights   |   L    |   H    |    A   | MLM accuracy   | NSP accuracy   |
|----------------|:---------:|:------:|:------:|:------:|:--------------:|:--------------:|
| *RoBERT-small* | *19M*     | *12*   | *256*  | *8*    | *0.5363*       | *0.9687*       |
| RoBERT-base    | 114M      | 12     | 768    | 12     | 0.6511         | 0.9802         |
| RoBERT-large   | 341M      | 24     | 1024   | 24     | 0.6929         | 0.9843         |




All models are available:

* [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
* [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
* [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)



#### How to use

```python
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)

# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = AutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
```


## Training data

The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.

| Corpus    | Words     | Sentences | Size (GB)|
|-----------|:---------:|:---------:|:--------:|
| Oscar     | 1.78B     | 87M       | 10.8     |
| RoTex     | 240M      | 14M       | 1.5      |
| RoWiki    | 50M       | 2M        | 0.3      |
| **Total** | **2.07B** | **103M**  | **12.6** |


## Downstream performance

### Sentiment analysis

We report Macro-averaged F1 score (in %)

| Model            | Dev      | Test     |
|------------------|:--------:|:--------:|
| multilingual-BERT| 68.96    | 69.57    |
| XLM-R-base       | 71.26    | 71.71    |
| BERT-base-ro     | 70.49    | 71.02    |
| *RoBERT-small*   | *66.32*  | *66.37*  |
| RoBERT-base      | 70.89    | 71.61    |
| RoBERT-large     | **72.48**| **72.11**|

### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification

We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).

| Model             | Dialect Classification | MD to RO | RO to MD |
|-------------------|:----------------------:|:--------:|:--------:|
| 2-CNN + SVM       | 93.40                  | 65.09    | 75.21    |
| Char+Word SVM     | 96.20                  | 69.08    | 81.93    |
| BiGRU             | 93.30                  | **70.10**| 80.30    |
| multilingual-BERT | 95.34                  | 68.76    | 78.24    |
| XLM-R-base        | 96.28                  | 69.93    | 82.28    |
| BERT-base-ro      | 96.20                  | 69.93    | 78.79    |
| *RoBERT-small*    | *95.67*                | *69.01*  | *80.40*  |
| RoBERT-base       | 97.39                  | 68.30    | 81.09    |
| RoBERT-large      | **97.78**              | 69.91    | **83.65**|

### Diacritics Restoration

Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.

| Model                       | word level | char level |
|-----------------------------|:----------:|:----------:|
| BiLSTM                      | 99.42      | -          |
| CharCNN                     | 98.40      | 99.65      |
| CharCNN + multilingual-BERT | 99.72      | 99.94      |
| CharCNN + XLM-R-base        | 99.76      | **99.95**  |
| CharCNN + BERT-base-ro      | **99.79**  | **99.95**  |
| *CharCNN + RoBERT-small*    | *99.73*    | *99.94*    |
| CharCNN + RoBERT-base       | 99.78      | **99.95**  |
| CharCNN + RoBERT-large      | 99.76      | **99.95**  |


### BibTeX entry and citation info

```bibtex
@inproceedings{masala2020robert,
  title={RoBERT--A Romanian BERT Model},
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6626--6637},
  year={2020}
}
```