File size: 2,104 Bytes
de906df eda7d42 de906df a861e5e 1789c36 e14811b a7b55dc 394fff2 2297684 394fff2 37d4558 394fff2 d83086d 4bbeca1 5403ca1 71f7b25 5403ca1 693dd3a 5403ca1 d83086d 1cc4909 d83086d 1cc4909 d83086d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
language: tr
license: mit
---
🇹🇷 RoBERTaTurk
## Model description
This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites.
The final training corpus has a size of 38 GB and 329.720.508 sentences.
Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.
# Usage
Load transformers library with:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased")
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased")
```
# Fill Mask Usage
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="burakaytan/roberta-base-turkish-uncased",
tokenizer="burakaytan/roberta-base-turkish-uncased"
)
fill_mask("iki ülke arasında <mask> başladı")
[{'sequence': 'iki ülke arasında savaş başladı',
'score': 0.3013845384120941,
'token': 1359,
'token_str': ' savaş'},
{'sequence': 'iki ülke arasında müzakereler başladı',
'score': 0.1058429479598999,
'token': 30439,
'token_str': ' müzakereler'},
{'sequence': 'iki ülke arasında görüşmeler başladı',
'score': 0.07718811184167862,
'token': 4916,
'token_str': ' görüşmeler'},
{'sequence': 'iki ülke arasında kriz başladı',
'score': 0.07174749672412872,
'token': 3908,
'token_str': ' kriz'},
{'sequence': 'iki ülke arasında çatışmalar başladı',
'score': 0.05678590387105942,
'token': 19346,
'token_str': ' çatışmalar'}]
```
## Citation and Related Information
To cite this model:
```bibtex
@inproceedings{aytan2022comparison,
title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems},
author={Aytan, Burak and Sakar, C Okan},
booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)},
pages={1--4},
year={2022},
organization={IEEE}
}
``` |