|
--- |
|
license: mit |
|
--- |
|
🇹🇷 RoBERTaTurk |
|
|
|
## Model description |
|
This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites. |
|
|
|
The final training corpus has a size of 38 GB and 329.720.508 sentences. |
|
|
|
Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps. |
|
|
|
# Usage |
|
Load transformers library with: |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased") |
|
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased") |
|
``` |
|
|
|
|
|
## Citation and Related Information |
|
|
|
To cite this model: |
|
```bibtex |
|
|
|
@INPROCEEDINGS{999, |
|
author={Aytan, Burak and Sakar, C. Okan}, |
|
booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)}, |
|
title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems}, |
|
year={2022}, |
|
volume={}, |
|
number={}, |
|
pages={}, |
|
doi={}} |
|
``` |