metadata
license: mit
🇹🇷 RoBERTaTurk
Model description
This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites.
The final training corpus has a size of 38 GB and 329.720.508 sentences.
Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.
Usage
Load transformers library with:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased")
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased")
Citation and Related Information
To cite this model:
@INPROCEEDINGS{999,
author={Aytan, Burak and Sakar, C. Okan},
booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)},
title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems},
year={2022},
volume={},
number={},
pages={},
doi={}}