RoBERTa Turkish medium WordPiece 28k (uncased)

Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased. The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.

Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 28.6k.

The details and performance comparisons can be found at this paper: https://arxiv.org/abs/2204.08832

The following code can be used for model loading and tokenization, example max length (514) can be changed:

    model = AutoModel.from_pretrained([model_path])
    #for sequence classification:
    #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])

    tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
    tokenizer.mask_token = "[MASK]"
    tokenizer.cls_token = "[CLS]"
    tokenizer.sep_token = "[SEP]"
    tokenizer.pad_token = "[PAD]"
    tokenizer.unk_token = "[UNK]"
    tokenizer.bos_token = "[CLS]"
    tokenizer.eos_token = "[SEP]"
    tokenizer.model_max_length = 514

BibTeX entry and citation info

@misc{https://doi.org/10.48550/arxiv.2204.08832,
  doi = {10.48550/ARXIV.2204.08832},
  url = {https://arxiv.org/abs/2204.08832},
  author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ctoraman/RoBERTa-TR-medium-wp-28k