Kaz-RoBERTa (base-sized model)

Model description

Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.

Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
#Out:
# {'score': 0.8131822347640991,
#   'token': 18749,
#   'token_str': ' мағынада',
#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
# ...
# ...]

Training data

The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:

  • MDBKD Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
  • Conversational data Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)

Together these datasets weigh 25GB of text.

Training procedure

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s>

Pretraining

The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12, num_hidden_layers=6.

Contributions

Thanks to @BeksultanSagyndyk, @SanzharMrz for adding this model. Point of Contact: Sanzhar Murzakhmetov, Besultan Sagyndyk

Downloads last month
202
Safetensors
Model size
83.5M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for kz-transformers/kaz-roberta-conversational

Finetunes
5 models

Dataset used to train kz-transformers/kaz-roberta-conversational

Space using kz-transformers/kaz-roberta-conversational 1