kz-transformers
/

kaz-roberta-conversational

Inference Endpoints

Model card Files Files and versions Community

kz-transformers commited on Apr 19, 2024

Commit

c7dabed

·

verified ·

1 Parent(s): ec45403

Update README.md

Files changed (1) hide show

README.md +21 -17

README.md CHANGED Viewed

@@ -16,11 +16,27 @@ widget:
 Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
 ## Training data
 The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
 - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
-- [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/)
 Together these datasets weigh 25GB of text.
 ## Training procedure
@@ -35,21 +51,9 @@ with `<s>` and the end of one by `</s>`
 The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
-## Usage
-You can use this model directly with a pipeline for masked language modeling:
-```python
->>> from transformers import pipeline
->>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
->>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
-#Out:
-# {'score': 0.8131822347640991,
-#   'token': 18749,
-#   'token_str': ' мағынада',
-#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
-# ...
-# ...]
-```
-### BibTeX entry and citation info

 Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
+## Usage
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
+>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
+#Out:
+# {'score': 0.8131822347640991,
+#   'token': 18749,
+#   'token_str': ' мағынада',
+#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
+# ...
+# ...]
+```
 ## Training data
 The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
 - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
+- [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)
 Together these datasets weigh 25GB of text.
 ## Training procedure
 The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
+### Contributions
+Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model.
+**Point of Contact:** [Sanzhar Murzakhmetov](mailto:[email protected]), [Besultan Sagyndyk](mailto:[email protected])
+---