kz-transformers
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -16,11 +16,27 @@ widget:
|
|
16 |
|
17 |
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Training data
|
20 |
|
21 |
The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
|
22 |
- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
|
23 |
-
- [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)
|
24 |
|
25 |
Together these datasets weigh 25GB of text.
|
26 |
## Training procedure
|
@@ -35,21 +51,9 @@ with `<s>` and the end of one by `</s>`
|
|
35 |
|
36 |
The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
|
37 |
|
38 |
-
## Usage
|
39 |
-
|
40 |
-
You can use this model directly with a pipeline for masked language modeling:
|
41 |
|
42 |
-
|
43 |
-
>>> from transformers import pipeline
|
44 |
-
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
|
45 |
-
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
|
46 |
-
#Out:
|
47 |
-
# {'score': 0.8131822347640991,
|
48 |
-
# 'token': 18749,
|
49 |
-
# 'token_str': ' мағынада',
|
50 |
-
# 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
|
51 |
-
# ...
|
52 |
-
# ...]
|
53 |
-
```
|
54 |
|
55 |
-
|
|
|
|
|
|
16 |
|
17 |
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
|
18 |
|
19 |
+
## Usage
|
20 |
+
|
21 |
+
You can use this model directly with a pipeline for masked language modeling:
|
22 |
+
|
23 |
+
```python
|
24 |
+
>>> from transformers import pipeline
|
25 |
+
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
|
26 |
+
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
|
27 |
+
#Out:
|
28 |
+
# {'score': 0.8131822347640991,
|
29 |
+
# 'token': 18749,
|
30 |
+
# 'token_str': ' мағынада',
|
31 |
+
# 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
|
32 |
+
# ...
|
33 |
+
# ...]
|
34 |
+
```
|
35 |
## Training data
|
36 |
|
37 |
The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
|
38 |
- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
|
39 |
+
- [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)
|
40 |
|
41 |
Together these datasets weigh 25GB of text.
|
42 |
## Training procedure
|
|
|
51 |
|
52 |
The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.
|
53 |
|
|
|
|
|
|
|
54 |
|
55 |
+
### Contributions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
+
Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model.
|
58 |
+
**Point of Contact:** [Sanzhar Murzakhmetov](mailto:[email protected]), [Besultan Sagyndyk](mailto:[email protected])
|
59 |
+
---
|