Update README.md
Browse files
README.md
CHANGED
@@ -73,8 +73,12 @@ binary models. For the tokenizers, this is the template their file name follows:
|
|
73 |
|
74 |
In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
|
75 |
tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
|
76 |
-
of 2000, 4000, 8000, 16000, 32000, and 57218 tokens.
|
77 |
-
|
|
|
|
|
|
|
|
|
78 |
|
79 |
```
|
80 |
jomleh-sp-32000.model
|
|
|
73 |
|
74 |
In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
|
75 |
tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
|
76 |
+
of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Due to hardware limitations that I've
|
77 |
+
faced, I could only use 4 out of 60 jomleh's text file to train the tokenizer, namely: 10, 11,
|
78 |
+
12, and 13. Also, 57218 was the largest number that SentencePiece allowed to set as for the
|
79 |
+
vocabulary size.
|
80 |
+
|
81 |
+
Here's an example of the tokenizer files you can find in this repository:
|
82 |
|
83 |
```
|
84 |
jomleh-sp-32000.model
|