mehran commited on
Commit
6d91bc8
1 Parent(s): 9bbbdd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -73,8 +73,12 @@ binary models. For the tokenizers, this is the template their file name follows:
73
 
74
  In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
75
  tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
76
- of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Here's an example of the tokenizer
77
- files you can find:
 
 
 
 
78
 
79
  ```
80
  jomleh-sp-32000.model
 
73
 
74
  In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
75
  tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
76
+ of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Due to hardware limitations that I've
77
+ faced, I could only use 4 out of 60 jomleh's text file to train the tokenizer, namely: 10, 11,
78
+ 12, and 13. Also, 57218 was the largest number that SentencePiece allowed to set as for the
79
+ vocabulary size.
80
+
81
+ Here's an example of the tokenizer files you can find in this repository:
82
 
83
  ```
84
  jomleh-sp-32000.model