mlengineer-ai
/

kenlm-sp-jomleh

Model card Files Files and versions Community

mehran commited on May 10, 2023

Commit

6d91bc8

•

1 Parent(s): 9bbbdd9

Update README.md

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -73,8 +73,12 @@ binary models. For the tokenizers, this is the template their file name follows:
 In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
 tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
-of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Here's an example of the tokenizer
-files you can find:
 ```
 jomleh-sp-32000.model

 In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
 tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
+of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Due to hardware limitations that I've
+faced, I could only use 4 out of 60 jomleh's text file to train the tokenizer, namely: 10, 11,
+12, and 13. Also, 57218 was the largest number that SentencePiece allowed to set as for the
+vocabulary size.
+Here's an example of the tokenizer files you can find in this repository:
 ```
 jomleh-sp-32000.model