mehran commited on
Commit
9bbbdd9
1 Parent(s): c45e4f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -20
README.md CHANGED
@@ -14,24 +14,25 @@ tags:
14
 
15
  # KenLM models for Farsi
16
 
17
- This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
18
- dataset. Among all the use cases for the language models like KenLM, the models provided here are
19
- very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
20
- the more likely sequence of tokens extracted from spectogram.
21
 
22
- The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
23
- binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
24
- that they are faster but with bigger memory footprint.
25
 
26
- There are a total 36 different KenLM models that you can find here. Unless you are doing some
27
- research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
28
- need and not the whole repository. As the total size of files is larger than half a TB.
 
29
 
30
  # Sample code how to use the models
31
 
32
  Unfortunately, I could not find an easy way to integrate the Python code that loads the models
33
- using Huggingface library. These are the steps that you have to take if you want to use any of the
34
- models provided here:
35
 
36
  1. Install KenLM package:
37
 
@@ -61,13 +62,48 @@ files/jomleh-sp-32000-o5-prune01111.probing
61
  ```
62
  ```
63
 
64
- # What are the different models provided here
65
 
66
- There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
67
- which is a Farsi dataset, there differences among them. Namely:
68
 
69
- 1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
70
- Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
71
- once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
72
- here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
73
- 57218 token vocabulary size should be the best option.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  # KenLM models for Farsi
16
 
17
+ This repository contains KenLM models trained on the Jomleh dataset for the Farsi (Persian)
18
+ language. Among the various use cases for KenLM language models, the models provided here are
19
+ particularly useful for automatic speech recognition (ASR) tasks. They can be used in conjunction
20
+ with CTC to select the most likely sequence of tokens extracted from a spectrogram.
21
 
22
+ The models in this repository are KenLM arpa files that have been converted to binary format.
23
+ KenLM supports two binary formats: probing and trie. The models provided here are in the probing
24
+ format, which KenLM claims are faster but have a larger memory footprint.
25
 
26
+ There are a total of 36 different KenLM models available in this repository. Unless you are
27
+ conducting research, you will not need all of them. In that case, it is recommended that you
28
+ download only the models you require rather than the entire repository since the total file size
29
+ is over half a terabyte.
30
 
31
  # Sample code how to use the models
32
 
33
  Unfortunately, I could not find an easy way to integrate the Python code that loads the models
34
+ using Huggingface library. These are the steps that you have to take when you want to use any of
35
+ the models provided here:
36
 
37
  1. Install KenLM package:
38
 
 
62
  ```
63
  ```
64
 
65
+ # What are the different files you can find in this repository?
66
 
67
+ The files you can find in this repository are either SentencePiece tokenizer models or KenLM
68
+ binary models. For the tokenizers, this is the template their file name follows:
69
 
70
+ ```
71
+ <dataset-name>-<tokenizer-type>-<vocabulary-size>.<model|vocab>
72
+ ```
73
+
74
+ In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
75
+ tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
76
+ of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Here's an example of the tokenizer
77
+ files you can find:
78
+
79
+ ```
80
+ jomleh-sp-32000.model
81
+ ```
82
+
83
+ Moving on to the KenLM binary models, their file names follow this template:
84
+
85
+ ```
86
+ <dataset-name>-<tokenizer-type>-<vocabulary-size>-o<n-gram>-prune<pruning>.<model|vocab>
87
+ ```
88
+
89
+ Just like with the tokenizers, the only available options for dataset and tokenizer type are
90
+ `jomleh` and `sp`. The same applies to vocabulary sizes. There are two n-grams trained,
91
+ 3-grams, and 5-grams. Additionally, there are three different pruning options available for
92
+ each configuration. To interpret the pruning numbers, add a space between each pair of digits.
93
+ For example, `011` means `0 1 1` was set during training of the KenLM model.
94
+
95
+ Here is a complete example: To train the binary model named `jomleh-sp-32000-o5-prune01111.probing`,
96
+ the tokenizer `jomleh-sp-32000.model` was used to encode (tokenize) the 95% Jomleh dataset,
97
+ resulting in a large text file holding space-separated tokens. Then, the file was fed into
98
+ the `lmplz` program with the following input arguments:
99
+
100
+ ```
101
+ lmplz -o 5 -T /tmp --vocab_estimate 32000 -S 80% --discount_fallback --prune "0 1 1 1 1" < enocoded.txt > jomleh-sp-32000-o5-prune01111.arpa
102
+ ```
103
+
104
+ This command will produce the raw arpa file, which can then be converted into binary format
105
+ using the `build_binary` program, as shown below:
106
+
107
+ ```
108
+ build_binary -T /tmp -S 80% probing jomleh-sp-32000-o5-prune01111.arpa jomleh-sp-32000-o5-prune01111.probing
109
+ ```