Update README.md
Browse files
README.md
CHANGED
@@ -14,24 +14,25 @@ tags:
|
|
14 |
|
15 |
# KenLM models for Farsi
|
16 |
|
17 |
-
This repository contains
|
18 |
-
|
19 |
-
|
20 |
-
the
|
21 |
|
22 |
-
The models in this repository are KenLM arpa files
|
23 |
-
binary formats: probing and trie. The models provided here are
|
24 |
-
|
25 |
|
26 |
-
There are a total 36 different KenLM models
|
27 |
-
research, you
|
28 |
-
|
|
|
29 |
|
30 |
# Sample code how to use the models
|
31 |
|
32 |
Unfortunately, I could not find an easy way to integrate the Python code that loads the models
|
33 |
-
using Huggingface library. These are the steps that you have to take
|
34 |
-
models provided here:
|
35 |
|
36 |
1. Install KenLM package:
|
37 |
|
@@ -61,13 +62,48 @@ files/jomleh-sp-32000-o5-prune01111.probing
|
|
61 |
```
|
62 |
```
|
63 |
|
64 |
-
# What are the different
|
65 |
|
66 |
-
|
67 |
-
|
68 |
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
# KenLM models for Farsi
|
16 |
|
17 |
+
This repository contains KenLM models trained on the Jomleh dataset for the Farsi (Persian)
|
18 |
+
language. Among the various use cases for KenLM language models, the models provided here are
|
19 |
+
particularly useful for automatic speech recognition (ASR) tasks. They can be used in conjunction
|
20 |
+
with CTC to select the most likely sequence of tokens extracted from a spectrogram.
|
21 |
|
22 |
+
The models in this repository are KenLM arpa files that have been converted to binary format.
|
23 |
+
KenLM supports two binary formats: probing and trie. The models provided here are in the probing
|
24 |
+
format, which KenLM claims are faster but have a larger memory footprint.
|
25 |
|
26 |
+
There are a total of 36 different KenLM models available in this repository. Unless you are
|
27 |
+
conducting research, you will not need all of them. In that case, it is recommended that you
|
28 |
+
download only the models you require rather than the entire repository since the total file size
|
29 |
+
is over half a terabyte.
|
30 |
|
31 |
# Sample code how to use the models
|
32 |
|
33 |
Unfortunately, I could not find an easy way to integrate the Python code that loads the models
|
34 |
+
using Huggingface library. These are the steps that you have to take when you want to use any of
|
35 |
+
the models provided here:
|
36 |
|
37 |
1. Install KenLM package:
|
38 |
|
|
|
62 |
```
|
63 |
```
|
64 |
|
65 |
+
# What are the different files you can find in this repository?
|
66 |
|
67 |
+
The files you can find in this repository are either SentencePiece tokenizer models or KenLM
|
68 |
+
binary models. For the tokenizers, this is the template their file name follows:
|
69 |
|
70 |
+
```
|
71 |
+
<dataset-name>-<tokenizer-type>-<vocabulary-size>.<model|vocab>
|
72 |
+
```
|
73 |
+
|
74 |
+
In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
|
75 |
+
tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
|
76 |
+
of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Here's an example of the tokenizer
|
77 |
+
files you can find:
|
78 |
+
|
79 |
+
```
|
80 |
+
jomleh-sp-32000.model
|
81 |
+
```
|
82 |
+
|
83 |
+
Moving on to the KenLM binary models, their file names follow this template:
|
84 |
+
|
85 |
+
```
|
86 |
+
<dataset-name>-<tokenizer-type>-<vocabulary-size>-o<n-gram>-prune<pruning>.<model|vocab>
|
87 |
+
```
|
88 |
+
|
89 |
+
Just like with the tokenizers, the only available options for dataset and tokenizer type are
|
90 |
+
`jomleh` and `sp`. The same applies to vocabulary sizes. There are two n-grams trained,
|
91 |
+
3-grams, and 5-grams. Additionally, there are three different pruning options available for
|
92 |
+
each configuration. To interpret the pruning numbers, add a space between each pair of digits.
|
93 |
+
For example, `011` means `0 1 1` was set during training of the KenLM model.
|
94 |
+
|
95 |
+
Here is a complete example: To train the binary model named `jomleh-sp-32000-o5-prune01111.probing`,
|
96 |
+
the tokenizer `jomleh-sp-32000.model` was used to encode (tokenize) the 95% Jomleh dataset,
|
97 |
+
resulting in a large text file holding space-separated tokens. Then, the file was fed into
|
98 |
+
the `lmplz` program with the following input arguments:
|
99 |
+
|
100 |
+
```
|
101 |
+
lmplz -o 5 -T /tmp --vocab_estimate 32000 -S 80% --discount_fallback --prune "0 1 1 1 1" < enocoded.txt > jomleh-sp-32000-o5-prune01111.arpa
|
102 |
+
```
|
103 |
+
|
104 |
+
This command will produce the raw arpa file, which can then be converted into binary format
|
105 |
+
using the `build_binary` program, as shown below:
|
106 |
+
|
107 |
+
```
|
108 |
+
build_binary -T /tmp -S 80% probing jomleh-sp-32000-o5-prune01111.arpa jomleh-sp-32000-o5-prune01111.probing
|
109 |
+
```
|