mlengineer-ai
/

kenlm-sp-jomleh

Persian

kneser-ney

n-gram

kenlm

Model card Files Files and versions Community

mehran commited on May 10, 2023

Commit

9bbbdd9

•

1 Parent(s): c45e4f5

Update README.md

Browse files

Files changed (1) hide show

README.md +56 -20

README.md CHANGED Viewed

@@ -14,24 +14,25 @@ tags:
 # KenLM models for Farsi
-This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
-dataset. Among all the use cases for the language models like KenLM, the models provided here are
-very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
-the more likely sequence of tokens extracted from spectogram.
-The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
-binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
-that they are faster but with bigger memory footprint.
-There are a total 36 different KenLM models that you can find here. Unless you are doing some
-research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
-need and not the whole repository. As the total size of files is larger than half a TB.
 # Sample code how to use the models
 Unfortunately, I could not find an easy way to integrate the Python code that loads the models
-using Huggingface library. These are the steps that you have to take if you want to use any of the
-models provided here:
 1. Install KenLM package:
@@ -61,13 +62,48 @@ files/jomleh-sp-32000-o5-prune01111.probing
 ```
 ```
-# What are the different models provided here
-There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
-which is a Farsi dataset, there differences among them. Namely:
-1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
-Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
-once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
-here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
-57218 token vocabulary size should be the best option.

 # KenLM models for Farsi
+This repository contains KenLM models trained on the Jomleh dataset for the Farsi (Persian)
+language. Among the various use cases for KenLM language models, the models provided here are
+particularly useful for automatic speech recognition (ASR) tasks. They can be used in conjunction
+with CTC to select the most likely sequence of tokens extracted from a spectrogram.
+The models in this repository are KenLM arpa files that have been converted to binary format.
+KenLM supports two binary formats: probing and trie. The models provided here are in the probing
+format, which KenLM claims are faster but have a larger memory footprint.
+There are a total of 36 different KenLM models available in this repository. Unless you are
+conducting research, you will not need all of them. In that case, it is recommended that you
+download only the models you require rather than the entire repository since the total file size
+is over half a terabyte.
 # Sample code how to use the models
 Unfortunately, I could not find an easy way to integrate the Python code that loads the models
+using Huggingface library. These are the steps that you have to take when you want to use any of
+the models provided here:
 1. Install KenLM package:
 ```
 ```
+# What are the different files you can find in this repository?
+The files you can find in this repository are either SentencePiece tokenizer models or KenLM
+binary models. For the tokenizers, this is the template their file name follows:
+```
+<dataset-name>-<tokenizer-type>-<vocabulary-size>.<model|vocab>
+```
+In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
+tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
+of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Here's an example of the tokenizer
+files you can find:
+```
+jomleh-sp-32000.model
+```
+Moving on to the KenLM binary models, their file names follow this template:
+```
+<dataset-name>-<tokenizer-type>-<vocabulary-size>-o<n-gram>-prune<pruning>.<model|vocab>
+```
+Just like with the tokenizers, the only available options for dataset and tokenizer type are
+`jomleh` and `sp`. The same applies to vocabulary sizes. There are two n-grams trained,
+3-grams, and 5-grams. Additionally, there are three different pruning options available for
+each configuration. To interpret the pruning numbers, add a space between each pair of digits.
+For example, `011` means `0 1 1` was set during training of the KenLM model.
+Here is a complete example: To train the binary model named `jomleh-sp-32000-o5-prune01111.probing`,
+the tokenizer `jomleh-sp-32000.model` was used to encode (tokenize) the 95% Jomleh dataset,
+resulting in a large text file holding space-separated tokens. Then, the file was fed into
+the `lmplz` program with the following input arguments:
+```
+lmplz -o 5 -T /tmp --vocab_estimate 32000 -S 80% --discount_fallback --prune "0 1 1 1 1" < enocoded.txt > jomleh-sp-32000-o5-prune01111.arpa
+```
+This command will produce the raw arpa file, which can then be converted into binary format
+using the `build_binary` program, as shown below:
+```
+build_binary -T /tmp -S 80% probing jomleh-sp-32000-o5-prune01111.arpa jomleh-sp-32000-o5-prune01111.probing
+```