BramVanroy commited on
Commit
c75ca85
1 Parent(s): 07346e9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - nl
4
+ tags:
5
+ - kenlm
6
+ license: apache-2.0
7
+ ---
8
+
9
+
10
+ # KenLM (arpa) models for English based on Wikipedia
11
+
12
+ This repository contains KenLM models (n=5) for English, based on the [English portion of Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.en) - sentence-segmented (one sentence per line). Models are provided on tokens, part-of-speech, dependency labels, and lemmas, as processed with spaCy `en_core_web_sm`:
13
+
14
+ - wiki_en_token.arpa[.bin]: token
15
+ - wiki_en_pos.arpa[.bin]: part-of-speech tag
16
+ - wiki_en_dep.arpa[.bin]: dependency label
17
+ - wiki_en_lemma.arpa[.bin]: lemma
18
+
19
+ Both regular `.arpa` files as well as more efficient KenLM binary files (`.arpa.bin`) are provided. You probably want to use the binary versions.
20
+
21
+ ## Usage from within Python
22
+
23
+ Make sure to install dependencies:
24
+
25
+ ```shell
26
+ pip install huggingface_hub
27
+ pip install https://github.com/kpu/kenlm/archive/master.zip
28
+
29
+ # If you want to use spaCy preprocessing
30
+ pip install spacy
31
+ python -m spacy download en_core_web_sm
32
+ ```
33
+
34
+ We can then use the Hugging Face hub software to download and cache the model file that we want, and directly use it with KenLM.
35
+
36
+ ```python
37
+ import kenlm
38
+ from huggingface_hub import hf_hub_download
39
+
40
+ model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_token.arpa.bin")
41
+ model = kenlm.Model(model_file)
42
+
43
+ text = "I love eating cookies !" # pre-tokenized
44
+ model.perplexity(text)
45
+ # 1790.5033832700467
46
+ ```
47
+
48
+ It is recommended to use spaCy as a preprocessor to automatically use the same tagsets and tokenization as were used when creating the LMs.
49
+
50
+
51
+ ```python
52
+ import kenlm
53
+ import spacy
54
+ from huggingface_hub import hf_hub_download
55
+
56
+ model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_pos.arpa.bin") # pos file
57
+ model = kenlm.Model(model_file)
58
+
59
+ nlp = spacy.load("en_core_web_sm")
60
+
61
+ text = "I love eating cookies!"
62
+ pos_sequence = " ".join([token.pos_ for token in nlp(text)])
63
+ # 'PRON VERB ADV NOUN PUNCT'
64
+ model.perplexity(pos_sequence)
65
+ # 6.190638021041525
66
+ ```
67
+
68
+
69
+ ## Reproduction
70
+
71
+ Example:
72
+
73
+ ```sh
74
+ bin/lmplz -o 5 -S 75% -T ../data/tmp/ < ../data/wikipedia/en/wiki_en_processed_lemma_dedup.txt > ../data/wikipedia/en/models/wiki_en_lemma.arpa
75
+ bin/build_binary ../data/wikipedia/en/models/wiki_en_lemma.arpa ../data/wikipedia/en/models/wiki_en_lemma.arpa.bin
76
+ ```
77
+
78
+ For class-based LMs (POS and DEP), the `--discount_fallback` was used and the parsed data was not deduplicated (but it was deduplicated on the sentence-level for token and lemma models).
79
+
80
+ For the token and lemma models, n-grams were pruned to save on model size by adding `--prune 0 1 1 1 2` to the `lmplz` command.