KenLM (arpa) models for English based on Wikipedia

This repository contains KenLM models (n=5) for English, based on the English portion of Wikipedia - sentence-segmented (one sentence per line). Models are provided on tokens, part-of-speech, dependency labels, and lemmas, as processed with spaCy en_core_web_sm:

wiki_en_token.arpa[.bin]: token
wiki_en_pos.arpa[.bin]: part-of-speech tag
wiki_en_dep.arpa[.bin]: dependency label
wiki_en_lemma.arpa[.bin]: lemma

Both regular .arpa files as well as more efficient KenLM binary files (.arpa.bin) are provided. You probably want to use the binary versions.

Usage from within Python

Make sure to install dependencies:

pip install huggingface_hub
pip install https://github.com/kpu/kenlm/archive/master.zip

# If you want to use spaCy preprocessing
pip install spacy
python -m spacy download en_core_web_sm

We can then use the Hugging Face hub software to download and cache the model file that we want, and directly use it with KenLM.

import kenlm
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_en", filename="wiki_en_token.arpa.bin")
model = kenlm.Model(model_file)

text = "I love eating cookies !"  # pre-tokenized
model.perplexity(text)
# 557.3027766772162

It is recommended to use spaCy as a preprocessor to automatically use the same tagsets and tokenization as were used when creating the LMs.

import kenlm
import spacy
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_en", filename="wiki_en_pos.arpa.bin")  # pos file
model = kenlm.Model(model_file)

nlp = spacy.load("en_core_web_sm")

text = "I love eating cookies!" 
pos_sequence = " ".join([token.pos_ for token in nlp(text)])
# 'PRON VERB VERB NOUN PUNCT'
model.perplexity(pos_sequence)
# 6.9449849329974365

Reproduction

Example:

bin/lmplz -o 5 -S 75% -T ../data/tmp/ < ../data/wikipedia/en/wiki_en_processed_lemma_dedup.txt > ../data/wikipedia/en/models/wiki_en_lemma.arpa
bin/build_binary ../data/wikipedia/en/models/wiki_en_lemma.arpa ../data/wikipedia/en/models/wiki_en_lemma.arpa.bin

For class-based LMs (POS and DEP), the --discount_fallback was used and the parsed data was not deduplicated (but it was deduplicated on the sentence-level for token and lemma models).

For the token and lemma models, n-grams were pruned to save on model size by adding --prune 0 1 1 1 2 to the lmplz command.