Spaces:

sunnychenxiwang
/

EasyDetect

Sleeping

App Files Files Community

EasyDetect / pipeline /nltk /test /lm.doctest

sunnychenxiwang

update nltk

d916065 12 months ago

raw

history blame

3.95 kB

	.. Copyright (C) 2001-2023 NLTK Project
	.. For license information, see LICENSE.TXT

	.. -- coding: utf-8 --


	Regression Tests
	================


	Issue 167
	---------
	https://github.com/nltk/nltk/issues/167

	>>> from nltk.corpus import brown
	>>> from nltk.lm.preprocessing import padded_everygram_pipeline
	>>> ngram_order = 3
	>>> train_data, vocab_data = padded_everygram_pipeline(
	... ngram_order,
	... brown.sents(categories="news")
	... )

	>>> from nltk.lm import WittenBellInterpolated
	>>> lm = WittenBellInterpolated(ngram_order)
	>>> lm.fit(train_data, vocab_data)




	Sentence containing an unseen word should result in infinite entropy because
	Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
	Crucially, it shouldn't raise any exceptions for unseen words.

	>>> from nltk.util import ngrams
	>>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
	>>> lm.entropy(sent)
	inf

	If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
	for the entropy.

	>>> sent = ngrams("This is a sentence".split(), 3)
	>>> round(lm.entropy(sent), 14)
	10.23701322869105


	Issue 367
	---------
	https://github.com/nltk/nltk/issues/367

	Reproducing Dan Blanchard's example:
	https://github.com/nltk/nltk/issues/367#issuecomment-14646110

	>>> from nltk.lm import Lidstone, Vocabulary
	>>> word_seq = list('aaaababaaccbacb')
	>>> ngram_order = 2
	>>> from nltk.util import everygrams
	>>> train_data = [everygrams(word_seq, max_len=ngram_order)]
	>>> V = Vocabulary(['a', 'b', 'c', ''])
	>>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
	>>> lm.fit(train_data)

	For doctest to work we have to sort the vocabulary keys.

	>>> V_keys = sorted(V)
	>>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
	1.0
	>>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
	1.0

	>>> [lm.score(w, ("b",)) for w in V_keys]
	[0.05, 0.05, 0.8, 0.05, 0.05]
	>>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
	[0.0222, 0.0222, 0.4667, 0.2444, 0.2444]


	Here's reproducing @afourney's comment:
	https://github.com/nltk/nltk/issues/367#issuecomment-15686289

	>>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
	>>> ngram_order = 3
	>>> from nltk.lm.preprocessing import padded_everygram_pipeline
	>>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
	>>> from nltk.lm import Lidstone
	>>> lm = Lidstone(0.2, ngram_order)
	>>> lm.fit(train_data, vocab_data)

	The vocabulary includes the "UNK" symbol as well as two padding symbols.

	>>> len(lm.vocab)
	6
	>>> word = "foo"
	>>> context = ("bar", "baz")

	The raw counts.

	>>> lm.context_counts(context)[word]
	0
	>>> lm.context_counts(context).N()
	1

	Counts with Lidstone smoothing.

	>>> lm.context_counts(context)[word] + lm.gamma
	0.2
	>>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
	2.2

	Without any backoff, just using Lidstone smoothing, P("foo" \| "bar", "baz") should be:
	0.2 / 2.2 ~= 0.090909

	>>> round(lm.score(word, context), 6)
	0.090909


	Issue 380
	---------
	https://github.com/nltk/nltk/issues/380

	Reproducing setup akin to this comment:
	https://github.com/nltk/nltk/issues/380#issue-12879030

	For speed take only the first 100 sentences of reuters. Shouldn't affect the test.

	>>> from nltk.corpus import reuters
	>>> sents = reuters.sents()[:100]
	>>> ngram_order = 3
	>>> from nltk.lm.preprocessing import padded_everygram_pipeline
	>>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)

	>>> from nltk.lm import Lidstone
	>>> lm = Lidstone(0.2, ngram_order)
	>>> lm.fit(train_data, vocab_data)
	>>> lm.score("said", ("",)) < 1
	True