Spaces:
Sleeping
Sleeping
File size: 3,951 Bytes
d916065 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT
.. -*- coding: utf-8 -*-
Regression Tests
================
Issue 167
---------
https://github.com/nltk/nltk/issues/167
>>> from nltk.corpus import brown
>>> from nltk.lm.preprocessing import padded_everygram_pipeline
>>> ngram_order = 3
>>> train_data, vocab_data = padded_everygram_pipeline(
... ngram_order,
... brown.sents(categories="news")
... )
>>> from nltk.lm import WittenBellInterpolated
>>> lm = WittenBellInterpolated(ngram_order)
>>> lm.fit(train_data, vocab_data)
Sentence containing an unseen word should result in infinite entropy because
Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
Crucially, it shouldn't raise any exceptions for unseen words.
>>> from nltk.util import ngrams
>>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
>>> lm.entropy(sent)
inf
If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
for the entropy.
>>> sent = ngrams("This is a sentence".split(), 3)
>>> round(lm.entropy(sent), 14)
10.23701322869105
Issue 367
---------
https://github.com/nltk/nltk/issues/367
Reproducing Dan Blanchard's example:
https://github.com/nltk/nltk/issues/367#issuecomment-14646110
>>> from nltk.lm import Lidstone, Vocabulary
>>> word_seq = list('aaaababaaccbacb')
>>> ngram_order = 2
>>> from nltk.util import everygrams
>>> train_data = [everygrams(word_seq, max_len=ngram_order)]
>>> V = Vocabulary(['a', 'b', 'c', ''])
>>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
>>> lm.fit(train_data)
For doctest to work we have to sort the vocabulary keys.
>>> V_keys = sorted(V)
>>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
1.0
>>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
1.0
>>> [lm.score(w, ("b",)) for w in V_keys]
[0.05, 0.05, 0.8, 0.05, 0.05]
>>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
[0.0222, 0.0222, 0.4667, 0.2444, 0.2444]
Here's reproducing @afourney's comment:
https://github.com/nltk/nltk/issues/367#issuecomment-15686289
>>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
>>> ngram_order = 3
>>> from nltk.lm.preprocessing import padded_everygram_pipeline
>>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
>>> from nltk.lm import Lidstone
>>> lm = Lidstone(0.2, ngram_order)
>>> lm.fit(train_data, vocab_data)
The vocabulary includes the "UNK" symbol as well as two padding symbols.
>>> len(lm.vocab)
6
>>> word = "foo"
>>> context = ("bar", "baz")
The raw counts.
>>> lm.context_counts(context)[word]
0
>>> lm.context_counts(context).N()
1
Counts with Lidstone smoothing.
>>> lm.context_counts(context)[word] + lm.gamma
0.2
>>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
2.2
Without any backoff, just using Lidstone smoothing, P("foo" | "bar", "baz") should be:
0.2 / 2.2 ~= 0.090909
>>> round(lm.score(word, context), 6)
0.090909
Issue 380
---------
https://github.com/nltk/nltk/issues/380
Reproducing setup akin to this comment:
https://github.com/nltk/nltk/issues/380#issue-12879030
For speed take only the first 100 sentences of reuters. Shouldn't affect the test.
>>> from nltk.corpus import reuters
>>> sents = reuters.sents()[:100]
>>> ngram_order = 3
>>> from nltk.lm.preprocessing import padded_everygram_pipeline
>>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)
>>> from nltk.lm import Lidstone
>>> lm = Lidstone(0.2, ngram_order)
>>> lm.fit(train_data, vocab_data)
>>> lm.score("said", ("",)) < 1
True
|