File size: 3,951 Bytes
d916065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT

.. -*- coding: utf-8 -*-


Regression Tests

================


Issue 167

---------
https://github.com/nltk/nltk/issues/167

    >>> from nltk.corpus import brown
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> ngram_order = 3
    >>> train_data, vocab_data = padded_everygram_pipeline(
    ...     ngram_order,
    ...     brown.sents(categories="news")
    ... )

    >>> from nltk.lm import WittenBellInterpolated
    >>> lm = WittenBellInterpolated(ngram_order)
    >>> lm.fit(train_data, vocab_data)




Sentence containing an unseen word should result in infinite entropy because
Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
Crucially, it shouldn't raise any exceptions for unseen words.

    >>> from nltk.util import ngrams
    >>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
    >>> lm.entropy(sent)
    inf

If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
for the entropy.

    >>> sent = ngrams("This is a sentence".split(), 3)
    >>> round(lm.entropy(sent), 14)
    10.23701322869105


Issue 367

---------
https://github.com/nltk/nltk/issues/367

Reproducing Dan Blanchard's example:
https://github.com/nltk/nltk/issues/367#issuecomment-14646110

    >>> from nltk.lm import Lidstone, Vocabulary
    >>> word_seq = list('aaaababaaccbacb')
    >>> ngram_order = 2
    >>> from nltk.util import everygrams
    >>> train_data = [everygrams(word_seq, max_len=ngram_order)]
    >>> V = Vocabulary(['a', 'b', 'c', ''])
    >>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
    >>> lm.fit(train_data)

For doctest to work we have to sort the vocabulary keys.

    >>> V_keys = sorted(V)
    >>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
    1.0
    >>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
    1.0

    >>> [lm.score(w, ("b",)) for w in V_keys]
    [0.05, 0.05, 0.8, 0.05, 0.05]
    >>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
    [0.0222, 0.0222, 0.4667, 0.2444, 0.2444]


Here's reproducing @afourney's comment:
https://github.com/nltk/nltk/issues/367#issuecomment-15686289

    >>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
    >>> ngram_order = 3
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
    >>> from nltk.lm import Lidstone
    >>> lm = Lidstone(0.2, ngram_order)
    >>> lm.fit(train_data, vocab_data)

The vocabulary includes the "UNK" symbol as well as two padding symbols.

    >>> len(lm.vocab)
    6
    >>> word = "foo"
    >>> context = ("bar", "baz")

The raw counts.

    >>> lm.context_counts(context)[word]
    0
    >>> lm.context_counts(context).N()
    1

Counts with Lidstone smoothing.

    >>> lm.context_counts(context)[word] + lm.gamma
    0.2
    >>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
    2.2

Without any backoff, just using Lidstone smoothing, P("foo" | "bar", "baz") should be:
0.2 / 2.2 ~= 0.090909

    >>> round(lm.score(word, context), 6)
    0.090909


Issue 380

---------
https://github.com/nltk/nltk/issues/380

Reproducing setup akin to this comment:
https://github.com/nltk/nltk/issues/380#issue-12879030

For speed take only the first 100 sentences of reuters. Shouldn't affect the test.

    >>> from nltk.corpus import reuters
    >>> sents = reuters.sents()[:100]
    >>> ngram_order = 3
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)

    >>> from nltk.lm import Lidstone
    >>> lm = Lidstone(0.2, ngram_order)
    >>> lm.fit(train_data, vocab_data)
    >>> lm.score("said", ("",)) < 1
    True