Spaces:
Sleeping
Sleeping
File size: 7,699 Bytes
d916065 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT
=============
Classifiers
=============
>>> from nltk.test.classify_fixt import setup_module
>>> setup_module()
Classifiers label tokens with category labels (or *class labels*).
Typically, labels are represented with strings (such as ``"health"``
or ``"sports"``. In NLTK, classifiers are defined using classes that
implement the `ClassifierI` interface, which supports the following operations:
- self.classify(featureset)
- self.classify_many(featuresets)
- self.labels()
- self.prob_classify(featureset)
- self.prob_classify_many(featuresets)
NLTK defines several classifier classes:
- `ConditionalExponentialClassifier`
- `DecisionTreeClassifier`
- `MaxentClassifier`
- `NaiveBayesClassifier`
- `WekaClassifier`
Classifiers are typically created by training them on a training
corpus.
Regression Tests
~~~~~~~~~~~~~~~~
We define a very simple training corpus with 3 binary features: ['a',
'b', 'c'], and are two labels: ['x', 'y']. We use a simple feature set so
that the correct answers can be calculated analytically (although we
haven't done this yet for all tests).
>>> import nltk
>>> train = [
... (dict(a=1,b=1,c=1), 'y'),
... (dict(a=1,b=1,c=1), 'x'),
... (dict(a=1,b=1,c=0), 'y'),
... (dict(a=0,b=1,c=1), 'x'),
... (dict(a=0,b=1,c=1), 'y'),
... (dict(a=0,b=0,c=1), 'y'),
... (dict(a=0,b=1,c=0), 'x'),
... (dict(a=0,b=0,c=0), 'x'),
... (dict(a=0,b=1,c=1), 'y'),
... (dict(a=None,b=1,c=0), 'x'),
... ]
>>> test = [
... (dict(a=1,b=0,c=1)), # unseen
... (dict(a=1,b=0,c=0)), # unseen
... (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
... (dict(a=0,b=1,c=0)), # seen 1 time, label=x
... ]
Test the Naive Bayes classifier:
>>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
>>> sorted(classifier.labels())
['x', 'y']
>>> classifier.classify_many(test)
['y', 'x', 'y', 'x']
>>> for pdist in classifier.prob_classify_many(test):
... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
0.2500 0.7500
0.5833 0.4167
0.3571 0.6429
0.7000 0.3000
>>> classifier.show_most_informative_features()
Most Informative Features
c = 0 x : y = 2.3 : 1.0
c = 1 y : x = 1.8 : 1.0
a = 1 y : x = 1.7 : 1.0
a = 0 x : y = 1.0 : 1.0
b = 0 x : y = 1.0 : 1.0
b = 1 x : y = 1.0 : 1.0
Test the Decision Tree classifier (without None):
>>> classifier = nltk.classify.DecisionTreeClassifier.train(
... train[:-1], entropy_cutoff=0,
... support_cutoff=0)
>>> sorted(classifier.labels())
['x', 'y']
>>> print(classifier)
c=0? .................................................. x
a=0? ................................................ x
a=1? ................................................ y
c=1? .................................................. y
<BLANKLINE>
>>> classifier.classify_many(test)
['y', 'y', 'y', 'x']
>>> for pdist in classifier.prob_classify_many(test):
... print('%.4f %.4f' % (pdist.prob('x'), pdist.prob('y')))
Traceback (most recent call last):
. . .
NotImplementedError
Test the Decision Tree classifier (with None):
>>> classifier = nltk.classify.DecisionTreeClassifier.train(
... train, entropy_cutoff=0,
... support_cutoff=0)
>>> sorted(classifier.labels())
['x', 'y']
>>> print(classifier)
c=0? .................................................. x
a=0? ................................................ x
a=1? ................................................ y
a=None? ............................................. x
c=1? .................................................. y
<BLANKLINE>
Test SklearnClassifier, which requires the scikit-learn package.
>>> from nltk.classify import SklearnClassifier
>>> from sklearn.naive_bayes import BernoulliNB
>>> from sklearn.svm import SVC
>>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
... ({"a": 5, "b": 2, "c": 1}, "ham"),
... ({"a": 0, "b": 3, "c": 4}, "spam"),
... ({"a": 5, "b": 1, "c": 1}, "ham"),
... ({"a": 1, "b": 4, "c": 3}, "spam")]
>>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
>>> test_data = [{"a": 3, "b": 2, "c": 1},
... {"a": 0, "b": 3, "c": 7}]
>>> classif.classify_many(test_data)
['ham', 'spam']
>>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
>>> classif.classify_many(test_data)
['ham', 'spam']
Test the Maximum Entropy classifier training algorithms; they should all
generate the same results.
>>> def print_maxent_test_header():
... print(' '*11+''.join([' test[%s] ' % i
... for i in range(len(test))]))
... print(' '*11+' p(x) p(y)'*len(test))
... print('-'*(11+15*len(test)))
>>> def test_maxent(algorithm):
... print('%11s' % algorithm, end=' ')
... try:
... classifier = nltk.classify.MaxentClassifier.train(
... train, algorithm, trace=0, max_iter=1000)
... except Exception as e:
... print('Error: %r' % e)
... return
...
... for featureset in test:
... pdist = classifier.prob_classify(featureset)
... print('%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')), end=' ')
... print()
>>> print_maxent_test_header(); test_maxent('GIS'); test_maxent('IIS')
test[0] test[1] test[2] test[3]
p(x) p(y) p(x) p(y) p(x) p(y) p(x) p(y)
-----------------------------------------------------------------------
GIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
IIS 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
>>> test_maxent('MEGAM'); test_maxent('TADM') # doctest: +SKIP
MEGAM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
TADM 0.16 0.84 0.46 0.54 0.41 0.59 0.76 0.24
Regression tests for TypedMaxentFeatureEncoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> from nltk.classify import maxent
>>> train = [
... ({'a': 1, 'b': 1, 'c': 1}, 'y'),
... ({'a': 5, 'b': 5, 'c': 5}, 'x'),
... ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
... ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
... ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
... ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
... ]
>>> test = [
... {'a': 1, 'b': 0.8, 'c': 1.2},
... {'a': 5.2, 'b': 5.1, 'c': 5}
... ]
>>> encoding = maxent.TypedMaxentFeatureEncoding.train(
... train, count_cutoff=3, alwayson_features=True)
>>> classifier = maxent.MaxentClassifier.train(
... train, bernoulli=False, encoding=encoding, trace=0)
>>> classifier.classify_many(test)
['y', 'x']
|