Spaces:
Sleeping
Sleeping
File size: 14,266 Bytes
d916065 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 |
.. Copyright (C) 2001-2023 NLTK Project
.. For license information, see LICENSE.TXT
=========================================
Loading Resources From the Data Package
=========================================
>>> import nltk.data
Overview
~~~~~~~~
The `nltk.data` module contains functions that can be used to load
NLTK resource files, such as corpora, grammars, and saved processing
objects.
Loading Data Files
~~~~~~~~~~~~~~~~~~
Resources are loaded using the function `nltk.data.load()`, which
takes as its first argument a URL specifying what file should be
loaded. The ``nltk:`` protocol loads files from the NLTK data
distribution:
>>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize('Hello. This is a test. It works!')
['Hello.', 'This is a test.', 'It works!']
It is important to note that there should be no space following the
colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will
not work!
The ``nltk:`` protocol is used by default if no protocol is specified:
>>> nltk.data.load('tokenizers/punkt/english.pickle')
<nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>
But it is also possible to load resources from ``http:``, ``ftp:``,
and ``file:`` URLs:
>>> # Load a grammar from the NLTK webpage.
>>> cfg = nltk.data.load('https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg')
>>> print(cfg) # doctest: +ELLIPSIS
Grammar with 14 productions (start state = S)
S -> NP VP
PP -> P NP
...
P -> 'on'
P -> 'in'
>>> # Load a grammar using an absolute path.
>>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg')
>>> url.replace('\\', '/')
'file:...toy.cfg'
>>> print(nltk.data.load(url))
Grammar with 14 productions (start state = S)
S -> NP VP
PP -> P NP
...
P -> 'on'
P -> 'in'
The second argument to the `nltk.data.load()` function specifies the
file format, which determines how the file's contents are processed
before they are returned by ``load()``. The formats that are
currently supported by the data module are described by the dictionary
`nltk.data.FORMATS`:
>>> for format, descr in sorted(nltk.data.FORMATS.items()):
... print('{0:<7} {1:}'.format(format, descr))
cfg A context free grammar.
fcfg A feature CFG.
fol A list of first order logic expressions, parsed with
nltk.sem.logic.Expression.fromstring.
json A serialized python object, stored using the json module.
logic A list of first order logic expressions, parsed with
nltk.sem.logic.LogicParser. Requires an additional logic_parser
parameter
pcfg A probabilistic CFG.
pickle A serialized python object, stored using the pickle
module.
raw The raw (byte string) contents of a file.
text The raw (unicode string) contents of a file.
val A semantic valuation, parsed by
nltk.sem.Valuation.fromstring.
yaml A serialized python object, stored using the yaml module.
`nltk.data.load()` will raise a ValueError if a bad format name is
specified:
>>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar')
Traceback (most recent call last):
. . .
ValueError: Unknown format type!
By default, the ``"auto"`` format is used, which chooses a format
based on the filename's extension. The mapping from file extensions
to format names is specified by `nltk.data.AUTO_FORMATS`:
>>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
... print('.%-7s -> %s' % (ext, format))
.cfg -> cfg
.fcfg -> fcfg
.fol -> fol
.json -> json
.logic -> logic
.pcfg -> pcfg
.pickle -> pickle
.text -> text
.txt -> text
.val -> val
.yaml -> yaml
If `nltk.data.load()` is unable to determine the format based on the
filename's extension, it will raise a ValueError:
>>> nltk.data.load('foo.bar')
Traceback (most recent call last):
. . .
ValueError: Could not determine format for foo.bar based on its file
extension; use the "format" argument to specify the format explicitly.
Note that by explicitly specifying the ``format`` argument, you can
override the load method's default processing behavior. For example,
to get the raw contents of any file, simply use ``format="raw"``:
>>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text')
>>> print(s)
S -> NP VP
PP -> P NP
NP -> Det N | NP PP
VP -> V NP | VP PP
...
Making Local Copies
~~~~~~~~~~~~~~~~~~~
.. This will not be visible in the html output: create a tempdir to
play in.
>>> import tempfile, os
>>> tempdir = tempfile.mkdtemp()
>>> old_dir = os.path.abspath('.')
>>> os.chdir(tempdir)
The function `nltk.data.retrieve()` copies a given resource to a local
file. This can be useful, for example, if you want to edit one of the
sample grammars.
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg'
>>> # Simulate editing the grammar.
>>> with open('toy.cfg') as inp:
... s = inp.read().replace('NP', 'DP')
>>> with open('toy.cfg', 'w') as out:
... _bytes_written = out.write(s)
>>> # Load the edited grammar, & display it.
>>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg'))
>>> print(cfg)
Grammar with 14 productions (start state = S)
S -> DP VP
PP -> P DP
...
P -> 'on'
P -> 'in'
The second argument to `nltk.data.retrieve()` specifies the filename
for the new copy of the file. By default, the source file's filename
is used.
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg')
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg'
>>> os.path.isfile('./mytoy.cfg')
True
>>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg')
Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg'
>>> os.path.isfile('./np.fcfg')
True
If a file with the specified (or default) filename already exists in
the current directory, then `nltk.data.retrieve()` will raise a
ValueError exception. It will *not* overwrite the file:
>>> os.path.isfile('./toy.cfg')
True
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
Traceback (most recent call last):
. . .
ValueError: File '...toy.cfg' already exists!
.. This will not be visible in the html output: clean up the tempdir.
>>> os.chdir(old_dir)
>>> for f in os.listdir(tempdir):
... os.remove(os.path.join(tempdir, f))
>>> os.rmdir(tempdir)
Finding Files in the NLTK Data Package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `nltk.data.find()` function searches the NLTK data package for a
given file, and returns a pointer to that file. This pointer can
either be a `FileSystemPathPointer` (whose `path` attribute gives the
absolute path of the file); or a `ZipFilePathPointer`, specifying a
zipfile and the name of an entry within that zipfile. Both pointer
types define the `open()` method, which can be used to read the string
contents of the file.
>>> path = nltk.data.find('corpora/abc/rural.txt')
>>> str(path)
'...rural.txt'
>>> print(path.open().read(60).decode())
PM denies knowledge of AWB kickbacks
The Prime Minister has
Alternatively, the `nltk.data.load()` function can be used with the
keyword argument ``format="raw"``:
>>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
>>> print(s.decode())
PM denies knowledge of AWB kickbacks
The Prime Minister has
Alternatively, you can use the keyword argument ``format="text"``:
>>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
>>> print(s)
PM denies knowledge of AWB kickbacks
The Prime Minister has
Resource Caching
~~~~~~~~~~~~~~~~
NLTK uses a weakref dictionary to maintain a cache of resources that
have been loaded. If you load a resource that is already stored in
the cache, then the cached copy will be returned. This behavior can
be seen by the trace output generated when verbose=True:
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
<<Loading nltk:grammars/book_grammars/feat0.fcfg>>
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
If you wish to load a resource from its source, bypassing the cache,
use the ``cache=False`` argument to `nltk.data.load()`. This can be
useful, for example, if the resource is loaded from a local file, and
you are actively editing that file:
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True)
<<Loading nltk:grammars/book_grammars/feat0.fcfg>>
The cache *no longer* uses weak references. A resource will not be
automatically expunged from the cache when no more objects are using
it. In the following example, when we clear the variable ``feat0``,
the reference count for the feature grammar object drops to zero.
However, the object remains cached:
>>> del feat0
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',
... verbose=True)
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
You can clear the entire contents of the cache, using
`nltk.data.clear_cache()`:
>>> nltk.data.clear_cache()
Retrieving other Data Sources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> formulas = nltk.data.load('grammars/book_grammars/background.fol')
>>> for f in formulas: print(str(f))
all x.(boxerdog(x) -> dog(x))
all x.(boxer(x) -> person(x))
all x.-(dog(x) & person(x))
all x.(married(x) <-> exists y.marry(x,y))
all x.(bark(x) -> dog(x))
all x y.(marry(x,y) -> (person(x) & person(y)))
-(Vincent = Mia)
-(Vincent = Fido)
-(Mia = Fido)
Regression Tests
~~~~~~~~~~~~~~~~
Create a temp dir for tests that write files:
>>> import tempfile, os
>>> tempdir = tempfile.mkdtemp()
>>> old_dir = os.path.abspath('.')
>>> os.chdir(tempdir)
The `retrieve()` function accepts all url types:
>>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg',
... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'),
... 'nltk:grammars/sample_grammars/toy.cfg',
... 'grammars/sample_grammars/toy.cfg']
>>> for i, url in enumerate(urls):
... nltk.data.retrieve(url, 'toy-%d.cfg' % i)
Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg'
Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg'
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg'
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg'
Clean up the temp dir:
>>> os.chdir(old_dir)
>>> for f in os.listdir(tempdir):
... os.remove(os.path.join(tempdir, f))
>>> os.rmdir(tempdir)
Lazy Loader
-----------
A lazy loader is a wrapper object that defers loading a resource until
it is accessed or used in any way. This is mainly intended for
internal use by NLTK's corpus readers.
>>> # Create a lazy loader for toy.cfg.
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
>>> # Show that it's not loaded yet:
>>> object.__repr__(ll)
'<nltk.data.LazyLoader object at ...>'
>>> # printing it is enough to cause it to be loaded:
>>> print(ll)
<Grammar with 14 productions>
>>> # Show that it's now been loaded:
>>> object.__repr__(ll)
'<nltk.grammar.CFG object at ...>'
>>> # Test that accessing an attribute also loads it:
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
>>> ll.start()
S
>>> object.__repr__(ll)
'<nltk.grammar.CFG object at ...>'
Buffered Gzip Reading and Writing
---------------------------------
Write performance to gzip-compressed is extremely poor when the files become large.
File creation can become a bottleneck in those cases.
Read performance from large gzipped pickle files was improved in data.py by
buffering the reads. A similar fix can be applied to writes by buffering
the writes to a StringIO object first.
This is mainly intended for internal use. The test simply tests that reading
and writing work as intended and does not test how much improvement buffering
provides.
>>> from io import StringIO
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10)
>>> ans = []
>>> for i in range(10000):
... ans.append(str(i).encode('ascii'))
... test.write(str(i).encode('ascii'))
>>> test.close()
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb')
>>> test.read() == b''.join(ans)
True
>>> test.close()
>>> import os
>>> os.unlink('testbuf.gz')
JSON Encoding and Decoding
--------------------------
JSON serialization is used instead of pickle for some classes.
>>> from nltk import jsontags
>>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag
>>> @jsontags.register_tag
... class JSONSerializable:
... json_tag = 'JSONSerializable'
...
... def __init__(self, n):
... self.n = n
...
... def encode_json_obj(self):
... return self.n
...
... @classmethod
... def decode_json_obj(cls, obj):
... n = obj
... return cls(n)
...
>>> JSONTaggedEncoder().encode(JSONSerializable(1))
'{"!JSONSerializable": 1}'
>>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n
1
|