Spaces:
Sleeping
Sleeping
.. Copyright (C) 2001-2023 NLTK Project | |
.. For license information, see LICENSE.TXT | |
========================================= | |
Loading Resources From the Data Package | |
========================================= | |
>>> import nltk.data | |
Overview | |
~~~~~~~~ | |
The `nltk.data` module contains functions that can be used to load | |
NLTK resource files, such as corpora, grammars, and saved processing | |
objects. | |
Loading Data Files | |
~~~~~~~~~~~~~~~~~~ | |
Resources are loaded using the function `nltk.data.load()`, which | |
takes as its first argument a URL specifying what file should be | |
loaded. The ``nltk:`` protocol loads files from the NLTK data | |
distribution: | |
>>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') | |
>>> tokenizer.tokenize('Hello. This is a test. It works!') | |
['Hello.', 'This is a test.', 'It works!'] | |
It is important to note that there should be no space following the | |
colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will | |
not work! | |
The ``nltk:`` protocol is used by default if no protocol is specified: | |
>>> nltk.data.load('tokenizers/punkt/english.pickle') | |
<nltk.tokenize.punkt.PunktSentenceTokenizer object at ...> | |
But it is also possible to load resources from ``http:``, ``ftp:``, | |
and ``file:`` URLs: | |
>>> # Load a grammar from the NLTK webpage. | |
>>> cfg = nltk.data.load('https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg') | |
>>> print(cfg) # doctest: +ELLIPSIS | |
Grammar with 14 productions (start state = S) | |
S -> NP VP | |
PP -> P NP | |
... | |
P -> 'on' | |
P -> 'in' | |
>>> # Load a grammar using an absolute path. | |
>>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg') | |
>>> url.replace('\\', '/') | |
'file:...toy.cfg' | |
>>> print(nltk.data.load(url)) | |
Grammar with 14 productions (start state = S) | |
S -> NP VP | |
PP -> P NP | |
... | |
P -> 'on' | |
P -> 'in' | |
The second argument to the `nltk.data.load()` function specifies the | |
file format, which determines how the file's contents are processed | |
before they are returned by ``load()``. The formats that are | |
currently supported by the data module are described by the dictionary | |
`nltk.data.FORMATS`: | |
>>> for format, descr in sorted(nltk.data.FORMATS.items()): | |
... print('{0:<7} {1:}'.format(format, descr)) | |
cfg A context free grammar. | |
fcfg A feature CFG. | |
fol A list of first order logic expressions, parsed with | |
nltk.sem.logic.Expression.fromstring. | |
json A serialized python object, stored using the json module. | |
logic A list of first order logic expressions, parsed with | |
nltk.sem.logic.LogicParser. Requires an additional logic_parser | |
parameter | |
pcfg A probabilistic CFG. | |
pickle A serialized python object, stored using the pickle | |
module. | |
raw The raw (byte string) contents of a file. | |
text The raw (unicode string) contents of a file. | |
val A semantic valuation, parsed by | |
nltk.sem.Valuation.fromstring. | |
yaml A serialized python object, stored using the yaml module. | |
`nltk.data.load()` will raise a ValueError if a bad format name is | |
specified: | |
>>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar') | |
Traceback (most recent call last): | |
. . . | |
ValueError: Unknown format type! | |
By default, the ``"auto"`` format is used, which chooses a format | |
based on the filename's extension. The mapping from file extensions | |
to format names is specified by `nltk.data.AUTO_FORMATS`: | |
>>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()): | |
... print('.%-7s -> %s' % (ext, format)) | |
.cfg -> cfg | |
.fcfg -> fcfg | |
.fol -> fol | |
.json -> json | |
.logic -> logic | |
.pcfg -> pcfg | |
.pickle -> pickle | |
.text -> text | |
.txt -> text | |
.val -> val | |
.yaml -> yaml | |
If `nltk.data.load()` is unable to determine the format based on the | |
filename's extension, it will raise a ValueError: | |
>>> nltk.data.load('foo.bar') | |
Traceback (most recent call last): | |
. . . | |
ValueError: Could not determine format for foo.bar based on its file | |
extension; use the "format" argument to specify the format explicitly. | |
Note that by explicitly specifying the ``format`` argument, you can | |
override the load method's default processing behavior. For example, | |
to get the raw contents of any file, simply use ``format="raw"``: | |
>>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text') | |
>>> print(s) | |
S -> NP VP | |
PP -> P NP | |
NP -> Det N | NP PP | |
VP -> V NP | VP PP | |
... | |
Making Local Copies | |
~~~~~~~~~~~~~~~~~~~ | |
.. This will not be visible in the html output: create a tempdir to | |
play in. | |
>>> import tempfile, os | |
>>> tempdir = tempfile.mkdtemp() | |
>>> old_dir = os.path.abspath('.') | |
>>> os.chdir(tempdir) | |
The function `nltk.data.retrieve()` copies a given resource to a local | |
file. This can be useful, for example, if you want to edit one of the | |
sample grammars. | |
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') | |
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg' | |
>>> # Simulate editing the grammar. | |
>>> with open('toy.cfg') as inp: | |
... s = inp.read().replace('NP', 'DP') | |
>>> with open('toy.cfg', 'w') as out: | |
... _bytes_written = out.write(s) | |
>>> # Load the edited grammar, & display it. | |
>>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg')) | |
>>> print(cfg) | |
Grammar with 14 productions (start state = S) | |
S -> DP VP | |
PP -> P DP | |
... | |
P -> 'on' | |
P -> 'in' | |
The second argument to `nltk.data.retrieve()` specifies the filename | |
for the new copy of the file. By default, the source file's filename | |
is used. | |
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg') | |
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg' | |
>>> os.path.isfile('./mytoy.cfg') | |
True | |
>>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg') | |
Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg' | |
>>> os.path.isfile('./np.fcfg') | |
True | |
If a file with the specified (or default) filename already exists in | |
the current directory, then `nltk.data.retrieve()` will raise a | |
ValueError exception. It will *not* overwrite the file: | |
>>> os.path.isfile('./toy.cfg') | |
True | |
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') | |
Traceback (most recent call last): | |
. . . | |
ValueError: File '...toy.cfg' already exists! | |
.. This will not be visible in the html output: clean up the tempdir. | |
>>> os.chdir(old_dir) | |
>>> for f in os.listdir(tempdir): | |
... os.remove(os.path.join(tempdir, f)) | |
>>> os.rmdir(tempdir) | |
Finding Files in the NLTK Data Package | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
The `nltk.data.find()` function searches the NLTK data package for a | |
given file, and returns a pointer to that file. This pointer can | |
either be a `FileSystemPathPointer` (whose `path` attribute gives the | |
absolute path of the file); or a `ZipFilePathPointer`, specifying a | |
zipfile and the name of an entry within that zipfile. Both pointer | |
types define the `open()` method, which can be used to read the string | |
contents of the file. | |
>>> path = nltk.data.find('corpora/abc/rural.txt') | |
>>> str(path) | |
'...rural.txt' | |
>>> print(path.open().read(60).decode()) | |
PM denies knowledge of AWB kickbacks | |
The Prime Minister has | |
Alternatively, the `nltk.data.load()` function can be used with the | |
keyword argument ``format="raw"``: | |
>>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60] | |
>>> print(s.decode()) | |
PM denies knowledge of AWB kickbacks | |
The Prime Minister has | |
Alternatively, you can use the keyword argument ``format="text"``: | |
>>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60] | |
>>> print(s) | |
PM denies knowledge of AWB kickbacks | |
The Prime Minister has | |
Resource Caching | |
~~~~~~~~~~~~~~~~ | |
NLTK uses a weakref dictionary to maintain a cache of resources that | |
have been loaded. If you load a resource that is already stored in | |
the cache, then the cached copy will be returned. This behavior can | |
be seen by the trace output generated when verbose=True: | |
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True) | |
<<Loading nltk:grammars/book_grammars/feat0.fcfg>> | |
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True) | |
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>> | |
If you wish to load a resource from its source, bypassing the cache, | |
use the ``cache=False`` argument to `nltk.data.load()`. This can be | |
useful, for example, if the resource is loaded from a local file, and | |
you are actively editing that file: | |
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True) | |
<<Loading nltk:grammars/book_grammars/feat0.fcfg>> | |
The cache *no longer* uses weak references. A resource will not be | |
automatically expunged from the cache when no more objects are using | |
it. In the following example, when we clear the variable ``feat0``, | |
the reference count for the feature grammar object drops to zero. | |
However, the object remains cached: | |
>>> del feat0 | |
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', | |
... verbose=True) | |
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>> | |
You can clear the entire contents of the cache, using | |
`nltk.data.clear_cache()`: | |
>>> nltk.data.clear_cache() | |
Retrieving other Data Sources | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
>>> formulas = nltk.data.load('grammars/book_grammars/background.fol') | |
>>> for f in formulas: print(str(f)) | |
all x.(boxerdog(x) -> dog(x)) | |
all x.(boxer(x) -> person(x)) | |
all x.-(dog(x) & person(x)) | |
all x.(married(x) <-> exists y.marry(x,y)) | |
all x.(bark(x) -> dog(x)) | |
all x y.(marry(x,y) -> (person(x) & person(y))) | |
-(Vincent = Mia) | |
-(Vincent = Fido) | |
-(Mia = Fido) | |
Regression Tests | |
~~~~~~~~~~~~~~~~ | |
Create a temp dir for tests that write files: | |
>>> import tempfile, os | |
>>> tempdir = tempfile.mkdtemp() | |
>>> old_dir = os.path.abspath('.') | |
>>> os.chdir(tempdir) | |
The `retrieve()` function accepts all url types: | |
>>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', | |
... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'), | |
... 'nltk:grammars/sample_grammars/toy.cfg', | |
... 'grammars/sample_grammars/toy.cfg'] | |
>>> for i, url in enumerate(urls): | |
... nltk.data.retrieve(url, 'toy-%d.cfg' % i) | |
Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg' | |
Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg' | |
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg' | |
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg' | |
Clean up the temp dir: | |
>>> os.chdir(old_dir) | |
>>> for f in os.listdir(tempdir): | |
... os.remove(os.path.join(tempdir, f)) | |
>>> os.rmdir(tempdir) | |
Lazy Loader | |
----------- | |
A lazy loader is a wrapper object that defers loading a resource until | |
it is accessed or used in any way. This is mainly intended for | |
internal use by NLTK's corpus readers. | |
>>> # Create a lazy loader for toy.cfg. | |
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg') | |
>>> # Show that it's not loaded yet: | |
>>> object.__repr__(ll) | |
'<nltk.data.LazyLoader object at ...>' | |
>>> # printing it is enough to cause it to be loaded: | |
>>> print(ll) | |
<Grammar with 14 productions> | |
>>> # Show that it's now been loaded: | |
>>> object.__repr__(ll) | |
'<nltk.grammar.CFG object at ...>' | |
>>> # Test that accessing an attribute also loads it: | |
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg') | |
>>> ll.start() | |
S | |
>>> object.__repr__(ll) | |
'<nltk.grammar.CFG object at ...>' | |
Buffered Gzip Reading and Writing | |
--------------------------------- | |
Write performance to gzip-compressed is extremely poor when the files become large. | |
File creation can become a bottleneck in those cases. | |
Read performance from large gzipped pickle files was improved in data.py by | |
buffering the reads. A similar fix can be applied to writes by buffering | |
the writes to a StringIO object first. | |
This is mainly intended for internal use. The test simply tests that reading | |
and writing work as intended and does not test how much improvement buffering | |
provides. | |
>>> from io import StringIO | |
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10) | |
>>> ans = [] | |
>>> for i in range(10000): | |
... ans.append(str(i).encode('ascii')) | |
... test.write(str(i).encode('ascii')) | |
>>> test.close() | |
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb') | |
>>> test.read() == b''.join(ans) | |
True | |
>>> test.close() | |
>>> import os | |
>>> os.unlink('testbuf.gz') | |
JSON Encoding and Decoding | |
-------------------------- | |
JSON serialization is used instead of pickle for some classes. | |
>>> from nltk import jsontags | |
>>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag | |
>>> @jsontags.register_tag | |
... class JSONSerializable: | |
... json_tag = 'JSONSerializable' | |
... | |
... def __init__(self, n): | |
... self.n = n | |
... | |
... def encode_json_obj(self): | |
... return self.n | |
... | |
... @classmethod | |
... def decode_json_obj(cls, obj): | |
... n = obj | |
... return cls(n) | |
... | |
>>> JSONTaggedEncoder().encode(JSONSerializable(1)) | |
'{"!JSONSerializable": 1}' | |
>>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n | |
1 | |