Spaces:
Sleeping
Sleeping
.. Copyright (C) 2001-2023 NLTK Project | |
.. For license information, see LICENSE.TXT | |
======= | |
Chat-80 | |
======= | |
Chat-80 was a natural language system which allowed the user to | |
interrogate a Prolog knowledge base in the domain of world | |
geography. It was developed in the early '80s by Warren and Pereira; see | |
`<https://aclanthology.org/J82-3002.pdf>`_ for a description and | |
`<http://www.cis.upenn.edu/~pereira/oldies.html>`_ for the source | |
files. | |
The ``chat80`` module contains functions to extract data from the Chat-80 | |
relation files ('the world database'), and convert then into a format | |
that can be incorporated in the FOL models of | |
``nltk.sem.evaluate``. The code assumes that the Prolog | |
input files are available in the NLTK corpora directory. | |
The Chat-80 World Database consists of the following files:: | |
world0.pl | |
rivers.pl | |
cities.pl | |
countries.pl | |
contain.pl | |
borders.pl | |
This module uses a slightly modified version of ``world0.pl``, in which | |
a set of Prolog rules have been omitted. The modified file is named | |
``world1.pl``. Currently, the file ``rivers.pl`` is not read in, since | |
it uses a list rather than a string in the second field. | |
Reading Chat-80 Files | |
===================== | |
Chat-80 relations are like tables in a relational database. The | |
relation acts as the name of the table; the first argument acts as the | |
'primary key'; and subsequent arguments are further fields in the | |
table. In general, the name of the table provides a label for a unary | |
predicate whose extension is all the primary keys. For example, | |
relations in ``cities.pl`` are of the following form:: | |
'city(athens,greece,1368).' | |
Here, ``'athens'`` is the key, and will be mapped to a member of the | |
unary predicate *city*. | |
By analogy with NLTK corpora, ``chat80`` defines a number of 'items' | |
which correspond to the relations. | |
>>> from nltk.sem import chat80 | |
>>> print(chat80.items) | |
('borders', 'circle_of_lat', 'circle_of_long', 'city', ...) | |
The fields in the table are mapped to binary predicates. The first | |
argument of the predicate is the primary key, while the second | |
argument is the data in the relevant field. Thus, in the above | |
example, the third field is mapped to the binary predicate | |
*population_of*, whose extension is a set of pairs such as | |
``'(athens, 1368)'``. | |
An exception to this general framework is required by the relations in | |
the files ``borders.pl`` and ``contains.pl``. These contain facts of the | |
following form:: | |
'borders(albania,greece).' | |
'contains0(africa,central_africa).' | |
We do not want to form a unary concept out the element in | |
the first field of these records, and we want the label of the binary | |
relation just to be ``'border'``/``'contain'`` respectively. | |
In order to drive the extraction process, we use 'relation metadata bundles' | |
which are Python dictionaries such as the following:: | |
city = {'label': 'city', | |
'closures': [], | |
'schema': ['city', 'country', 'population'], | |
'filename': 'cities.pl'} | |
According to this, the file ``city['filename']`` contains a list of | |
relational tuples (or more accurately, the corresponding strings in | |
Prolog form) whose predicate symbol is ``city['label']`` and whose | |
relational schema is ``city['schema']``. The notion of a ``closure`` is | |
discussed in the next section. | |
Concepts | |
======== | |
In order to encapsulate the results of the extraction, a class of | |
``Concept``\ s is introduced. A ``Concept`` object has a number of | |
attributes, in particular a ``prefLabel``, an arity and ``extension``. | |
>>> c1 = chat80.Concept('dog', arity=1, extension=set(['d1', 'd2'])) | |
>>> print(c1) | |
Label = 'dog' | |
Arity = 1 | |
Extension = ['d1', 'd2'] | |
The ``extension`` attribute makes it easier to inspect the output of | |
the extraction. | |
>>> schema = ['city', 'country', 'population'] | |
>>> concepts = chat80.clause2concepts('cities.pl', 'city', schema) | |
>>> concepts | |
[Concept('city'), Concept('country_of'), Concept('population_of')] | |
>>> for c in concepts: | |
... print("%s:\n\t%s" % (c.prefLabel, c.extension[:4])) | |
city: | |
['athens', 'bangkok', 'barcelona', 'berlin'] | |
country_of: | |
[('athens', 'greece'), ('bangkok', 'thailand'), ('barcelona', 'spain'), ('berlin', 'east_germany')] | |
population_of: | |
[('athens', '1368'), ('bangkok', '1178'), ('barcelona', '1280'), ('berlin', '3481')] | |
In addition, the ``extension`` can be further | |
processed: in the case of the ``'border'`` relation, we check that the | |
relation is **symmetric**, and in the case of the ``'contain'`` | |
relation, we carry out the **transitive closure**. The closure | |
properties associated with a concept is indicated in the relation | |
metadata, as indicated earlier. | |
>>> borders = set([('a1', 'a2'), ('a2', 'a3')]) | |
>>> c2 = chat80.Concept('borders', arity=2, extension=borders) | |
>>> print(c2) | |
Label = 'borders' | |
Arity = 2 | |
Extension = [('a1', 'a2'), ('a2', 'a3')] | |
>>> c3 = chat80.Concept('borders', arity=2, closures=['symmetric'], extension=borders) | |
>>> c3.close() | |
>>> print(c3) | |
Label = 'borders' | |
Arity = 2 | |
Extension = [('a1', 'a2'), ('a2', 'a1'), ('a2', 'a3'), ('a3', 'a2')] | |
The ``extension`` of a ``Concept`` object is then incorporated into a | |
``Valuation`` object. | |
Persistence | |
=========== | |
The functions ``val_dump`` and ``val_load`` are provided to allow a | |
valuation to be stored in a persistent database and re-loaded, rather | |
than having to be re-computed each time. | |
Individuals and Lexical Items | |
============================= | |
As well as deriving relations from the Chat-80 data, we also create a | |
set of individual constants, one for each entity in the domain. The | |
individual constants are string-identical to the entities. For | |
example, given a data item such as ``'zloty'``, we add to the valuation | |
a pair ``('zloty', 'zloty')``. In order to parse English sentences that | |
refer to these entities, we also create a lexical item such as the | |
following for each individual constant:: | |
PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty' | |
The set of rules is written to the file ``chat_pnames.fcfg`` in the | |
current directory. | |
SQL Query | |
========= | |
The ``city`` relation is also available in RDB form and can be queried | |
using SQL statements. | |
>>> import nltk | |
>>> q = "SELECT City, Population FROM city_table WHERE Country = 'china' and Population > 1000" | |
>>> for answer in chat80.sql_query('corpora/city_database/city.db', q): | |
... print("%-10s %4s" % answer) | |
canton 1496 | |
chungking 1100 | |
mukden 1551 | |
peking 2031 | |
shanghai 5407 | |
tientsin 1795 | |
The (deliberately naive) grammar ``sql.fcfg`` translates from English | |
to SQL: | |
>>> nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg') | |
% start S | |
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp] | |
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp] | |
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap] | |
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n] | |
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np] | |
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp] | |
NP[SEM='Country="greece"'] -> 'Greece' | |
NP[SEM='Country="china"'] -> 'China' | |
Det[SEM='SELECT'] -> 'Which' | 'What' | |
N[SEM='City FROM city_table'] -> 'cities' | |
IV[SEM=''] -> 'are' | |
A[SEM=''] -> 'located' | |
P[SEM=''] -> 'in' | |
Given this grammar, we can express, and then execute, queries in English. | |
>>> cp = nltk.parse.load_parser('grammars/book_grammars/sql0.fcfg') | |
>>> query = 'What cities are in China' | |
>>> for tree in cp.parse(query.split()): | |
... answer = tree.label()['SEM'] | |
... q = " ".join(answer) | |
... print(q) | |
... | |
SELECT City FROM city_table WHERE Country="china" | |
>>> rows = chat80.sql_query('corpora/city_database/city.db', q) | |
>>> for r in rows: print("%s" % r, end=' ') | |
canton chungking dairen harbin kowloon mukden peking shanghai sian tientsin | |
Using Valuations | |
----------------- | |
In order to convert such an extension into a valuation, we use the | |
``make_valuation()`` method; setting ``read=True`` creates and returns | |
a new ``Valuation`` object which contains the results. | |
>>> val = chat80.make_valuation(concepts, read=True) | |
>>> 'calcutta' in val['city'] | |
True | |
>>> [town for (town, country) in val['country_of'] if country == 'india'] | |
['bombay', 'calcutta', 'delhi', 'hyderabad', 'madras'] | |
>>> dom = val.domain | |
>>> g = nltk.sem.Assignment(dom) | |
>>> m = nltk.sem.Model(dom, val) | |
>>> m.evaluate(r'population_of(jakarta, 533)', g) | |
True | |