Spaces:
Sleeping
Sleeping
# Natural Language Toolkit: Tree Transformations | |
# | |
# Copyright (C) 2005-2007 Oregon Graduate Institute | |
# Author: Nathan Bodenstab <[email protected]> | |
# URL: <https://www.nltk.org/> | |
# For license information, see LICENSE.TXT | |
r""" | |
A collection of methods for tree (grammar) transformations used | |
in parsing natural language. | |
Although many of these methods are technically grammar transformations | |
(ie. Chomsky Norm Form), when working with treebanks it is much more | |
natural to visualize these modifications in a tree structure. Hence, | |
we will do all transformation directly to the tree itself. | |
Transforming the tree directly also allows us to do parent annotation. | |
A grammar can then be simply induced from the modified tree. | |
The following is a short tutorial on the available transformations. | |
1. Chomsky Normal Form (binarization) | |
It is well known that any grammar has a Chomsky Normal Form (CNF) | |
equivalent grammar where CNF is defined by every production having | |
either two non-terminals or one terminal on its right hand side. | |
When we have hierarchically structured data (ie. a treebank), it is | |
natural to view this in terms of productions where the root of every | |
subtree is the head (left hand side) of the production and all of | |
its children are the right hand side constituents. In order to | |
convert a tree into CNF, we simply need to ensure that every subtree | |
has either two subtrees as children (binarization), or one leaf node | |
(non-terminal). In order to binarize a subtree with more than two | |
children, we must introduce artificial nodes. | |
There are two popular methods to convert a tree into CNF: left | |
factoring and right factoring. The following example demonstrates | |
the difference between them. Example:: | |
Original Right-Factored Left-Factored | |
A A A | |
/ | \ / \ / \ | |
B C D ==> B A|<C-D> OR A|<B-C> D | |
/ \ / \ | |
C D B C | |
2. Parent Annotation | |
In addition to binarizing the tree, there are two standard | |
modifications to node labels we can do in the same traversal: parent | |
annotation and Markov order-N smoothing (or sibling smoothing). | |
The purpose of parent annotation is to refine the probabilities of | |
productions by adding a small amount of context. With this simple | |
addition, a CYK (inside-outside, dynamic programming chart parse) | |
can improve from 74% to 79% accuracy. A natural generalization from | |
parent annotation is to grandparent annotation and beyond. The | |
tradeoff becomes accuracy gain vs. computational complexity. We | |
must also keep in mind data sparcity issues. Example:: | |
Original Parent Annotation | |
A A^<?> | |
/ | \ / \ | |
B C D ==> B^<A> A|<C-D>^<?> where ? is the | |
/ \ parent of A | |
C^<A> D^<A> | |
3. Markov order-N smoothing | |
Markov smoothing combats data sparcity issues as well as decreasing | |
computational requirements by limiting the number of children | |
included in artificial nodes. In practice, most people use an order | |
2 grammar. Example:: | |
Original No Smoothing Markov order 1 Markov order 2 etc. | |
__A__ A A A | |
/ /|\ \ / \ / \ / \ | |
B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D> | |
/ \ / \ / \ | |
C ... C ... C ... | |
Annotation decisions can be thought about in the vertical direction | |
(parent, grandparent, etc) and the horizontal direction (number of | |
siblings to keep). Parameters to the following functions specify | |
these values. For more information see: | |
Dan Klein and Chris Manning (2003) "Accurate Unlexicalized | |
Parsing", ACL-03. https://www.aclweb.org/anthology/P03-1054 | |
4. Unary Collapsing | |
Collapse unary productions (ie. subtrees with a single child) into a | |
new non-terminal (Tree node). This is useful when working with | |
algorithms that do not allow unary productions, yet you do not wish | |
to lose the parent information. Example:: | |
A | |
| | |
B ==> A+B | |
/ \ / \ | |
C D C D | |
""" | |
from nltk.internals import deprecated | |
from nltk.tree.transforms import chomsky_normal_form as cnf | |
from nltk.tree.transforms import collapse_unary as cu | |
from nltk.tree.transforms import un_chomsky_normal_form as ucnf | |
chomsky_normal_form = deprecated( | |
"Import using `from nltk.tree import chomsky_normal_form` instead." | |
)(cnf) | |
un_chomsky_normal_form = deprecated( | |
"Import using `from nltk.tree import un_chomsky_normal_form` instead." | |
)(ucnf) | |
collapse_unary = deprecated( | |
"Import using `from nltk.tree import collapse_unary` instead." | |
)(cu) | |
__all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"] | |