File size: 5,288 Bytes
d916065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Natural Language Toolkit: Tree Transformations
#
# Copyright (C) 2005-2007 Oregon Graduate Institute
# Author: Nathan Bodenstab <[email protected]>
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT

r"""

A collection of methods for tree (grammar) transformations used

in parsing natural language.



Although many of these methods are technically grammar transformations

(ie. Chomsky Norm Form), when working with treebanks it is much more

natural to visualize these modifications in a tree structure.  Hence,

we will do all transformation directly to the tree itself.

Transforming the tree directly also allows us to do parent annotation.

A grammar can then be simply induced from the modified tree.



The following is a short tutorial on the available transformations.



 1. Chomsky Normal Form (binarization)



    It is well known that any grammar has a Chomsky Normal Form (CNF)

    equivalent grammar where CNF is defined by every production having

    either two non-terminals or one terminal on its right hand side.

    When we have hierarchically structured data (ie. a treebank), it is

    natural to view this in terms of productions where the root of every

    subtree is the head (left hand side) of the production and all of

    its children are the right hand side constituents.  In order to

    convert a tree into CNF, we simply need to ensure that every subtree

    has either two subtrees as children (binarization), or one leaf node

    (non-terminal).  In order to binarize a subtree with more than two

    children, we must introduce artificial nodes.



    There are two popular methods to convert a tree into CNF: left

    factoring and right factoring.  The following example demonstrates

    the difference between them.  Example::



     Original       Right-Factored     Left-Factored



          A              A                      A

        / | \          /   \                  /   \

       B  C  D   ==>  B    A|<C-D>   OR   A|<B-C>  D

                            /  \          /  \

                           C    D        B    C



 2. Parent Annotation



    In addition to binarizing the tree, there are two standard

    modifications to node labels we can do in the same traversal: parent

    annotation and Markov order-N smoothing (or sibling smoothing).



    The purpose of parent annotation is to refine the probabilities of

    productions by adding a small amount of context.  With this simple

    addition, a CYK (inside-outside, dynamic programming chart parse)

    can improve from 74% to 79% accuracy.  A natural generalization from

    parent annotation is to grandparent annotation and beyond.  The

    tradeoff becomes accuracy gain vs. computational complexity.  We

    must also keep in mind data sparcity issues.  Example::



     Original       Parent Annotation



          A                A^<?>

        / | \             /   \

       B  C  D   ==>  B^<A>    A|<C-D>^<?>     where ? is the

                                 /  \          parent of A

                             C^<A>   D^<A>





 3. Markov order-N smoothing



    Markov smoothing combats data sparcity issues as well as decreasing

    computational requirements by limiting the number of children

    included in artificial nodes.  In practice, most people use an order

    2 grammar.  Example::



      Original       No Smoothing       Markov order 1   Markov order 2   etc.



       __A__            A                      A                A

      / /|\ \         /   \                  /   \            /   \

     B C D E F  ==>  B    A|<C-D-E-F>  ==>  B   A|<C>  ==>   B  A|<C-D>

                            /   \               /   \            /   \

                           C    ...            C    ...         C    ...







    Annotation decisions can be thought about in the vertical direction

    (parent, grandparent, etc) and the horizontal direction (number of

    siblings to keep).  Parameters to the following functions specify

    these values.  For more information see:



    Dan Klein and Chris Manning (2003) "Accurate Unlexicalized

    Parsing", ACL-03.  https://www.aclweb.org/anthology/P03-1054



 4. Unary Collapsing



    Collapse unary productions (ie. subtrees with a single child) into a

    new non-terminal (Tree node).  This is useful when working with

    algorithms that do not allow unary productions, yet you do not wish

    to lose the parent information.  Example::



       A

       |

       B   ==>   A+B

      / \        / \

     C   D      C   D



"""

from nltk.internals import deprecated
from nltk.tree.transforms import chomsky_normal_form as cnf
from nltk.tree.transforms import collapse_unary as cu
from nltk.tree.transforms import un_chomsky_normal_form as ucnf

chomsky_normal_form = deprecated(
    "Import using `from nltk.tree import chomsky_normal_form` instead."
)(cnf)
un_chomsky_normal_form = deprecated(
    "Import using `from nltk.tree import un_chomsky_normal_form` instead."
)(ucnf)
collapse_unary = deprecated(
    "Import using `from nltk.tree import collapse_unary` instead."
)(cu)


__all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"]