Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

# Natural Language Toolkit: Tree Transformations 

# 

# Copyright (C) 2005-2007 Oregon Graduate Institute 

# Author: Nathan Bodenstab <bodenstab@cslu.ogi.edu> 

# URL: <http://www.nltk.org/> 

# For license information, see LICENSE.TXT 

 

""" 

A collection of methods for tree (grammar) transformations used 

in parsing natural language. 

 

Although many of these methods are technically grammar transformations 

(ie. Chomsky Norm Form), when working with treebanks it is much more 

natural to visualize these modifications in a tree structure.  Hence, 

we will do all transformation directly to the tree itself. 

Transforming the tree directly also allows us to do parent annotation. 

A grammar can then be simply induced from the modified tree. 

 

The following is a short tutorial on the available transformations. 

 

1. Chomsky Normal Form (binarization) 

 

    It is well known that any grammar has a Chomsky Normal Form (CNF) 

    equivalent grammar where CNF is defined by every production having 

    either two non-terminals or one terminal on its right hand side. 

    When we have hierarchically structured data (ie. a treebank), it is 

    natural to view this in terms of productions where the root of every 

    subtree is the head (left hand side) of the production and all of 

    its children are the right hand side constituents.  In order to 

    convert a tree into CNF, we simply need to ensure that every subtree 

    has either two subtrees as children (binarization), or one leaf node 

    (non-terminal).  In order to binarize a subtree with more than two 

    children, we must introduce artificial nodes. 

 

    There are two popular methods to convert a tree into CNF: left 

    factoring and right factoring.  The following example demonstrates 

    the difference between them.  Example:: 

 

     Original       Right-Factored     Left-Factored 

 

          A              A                      A 

        / | \          /   \                  /   \ 

       B  C  D   ==>  B    A|<C-D>   OR   A|<B-C>  D 

                            /  \          /  \ 

                           C    D        B    C 

 

2. Parent Annotation 

 

    In addition to binarizing the tree, there are two standard 

    modifications to node labels we can do in the same traversal: parent 

    annotation and Markov order-N smoothing (or sibling smoothing). 

 

    The purpose of parent annotation is to refine the probabilities of 

    productions by adding a small amount of context.  With this simple 

    addition, a CYK (inside-outside, dynamic programming chart parse) 

    can improve from 74% to 79% accuracy.  A natural generalization from 

    parent annotation is to grandparent annotation and beyond.  The 

    tradeoff becomes accuracy gain vs. computational complexity.  We 

    must also keep in mind data sparcity issues.  Example:: 

 

     Original       Parent Annotation 

 

          A                A^<?> 

        / | \             /   \ 

       B  C  D   ==>  B^<A>    A|<C-D>^<?>     where ? is the 

                                 /  \          parent of A 

                             C^<A>   D^<A> 

 

 

3. Markov order-N smoothing 

 

    Markov smoothing combats data sparcity issues as well as decreasing 

    computational requirements by limiting the number of children 

    included in artificial nodes.  In practice, most people use an order 

    2 grammar.  Example:: 

 

      Original       No Smoothing       Markov order 1   Markov order 2   etc. 

 

       __A__            A                      A                A 

      / /|\ \         /   \                  /   \            /   \ 

     B C D E F  ==>  B    A|<C-D-E-F>  ==>  B   A|<C>  ==>   B  A|<C-D> 

                            /   \               /   \            /   \ 

                           C    ...            C    ...         C    ... 

 

 

 

    Annotation decisions can be thought about in the vertical direction 

    (parent, grandparent, etc) and the horizontal direction (number of 

    siblings to keep).  Parameters to the following functions specify 

    these values.  For more information see: 

 

    Dan Klein and Chris Manning (2003) "Accurate Unlexicalized 

    Parsing", ACL-03.  http://www.aclweb.org/anthology/P03-1054 

 

4. Unary Collapsing 

 

    Collapse unary productions (ie. subtrees with a single child) into a 

    new non-terminal (Tree node).  This is useful when working with 

    algorithms that do not allow unary productions, yet you do not wish 

    to lose the parent information.  Example:: 

 

       A 

       | 

       B   ==>   A+B 

      / \        / \ 

     C   D      C   D 

 

""" 

from __future__ import print_function 

 

from nltk.tree import Tree 

 

def chomsky_normal_form(tree, factor = "right", horzMarkov = None, vertMarkov = 0, childChar = "|", parentChar = "^"): 

    # assume all subtrees have homogeneous children 

    # assume all terminals have no siblings 

 

    # A semi-hack to have elegant looking code below.  As a result, 

    # any subtree with a branching factor greater than 999 will be incorrectly truncated. 

    if horzMarkov is None: horzMarkov = 999 

 

    # Traverse the tree depth-first keeping a list of ancestor nodes to the root. 

    # I chose not to use the tree.treepositions() method since it requires 

    # two traversals of the tree (one to get the positions, one to iterate 

    # over them) and node access time is proportional to the height of the node. 

    # This method is 7x faster which helps when parsing 40,000 sentences. 

 

    nodeList = [(tree, [tree.node])] 

    while nodeList != []: 

        node, parent = nodeList.pop() 

        if isinstance(node,Tree): 

 

            # parent annotation 

            parentString = "" 

            originalNode = node.node 

            if vertMarkov != 0 and node != tree and isinstance(node[0],Tree): 

                parentString = "%s<%s>" % (parentChar, "-".join(parent)) 

                node.node += parentString 

                parent = [originalNode] + parent[:vertMarkov - 1] 

 

            # add children to the agenda before we mess with them 

            for child in node: 

                nodeList.append((child, parent)) 

 

            # chomsky normal form factorization 

            if len(node) > 2: 

                childNodes = [child.node for child in node] 

                nodeCopy = node.copy() 

                node[0:] = [] # delete the children 

 

                curNode = node 

                numChildren = len(nodeCopy) 

                for i in range(1,numChildren - 1): 

                    if factor == "right": 

                        newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[i:min([i+horzMarkov,numChildren])]),parentString) # create new head 

                        newNode = Tree(newHead, []) 

                        curNode[0:] = [nodeCopy.pop(0), newNode] 

                    else: 

                        newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[max([numChildren-i-horzMarkov,0]):-i]),parentString) 

                        newNode = Tree(newHead, []) 

                        curNode[0:] = [newNode, nodeCopy.pop()] 

 

                    curNode = newNode 

 

                curNode[0:] = [child for child in nodeCopy] 

 

 

def un_chomsky_normal_form(tree, expandUnary = True, childChar = "|", parentChar = "^", unaryChar = "+"): 

    # Traverse the tree-depth first keeping a pointer to the parent for modification purposes. 

    nodeList = [(tree,[])] 

    while nodeList != []: 

        node,parent = nodeList.pop() 

        if isinstance(node,Tree): 

            # if the node contains the 'childChar' character it means that 

            # it is an artificial node and can be removed, although we still need 

            # to move its children to its parent 

            childIndex = node.node.find(childChar) 

            if childIndex != -1: 

                nodeIndex = parent.index(node) 

                parent.remove(parent[nodeIndex]) 

                # Generated node was on the left if the nodeIndex is 0 which 

                # means the grammar was left factored.  We must insert the children 

                # at the beginning of the parent's children 

                if nodeIndex == 0: 

                    parent.insert(0,node[0]) 

                    parent.insert(1,node[1]) 

                else: 

                    parent.extend([node[0],node[1]]) 

 

                # parent is now the current node so the children of parent will be added to the agenda 

                node = parent 

            else: 

                parentIndex = node.node.find(parentChar) 

                if parentIndex != -1: 

                    # strip the node name of the parent annotation 

                    node.node = node.node[:parentIndex] 

 

                # expand collapsed unary productions 

                if expandUnary == True: 

                    unaryIndex = node.node.find(unaryChar) 

                    if unaryIndex != -1: 

                        newNode = Tree(node.node[unaryIndex + 1:], [i for i in node]) 

                        node.node = node.node[:unaryIndex] 

                        node[0:] = [newNode] 

 

            for child in node: 

                nodeList.append((child,node)) 

 

 

def collapse_unary(tree, collapsePOS = False, collapseRoot = False, joinChar = "+"): 

    """ 

    Collapse subtrees with a single child (ie. unary productions) 

    into a new non-terminal (Tree node) joined by 'joinChar'. 

    This is useful when working with algorithms that do not allow 

    unary productions, and completely removing the unary productions 

    would require loss of useful information.  The Tree is modified 

    directly (since it is passed by reference) and no value is returned. 

 

    :param tree: The Tree to be collapsed 

    :type  tree: Tree 

    :param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie. 

                        Part-of-Speech tags) since they are always unary productions 

    :type  collapsePOS: bool 

    :param collapseRoot: 'False' (default) will not modify the root production 

                         if it is unary.  For the Penn WSJ treebank corpus, this corresponds 

                         to the TOP -> productions. 

    :type collapseRoot: bool 

    :param joinChar: A string used to connect collapsed node values (default = "+") 

    :type  joinChar: str 

    """ 

 

    if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1: 

        nodeList = [tree[0]] 

    else: 

        nodeList = [tree] 

 

    # depth-first traversal of tree 

    while nodeList != []: 

        node = nodeList.pop() 

        if isinstance(node,Tree): 

            if len(node) == 1 and isinstance(node[0], Tree) and (collapsePOS == True or isinstance(node[0,0], Tree)): 

                node.node += joinChar + node[0].node 

                node[0:] = [child for child in node[0]] 

                # since we assigned the child's children to the current node, 

                # evaluate the current node again 

                nodeList.append(node) 

            else: 

                for child in node: 

                    nodeList.append(child) 

 

################################################################# 

# Demonstration 

################################################################# 

 

def demo(): 

    """ 

    A demonstration showing how each tree transform can be used. 

    """ 

 

    from nltk.draw.tree import draw_trees 

    from nltk import tree, treetransforms 

    from copy import deepcopy 

 

    # original tree from WSJ bracketed text 

    sentence = """(TOP 

  (S 

    (S 

      (VP 

        (VBN Turned) 

        (ADVP (RB loose)) 

        (PP 

          (IN in) 

          (NP 

            (NP (NNP Shane) (NNP Longman) (POS 's)) 

            (NN trading) 

            (NN room))))) 

    (, ,) 

    (NP (DT the) (NN yuppie) (NNS dealers)) 

    (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right)))) 

    (. .)))""" 

    t = tree.Tree.parse(sentence, remove_empty_top_bracketing=True) 

 

    # collapse subtrees with only one child 

    collapsedTree = deepcopy(t) 

    treetransforms.collapse_unary(collapsedTree) 

 

    # convert the tree to CNF 

    cnfTree = deepcopy(collapsedTree) 

    treetransforms.chomsky_normal_form(cnfTree) 

 

    # convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two 

    parentTree = deepcopy(collapsedTree) 

    treetransforms.chomsky_normal_form(parentTree, horzMarkov=2, vertMarkov=1) 

 

    # convert the tree back to its original form (used to make CYK results comparable) 

    original = deepcopy(parentTree) 

    treetransforms.un_chomsky_normal_form(original) 

 

    # convert tree back to bracketed text 

    sentence2 = original.pprint() 

    print(sentence) 

    print(sentence2) 

    print("Sentences the same? ", sentence == sentence2) 

 

    draw_trees(t, collapsedTree, cnfTree, parentTree, original) 

 

if __name__ == '__main__': 

    demo() 

 

__all__ = ["chomsky_normal_form", "un_chomsky_normal_form", "collapse_unary"]