Following on from my previous post about NLTK Trees, here is a short Python function to extract phrases from an NLTK Tree structure.
Recently I needed to extract noun phrases from a section of text. This was in an attempt to choose “interesting concept phrases”. N-gram collocations are a common way of performing this, but these also resulted in partial phrases that poorly defined a concept. Most of the phrases I was interested in were noun phrases, so I chose to tag and chunk the text. The noun phrases (tagged ‘NP’) were then extracted from the chunked tree structures.
Here is the code:
from nltk.tree import * # Tree manipulation # Extract phrases from a parsed (chunked) tree # Phrase = tag for the string phrase (sub-tree) to extract # Returns: List of deep copies; Recursive def ExtractPhrases( myTree, phrase): myPhrases = [] if (myTree.node == phrase): myPhrases.append( myTree.copy(True) ) for child in myTree: if (type(child) is Tree): list_of_phrases = ExtractPhrases(child, phrase) if (len(list_of_phrases) > 0): myPhrases.extend(list_of_phrases) return myPhrases
This function iterates through the tree, finding all sub-trees with matching tags (‘NP’ in my application), and returning a list of deep copies of these sub-trees.
Here is an example of the function’s usage:
test = Tree.parse('(S (NP I) (VP (V enjoyed) (NP my cookies)))') print "Input tree: ", test print "\nNoun phrases:" list_of_noun_phrases = ExtractPhrases(test, 'NP') for phrase in list_of_noun_phrases: print " ", phrase
This function is a simple demonstration of how the Tree structure can be easily processed using short functions.
It is written in a very procedural manner and is neither very functional nor Pythonic. Perhaps you could write a more elegant and Pythonic version?
I tried implementing the above code but i get the errors of modifying the method to access the node value and also to set the label
The above code is Python 2 of course. also it was written nearly 4 yrs ago, so it is possible that NLTK has changed slightly.
@Jateen:
In line 3
, replace node with label(). That is, myTree.label()==phrase.
Thanks – I think there must have been a change to the tree definition in NLTK.
This is a handy function. I have also written a similar function. I’m wondering if you thought about those NPs that overlap? For example, what if we have nested NPs? Does this function return all of them or just simply chooses between either the parent or children NP?
This works on text which has already been parsed into a tree. A single tree cannot represent “overlapping” noun phrases, although it could represent nested ones. This function just extracts the highest level in such cases.