Although the Python NLTK library contains a number of parsers and grammars, these only support words which are defined in the grammar. This also applies to the supported Probabilistic Context Free Grammars (PCFGs). Therefore, in order to work with a more general parser that can handle unseen words, you have to use a Python wrapper to call an external parser. This article shows you how to call the popular Charniak-Johnson two stage parser from Python.
A probabilistic parser works in a similar way as the part of speech (POS) taggers which we have seen before. Instead of giving one parse, they give a range of parsers with estimated probabilities of which is correct. The NLTK parsers are lexicalized, i.e. they work with the actual words. A reasonable way to handle unseen words, is to use a non-lexicalized parsers, i.e. one which works with word categories. Typically POS or modified POS categories are used. Charniak was particularly successful with his non-lexicalized PCFG parser which he trained on the Penn Treebank data.
Later, working with Johnson at the Brown Laboratory for Linguistic Information Processing (BLLIP), a second re-ranking stage was added to the parser. The original Charniak model works locally, and cannot take into account any global parsing information. For example, English parse trees tend to be right-branching and with larger constituents towards the end of the sentence. The Charniak parser cannot take these large scale features into account. Therefore the first stage is set to produce the most likely parses (typically the 50 best), and a second “Discriminative Reranking” stage chooses the best of these parses by using features based on all or part of each parse tree.
The details of these algorithms are described in the references, below.
Charniak has published the original Charniak-Johnson parser on his home page. Unfortunately this version is not maintained, and requires quite a few changes to compile on a modern Ubuntu/G++ system.
Although links on the Brown website tend to prefer the older version, Brown do have a corrected and maintained version on GitHub, called BLLIP-Parser . This is the version you should download. It also includes trained models. It does include a trainer, but you will need the Penn-Treebank3 data in order to recreate the models. BLLIP-Parser should build first time on a Unix or Linux system. You will need to install G++ and Flex if you do not have them installed already.
After building the BLLIP-Parser, you should check that it works okay by using the sample shell scripts (see the README file for details).
Next we need to get it working from Python. Luckily mitjat has saved us most of the work by writing his charniak_wrapper. Although this was written for the original Charniak-Johnson parser, it also works with the BLLIP parser. This calls the parser using a shell script, and supports coarse multi-threading by running multiple parsers on multiple threads. Note that this is less important as the BLLIP parser adds support for multi-threading. charniak_wrapper can be run as a web service, or imported as a conventional Python library. I chose the latter, although I had to make a couple of changes to make it work properly. These have both been added to the GitHub master branch, so they should be included when you download the wrapper. The most significant was a set of syntax errors in kill_parsers().
The log code also needs to be amended. The wrapper currently uses mitjat’s own logging scripts (not included). Luckily mitjat also implemented a “fallback logger” which simply substitutes each log entry with a print statement. This works fine. I would recommend you keep the logging in place whilst you are testing the script and BLLIP, and then disable it for more advanced development.
You will also need to amend the configuration settings. These are all in capitals at the beginning of the script. PARSER_DIR should be set to point to the directory containing the BLLIP parser, i.e. the directory containing parse.sh. CURRENT_DIR is only used as a component of PARSER_DIR — I would recommend hard-coding it. I would also recommend setting an explicit path for the PARSER_CMD.
Here is my test script:
# Test script to test the Charniak Wrapper on the BLLIP/Charniak Wrapper import parser_on_demand from nltk import Tree print "Calling init..." parser_on_demand.init(num_parsers=1) print "Initialised" # Create some test data sents =  sents.append( "The cat sat on the mat .") sents.append( "Wrapper around Charniak's parser, allowing it to use one sentence at a time while the engine is running." ) # Parse the test sentences sresults = parser_on_demand.parse_sentences(sents) # combine input and the results results = zip(sents, sresults) # Display results, along with the NLTK tree version for rr in results: print print rr t = Tree.parse(rr) print rr print t print "finished: cleaning up..." parser_on_demand.kill_parsers() print "EXIT"
This should be located in the same directory as the Charniak wrapper’s parse_on_demand.py script. Note that the parsed strings use the same ‘( )’ format as the NLTK Tree parse. Hence they can be very easily converted into NLTK parse trees for further processing.
Foundations of Statistical Natural Language Processing, Manning & Schutze has good coverage of the original Charniak parser.
Speech and Language Processing, Jurafsky & Martin has less coverage of the original Charniak parser, but it does include some coverage of Discriminative Reranking.
Charniak & Johnson 2005, Coarse-to-fine n-best parsing and MaxEnt discriminative reranking
McClosky, Charniak & Johnson 2006, Effective Self-Training for Parsing