Why Python and NLTK?

Most modern natural language processing (NLP) depends heavily on statistics and complex statistical models. So why use Python,  a relatively slow scripting language,  for NLP?

Python’s strengths are in its text, list, and structure support. Structures are weakly typed, but supported by a powerful set of language constructs in the form of list comprehensions and lambda functions. Also, as a modern general purpose scripting language, Python is ideal for prototyping. This makes it ideal for quick development – even if final processing times might be slow.

Finally, Python has the open source Natural Language Toolkit (NLTK). This library is a rich set of natural language processing tools and datasets, intended for educational purposes. As such, the combination of NLTK and Python make the perfect combination for teaching NLP techniques, because powerful examples can be demonstrated with short sections of code. Students can then quickly modify this code in a meaningful way.

Built-in default implementations can be used for quick prototyping, and specific implementations replaced as required. For example, here is a simple Part of Speech (POS) Tagger based around NLTK’s default word tokenizer and POS tagger:

>>> import nltk
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

This example is taken from the Natural Language Processing with Python book.

It really is that that simply to get started! A more sophisticated implementation might use a work tokenizer tailored for the input data. Or a POS tagger could be trained from user-supplied pre-tagged data.

NLTK could be used for production work, although processing times might restrict its use. For faster processing, NLTK can make use of SciPy and NumPy, and core components could be replaced by compiled C++ implementations.

Further Reading

The NLTK installer, further information, and samples can be found on the Natural Language Toolkit website.

O’Reilly publish the Natural Language Processing with Python book. This is also published online through a Creative Commons license.  This is an excellent introduction to NLTK and also includes some basic coverage of the Python language.

Leave a Reply