Book Review: Python Text Processing with NLTK 2.0 Cookbook

“Python Text Processing with NLTK 2.0 Cookbook” by Jacob Perkins is a useful complement to “Natural Language Processing with Python”. Rather than trying to introduce Python, NLP, and NLTK in one book, it focuses on practical worked examples.

(This review focuses on the print edition of the full version of the book. Packt have also produced a cheaper ‘lite’ edition. It is also available in an electronic form.)

Although “Natural Language Processing with Python” is considered the standard NLTK text, it concerns itself with teaching and in my opinion spreads itself a little thin as it tries to cover basic Python programming along with natural language processing and the NLTK toolkit. Perkins takes a different approach with this cookbook which provides over 80 worked examples that solve specific text processing problems, primarily with NLTK.

Each cookbook entry has five sections:

  • Getting Ready. Usually a ‘big picture’ explanation of how it will work. Also includes external data or libraries that might be required.
  • How to do it. This supplies the actual code.
  • How it works. This explains how the code actually works.
  • There’s more. This explains how the code can be modified or expanded.
  • See also. References other related cookbook entries.

In my experience, How it works and There’s more tend to be the most useful sections. So far, I have rarely used the code as-is, but these sections give enough explanation that the code can be adapted to the problem at hand.

The book is organized so that it starts with simpler tasks such as work tokenization, and then builds on these building blocks to complete more sophisticated tasks.

The chapters are:

  1. Tokenizing Text and WordNet Basics. Splitting words, looking up words, finding collocations.
  2. Replacing and Correcting Words. Stemming, lemmatizing, correcting, translating words.
  3. Creating Custom Corpora.
  4. Part-of-Speech Tagging
  5. Extracting Chunks
  6. Transforming Chunks and Trees. Transforming phrases, singularizing, flattening trees.
  7. Text Classification
  8. Distributed Processing and Handling Large Datasets. Using execnet and Redis.
  9. Parsing Specific Data. Parsing dates, timezones, URLs; cleaning HTML; detecting character encodings.

The book also includes an appendix listing Penn Treebank Part-of-Speech tags, and an index.

Compared to “Natural Language Processing with Python”, this book chooses a more practical set of topics. The more theoretical aspects of NLTK, such as logic and reasoning, are not covered. Instead, coverage is given to topics such as tokenizing, simple parse tree manipulations, and using parallel processing for practical applications.

Summarizing, this is an excellent book for someone who is looking to perform sophisticated text processing with Python, and/or use NLTK in practical applications. Readers are recommended to read it alongside “Natural Language Processing with Python”.

Leave a Reply