“Word Sense Disambiguation: The Case for Combinations of Knowledge Sources” by Mark Stevenson describes the author’s six year research project into
Word Sense Disambiguation that started with his PhD in 1995. The book includes a summary literature review of previous attempts at Word Sense Disambiguation before building a framework that combines multiple models and filters to produce results of 90-95% accuracy.
This book outlines the author’s research project into Word Sense Disambiguation. It starts with chapters that explain why this is a problem, and a historical literature survey that describes the various approaches. This in itself is very useful, and provides much more background than the more popular NLTK books or individual research papers. These background chapters finish with a discussion of meaning, lexicons, synonyms, and the use of electronic dictionaries.
The rest of the book is devoted to the proposed framework. This is based around a series of
“Disambiguation Modules” which come in three main types:
- Filters. These provide a strong indication of which senses are not appropriate for a particular context.
For example, a filter could use part of speech tags to remove meanings which are not appropriate for the word’s tag.
- Partial Taggers. These have some disambiguation ability, but they are not accurate enough to be relied upon.
- Feature Extractors. These modules are more auxiliary, and extract features that can be used to define the word’s context.
For example, an extractor could determine the text’s topic.
A Word Sense Disambiguation implementation would typically use multiple examples of these modules which use different data sources and/or approaches. Stevenson readily admits that he is not the first person to think of combining multiple algorithms to produce a better, aggregate result. Even the various NLTK “back-off” algorithms implement a simple form of this.
Rather than a simple back-off implementation or a weighted sum, Stevenson combines the various results using a TiMBL learning system. TiMBL (“Tilburg Memory-Based Learner”) is an open source memory based learning system. This implements a number of algorithms, but is typically used for tree-based k-nearest neighbor classification. Further details can be found at https://languagemachines.github.io/timbl/ (the URL in the book’s bibliography is out of date).
After presenting the framework, Stevenson gives a number of examples and further implementation details for what should be a full system. He then finishes with a discussion of sense tagged corpora (finding and creating them), and an evaluation of the framework through a couple of experiments.
Overall the book is academic in tone when compared to the previous books that I have reviewed. This is echoed by the use of TeX and Computer Modern to typeset the book, plus the widespread use of technical acronyms. However, considering the academic nature of the book, I found it much more readable than other academic texts in the NLP and machine learning fields.
Michael Swaine described this book as a major step forward when it was published in 2003, and it does feel like it represented the state of the art. Modern practitioners should still find it a useful addition to their bookshelf.