The NLP Stack

Processing natural language is a complicated business. Not that long ago it seemed to be an intractable problem to many people. Although full understanding is still a cutting edge research problem, large areas of natural language processing have become practical on most computing platforms. This is especially true when NLP techniques are applied to restricted application domains (e.g. parsing travel web sites).

This article looks at the main components of an NLP system and some of the applications which are practical today.

A typical NLP stack will use many of the following stages:

  • Sentence Segmentation
  • Word Tokenization
  • Part of Speech (POS) Tagging
  • Grammar Parsing
    • Chunking
    • Chinking
    • Named Entity Recognition
  • Understanding
    • Relation understanding
    • Fact extraction

These stages are not necessarily limited to being executed in order. For example, an advanced system may use the results of grammar parsing and chunking to improve the POS tagging.

 Sentence Segmentation and Word Tokenization

These are the simplest stages and are often performed together. Many applications may not require sentence segmentation, but this is required for chunking and grammar parsing.

At first, both stages seem almost trivial: split sentences according to full stops (periods), exclamation marks, and question marks; split words according to spaces. Unfortunately such a system quickly runs into problems. What about punctuation characters: should these be treated as separate tokens? If the text is split according to full stops, what about abbreviations? How should hyphenated words be handled? Is quoted speech classed as a new sentence or a special nested sentence? To further complicate matters, a tokenizer should ideally be able to handle poorly spaced that a human could easily read.

Part of Speech (POS) Tagging

This is the processing to tagging each word with its “part of speech”, e.g. noun, verb, adjective. A simple dictionary look-up performs poorly because a large numbers of words can have multiple tags according to their usage. A simple example is the word “feed” which could be a noun (“animal feed“) or a verb (“feed the animals“).

Most POS taggers use a trained classifier. There are various approaches. The simplest is to base the classifier on bigrams, i.e. word pairs (tuples). This works because a word’s part of speech can often be determined by the preceding word (e.g. “animal” in “animal feed”). Such a scheme can be improved by expanding to trigrams (three word tuples), but this requires more memory and training data for a limited improvement in performance. These diminishing returns mean larger tuples (e.g. 4-grams) are usually impractical.

A more sophisticated tagger can be created using a training model that uses pre-determined features (eg. the current word, preceding word, and POS tag) and a probability model such as naive bayes or maximum entropy. Further improvements can be realized by implementing a “back off” tagger. This tries the best performing tagger first. If this cannot determine the tag, it “backs off” to different tagger. Such a system might start with a maximum entropy classifier, but back off to a bigram tagger, and finally a dictionary lookup if necessary.

Grammar Parsing

This stage converts the tagged words into a sentence structure. Full grammar parsing will produce a complete grammar tree, e.g.

Sample grammar tree for "The cat sat on the mat"

Sample grammar tree for “The cat sat on the mat”

This is actually a very difficult process due to the informal rules of grammar present in natural language, and ambiguity is common place. Some impressive results can be realized with a subset of grammar rules, but this is difficult to generalize.  Instead, most applications attempt partial parsing using processes such as chunking and named entity recognition.

Chunking is the process of finding natural word groups (‘chunks’) such as noun phrases and verb phrases. The rules for finding these phrases are simpler, but by limiting the problem it becomes practical to use a classifier to chunk sentences.

Chinking is a top-down approach to chunking. Instead of adding words together to form a chunk, superfluous words (‘chinks’) are removed from existing phrases to form useful chunks.

Named entity recognition attempts to identify named objects and places within the sentence. These may include things like people, locations, and organizations; and recognition may include further meta data (e.g. what type of organization, a location’s map coordinate). Named entity recognizers often use classifiers, but they can also incorporate gazetteers and other lists of names.

Understanding

I shall leave it to the philosophers to argue about what “understanding” really entails, but in the practical world, a computer “understands” some text if it can extract information and facts from it, and to be able to solve problems or answer questions based on this information, so having the computers in good condition is important and services as a Laptop Repair London could really be helpful in case the computers fail.

There is some overlap between understanding and grammar parsing. A grammatically parsed sentence is well on the way to being “understood”. By understanding some of the text, it is possible to resolve some grammatical ambiguities.

Understanding is mainly a research level project, and can include both propositional and first order logics.

 

Applications

Many practical NLP applications will apply different techniques, but they will use many of the above stages to process the data ready for manipulation. Here are some examples:

Text Classification

Text classification is probably the most visible NLP technique in use today. Almost every email client has a trained spam filter that classifies email according to the email’s text and other properties. Sentiment analysis is a similar problem although the classification is more subtle and will often have to deal with parsing sentence structure – eg. “Film X makes film Y look like an excellent movie” has negative sentiment despite using a strongly positive word.

All text classifiers require a work tokenization stage to extract words for use in the classifier. More sophisticated classifiers may use other information such as POS tags, and chunking to extract noun or verb phrases for classification. Finally full grammar analysis could be used to parse negative comparisons and double negatives.

Sentence Manipulation

NLP can be used to manipulate sentences, for example to add/remove plurals; manipulate pronouns; or to change verb tense. This kind of application will require most of the stages. The text has to be segmented, tokenized, tagged, and parsed into a grammar tree. Manipulation is then performed at the grammar tree level, before being converted back into text.

Language Translation

A similar approach could be taken for language translation. A simple word-for-word replacement will result in incorrect translations and very poor grammar. By parsing the text into a grammar tree, an attempt can be made to translate the grammar, and to correctly translate individual words according to their POS tags.

Although such translation systems are available, their performance continues to be relative poor. True, accurate language translation continues to be a major area of research.

 

Author’s Note: This article has been written to provide context to future articles. It is likely to be updated at a future date.

Leave a Reply