Segmenting Words and Sentences 1

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

Many of these questions will depend on your specific application: for example, sometimes it is appropriate to treat speech as a sentence fragment, and sometimes it should be treated as a complete sentence. Here, I shall treat punctuation marks as their own individual word tokens because I find this is usually the most appropriate.

The approach below is based on the Supervised Classification: Sentence Segmentation example provided by Bird et al in Natural Language Processing with Python (Chapter 6.2, pp 233-4). Note that the first edition (print) has a number of typographic errors in the code sample, so be sure to check the confirmed errata. This example uses a simple Naive Bayesian classifier to classify a sequence of word tokens into sentences. The classifier is trained using the Treebank Corpus supplied with NLTK.

The primary problem with this approach is that it cannot distinguish between punctuation “.” characters and end of sentence markers (full stops). The solution is to treat white space as a special “space separator” token. After classification, a repeated sequence of words and full stop tokens without space separators can be easily collapsed into one abbreviated word token.

To perform this, we need a word tokenizer that splits text into words, punctuation,  and space tokens. Fortunately this is not too complicated. NLTK comes with its own word and punctuation tokenizer, WordPunctTokenizer. This uses the NLTK RegexpTokenizer (defined in nltk.tokenize.regexp). Our tokenizer is based on this but uses its own regex to also treat white space sequences as tokens:

Note that multiple successive white space characters are treated as one token.

Here is the classifier and tokenizer:

The comments should explain most of what is happening here. The SentenceTokenizer class trains a new classifier in its constructor. This is a Naive Bayes classifier, so it is quick to train; but the complete trained classifier could be pickled for re-use instead of being created at the beginning of each application.

At classification time, white space tokens are collapsed into single space character tokens. Tokens are also converted into (word,bool) tuples that include a boolean flag: “Is this a space token?”. These pass through the classifier, and are used in the final processing to collapse abbreviations into single word tokens.

Each sentence is returned as a list of string word tokens. The final results are returned as a list of sentences.

Here is an example of the segmenter’s use:

The sample text is a few paragraphs of the Wikipedia article for “Treebank” as plain text. Here are the results (only the first few sentences are shown for brevity):

There we have it – text that has been tokenized and segmented ready for further processing such as classification, collocation detection, or POS tagging.

One comment on “Segmenting Words and Sentences

  1. Reply Richard Marsden Jun 13,2012 11:41 am

    An addenda that correctly handles multiple contiguous punctuation characters, such as “).”, has been written: Handling multiple punctuation characters.

Leave a Reply