Segmenting Words and Sentences 1

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

Many of these questions will depend on your specific application: for example, sometimes it is appropriate to treat speech as a sentence fragment, and sometimes it should be treated as a complete sentence. Here, I shall treat punctuation marks as their own individual word tokens because I find this is usually the most appropriate.

The approach below is based on the Supervised Classification: Sentence Segmentation example provided by Bird et al in Natural Language Processing with Python (Chapter 6.2, pp 233-4). Note that the first edition (print) has a number of typographic errors in the code sample, so be sure to check the confirmed errata. This example uses a simple Naive Bayesian classifier to classify a sequence of word tokens into sentences. The classifier is trained using the Treebank Corpus supplied with NLTK.

The primary problem with this approach is that it cannot distinguish between punctuation “.” characters and end of sentence markers (full stops). The solution is to treat white space as a special “space separator” token. After classification, a repeated sequence of words and full stop tokens without space separators can be easily collapsed into one abbreviated word token.

To perform this, we need a word tokenizer that splits text into words, punctuation,  and space tokens. Fortunately this is not too complicated. NLTK comes with its own word and punctuation tokenizer, WordPunctTokenizer. This uses the NLTK RegexpTokenizer (defined in nltk.tokenize.regexp). Our tokenizer is based on this but uses its own regex to also treat white space sequences as tokens:

import nltk
import string

# Tokenize text into words, punctuation, and whitespace tokens

class ModifiedWPTokenizer( nltk.tokenize.RegexpTokenizer):
    def __init__(self):
        nltk.tokenize.RegexpTokenizer.__init__(self, r'\w+|[^\w\s]|\s+')

Note that multiple successive white space characters are treated as one token.

Here is the classifier and tokenizer:

# Sentence Classification Class
# Based on O'Reilly, pp234 but also uses whitespace information
class SentenceTokenizer():

    # extract punctuation features from word list for position i
    # Features are: this word; previous word (lower case);
    # is the next word capitalized?; previous word only one char long?
    def punct_features(self, tokens, i):
        return {'next-word-capitalized': (i<len(tokens)-1) and tokens[i+1][0].isupper(),
                'prevword': tokens[i-1].lower(),
                'punct': tokens[i],
                'prev-word-is-one-char': len(tokens[i-1]) == 1}

    # Same as punct_features, but works with a list of
    # (word,bool) tuples for the tokesn. Word is used as above, but the bool
    # flag (whitespace separator?) is ignored
    # This allows the same features to be extracted from tuples instead of
    # words
    def punct_features2(self,tokens, i):
        return {'next-word-capitalized': (i<len(tokens)-1) and tokens[i+1][0][0].isupper(),
                'prevword': tokens[i-1][0].lower(),
                'punct': tokens[i][0],
                'prev-word-is-one-char': len(tokens[i-1][0]) == 1}

    # The constructor builds a classifier using treebank training data
    # Naive Bayes is used for fast training
    # The entire dataset is used for training
    def __init__(self):
        self.tokenizer = ModifiedWPTokenizer()

        training_sents = nltk.corpus.treebank_raw.sents()
        tokens = []
        boundaries = set()
        offset = 0
        for sent in nltk.corpus.treebank_raw.sents():
            offset += len(sent)

        # Create training features
        featuresets = [(self.punct_features(tokens,i), (i in boundaries))
                       for i in range(1, len(tokens)-1)
                       if tokens[i] in '.?!']

        train_set = featuresets
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    # Use the classifier to segment word tokens into sentences
    # words is a list of (word,bool) tuples
    def classify_segment_sentences(self,words):
        start = 0
        sents = []
        for i, word in enumerate(words):
            if word[0] in '.?!' and self.classifier.classify(self.punct_features2(words,i)) == True:
                start = i+1
        if start < len(words):
        return sents

    # Segment text into sentences and words
    # returns a list of sentences, each sentence is a list of words
    # punctuation chars are classed as word tokens (except abbreviations)
    def segment_text(self,full_text):

        # Split (tokenize) text into words. Count whitespace as
        # words. Keeping this information allows us to distinguish between
        # abbreviations and sentence terminators
        text_words_sp = self.tokenizer.tokenize(full_text)

        # Take tokenized words+spaces and create tuples of (token,bool)
        # with the bool entry indicating if the token is whitespace.
        # All whitespace is collapsed down to single sp chars
        word_tuples = []
        i =0
        while (i<len(text_words_sp)):
            word = text_words_sp[i]
            if (word.isspace()):
                word = " "    # convert all whitespace to a single sp char
            if (i == len(text_words_sp)-1):
                word_tuples.append( (word,False) )
                word2 = text_words_sp[i+1]
                if (word2.isspace()):
                    i = i +1
                    word_tuples.append( (word,True) )
                    word_tuples.append( (word,False) )
            i = i +1

        # Create list of sentence using the classifier
        sentences = []
        for sent in self.classify_segment_sentences(word_tuples):
            # sent holds the next sentence list of tokens
            # this is actually a list of (token,bool) tuples as above
            sentence = []
            i = 0
            tok = ""
            # loop over each token tuple, using separator boolean
            # to collapse abbreviations into single word tokens
            for i,tup in enumerate(sent):
                if (tup[0][0] in string.punctuation and not tup[0][0] in '.?!'):
                    # punctuation that should be kept as a single token
                    if (len(tok) > 0):
                elif (tup[1]):
                    # space character - finish a word token
                    sentence.append( tok+tup[0] )
                    tok = ""
                elif (i == len(sent)-2):
                    # penultimate end of the sentence - break off the punctuation
                    sentence.append( tok+tup[0] )
                    tok = ""           
                    # no space => accumulate a token in tok
                    tok = tok + tup[0]
            # Add this token to the current sentence
            if len(tok) > 0:
            # The sentence has been procssed => save it

        # return the resulting list of sentences
        return sentences

The comments should explain most of what is happening here. The SentenceTokenizer class trains a new classifier in its constructor. This is a Naive Bayes classifier, so it is quick to train; but the complete trained classifier could be pickled for re-use instead of being created at the beginning of each application.

At classification time, white space tokens are collapsed into single space character tokens. Tokens are also converted into (word,bool) tuples that include a boolean flag: “Is this a space token?”. These pass through the classifier, and are used in the final processing to collapse abbreviations into single word tokens.

Each sentence is returned as a list of string word tokens. The final results are returned as a list of sentences.

Here is an example of the segmenter’s use:

import os
import string

# Import the SentenceTokenizer (in module
from word_parser import SentenceTokenizer

# Read the text as one big string
print "Reading text..."

# Unix/Linux path to some sample plain text
#f = open("/home/richard/nltk/wiki_tree.txt","r")
# Windows Path
f = open(r"D:\nltk\wiki_tree.txt")

lines_text = f.readlines()
full_text = ""
for s in lines_text:
    full_text = full_text + " " + s

# Creating tokenizer
print "Creating tokenizer..."
myTokenizer = SentenceTokenizer()

print "Segmenting text into words and sentences..."
sentences = myTokenizer.segment_text(full_text)

print "Segmented sentences:"
for sentence in sentences:
    print sentence

The sample text is a few paragraphs of the Wikipedia article for “Treebank” as plain text. Here are the results (only the first few sentences are shown for brevity):

Reading text...
Creating tokenizer...
Segmenting text into words and sentences...
Segmented sentences:
[' A', 'treebank', 'or', 'parsed', 'corpus', 'is', 'a', 'text', 'corpus', 'in', 'which', 'each', 'sentence', 'has', 'been', 'parsed', ',', 'i.e.', 'annotated', 'with', 'syntactic', 'structure', '.']
['Syntactic', 'structure', 'is', 'commonly', 'represented', 'as', 'a', 'tree', 'structure', ',', 'hence', 'the', 'name', 'Treebank', '.']
['The', 'term', 'Parsed', 'Corpus', 'is', 'often', 'used', 'interchangeably', 'with', 'Treebank', ':', 'with', 'the', 'emphasis', 'on', 'the', 'primacy', 'of', 'sentences', 'rather', 'than', 'trees', '.']
['Treebanks', 'are', 'often', 'created', 'on', 'top', 'of', 'a', 'corpus', 'that', 'has', 'already', 'been', 'annotated', 'with', 'part', '-', 'of', '-', 'speech', 'tags', '.']
['In', 'turn', ',', 'treebanks', 'are', 'sometimes', 'enhanced', 'with', 'semantic', 'or', 'other', 'linguistic', 'information', '.']
['Treebanks', 'can', 'be', 'created', 'completely', 'manually', ',', 'where', 'linguists', 'annotate', 'each', 'sentence', 'with', 'syntactic', 'structure', ',', 'or', 'semi', '-', 'automatically', ',', 'where', 'a', 'parser', 'assigns', 'some', 'syntactic', 'structure', 'which', 'linguists', 'then', 'check', 'and', ',', 'if', 'necessary', ',', 'correct', '.']
['In', 'practice', ',', 'fully', 'checking', 'and', 'completing', 'the', 'parsing', 'of', 'natural', 'language', 'corpora', 'is', 'a', 'labour', 'intensive', 'project', 'that', 'can', 'take', 'teams', 'of', 'graduate', 'linguists', 'several', 'years', '.']

There we have it – text that has been tokenized and segmented ready for further processing such as classification, collocation detection, or POS tagging.

One comment on “Segmenting Words and Sentences

  1. Richard Marsden Jun 13,2012 11:41 am

    An addenda that correctly handles multiple contiguous punctuation characters, such as “).”, has been written: Handling multiple punctuation characters.

Comments are closed.