Sentence Segmentation: Handling multiple punctuation characters

Previously, I showed you how to segment words and sentences whilst also taking into account full stops (periods) and abbreviations. The problem with this implementation is that it is easily confused by contiguous punctuation characters. For example “).” is not recognized as the end of a sentence. This article shows you how to correct this.

The problem lies in the word tokenization. The modified “words and whitespaces” regular expression will correctly parse “).” as two tokens. However, the default NLTK regular expression tokenizer parses it as “).”. Hence the classifier sees “).” as a valid sentence terminator during training but the data to be classified only ever shows “)” + “.”. The solution is to modify the training tokenizer to parse contiguous punctuation characters as individual tokens. To complicate things, there are some valid multi-character punctuation tokens, such as “…”, “—“, and various mathematical operators (e.g. “<=”). You may also choose to parse “!!!” as one token which is then cleaned up as a single “!”. The following regular expressions correct this problem:

class ModifiedTrainingTokenizer(RegexpTokenizer):
    def __init__(self):
        RegexpTokenizer.__init__(self, r'\w+|\.+|[\-\<\>\=]+|[^\w\s]')

class ModifiedWPTokenizer(RegexpTokenizer):
    def __init__(self):
        RegexpTokenizer.__init__(self, r'\w+|\s+|\.+|[\-\<\>\=]+|[^\w\s]')

Note that the above expressions treat multiple exclamation marks and punctuation marks as individual character tokens. If required, they are readily modified as necessary.

ModifiedWPTokenizer is used as before, but ModifiedTrainingTokenizer is the new tokenizer which replaces the NLTK default. Replacing the default tokenizer requires some changes to the training code. Here is the modified constructor for the SentenceTokenizer() class:

    # The constructor builds a classifier using treebank training data
    def __init__(self):
        # Change training_path to point to your raw Penn Treebank (or other) data
        training_path = "/home/richard/nltk_data/corpora/treebank/raw/"
        self.tokenizer = ModifiedWPTokenizer()
        training_tok = ModifiedTrainingTokenizer()

        tokens = []
        boundaries = set()
        offset = 0
        reader = nltk.corpus.PlaintextCorpusReader(training_path, ".*", training_tok)        
        for sent in reader.sents():
            # skip Penn Treebank prolog (".START" at the beginning of each file)
            bSkip = (len(sent) == 2)
            if (bSkip):
                bSkip = bSkip and sent[0] == "."  and sent[1] == "START"
            if (not bSkip):                  
                offset += len(sent)

        # Create training features
        featuresets = [(self.punct_features(tokens,i), (i in boundaries))
                       for i in range(1, len(tokens)-1)
                       if tokens[i] in '.?!']

        #size = int(len(featuresets)*0.1)
        #train_set, test_set = featuresets[size:], featuresets[:size]
        train_set = featuresets
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

Note that in order to over-ride the default word tokenizer, we have to use the plain text corpus reader, instead of the treebank raw corpus reader. Hence the “.START” file prolog has to be manually skipped.