Support for SciPy in NLTK’s Maximum Entropy methods

Recently I have been working with the Maximum Entropy classifiers in NLTK. Maximum entropy models are similar to the well known Naive Bayes models but they allow for independence between the features – i.e. they are not “naive”.

SciPy has had some problems with its Maximum Entropy code, and v0.8 must be used. v0.9 crashes the NLTK unit tests, and support is being dropped from future versions.

Maximum entropy classifiers require an optimization function to find the optimum set of parameters that fit the model. NLTK’s maximum entropy classifier (nltk.MaxentClassifier) supports its own optimizers, some implemented via SciPy, and support for the third party TADM and MEGAM programs. As an aside, I have found MEGAM to be effective and I shall be covering it in a future posting.

The SciPy options include the LBFSGB, BFGS, and CG optimization methods. LBFSGB is a popular method because it converges quickly for maximum entropy problems.

The test code (below) is based on the NLTK unit tests. It will run correctly with SciPy 0.8, but crashes v 0.9. v 0.10 has not been released yet, but the Maximum Entropy package has been deprecated and will be dropped. I have compared the maximum entropy code between v0.8 and v0.9 and the only differences are things like exception objects. The difference must be triggered by something that is lower level, although the root cause may well be a problem in the maximum entropy code.

These tests have been run with NumPy 1.6.1 and Python 2.7.

Here is the test code that I used:

# Classifier tester

import numpy;
import scipy;
import nltk;

print 'NumPy Version: ', numpy.__version__
print 'SciPy Version: ', scipy.__version__
print 'NLTK Version: ', nltk.__version__

print nltk.usage(nltk.ClassifierI)

# Training & test data

train = [
     (dict(a=1,b=1,c=1), 'y'),
     (dict(a=1,b=1,c=1), 'x'),
     (dict(a=1,b=1,c=0), 'y'),
     (dict(a=0,b=1,c=1), 'x'),
     (dict(a=0,b=1,c=1), 'y'),
     (dict(a=0,b=0,c=1), 'y'),
     (dict(a=0,b=1,c=0), 'x'),
     (dict(a=0,b=0,c=0), 'x'),
     (dict(a=0,b=1,c=1), 'y')
     ]
test = [
     (dict(a=1,b=0,c=1)), # unseen
     (dict(a=1,b=0,c=0)), # unseen
     (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
     (dict(a=0,b=1,c=0)) # seen 1 time, label=x
     ]

def test_maxent(algorithms):
    classifiers = {}
    for algorithm in nltk.classify.MaxentClassifier.ALGORITHMS:
        if algorithm == 'MEGAM'  or  algorithm=='TADM':
            print '(skipping %s)' % algorithm
        else:
            try:
                classifiers[algorithm] = nltk.MaxentClassifier.train(
                    train, algorithm, trace=0, max_iter=1000)
            except Exception, e:
                classifiers[algorithm] = e
    print ' '*11+''.join(['      test[%s]  ' % i
                          for i in range(len(test))])
    print ' '*11+'     p(x)  p(y)'*len(test)
    print '-'*(11+15*len(test))
    for algorithm, classifier in classifiers.items():
        print '%11s' % algorithm,
        if isinstance(classifier, Exception):
            print 'Error: %r' % classifier; continue
        for featureset in test:
            pdist = classifier.prob_classify(featureset)
            print '%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')),
        print

test_maxent(nltk.classify.MaxentClassifier.ALGORITHMS)
#test_maxent( ['lbfsgb'] )