Calculating Word Frequency Tables

Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.

Word frequency tables are useful for a wide range of applications, including collocation detection, spelling correction, and n-gram modeling. This article concentrates on simple word frequencies, but the code can (and will) be extended to also calculate n-grams.

Applications that require word frequency tables, require the frequencies across the problem domain. Typically the domain is a subset of texts (e.g. news articles) but it is frequently across larger domains (e.g. English literature).  Either way, the sample text is probably going to be large and consist of many (possible thousands or tens of thousands) of texts. The script below was written to create a word frequency table for all of the text files in a directory. It has been used with both the entire English language Gutenberg library, and the entire set of Wikipedia English pages. Preparing these texts will be covered in future articles.

Frequency tables are stored using NLTK’s FreqDist class. This derives from a standard Python Dictionary, but stores a word count for each word (key). This allows the resulting table to be used by NLTK, if so desired.

Words are segmented using my own word and sentence segmenter. Punctuation and numbers are dropped from the word counts, but these could be easily included if your application requires it. Acronyms, abbreviations,  and acronyms that include numeric digits are included. E.g. “1970” is dropped because it is a number, but “1970s” (an abbreviation) is not.

Here is the code:

# Module of functions for calculating word frequency tables
# These can work on in-memory text, a text file, or a directory
# Also includes functions for combining tables, etc.

# Originally written to create word tables for the Gutenberg and
# Wikipedia Corpora

import string
import sys
import os
import re
import gc
import shutil

# Import the SentenceTokenizer (in module word_parser.py)
from word_parser import SentenceTokenizer
from nltk.probability import FreqDist

class WordFrequencyBuilder(object):
    def __init__ (self, diagnostics):
        self.diagnostics = diagnostics
        # Create one sentence tokenizer for re-use as required
        self.myTokenizer = SentenceTokenizer()
        # Create an empty frequency distribution
        self.myFD = FreqDist()

        # regex for 'words' to drop (numbers/punctuation)
        self.regexDrop = re.compile(r'^[\d\WeE\_]+$')

    # Accessors

    # Return a reference to the frequency distribution
    # This is an NLTK FreqDist object
    def FD(self):
        return self.myFD

    # Used by buildTableForFile() to process a section of text for word count
    # This text is assumed to be a paragraph
    # text: Text to process
    def processText(self, text):
        # segment the text into words and sentences
        # Only words are required, but sentence segmentation is involved
        # because we want to intepret full stops correctly
        sentences = self.myTokenizer.segment_text(text)
        for sentence in sentences:
            for word in sentence:
                if (not self.regexDrop.match(word) ):
                    self.myFD.inc( word.lower().strip() )

    # Add the words for a file to this frequency table
    # fname: Full path file name to read (plain text only)
    # include_numsym: True if you wish to include numbers and
    #                 symbol/punctuation tokens
    # Note: This distribution is accumulative. Create a new class
    # to reset the table
    def buildTableForFile(self, fname):
        # Read the text as one big string
        f = open(fname,"r")
        lines_text = f.readlines()
        f.close()

        # Process text in paragraphs, using empty lines as paragraph markers
        # This avoids inefficient text processing and re-allocations        
        full_text = ""
        for s in lines_text:
            ss = s.strip()
            if (len(ss) == 0  and len(full_text)>0):
                # Empty line => process what we have
                self.processText(full_text)
                full_text = ""
            else:
                # Accumulate this line
                full_text = full_text + " " + ss

        # Process any remaining text
        if (len(full_text)>0):
            self.processText(full_text)

    # Add the words for all files in the supplied directory, to this
    # frequency table. Directory is recursed if necessary
    # All files should be plain text.
    # path: Full path to the directory to read
    # include_numsym: True if you wish to include numbers and
    #                 symbol/puncuation tokens
    # Returns a reference to our frequency distribution
    # Note: This distribution is accumulative. Create a new class
    # to reset the table
    def buildTableForTextDir(self, path):
        counter=0
        for dirname, dirnames, filenames in os.walk(path):
            for f in filenames:
                #sys.stdout.write(f+':')
                #sys.stdout.flush()

                infile = os.path.join(dirname, f)
                self.buildTableForFile(infile)
                counter = counter+1
                if ( (counter % 100) == 0):
                    if (self.diagnostics):
                        print counter,": B=", self.myFD.B(),"  N=",self.myFD.N()
                    gc.collect()
        return self.myFD

# Main script - create frequency table for the supplied path
# Usage: python word_freqs.py /my/input/path /my/output/table.txt

if __name__ == '__main__':

    if (len(sys.argv) != 3):
        sys.stderr.write("Usage: python %s inputpath outputfile\n" % sys.argv[0])
        raise SystemExit(1)

    input_path = sys.argv[1]
    output_file = sys.argv[2]

    print "Scanning word frequencies..."
    myWF = WordFrequencyBuilder(True)

    fd = myWF.buildTableForTextDir( input_path )

    print "No. of different words = ", myWF.FD().B(), '(samples=', myWF.FD().N(),')'
    print "Writing to file..."

    f = open(output_file, "w")
    f.write( "ALL\t{0}\n".format( fd.N() ) )
    for word in fd.keys():
        f.write( "{0}\t{1}\n".format( word, fd[word] ) )
    f.close()

Note that this code includes a number of diagnostics to indicate the current progress. These could be removed for silent running if you are confident that it is running okay.

Also, it can be used as a module (by importing and creating a WordFrequencyBuilder class), or it can be used as a script from the command line. In the latter form, it should be called with two command line parameters: the source path, and the output file. Output is in the form of a tab-separated file, with one word per line. Tabs are used in-case punctuation is required – this avoids handling quote-escape sequences. The (lowercase) word is in the first column, and is followed by the respective count. Words are sorted with the most frequent first. The first line gives the total number of samples counted (with the word “ALL”).

The FreqDist dictionary in the WordFrequencyBuilder class is cumulative. I.e. multiple calls to buildTableForTextDir() on the same WordFrequencyBuilder object could be used to count the total word frequency across multiple directories.