Calculating N-Gram Frequency Tables

The Word Frequency Table scripts can be easily expanded to calculate N-Gram frequency tables. This post explains how.

The Word Frequency Table scripts are not limited to string (word) keys, but can work with any kind of valid Python data type. This means we could easily create an n-gram frequency table using tuples to represent n-grams. This can work very well for relatively small amounts of data, but memory usage and memory churn add inefficiencies that can become significant for large amounts of data. Therefore we shall use simple space-separated text strings to represent each n-gram.

The previous word frequency script only requires two modifications. First, the constructor must be modified in order to store the size of the required n-grams:

In this modified constructor, ‘N’ stores the size of the n-grams to search for. Setting this in the constructor ensures we do not accidentally mix different n-gram sizes in the same table.

Next we modify the processText() method to keep track of the n-grams rather than simply store word frequencies. We shall mark sentence beginnings and ends with ‘<s>’ and ‘</s>’ tags. This particular implementation can skip numbers and punctuation. Skipped numbers and punctuation will break tuples (i.e. tuples cannot span skipped tokens). The calling methods will need to be modified to set these parameters.

Here is the new processText() implementation:

That is all that is required to convert the word frequency table script into an n-gram frequency table script. The main limitation is that the frequency table is constructed entirely in memory. This is not a problem for simple word frequencies, but it becomes a major problems for large datasets such as the Wikipedia or Gutenberg texts. Even a bigram frequency table for either of these texts will take more than 8GB of RAM. In a future article I shall address this issue by creating a disk version of the table. A disk version can also perform more sophisticated subset queries (i.e. “find the frequency table for n-grams that start with the following words”).

Leave a Reply