Calculating N-Gram Frequency Tables

The Word Frequency Table scripts can be easily expanded to calculate N-Gram frequency tables. This post explains how.

Calculating Word and N-Gram Statistics from a Wikipedia Corpora 3

As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.

Calculating Word Statistics from the Gutenberg Corpus 1

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

Calculating Word Frequency Tables

Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.

Segmenting Words and Sentences 1

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences. This post uses a classification approach to create ...

Book Review: Natural Language Understanding

Although “Natural Language Understanding” by James Allen is an older book, it still contains some useful content presented in a readable form. Although more modern books take a more statistical approach, this book has good, clear presentations of formal grammar, logic, and conversation agent topics.

Extracting Noun Phrases from Parsed Trees 4

Following on from my previous post about NLTK Trees, here is a short Python function to extract phrases from an NLTK Tree structure.

NLTK Trees

A number of NLTK functions work with Tree objects. For example, part of speech tagging and chunking classifiers, naturally return trees. Sentence manipulation functions also work with trees. Although Natural Language Processing with Python (Bird et al) includes a couple of pages about NLTK’s Tree module, coverage is generally sparse. The online documentation actually contains ...

Computational Linguistics v37 Issue 4 published

Apologies for the lack of posts over the past couple of weeks – this has been due to a major office move, and the Thanksgiving Holiday. Indeed, due to the holidays, the internet was only re-connected yesterday. Anyway, Volume 37, Issue 4 of Computational Linguistics has just been published. This is an open access journal ...

Book Review: Word Sense Disambiguation

“Word Sense Disambiguation: The Case for Combinations of Knowledge Sources” by Mark Stevenson describes the author’s six year research project into Word Sense Disambiguation that started with his PhD in 1995. The book includes a summary literature review of previous attempts at Word Sense Disambiguation before building a framework that combines multiple models and filters ...