Calculating Word and N-Gram Statistics from a Wikipedia Corpora

As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.

Calculating Word Statistics from the Gutenberg Corpus

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

Part of Speech Tags

A frequently asked question is “What do the Part of Speech tags (VB, JJ, etc) mean?” The bottom line is that these tags mean whatever they meant in your original training data. You are free to invent your own tags in your training data, as long as you are consistent in their usage. Training data ...