Calculating Word and N-Gram Statistics from a Wikipedia Corpora 3

As well as using the Gutenberg Corpus, it is possible to create a word frequency table for the English text of the Wikipedia encyclopedia.

The English language download of Wikipedia (download pages-articles.xml.bz2 from ) contains a large amount of markup. Use the wikipedia2text scripts (see Generating a Plain Text corpus from Wikipedia for instructions and download link) followed by to produce consolidated plain text files of articles.

The consolidated files have XML wrappers around each individual entry, and they are stored in a directory hierarchy. As with the Gutenberg corpus, a custom Python script is used to remove unnecessary information and to copy the files into one large flat directory. Here is the script for the Wikipedia pages:


This simply goes through the directory tree, copying each file. During the copy process, all XML tags are removed, and HTML entity symbols are converted to their correct characters. These are converted to ISO-8859 (‘8 bit ASCII’, aka ‘latin-1’) as necessary. uses the same file names for each sub-directory, so new names are created that also incorporate the sub-directory name.

The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (21.6MB). A combined (Wikipedia and Gutenberg) frequency table is also available (45.7MB)

Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Wikipedia dataset will prove a challenge. I shall address this in a future article.

3 thoughts on “Calculating Word and N-Gram Statistics from a Wikipedia Corpora

  1. Reply Justin Black Jul 3,2012 9:32 am

    This posting was very helpful. For a little program I’m writing, I need to sort nouns by popularity. Wikipedia word frequency works well for that.

    I’ve posted a small program on github comparing word frequency scores using these text files versus nltk corpora: brown, reuters, treebank, and gutenberg.

    If you’d like to see how they compare, check out:

  2. Reply adam n Nov 21,2013 2:44 am

    Hi, I found this very useful for some language processing tasks.
    Do you have similar resources also for other languages ? (I’m interested mainly in French, Spanish, and German).

Leave a Reply