Calculating Word Statistics from the Gutenberg Corpus 1

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command:

I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily.

A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script.

Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped.

IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact.

Here is the script:



Compared to the download, this runs quite quickly and the final directory should be much easier to process.

The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (28.5MB).

Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database.

One comment on “Calculating Word Statistics from the Gutenberg Corpus

  1. Reply Andrew Jun 1,2016 1:15 am

    The Gutenberg license prohibits the distribution of their texts without the header and footer

    I don’t think this is correct, or at least their licence has been updated since this post was written. In fact, the licence allows commercial use of these texts only if you don’t include any mention of Project Gutenberg:

    Creating the works from print editions not protected by U.S. copyright law means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties. Special rules, set forth in the General Terms of Use part of this license, apply to copying and distributing Project Gutenberg-tm electronic works to protect the PROJECT GUTENBERG-tm concept and trademark. Project Gutenberg is a registered trademark, and may not be used if you charge for the eBooks, unless you receive specific permission. If you do not charge anything for copies of this eBook, complying with the rules is very easy. You may use this eBook for nearly any purpose such as creation of derivative works, reports, performances and research. They may be modified and printed and given away–you may do practically ANYTHING in the United States with eBooks not protected by U.S. copyright law. Redistribution is subject to the trademark license, especially commercial redistribution.


Leave a Reply