Calculating Word Statistics from the Gutenberg Corpus

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command:

I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily.

A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script.

Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped.

IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact.

Here is the script:



Compared to the download, this runs quite quickly and the final directory should be much easier to process.

The resulting frequency table (created using the word frequency table scripts) can be downloaded as a GZIP file (28.5MB).

Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database.

Leave a Reply