Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.
Project Gutenberg is a non-profit project to digitize books with expired copyright. It is noted for including large numbers of classic texts and novels. All of the texts are available for free online. Although the site could be spidered, this is strongly discouraged due to the limited resources of the project. Instead, a CD or DVD image can be downloaded (or purchased); or you can create a mirror. Instructions for doing this are here, and include the following recommend shell command:
rsync -avHS --delete --delete-after --exclude '*cache/generated' firstname.lastname@example.org::gutenberg /home/ftp/pub/mirrors/gutenberg
I would strongly recommend you use an even restrictive command to exclude everything except for directories and files with the “.txt” suffix. If you do not do this, the rsync command will download a wide range of miscellaneous binary files including bitmaps, ISOs and RARs. some of these are very large. Even if you have the time, disk space, and bandwidth; it is definitely not polite to stress Project Gutenberg’s servers unnecessarily.
A restrictive “text only” fetch will probably still take a few hours to run. It will also create a complex directory hierarchy. This includes a range of unnecessary text files (e.g. readme files, and the complete Human Genome), and many texts will contain multiple copies in different formats (typically ASCII and 8 bit ISO-8859-1). The next job is to tidy this up, by copying all unique texts into one large flat directory. I chose to use the ISO-8859-1 format where available, and plain ASCII when it wasn’t. This would enable the correct format to be used for accented words that have been adopted into the English language (e.g. “déjà vu”). This copy and filter process is performed using a Python script.
Each Gutenberg text also includes a standard header and footer. We do not wish to include this information in the statistics, so it needs to be stripped off. For efficiency, the header and footer are removed by the same script. This also has the advantage that descriptive files (e.g. readme files) lack the header/footer markers and are automatically skipped.
IMPORTANT NOTE: The Gutenberg license prohibits the distribution of their texts without the header and footer. Do not distribute the texts in this form. If there is the chance you will be sharing your files (e.g. with colleagues), then the headers and footers should be kept in tact.
Here is the script:
# Loops over all of the gutenberg text files in /var/bigdisk/gutenberg/gt_text # Extracting unique texts, and copies them to ./gt_raw # Removes header and footer information - leaving just the text, ready for # statistical processing # # Usage: Be sure to change these paths to point to the relevant directories on your system import string import os import gc import shutil # The logic for keeping a file based on its name is put # into a name to improve readability # fname = Name of file without path or file extension def keep_file(fname): # filter out readme, info, notes, etc if (fname.lower().find("readme") != -1): return False if (fname.find(".zip.info") != -1): return False if (fname.find("pnote") != -1 ): return False # Filter out the Human Genome if (len(fname)==4): try: n = int(fname) if (n>=2201 and n<=2224): print "*** Genome skipped:",n return False # Human Genome except ValueError: n=0 # dummy line # Looks good => keep this file return True # Recursively walk the entire directory tree finding all .txt files which # are not in old sub-directories. readme.txt files are also skipped. # Empty the output directory outputdir = "/var/bigdisk/gutenberg/gt_raw" for f in os.listdir(outputdir): fpath = os.path.join(outputdir, f) try: if (os.path.isfile(fpath)): os.unlink(fpath) except Exception, e: print e for dirname, dirnames, filenames in os.walk('/var/bigdisk/gutenberg/gt_text'): if (dirname.find('old') == -1 and dirname.find('-h') == -1 ) : # some files are duplicates, remove these and only copy a single copy # The -8 suffix takes priority (8 bit ISO-8859-1) over the # files with no suffix or -1 suffix (simple ASCII) # also remove auxiliaries: Names contain pnote or .zip.info flist =  flist_toremove =  for fname in filenames: fbase, fext = os.path.splitext(fname) if ( fext == '.txt'): if (keep_file(fbase)): flist.append(fname) if (fname.endswith("-8.txt") ): # -8 takes priority => remove any duplicates flist_toremove.append( fname[: (len(fname)-6)] + ".txt" ) flist_toremove.append( fname[: (len(fname)-6)] + "-0.txt" ) flist_to_proc = [i for i in flist if i not in flist_toremove] # flist_to_proc now contains the files to copy # loop over them, copying line-by-line # Check for header/footer markers - file is skipped if header marker is missing for f in flist_to_proc: infile = os.path.join(dirname, f) outfile = os.path.join(outputdir, f) bCopying = False for line in open(infile): if (not bCopying): if (line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK")): fout = open(outfile, "w") print "Copying: " + f bCopying = True elif (bCopying): if (line.startswith("*** END OF THIS PROJECT GUTENBERG EBOOK")): fout.close() bCopying = False else: fout.write(line)
Compared to the download, this runs quite quickly and the final directory should be much easier to process.
Note that although the word frequency table scripts could be easily modified to process N-grams, the shear size of the Gutenberg dataset will prove a challenge. I shall address this in a future article. A similar process can be used to create a word frequency table for the English language Wikipedia database.