Extracting Body Content from a Web Page

I recently encountered the problem of having to extract the main body content from a series of web pages, and to discard all of the ‘boiler plate’ — i.e. header, menus, footer, and advertising. The application was performing statistical comparisons between web pages, and although it was producing the correct answers for my test data, identical body text could produce wildly different statistical scores according to the amount of ‘junk’ boilerplate that was present. This article introduces the Boilerpipe library which can be used to perform this task.

Extracting the main content (‘body’) text from a web page is difficult for the general case. It would appear to lend itself to machine learning, and researchers have had some success in this arena. Specifically, Kohlschütter et al describe a particularly useful set of techniques in their 2010 paper Boilerplate Detection using Shallow Text Features. Kohlschütter has also implemented his algorithm as the open source Boilerpipe library. This is written in Java, which I am not currently using for natural language processing. However, there is the NBoilerpipe port for Mono, and the python-boilerpipe wrapper library for Python. I shall address NBoilerpipe in a future post, and discuss python-boilerpipe here.

The project home and library downloads for python-boilerpipe can be found at: http://code.google.com/p/boilerpipe/ . The installer will download Boilerpipe on its own, but it does have some pre-requisites. You must have Java installed, and the JAVA_HOME enviroment variable must be set. It also requires JPype (as a Java-Python bridge). I had quite a few problems installing these on my Python (Ubuntu) development environment, but I am not a Linux guru and I have limited Java installation experience. The main problem was that JPype could not find the Java runtime library (libjvm.so). Eventually I fixed this by editing python-boilerpipe’s __init__.py file to use an explicit path to libjvm.so.

Actual usage documentation for the python wrapper is limited, but luckily it is quite easy to use. For example, to load the library, and fetch a page for processing, use the following:

from boilerpip.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url="http://www.monlp.com")

Or if you have the page loaded as a string variable, you can use:

extractor = Extractor(extractor='ArticleExtractor', html=myWebPage)

Then the processed body text is read using the getText() or getHTML() methods:

processed_plaintext = extractor.getText()
highlighted_html = extractor.getHTML()

Boilerpipe implements a number of extraction algorithms for different circumstances. ArticleExtractor is recommended for most cases, but it is specifically intended for news articles.

The only problem I have had so far is that Boilerpipe does not extract the title tag. Therefore my code fetches the web page and passes the text through BeautifulSoup to extract the title, and through Boilerpipe to extract the content text.

The original application that started this off, is firmly in the .NET environment. Luckily there is already a Mono port of Boilerpipe called NBoilerpipe. I shall be looking at this next.

Leave a Reply