Extracting Body content from a Web Page using .NET

Boilerpipe is a useful library for extracting body content from web pages and discard the ‘boilerplate’ (menus, footers, advertising, etc). It is a Java library, so it requires a Bridge (e.g. JPype for Python) if you wish to use it in a non-Java environment.  Luckily for C# users, Arif Ogan has ported Boilerpipe to C#/Mono. The port is called NBoilerpipe and can be downloaded from github.

NBoilerpipe is easy to use in C#, although you will need to make some minor changes if you wish to use it in the .NET environment. The following applies to .NET 4.0 but is probably applicable to virtually all versions of .NET.

The sample code was usable pretty much as is. Here is my working sample:

NBoilerpipe requires Sharpen and the HtmlAgilityPack. It ships with the Sharpen source code and Visual Studio project, and the binaries for HtmlAgilityPack. I had to refresh the project with the latest .NET 4 HtmlAgilityPack binaries; and update the project references. I also had to add a reference to System.XML in the NBoilerpipe Project References.

Sharpen produced a number of compile errors, mainly relating to the zip support. This is not used by NBoilerPipe, so I simply commented the relevant classes (e.g. DeflaterOutputStream ). References to Mono.Unix also had to be removed, along with the (unused) methods which called this namespace (e.g. in FilePath.cs ). These changes were pretty simple for a one-off, but could become a maintenance hassle if you had to refresh either library with new versions.

Although I have found NBoilerPipe to be easy to use (and ‘port’), some of the API appears to be missing. The different extractors are present, but the HTML output options appear to be missing. I have asked about this on Stackoverflow, but so far there have been no responses.

Leave a Reply