MPCluster for Maptitude v2 released

v2 of MPCluster for Maptitude, our cluster analysis add-in for Caliper Maptitude, has just been released. v2 adds two license levels: Basic and Professional. Both licenses include the following updates: Hierarchical boundary polygons can now be concave – inline with the potentially concave nature of these clusters. Input data layers are limited to only those ...

New Maptitude 3d Surfaces Section

I have just added a new section to the Maptitude ‘Howto’ pages over at mapping-tools.com, discussing Maptitude’s 3d surface and landscape options. Here is an example image of Snowdonia, created using Ordnance Survey elevation data combined with Google Maps Satellite imagery: Other examples include Guadalupe Mountains (Texas), and geological overlays of both the Caprock Escarpment ...

Mapping the St Albans Sinkhole 2

Mapping the St Albans Sinkhole
On 1st October, a large sinkhole opened up in St Albans, UK, cutting off an entire cul-de-sac of houses. New sinkholes are very common, but this one quickly became international news due to its photogenic proximity to houses. We think of sinkholes as appearing in places like Florida or the Yorkshire Dales. Why did one ...

Running the Charniak-Johnson Parser from Python 2

Although the Python NLTK library contains a number of parsers and grammars, these only support words which are defined in the grammar. This also applies to the supported Probabilistic Context Free Grammars (PCFGs). Therefore, in order to work with a more general parser that can handle unseen words, you have to use a Python wrapper ...

Extracting Body content from a Web Page using .NET

Boilerpipe is a useful library for extracting body content from web pages and discard the ‘boilerplate’ (menus, footers, advertising, etc). It is a Java library, so it requires a Bridge (e.g. JPype for Python) if you wish to use it in a non-Java environment.  Luckily for C# users, Arif Ogan has ported Boilerpipe to C#/Mono. ...

Extracting Body Content from a Web Page

I recently encountered the problem of having to extract the main body content from a series of web pages, and to discard all of the ‘boiler plate’ — i.e. header, menus, footer, and advertising. The application was performing statistical comparisons between web pages, and although it was producing the correct answers for my test data, ...

NLTK on the Raspberry PI

If you haven’t heard of it yet, the Raspberry Pi is a $25/$35 barebones computer intended to excite kids with programming and hardware projects. It is very much modeled on the British experience of home computing in the early 1980s and even has a “Model A” and a “Model B” in homage to the BBC ...

Sentence Segmentation: Handling multiple punctuation characters

Previously, I showed you how to segment words and sentences whilst also taking into account full stops (periods) and abbreviations. The problem with this implementation is that it is easily confused by contiguous punctuation characters. For example “).” is not recognized as the end of a sentence. This article shows you how to correct this.

Using BerkeleyDB to Create a Large N-gram Table 1

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Book Review: Foundations of Statistical Natural Language Processing

“Foundations of Statistical Natural Language Processing” by Christopher D. Manning and Hinrich Schutze has a relatively old publication date of 1999, but do not let this deter you from reading this useful book. This book continues to be an important foundation text in a fast moving field.