Plotting Place Names from Natural Text in Python 3

This example uses Maptitude’s new Python 3 interface to draw annotation on a map. The annotation is in the form of place names mentioned in H.G. Wells’ War of the Worlds. This also serves as a basic demonstration of using NLTK (Natural Language Toolkit) to identify named entities (proper nouns) in the book text. Annotation is used merely as a demonstration. It coud be argued that the data should be created as a point layer, allowing the possibility of further processing.

The following code was written using the Spyder environment and Python 3.5. Maptitude’s Python 3 interface is currently in beta testing and will become a standard part of Maptitude 2017.

NLTK 3.0 was also used. NLTK (Natural Language Toolkit) is a powerful toolkit of Python natural language tools intended for education. Personally I have also found it useful for prototyping natural language processing flows. Further information can be found on the official website at www.nltk.org.

Here is the code:

# -*- coding: utf-8 -*-
"""
Maptitude Demo: Plots locations from HG Wells' War of the Worlds in Maptitude.
Demonstrates the use of the Python3 interface to search for locations and draw annotation.
(C) Copyright Winwaed Software Technology LLC 2016
"""

import caliper3
from win32com.client import Dispatch   # required for compound objects

import nltk, nltk.tag, nltk.chunk

# Extracts named entities (people, places,etc) from a chunked sentence
def extract_entity_names(t):
    entity_names = []
    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

# Load the book text into one large string
in_file = "c:\\Data\\howto\\nlp\\war_of_the_worlds.txt"
main_text = ""
with open(in_file, 'r') as f:
    for line in f:
        main_text = main_text + line

# convert the text into a list of sentences        
sentences = nltk.sent_tokenize(main_text)

# Tokenize the sentences - ie. split them into words
tokenized_sents = [nltk.word_tokenize(sentence) for sentence in sentences]

# Tag these words with Part-of-Speech tags (noun, verb, etc)
tagged_sents = [nltk.pos_tag(sentence) for sentence in tokenized_sents]

# Create named entity chunks from these tagged words
chunked_sents = nltk.ne_chunk_sents(tagged_sents, binary=True)

# Extract a list of named entities (people, places, etc)
entities = []
for tree in chunked_sents:
    entities.extend(extract_entity_names(tree))


# Start Maptitude, loading an empty UK map (wow.map)
map_file = "c:\\Data\\howto\\nlp\\wow.map"
dk = caliper3.Gisdk("Maptitude")
map_name = dk.OpenMap(map_file, {"Auto Project" : "True" } )
window_spec = "Map|" + map_name

# Delete any existing annotations
dk.DropAnnotations(window_spec , dk.GetAnnotations(window_spec) )
dk.RedrawMap(map_name)

# We set the zoom window for each annotation, to avoid strange rotations due to the projection
# Save the original scope, so we can restore it at the end
original_scope = dk.GetMapScope(map_name)

# Create some useful colors plus fill and line styles for the labels
font_color = dk.ColorRGB(0,0,0)
sym_color = dk.ColorRGB(65535,0,0)
frame_fill = dk.ColorRGB(65535,65535,65535)

str1 = "XXXXXXXX"
solid_fill = dk.FillStyle( [str1, str1, str1, str1, str1, str1, str1, str1] )
solid_line = dk.LineStyle( [[[0,-1,0]]] )

# Create the Maptitude Data.Finder object - this is used to geocode the locations
finder = dk.CreateGisdkObject("gis_ui","Data.Finder",None)
finder.SetRegion()

# This is a list of entities to skip. The initial values include false positives,etc.
# Eg. peoples' names that the geocoder matches with places; or places outside of the UK that get false matches
# Many of the chapter identifiers are tagged as named entities, so we exclude these.
# We will add named entities to this list as they are searched, in order to avoid duplicate searches and annotations
entities_to_skip = ["martian","ogilvy","philips","one","two","three","four","five","six","seven","eight","nine","ten","eleven","twelve",
                    "mars","martians","forty","ugh","moscow","france","zodiac", "yonder","none","dim","thick","vast","life","paris",
                    "venus","life","stent","champagne","book","earth","look", "cardigan"]                  

# Loop over the named entities, attempting to find them                    
for ne in entities:
    nel = ne.lower()
    if nel not in entities_to_skip:
        entities_to_skip.append(nel)  # no need to process the named entity twice

        found = False

        # First try to search for the named entity as a London Landmark
        result = dict( finder.Find("LANDMARK", [ne, "London"]) )

        # Error and Message are unreliable - check to see if there are any scopes (ie. coords) returned
        if "Scopes" in result:
            list_scopes = result["Scopes"]
            if len(list_scopes) == 1:
                print(ne+": FOUND (Location): "+str(result["Score"]) )
                found = True

        # If the landmark was not found, search for the named entity as a UK town or city
        if not found:
            result = dict( finder.Find("CITY", ne) )
            if "Scopes" in result:
                list_scopes = result["Scopes"]
                if len(list_scopes) == 1:
                    print(ne+": FOUND (City): "+str(result["Score"]) )
                    found = True
        if found:
            this_scope = Dispatch(list_scopes[0])    # casting required due to a type mismatch on the scope return
            dk.SetMapScope(map_name, this_scope) # zoom in, so we don't get any weird map projection rotation effects
            c = this_scope.Center
            # Write a marker at the scope's center, and add a text label next to it
            dk.AddAnnotation(window_spec,"Font Character",{ "Index": 53, "Location": c,
                                             "Font": "Caliper Cartographic|Bold|14", 
                                             "Color" : sym_color } )
            dk.AddAnnotation(window_spec,"Text",{ "Text": ne, "Location": [c,'NE'],
                                             "Font": "Arial|Bold|12", "Angle": 60.0,
                                             "Color" : font_color, "Framed" : "True", "Frame Type" : "Rectangle",
                                             "Frame Fill Color" : frame_fill, "Frame Border Color" : font_color,
                                             "Frame Fill Style" : solid_fill, "Frame Border Style" : solid_line,
                                             "Frame Border Width": 1 } )
        else:
            print("Not found:"+ne)


# reset the map's scope
dk.SetMapScope(map_name, original_scope)

# Save the map, and close Maptitude
dk.SaveMap( None, map_file)
dk.CloseMap(None)
dk = None

print("Finished")

The script requires an existing Maptitude UK map (wow.map) and the War of the Worlds text (war_of_the_worlds.txt). The text can be downloaded from the Gutenberg project.

Most of the code should be self-explanatory with the comments. The interface is currently in beta testing, and there is currently an issue with the casting of compound objects returned by Maptitude. This is fixed with the use of Dispatch() to cast the Maptitude scope object into a usable form (line 112).

The text is processed by splitting into sentences and then words. The words are then tagged with ‘part of speech’ tags (e.g. noun, verb, etc) and then chunked into named entities (i.e. proper noun phrases). These are then extracted, and passed individually to Maptitude for geocoding. The geocoding first attempts to find the proper noun as a landmark within London. If this is not found, it then tries to find it as a town or city in the UK. Some named entities are filtered out. These include people’s names such as Ogilvy (“the chances of anything coming from Mars are a million to one“), international locations (moscow, france), and other false positives (e.g. chapter numbers). For simplicity, all of these natural language steps use existing NLTK models. For a more robust solution, many of these false positives could be removed with better, larger training data.

Here is the resulting UK map:

The results on a UK map. Most locations are accurate, but a few are mis-placed. E.g. the Strand and Hyde Park are geocoded to locations with these names, but the book actually refers to locations in London. (Click for larger view)

Notice that the script does a pretty good job. Some locations are obviously wrong. For example Strand (a street) and Hyde Park (a well known park) were not located as landmarks within London, so the town/city geocoding step located them elsewhere in the UK.

Here is the map zoomed into the London area:

Zoomed into the London area. Note that most of the locations are in central London and to the south-west – as expected from the plot. (Click for larger view)

Zooming in, we see even better locations. Notice that Horsell is located, but not Horsell Common (where the first cylinder lands). The plot starts in the Horsell area, and generally moves towards central London – as seen in this map.

This example demonstrates the use of Maptitude with Python 3. It also demonstrates that powerful applications that can be create using third party libraries such as NLTK.