Handling Unicode Input in Python

We have looked at reading data into Python, but have ignored the issue of character encoding. In English speaking countries we often assume a text file or string is simple ASCII. More often than not, the file is actually a Unicode file. With Python 2, ignoring this issue would not usually result in any problems ...

Python Data Validation: Date & Time

Regardless of language, handling dates and times is trickier than simple numbers and strings. This is because, even within the Gregorian system, there are a wide range of different formats in addition to multiple time zones and daylight savings / summer time corrections. Just to complicate things, the corrections vary according to date, and these ...

Python Data Validation

Python is a good scripting language for data analysis and processing, but are you sure your imported data is valid? As well as import errors, it is possible the data itself contains errors such as values in the wrong field, inconsistent values/fields, and unexpected situations. Immediately after reading the data, you must validate it, and ...

Solving the Six Degrees of Kevin Bacon Problem 1

This article shows you how to solve the “Six Degrees of Kevin Bacon” game using a mixture of SPARQL and Python. SPARQL SPARQL is a query language for triple stores that was born out of the Semantic Web effort. A triple is a simple 3 part statement or ‘fact’ such as “Australia is Country”. This ...

Importing Data into Python

Python is a popular tool for data manipulation and processing. In this first post about Python data manipulation and input, we look at a number of different ways to get your data files loaded into Python. Structured non-tabular data Structured non-tabular data typically consists of data records with fields which are not always present, in ...

New Book: “Artificial Intelligence with Python” by Prateek Joshi

Packt Publishing have just published “Artificial Intelligence with Python” by Prateek Joshi. I was the technical editor. The book serves as a good introduction to a wide range of AI techniques and Python libraries. Due to this breadth of coverage, there isn’t the space for really in-depth discussion of individual techniques, but the book should ...

Running the Charniak-Johnson Parser from Python 2

Although the Python NLTK library contains a number of parsers and grammars, these only support words which are defined in the grammar. This also applies to the supported Probabilistic Context Free Grammars (PCFGs). Therefore, in order to work with a more general parser that can handle unseen words, you have to use a Python wrapper ...

Extracting Body Content from a Web Page

I recently encountered the problem of having to extract the main body content from a series of web pages, and to discard all of the ‘boiler plate’ — i.e. header, menus, footer, and advertising. The application was performing statistical comparisons between web pages, and although it was producing the correct answers for my test data, ...

Using BerkeleyDB to Create a Large N-gram Table 1

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Calculating N-Gram Frequency Tables

The Word Frequency Table scripts can be easily expanded to calculate N-Gram frequency tables. This post explains how.