Geospatial Data Verification

Geospatial Data Verification
Previously we looked at visual data verification using Python and Pandas. Here we shall extend this to look at geospatial data verification of the earlier Oklahoma Injection Well Dataset. Gesopatial data can be managed and plotted using Geopandas – a geospatial extension to Pandas. This comes with some basic basemap data, but you will probably ...

Visual Data Verification

Visual Data Verification
Previously we looked at importing and the initial verification of data in Python. Next we shall look at the visual verification of data. We shall use Pandas with Matplotlib to plot a series of graphs to check for erroneous data. We will use numpy, matplotlib, and pandas; and all of these can be installed with ...

Handling Unicode Input in Python

We have looked at reading data into Python, but have ignored the issue of character encoding. In English speaking countries we often assume a text file or string is simple ASCII. More often than not, the file is actually a Unicode file. With Python 2, ignoring this issue would not usually result in any problems ...

Python Data Validation: Date & Time

Regardless of language, handling dates and times is trickier than simple numbers and strings. This is because, even within the Gregorian system, there are a wide range of different formats in addition to multiple time zones and daylight savings / summer time corrections. Just to complicate things, the corrections vary according to date, and these ...

Python Data Validation

Python is a good scripting language for data analysis and processing, but are you sure your imported data is valid? As well as import errors, it is possible the data itself contains errors such as values in the wrong field, inconsistent values/fields, and unexpected situations. Immediately after reading the data, you must validate it, and ...

Solving the Six Degrees of Kevin Bacon Problem

This article shows you how to solve the “Six Degrees of Kevin Bacon” game using a mixture of SPARQL and Python. Meanwhile, if you want to relax from problems like this, games such as w88 for mobile can be played. SPARQL SPARQL is a query language for triple stores that was born out of the ...

Importing Data into Python

Python is a popular tool for data manipulation and processing. In this first post about Python data manipulation and input, we look at a number of different ways to get your data files loaded into Python. Structured non-tabular data Structured non-tabular data typically consists of data records with fields which are not always present, in ...

Running the Charniak-Johnson Parser from Python 2

Although the Python NLTK library contains a number of parsers and grammars, these only support words which are defined in the grammar. This also applies to the supported Probabilistic Context Free Grammars (PCFGs). Therefore, in order to work with a more general parser that can handle unseen words, you have to use a Python wrapper ...

Extracting Body Content from a Web Page

I recently encountered the problem of having to extract the main body content from a series of web pages, and to discard all of the ‘boiler plate’ — i.e. header, menus, footer, and advertising. The application was performing statistical comparisons between web pages, and although it was producing the correct answers for my test data, ...

Using BerkeleyDB to Create a Large N-gram Table 1

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.