Python Data Validation

Python is a good scripting language for data analysis and processing, but are you sure your imported data is valid? As well as import errors, it is possible the data itself contains errors such as values in the wrong field, inconsistent values/fields, and unexpected situations. Immediately after reading the data, you must validate it, and apply corrections if relevant or reject the erroneous records.

A function can be easily defined that checks individual values in a data row. The data row could be passed as a dictionary (field names as keys), an object, a list, or a row from a Pandas frame.

The checks for individual values will depend on what they are and what they represent. For example, the following code checks an unknown value (which might be a string) represents an integer in the range 1…10:

This returns False if the value cannot be represented by an integer in the range 1…10.

Having an entire row’s data available can also enable more sophisticated checks that use more that one data field. For example, if there are start and end dates, you can check that the end data comes after the start date.

Localization

Most of this post is written assuming the data is written with English-style formatting. For example, numbers use “.” for the decimal point and there are no thousands separators.  Data conversion with different locales is handled by the locale module, which uses an ANSI C locale implementation. The current default locale can be queried with locale.getdefaultlocale(), whilst locale.getlocale() and locale.setLocale() can be used to get/set the current locale. The locale settings can be queried using locale.localeconv() which returns a dictionary of the locale’s conventions. Similarly, locale.nl_langinfo() can be used to query different format strings for numbers, dates, etc.

The locale module also implements a number of conversion methods that use the current locale. For example:

Full details are provided in the official locale module documentation.

Regexes

More sophisticated string validation and extraction can be performed using Regular Expressions. Also known as regexes, Regular Expressions provide a search pattern which Python attempts to match against your string. They originate in Stephen Kleene’s Regular Languages which are equivalent to finite automata, and are related to Turing Machines. Regular Expressions were introduced in Unix in the 1970s and are now supported by most modern programming languages. They can be difficult to read, so it is recommended you find an online regex tool (e.g. regex101.com), or download a regex app for your phone or tablet. This will prove invaluable when testing a new regex pattern.

Here is a simple example that matches C-style hexadecimal constant values:

The regex matches a lead 0x followed by one or more hexadecimal characters. Both upper and lower case characters can be used.

Here is a US zipcode example:

In theory regexes could be used for more complex examples such as Canadian and UK postcodes. Credit card numbers and network (IP) addresses are other potential applications.

Comprehensive coverage of regular expressions can be found in Mastering Regular Expressions by Jeffrey EF Friedl (O’Reilly). The theoretical underpinnings of regular languages can be found in Introduction to Automata Theory, Languages, and Computation by Hopcroft et al (Pearson).

 

Zero, False, Empty String, None, NaN, Inf

There are a range of different values and non-values which may appear in Python data input. Sometimes these are due to invalid data, sometimes they are not. For example “None” might be due to missing data in a corrupt record, or it might indicate a valid “not set” value (e.g. an optional comment field). It is important to distinguish between these values, and only convert them blindly to an internal default when appropriate.

Zero (0 or 0.0) is a numeric value, and can be an integer or floating point. Usually it indicates a zero value, but might result as a default when converting from an empty cell (eg. in Excel).

False is a boolean value. Again, it might appear as a default when a boolean value is expected.

Empty String (“”) is a valid string of zero length. This is often read when a string is expected but there is nothing in a text file. It is different to None, but some input code may return an empty string for both values.

None is the Python equivalent of NULL. This literally means no object. Be careful not to confuse it with an empty string. If a string might be None, be sure to check it is not None before performing any other operations on it. Performing a string operation on a None object will usually result in an exception. Pandas uses None to indicate an object (including string) values are ‘not available’.

NaN means “Not a Number”. This is due to a mathematical error such as 0 divide by 0. As it is a part of the IEEE standard, it is possible to read NaN values from a data input. Check a variable for NaN by calling math.isnan(val). Be aware that Pandas also uses NaN values as “not available”

inf indicates positive infinity. -inf indicates negative infinity. As with NaN, these values are typically generated in internal math operations, but are IEEE standards and can be imported from external data. You can check a value for infinity by calling math.isinf(val).

Next

Next we shall look at date and time formatting and validation.

This will be followed by deeper ‘big picture checks on the overall dataset.

Leave a Reply