Handling Unicode Input in Python

We have looked at reading data into Python, but have ignored the issue of character encoding. In English speaking countries we often assume a text file or string is simple ASCII. More often than not, the file is actually a Unicode file. With Python 2, ignoring this issue would not usually result in any problems until you tried to read in a non-ASCII character such as an accent character. Python 3 makes the conversions more explicit, so you will experience problems at development time and not be surprised by strange failures months or years later.

Unicode

The original ASCII character set was limited to 7 bit characters with unprintable (control) codes 1..31, and printable characters 32..127. With widespread use of PCs in the 1980s, a range of different character encodings were used for codes 128-255. The most well known was the IBM “OEM” or ANSI set which used a range of accent, drawing, and mathematics characters. Other languages used other character sets which could not be interchanged.

This reached a breaking point with the widespread adoption of the Internet, but luckily Unicode had been invented by this time. Unicode characters are known as code points and are represented by between 2 and 4 bytes. For example, the English letter ‘A’ has the Unicode code point of U+0041. Unicode is big enough to support over a million code points although only about 100,000 have currently been assigned. These include fictional languages (e.g. Klingon) and the every-growing collection of emoticons which have invaded our phones.

There are different Unicode encodings. The most common is known as “UTF+8”. Every code point 0..127 is stored as a single byte. Additional bytes are used as required. If a byte >=128, then it follows that there must be at least one more byte in the code point. Up to 6 bytes are supported. UTF-8 is compact, supports ‘classic’ ASCII characters, and supports all Unicode code points present and future.

There are a lot of character encodings in widespread usage. Most will only encode a subset of code points. For example, “Windows-1252” (90s era Windows character set) and ISO-8859-1 (aka ‘Latin-1’; used by many databases) can both be used to code most European languages, but will not be able to encode Cyrillic or Hebrew.

Python 3 and Unicode

The Python 3 ‘str’ data type is defined as being a Unicode data type. Therefore once you have the data correctly loaded into Python, it should be able to easily handle any Unicode you have.

In addition to the ‘str’ data type, Python supports the ‘bytes’ type for raw byte (7 or 8 bit character) values. This can be useful for loading raw byte data and then explicitly converting it to internal Unicode using a specific encoding. For example:

if isinstance(raw_data, bytes):
    str_value = raw_data.decode('utf-8')

The default encoding for file reading will depend on your locale, but is usually UTF-8. You can specify a different encoding in the call to open(), for example:

open('datafile.txt', 'r', encoding='latin-1').read()

Note: This will give an error for characters/bytes that are not supported by the latin-1 encoding.

Pandas has similar parameters to set the encoding when reading from a text file. You can also explicitly decode a string series in Pandas using the Series.str.decode() method.

Ideally you should know the encoding of the file or data. In the case of a web response, the encoding is usually listed in the header with a Content-Type meta tag.

It is impossible to automatically detect the encoding in all cases, but it is usually possible to make some intelligent guesses. For example, if all bytes are within the range 0..127, then a traditional ASCII encoding can be assumed. Third party libraries that attempt to detect the encoding include UnicodeDammit (a part of Beautiful Soup), and Chardet: The Universal Character Encoding Detector.

Normalizing

With the large number of character sets and symbols, Unicode often has multiple code points that represent the same character. For example, the ‘Å’  (U+00C5) character has two other representations: U+212B (Angstrom sign), and U+0041 + U+030A (Latin ‘A’ with a combining additional ring). A process of Normalization converts code points to canonical-equivalents. This is fully discussed on the Official Unicode Normalization FAQ, and there are a number of normalization forms in the Unicode standard.

It is a good idea to normalize Unicode input before trying to process it further and perform comparisons. This can be performed using the unicodedata library:

import unicodedata

str1 = '400 \u00c5'
str2 = '400 \u212b'
norm1 = unicodedata.normalize('NFC', str1) 
norm2 = unicodedata.normalize('NFC', str2)
print(norm1==norm2)

As well as NFC, unicodedata supports the NFG, NFKX, and NFKD normalization forms. Unicodedata also contains utility functions for testing code points against the different character classes (digit, etc). See the unicodedata documentation for further details.

 

Using Regexes with Unicode Strings

The Python standard re regex library is aware of some Unicode classes. For example ‘\d’ matches any Unicode digit character, not just the ASCII ‘0’-‘9’ (Code Points U0030 – U0039). It can, though, have problems with some classes such as the handling of upper/lower case and case-insensitive comparisons. Normalizing (see above) can help, but the PYPI Regex library can be a useful replacement.

Next

Next we shall look at using graphical methods for quick, large-scale data validation.