Importing Data into Python

Python is a popular tool for data manipulation and processing. In this first post about Python data manipulation and input, we look at a number of different ways to get your data files loaded into Python.

Structured non-tabular data

Structured non-tabular data typically consists of data records with fields which are not always present, in addition to dictionaries and often nested records. The two most popular ways of transferring this type of data is either in an XML file or a JSON file.

XML

XML is potentially a very verbose but powerful data format. Data can be read and processed sequentially, or read as one complete data structure in one go. Here we assume the latter, because the intention is to read the data in for further processing.

One of the easiest ways to read an entire XML structure into memory is using the xml.etree.ElementTree package. E.g.

from xml.etree.ElementTree import parse, Element

doc = parse('myfile.xml')

root = doc.getroot()
for child in root:
    print(child.tag, child.attrib)

XML element paths can also be queried.

Further information on xml.etree.ElementTree can be found in the official documentaton. Note: parse() can also read from a urllib. Beware that ElementTree is vulnerable to attack from untrusted data sources. Make sure your data file is trusted first before using this library.

JSON

JSON is a more lightweight data format. It has the same flexibility as XML but lacks the meta and semantic support. JSON files are actually JavaScript data definitions, and have become very popular to transfer data to/from webservices. JSON can be read into a matching Python data structure using the json library:

import json

data = { 'name' : 'Wichita Falls', 'population' : 104000 }

# write json
with open('data.json', 'w') as f:
    json.dump(data, f)

with open('data.json', 'r') as f:
    indata = json.load(f)

print( indata['population'] )

The JSON library supports the None, bool, int, float, string data types; and list, tuple, and dictionary structures.

 

Tabular Data

Tabular data is data that can be represented by a table. The table has rows (records). Each row has a value defined for each column (field). Tabular data can be read from a relational database using that database’s interface (e.g. psycopg2 for Postgres). Popular text file formats include CSV (comma separated value) and fixed width files.

CSV Files

These are text files with one record per line. The first line typically contains column headers. Individual fields have a separator which is a comma for a true CSV file, but other common characters include tab and semicolons. Double quotes are often used to enclose field values – e.g. to include commas in field strings.

The Python CSV library is one of the most convenient ways to read these files:

import csv

with open('sample.csv') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        print("; ".join(row) )

For tab-separated files, use ‘\t’ for the delimiter. These are the most common formatting options, but the library can also attempt to detect the CSV format.

If you have a single row header then this can be parsed by using a DictReader instead:

import csv

# input CSV file has two columns: name and population

with open('sample.csv') as csvfile:
    csvreader = csv.DictReader(csvfile)
    for row in csvreader:
        print( row['name'], row['population'] )

 

Fixed Width Files

Fixed width files are popular with older mainframe and Fortran applications, and continue to be frequently used by various Government data sources. These data files have one record per row. Each data field as defined by character position and not separator. For example, line chars 20-24 might define a numeric field, but 25 might be a single character field. The best approach to extract these using basic Python is to use string slicing, e.g. st = line[20:24]

 

Using Pandas

Pandas is a relatively new but very popular library for Python. It is intended for data analysis, filling a gap between NumPY / SciPY and specific data processing tools (e.g. for machine learning). As such, it is popularly combined with IPython and Jupyter, as an alternative to R. Full information on Pandas can be found on the Pandas website.

In common with R, Pandas handles data in frames. These frames are “packages” of data, typically in tabular form. Pandas supports a range of different file I/O methods. Here is a CSV example:

import pandas as pd

pd.read_csv('mydata.csv', sep=',', header=0)

This reads a file that uses a comma separator and parses the first row as column headers. Note that read_csv() has a lot of different options and is much more flexible than the standard Python CSV library. For example the sep parameter does not have to be a single character or even a string, it can be a regular expression! For example, sep=’\s+’ will match one or more whitespace characters.

Similarly, Pandas has the read_fwf() function to read fixed width data. For this you then have to provide a list of the column (field) specifications and widths.  Other import methods support Excel, JSON, SAS, and SQL.

 

Next

We have read the data into Python. The next post in this series will look at some validation techniques for the data.