In this series of posts we follow the Kwaggle Python Tutorial for constructing a model to predict deaths on the Titanic. Whereas the Kwaggle tutorial is a crash course on using Python to make and submit a basic model to the competition, my purpose is to go through the tutorial methodically. I will peek behind the curtain to shed light on how Python implements the model.

This post follows the tutorial to the extent that it imports data from a CSV file into Python.  In doing so it expounds the concepts of file objects, iterables, iterators, and for loops.

At its most basic, in Python we code to create objects in order to do stuff to other objects. In our case we wish to use Python to import data that has already been provided to us in a CSV format, and to perform analysis on this data using Python.

Python has a library which contains different modules which in turn contain statements which instruct Python to do different things. The advantage of being able to import a module into Python is that we do not have to code an application programme interface (API), code which tells python how to talk to CSV files. We can import CSV module and concern ourselves with how we provide the arguments necessary for the module to work.

We will join the Kwaggle Python tutorial here…

import csv as csv 
import numpy as np
import pandas as pd


csv_file_object = csv.reader(open('/Users/williamneal/Scratch/Titanic/train.csv', 'rt'))
header = csv_file_object.next()

data=[]
for row in csv_file_object: 
 data.append(row) 

The CSV module contains statements which tells Python how to read and write CSV files. We are concerned with the statements which enable us to read CSV files. The “import” command tells Python to import the respective modules, and the “as” gives each module a nickname in order to avoid having to type out “numpy” each time you want to grab a tool from the numpy toolbox.

csv.reader(open('../csv/train.csv', 'rb'))

The open() command creates a file object which mediates with the underlying CSV file. The open function effectively asks for the ‘handle’, from the operating system, in order to be able to open and access the underlying object.  The first parameter of the call function tells Python where this underlying file can be found. The address that I use reflects the fact that I have downloaded the dataset and have stored it locally.

csv.reader(open('../csv/train.csv', 'rb'))

As well as telling Python where the underlying file can be found, the second parameter “rb” tells Python what you want to do with that underlying file. Kwaggle’s Python guide is written for a particular version of Python, version 2.7. In this version of Python’r’ stands for reading i.e. we want Python to read the file. Window’s (unlike Macs) makes a distinction between text files and binary files. Text files are those files which are meant to be read by us, and binary files are those files meant to be understood by the computer. Whilst Python 2.7 is happy to convert text into binary and binary into text, Windows itself needs to be instructed. The ‘b’ in this context stands for binary. Together ‘rb’ tells Python that the underlying CSV file is to be read in a binary mode.

We hit a problem when we try to run the code in Python 3.5, which throws up the following error:

Error: iterator should return strings, not bytes (did you open the file in text mode?)

The cvs.reader returns values in the data format of strings, and for Python 3 strings are coded as unicode and does not support binary. Because Python 3x strictly separates binary and text,  if you wish to follow the Kwaggle guide in Python 3x you will need to modify the code so that the binary ‘b’ is replaced by the text ‘t’ i.e. ‘rt’.

To summarize what has been achieved so far, cvs.reader has created a file object which gives us access to the underlying file, it does not copy the data. The open() function is itself used as as the first parameter of the csv.reader function. The file object, that which mediates with the underlying file, has a method called __iter__. The cvs.reader calls that method which returns a second object whose job is to read through each line of the underlying cvs file.

csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 

This second object is called an iterator, and the file object (the first object) is said to be iterable i.e. explicable, the iterator turns the iterable inside out. It is this which creates a representation of the data which is internal to Python. The iterator does this through possessing a method called next(), which reads the line of code and returns it.  Next() is executed each time until it has exhausted the underlying CSV file.

You may recall that normally a function is called, processes an argument(s), returns a value, and then dies.   What is special about the iterator is that it has the ability to hibernate and can retain state through this period of hibernation, i.e. it can remember whether it has been triggered before. When next() is called, it doesn’t start from the beginning of the underlying file each time, it remembers where it left off from last time.

The underlying CSV file is structured as follows:

Screen Shot 2016-06-30 at 2.13.18 PM

As you can see the first line contains field headers. Each line that follows contains a data entry which conforms to the field headers, note that this correspondence is assumed by us as readers and is not intrinsic to the data set itself. From the CSV file perspective each line is discreet containing 12 items of data.

The result of CSV.reader is a new python object called cvs_file_object,  each line of CSV is now a list, and the cvs_file_object is a list of these lists whose data type as previously mentioned is a string.

We would like the ability to skip the headers, in order to focus on the data itself and avoid the risk of confusion later when it comes to data retrieval. We therefore create a new variable linked to CSV.reader but which calls the next() function…

header = csv_file_object.next()

Normally the iterator will start from the beginning, but by appending next() it will skip the first line when we invoke the variable ‘header’. Note that both variables, cvs_file_object and header, refer to the same object. It is just that with header we tell python to skip the first line.Nevertheless, for each variable, we will lose the end product if we do not create an object to contain the results.

Another difference that comes about through using Python 3x raises its head. In Python 3x the level at which the next method is coded has changed. Previously it was embedded at the modular level, but now it is a builtin method denoted by __next__ .

Python 3x users should change it to the following:

header = next(csv_file_object)

Now we to create an object to contain the results of the iteration:

data=[]
for row in csv_file_object:
data.append(row)

Here we have created a Python object and have labeled it “data”. This object will act as a container for the data we wish to use in Python. The empty square brackets indicate to Python that the container is empty, it is yet to be populated with data.

data=[]
for row in csv_file_object:
    data.append(row)

You will recall that the iterator reads each line of the underlying CSV file in turn,  each line is a discreet record, with each item of data in each record separated. Python preserves this relationship between a single record and the items which make up the record through making a list containing other lists, the latter representing each record. Let us see an example… The first record in the underlying CSV file:

Screen Shot 2016-06-30 at 2.13.18 PM

We have skipped the first line using csv_file_object.next() / next(csv_file_object) so the first entry is:

Screen Shot 2016-06-30 at 3.54.19 PM

The first record is a list containing 12 items, this record is itself an item contained in another list made up of other records (each record containing 12 items each) which together forms the data set as it dwells within Python.

Instead of using a for-loop and appending the result to an empty object, we could depart from the tutorial and code a list comprehension  instead:

data = [row for row in csv_file_object]

vs…

data=[]
for row in csv_file_object:
    data.append(row)

There is little point to using list comprehension in this case other than typing less (small size of the data set limits any  optimization benefits at the expense of making the code more obtuse). However it is useful to note that on a larger dataset using list comprehension represents a possible optimization because the calculation takes place at the level of the C and not at the additional level of abstraction of the Python interpreter.

data = [row for row in csv_file_object]

The obtuseness of the list comprehension can be overcome if you associate the first row as belong to the new object being created i.e. “new row (in new object named “data”) for each old row (in old object named cvs_file_object)”.

img_0003

Next post in series will look at converting data type and manipulating data in numpy arrays.

 

Advertisements

2 thoughts on “File objects, iterators, and list comprehensions for the uninitiated

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s