In this series of posts we follow the Kwaggle Python Tutorial for constructing a model to predict deaths on the Titanic. Whereas the Kwaggle tutorial is a crash course on using Python to make and submit a basic model to the competition, my purpose is to go through the tutorial methodically. I will peek behind the curtain to shed light on how Python implements the model.
We pick up the Kwaggle tutorial from where we left off last time, and use it as an excuse to explore lists, numpy arrays and changing datatypes. We demonstrate how it is possible to mimic numpy array functions using lists, and how it is still probably better just to use numpy arrays. We follow the Kwaggle tutorial in order to select gender as the independent variable to predict survival rates on the Titanic.
Previously we have used python to make a file object in order to access the underlying CSV file, we then created an iterator which read the the underlying CSV file line by line and created a new data object arranged as a list of lists which represented individual records, each having 12 categories of passenger information.
We did this with the following code:
import csv as csv import numpy as np import pandas as pd csv_file_object = csv.reader(open('/Users/williamneal/Scratch/Titanic/train.csv', 'rb')) header = next(csv_file_object) data= for row in csv_file_object: data.append(row)
I am using Python 3.5, as a consequence I edited the above code and changed the mode in which underlying CSV file was opened by Python from read-binary (rb) to read-text (rt).
We can check the length of data through using the len function:
But that only returns the number of records and not the number of items in each record. We can look look up individual list of items in each list:
Here we randomly chose three indices that were equal or lower than 891, and as we expected it to do so it consistently returned a length of 12 which corresponds to the 12 passenger data categories. Notice that being a list of lists it does not form a matrix where we can can slice data across different lists (records) contained in the data object.
There are several ways to approximate this, using nested for-loops is one way:
for row in data:
for row in row:
The first for-loop iterates through records, the second for-loop iterates through items of a record. It creates a new list object, ‘deadoralive’, by adding items from the second item in each record (lists and arrays indices start at ‘0’, ‘’ is therefore the second item in each record). If we wish to return an object that gives you survivors by gender, again this is doable in python, we would need create an empty object and iterate through the records for the gender column before using the function zip() to marry the two lists together.
The tutorial does not want us to follow this path but wants us to repackage the data into a numpy array. It should be born in mind that much of the subsequent operations that follow, shown above, can be carried out without numpy arrays. However numpy arrays are faster which becomes a greater benefit when working on large datasets. They are faster for two main reasons:
- Locality of reference: Numpy arrays are able to support matrices which has the effect of taking a short cut, of being able to directly traverse adjacent data structures and access mutually corresponding elements of each structure, without having to take a detour through the ‘proper’ entrance of each respective structure. Furthermore, numpy arrays, unlike lists, are a fixed length and cannot be dynamically extended (numpy would make a new array object, not modifying an existing one) which further supports optimization.
- Closer to machine code: Due to being executed in C language, operations are carried out by the computers CPU, not Python. Less hoops to jump through.
In our case we are not starting from a blank canvas and so we need to create an array from existing data.
data = np.array(data)
The first parameter requires an object with an array_like interface which simply means that it an object that is compatible with the numpy array API.
The object that supports arrays must provide as a minimum:
- Shape: A tuple whose element describes the arrays size in each dimension
- Typestr: a string which describes the homogeneous datatype of the array
Although np.array() handles it for us, we know from running len() on data that it has 891 records each with 12 items, and therefore that we can describe it as (891, 12). We also know that csv.reader returns an object with a datatype string, and we know that this is compatible with typestr. It therefore returns a 2D array containing 891*12 items.
We can check this by calling the shape of the array as well as printing it:
There are three objects which make up a numpy array:
This image is taken from the numpy documentation , the array itself is a collection of items of a specified length (the aforementioned shape) which are linked to the data-type object which determines the data type (the aforementioned typestr). When we make a slice of data it returns a python object which has the array scalar as a data type. A returned array scalar has the same methods as an array as the scalers are internally converted into a 0-dimensional array which then calls the corresponding array method.
The tutorial wants us to define a subset of the data, survival by gender and to build a trivial model to predict deaths on the Titanic by gender. The first step is to work out the survival rate in general:
number_passengers = np.size(data[0::,1].astype(np.float)) number_survived = np.sum(data[0::,1].astype(np.float)) proportion_survivors = number_survived / number_passengers
As we recall the items in the data array have the string data type, in order to do numeric calculations we need to convert the data type to float. One of the attributes of a np.array is the attribute np.astype which when called returns a copy of the array in the desired data type.
The slice is defined by [start:stop:step, column], and determines the scope of size and sum functions. By only specifying the start of the slice and the column number, it tells python that the scope of the function is the entirety of the second column which corresponds to the data field survival. It is incumbent upon us to remember which column corresponds to which field, a drawback of using numpy which will be overcome later in the series when the tutorial goes on to look at Pandas.
It should be remembered that the original CSV file for survival was in binary i.e. 0’s and 1’s. Returning the size of the slice, the number of array scalars that hold the 0’s and 1’s tells us the passenger numbers. Summing the survival column we find out the number who have survived.
This gives us a general survival rate of 38%, but the tutorial wants us to see if we can identify an independent variable that can act as a predictor in an unknown dataset. To this end it picks gender, and wishes us to calculate survival rates based on gender categories. The first step is to segregate data from the gender column:
women_only_stats = data[0::,4] == "female" men_only_stats = data[0::,4] != "female"
Notice that unlike previous operation there is no need to use np.astype, that is because the code is comparing one string with another, it asks each array scalar in the slice “are you ‘female’?” and “are you not ‘female’?” and returns answer as a boolean value. Python considers boolean values as being a type of integer, True = 1 False = 0, and automatically convert between string and integer data type as required. This becomes important considering that numpy array is a structure that is only able to accept one data type at a time. In our case it therefore populates a ‘view’ of the array in terms of boolean values in the string format, True and False:
Having established the gender categories, we are now going to do a numerical calculation to work out the number of passengers in each category. We need to convert the results into floats:
women_onboard = data[women_only_stats,1].astype(np.float) men_onboard = data[men_only_stats,1].astype(np.float)
String boolean values are now converted into floats:
We now wish to work out the proportion of survivors through dividing the size of the respective categories by the sum of the respective categories:
proportion_women_survived = np.sum(women_onboard) / np.size(women_onboard) proportion_men_survived = np.sum(men_onboard) / np.size(men_onboard)
Which gives us the following:
Compared to the general survival rate of 38%, gender is a good predictor of survival. It should be born in mind that this dataset is a subset of a larger dataset. The purpose of this subset is to use it to train a model in order to make predictions as to likelihood of survival in the unknown dataset. Assuming that we we know the gender of passengers in this unknown dataset, we could use use it as a more specific predictor than the general survival rate. We would be in a better position to breakdown the analysis to say who died and survived.
Numpy arrays as well as being faster are also useful to understand Pandas, which is a data analysis module for Python. Arguments supplied to Panda functions/methods are provided as numpy arrays.