In this series of posts we follow the Kwaggle Python Tutorial for constructing a model to predict deaths on the Titanic. Whereas the Kwaggle tutorial is a crash course on using Python to make and submit a basic model to the competition, my purpose is to go through the tutorial methodically. I will peek behind the curtain to shed light on how Python implements the model.

We followed the tutorial in the first post of the series as it read a training dataset and in  the second post we built a model to predict survival on the Titanic based on gender. In this third post we will follow the tutorial as it looks to apply the model on a test dataset which will provide an output for submission.

The model created in the second post  is interpreted in a trial way so that all female passengers are predicted to survive and all male passengers are predicted to have died.

As we have done before with the training data, the first step is to read in the test dataset:

test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

It is worth comparing this to the code that we used first time around:

csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 
header = csv_file_object.next()

You will notice that this time round the tutorial splits the opening of the file and the creation of the file object into two separate steps. Whilst in this instance it achieves the same result, the creation of a file object, by separating the the two steps it allows us to more easily sign post the element that needs to be changed in order to read in a different file in future, through reassigning the test_file variable. In Python 3x replace ‘rb, with ‘rt’.

test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

The header variable, like before, skips the first line of the cvs file. In Python 3x use the following instead:

header = next(csv_file_object)

When it comes to submitting out cvs file, if the header is not skipped for any reason we will get the following error:

Screen Shot 2016-07-11 at 11.48.22 AM

As the computer expects to read a number and not a string as its first line!

We now  create  a new file in order to store our results:

prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

Like the previous step, this code could have been written with only one variable but by inserting another variable it helps to identify the element that can be substituted should we like to reuse the code in order to create a different file.

prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

Prediction_file  creates a file object which mediates with a file that is external to Python, ‘wb’ indicates that it is possible to write in binary format (in Python 3x this will need to be changed to ‘wt’). This will overwrite any existing data contained in the file object, if you wish to retain any existing data you open it with ‘a’ which appends the data that is to be written.

prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

Turns the file object into a  writer object, which formalizes the relationship of the writing of the data to the underlying file that mediates access to the external CSV file.

prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       
    if row[3] == 'female':                                              
        prediction_file_object.writerow([row[0],'1'])    
    else:                                   
        prediction_file_object.writerow([row[0],'0'])    
test_file.close()
prediction_file.close()

Creates the headers which the submission needs in order to be accepted by Kwaggle. It also sets up two parameters for the iteration which will follow shortly.

prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       
    if row[3] == 'female':                                        
        prediction_file_object.writerow([row[0],'1'])    
    else:                                   
        prediction_file_object.writerow([row[0],'0'])    
test_file.close()
prediction_file.close()

It looks at the test.csv file in which the data is structured in a table, row[3] indicates the independent variable of the gender of the particular passenger.

If the gender of the passenger is female, the first argument that is to be passed to the prediction_file_object.writerow is the record’s first item which is the passenger ID. The second argument passed is the dependent variable that predicts survival based on our model

If the gender of the passenger is not female, the first argument that is to be passed to the prediction_file_object.writerow is the record’s first item which is the passenger ID. The second argument passed is the dependent variable that predicts non-survival based on our model.

It iterates through all of the records available in the test.csv, until it runs out of records and StopIteration is raised which brings the recursion to an end.

prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       
    if row[3] == 'female':                                        
        prediction_file_object.writerow([row[0],'1'])    
    else:                                   
        prediction_file_object.writerow([row[0],'0'])    
test_file.close()
prediction_file.close()

Explicitly closing is best practice but is not strictly necessary if we are using CPython implementation. This is because the iterator will automatically close test_file and prediction_file at the end of its iteration as the reference counter which counts how many variables point to a particular object will close it down if the reference counter reaches zero. However not all Python implementations operate in this manner and future versions of CPython may behave differently, so it is a good idea to close down the open files in order to free up system resource.

We can then go to part of the Kwaggle website which will allow us to submit our prediction based on gender.

Once submitted it gives us feedback as to regards to our accuracy:

Screen Shot 2016-07-11 at 4.43.38 PM

An accuracy of 76%!

As mentioned at the start of this post, the application of the gender model is trivial in the way that it simply ascribes survival based on gender. Surely some female passengers died and some some male passengers survived. In the next post we will follow the tutorial as it makes a more accurate model for predictions.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s