# Titanic Walkthough

> WARNING: I am a bit loopy from post-operative drugs. Hope all this makes sense

### First, a non-Titanic Example

Let's take a look at this car MPG table:

| make | mpg | cylinders | cubic inches | HP | weight | secs 0-60 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|fiat 128| 30 | 4 | 68 | 49 | 1867 | 19.5 |
| chevrolet chevelle malibu | 10 | 8 | 307 | 130 | 3504 | 12 |
| plymouth 'cuda 340 | 15 | 8 | 340 | 160 | 3609 | 8 |
| datsun 1200 | 35 | 4 | 72 | 69 | 1613 | 18 |

and we are trying to predict the MPG in 5 MPG increments of these cars. That is, given a new car with 8 cylinders, 400c.i., 175 HP, 4464 pounds and 0-60 in 11.5 seconds we are trying to predict its MPG.

Here is the classifier code from chapter 5 slightly modified:


In [26]:
from urllib.request import urlopen 

class Classifier:

 def __init__(self, url, normalize=True):

 self.medianAndDeviation = []
 self.normalize = normalize
 # reading the data in from the url
 
 html = urlopen(url)
 lines = html.read().decode('UTF-8').split('\n')
 self.format = lines[0].strip().split('\t')
 self.data = []
 for line in lines[1:]:
 fields = line.strip().split('\t')
 ignore = []
 vector = []
 for i in range(len(fields)):
 if self.format[i] == 'num':
 vector.append(float(fields[i]))
 elif self.format[i] == 'comment':
 ignore.append(fields[i])
 elif self.format[i] == 'class':
 classification = fields[i]
 self.data.append((classification, vector, ignore))
 self.rawData = list(self.data)
 # get length of instance vector
 self.vlen = len(self.data[0][1])
 # now normalize the data
 if self.normalize:
 for i in range(self.vlen):
 self.normalizeColumn(i)


 
 
 ##################################################
 ###
 ### CODE TO COMPUTE THE MODIFIED STANDARD SCORE

 def getMedian(self, alist):
 """return median of alist"""
 if alist == []:
 return []
 blist = sorted(alist)
 length = len(alist)
 if length % 2 == 1:
 # length of list is odd so return middle element
 return blist[int(((length + 1) / 2) - 1)]
 else:
 # length of list is even so compute midpoint
 v1 = blist[int(length / 2)]
 v2 =blist[(int(length / 2) - 1)]
 return (v1 + v2) / 2.0
 

 def getAbsoluteStandardDeviation(self, alist, median):
 """given alist and median return absolute standard deviation"""
 sum = 0
 for item in alist:
 sum += abs(item - median)
 return sum / len(alist)


 def normalizeColumn(self, columnNumber):
 """given a column number, normalize that column in self.data"""
 # first extract values to list
 col = [v[1][columnNumber] for v in self.data]
 median = self.getMedian(col)
 asd = self.getAbsoluteStandardDeviation(col, median)
 #print("Median: %f ASD = %f" % (median, asd))
 self.medianAndDeviation.append((median, asd))
 for v in self.data:
 v[1][columnNumber] = (v[1][columnNumber] - median) / asd


 def normalizeVector(self, v):
 """We have stored the median and asd for each column.
 We now use them to normalize vector v"""
 vector = list(v)
 if self.normalize:
 for i in range(len(vector)):
 (median, asd) = self.medianAndDeviation[i]
 vector[i] = (vector[i] - median) / asd
 return vector

 
 ###
 ### END NORMALIZATION
 ##################################################



 def manhattan(self, vector1, vector2):
 """Computes the Manhattan distance."""
 return sum(map(lambda v1, v2: abs(v1 - v2), vector1, vector2))


 def nearestNeighbor(self, itemVector):
 """return nearest neighbor to itemVector"""
 return min([ (self.manhattan(itemVector, item[1]), item)
 for item in self.data])
 
 def classify(self, itemVector):
 """Return class we think item Vector is in"""
 return(self.nearestNeighbor(self.normalizeVector(itemVector))[1][0])
 



This is our same old nearest neighbor code converted to a classifier. It is short and sweet. Before I wrote the class I wrote the following `unitTest` code:. 

In [27]:
def unitTest():
 classifier = Classifier('https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/athletesTrainingSet.txt')
 br = ('Basketball', [72, 162], ['Brittainey Raven'])
 nl = ('Gymnastics', [61, 76], ['Viktoria Komova'])
 cl = ("Basketball", [74, 190], ['Crystal Langhorne'])
 # first check normalize function
 brNorm = classifier.normalizeVector(br[1])
 nlNorm = classifier.normalizeVector(nl[1])
 clNorm = classifier.normalizeVector(cl[1])
 assert(brNorm == classifier.data[1][1])
 assert(nlNorm == classifier.data[-1][1])
 print('normalizeVector fn OK')
 # check distance
 assert (round(classifier.manhattan(clNorm, classifier.data[1][1]), 5) == 1.16823)
 assert(classifier.manhattan(brNorm, classifier.data[1][1]) == 0)
 assert(classifier.manhattan(nlNorm, classifier.data[-1][1]) == 0)
 print('Manhattan distance fn OK')
 # Brittainey Raven's nearest neighbor should be herself
 result = classifier.nearestNeighbor(brNorm)
 assert(result[1][2]== br[2])
 # Nastia Liukin's nearest neighbor should be herself
 result = classifier.nearestNeighbor(nlNorm)
 assert(result[1][2]== nl[2])
 # Crystal Langhorne's nearest neighbor is Jennifer Lacy"
 assert(classifier.nearestNeighbor(clNorm)[1][2][0] == "Jennifer Lacy")
 print("Nearest Neighbor fn OK")
 # Check if classify correctly identifies sports
 assert(classifier.classify(br[1]) == 'Basketball')
 assert(classifier.classify(cl[1]) == 'Basketball')
 assert(classifier.classify(nl[1]) == 'Gymnastics')
 print('Classify fn OK')



This method just checks the other methods that I write to make sure they work as expected.
If you are not familiar with `assert`, something like

 x = 3
 assert(x == 5)

so that `assert(x == 5)` line checks to make sure x equals five. If not, as in this case, the program terminates and prints out the assert that fails. It is good practice to write test code before starting to write the actual code. In my writing of the class, I first wrote `normalizeVector`, then `manhattan`, then `nearestNeighbor` and finally `classify` and my unitTest matches that order. Let's run it now to make sure the code passes the unit test.

In [28]:
unitTest()

normalizeVector fn OK
Manhattan distance fn OK
Nearest Neighbor fn OK
Classify fn OK


Great. Now we have some confidence that our code works.

## My classifier is better than your classifer

Now, we would like to have a somewhat objective way of saying if one classifier is better than another. One way is to report the accuracy. So if our classifier was correct 90 out of 100 times we would say it is 90% accurate. That makes sense. 

### First try
For our first attempt we will load the data as before. We will call that our training set. That might look like:



| make | mpg | cylinders | cubic inches | HP | weight | secs 0-60 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|fiat 128| 30 | 4 | 68 | 49 | 1867 | 19.5 |
| chevrolet chevelle malibu | 10 | 8 | 307 | 130 | 3504 | 12 |
| plymouth 'cuda 340 | 15 | 8 | 340 | 160 | 3609 | 8 |
| datsun 1200 | 35 | 4 | 72 | 69 | 1613 | 18 |

Next we are going to go through that table again, but now, for each entry, we are going to find its nearest neighbor and use that to predict the MPG. Then we will see if our predicted value matches the actual value. We will just count all those up and compute the accuracy. 
Here's the problem. 

1. we are trying to get a predicted class for fiat 128
2. we find the nearest neighbor for fiat 128. (it will be itself)
3. we see if the mpg of the nearest neighbor matches the actual mpg. (it does)
4. and we are on our way to a wildly optimistic estimate of being 100% accurate.

Let's see if we can improve on this

### Try 2
The simpliest solution is to divide our data into two parts. One part, we will call the training data and that is what we will use to load into the classifier. The second part, we will call the test data, and that is what we will use to test the classifier. So, if Fiat 128 is in the training data, it will not be in the testing data and vice versa.

> NOTE: this division of training and testing data IS NOT THE SAME AS the Titanic files training.csv and testing.csv. In what I am talking about here we need to divide the training.csv data into 2 parts.

I've done that with this simple MPG data set.

* [Here is my training set](https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt)
* [And here is my test set](https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt)

And I will build my classifier with the data in the training set, and create a little test program that tests the classifier with data from test set. Here is that code:


In [29]:
from urllib.request import urlopen

def test(training_url, test_url):
 """Test the classifier on a test set of data"""
 classifier = Classifier(training_url)
 
 
 html = urlopen(test_url)
 lines = html.read().decode('UTF-8').split('\n')
 
 numCorrect = 0.0
 for line in lines:
 data = line.strip().split('\t')
 #print(data)
 if data != ['']:
 vector = []
 classInColumn = -1
 for i in range(len(classifier.format)):
 if classifier.format[i] == 'num':
 vector.append(float(data[i]))
 elif classifier.format[i] == 'class':
 classInColumn = i
 theClass= classifier.classify(vector)
 prefix = '-'
 if theClass == data[classInColumn]:
 # it is correct
 numCorrect += 1
 prefix = '+'
 print("%s %12s %s" % (prefix, theClass, line))
 print("%4.2f%% correct" % (numCorrect * 100/ len(lines)))

and now let's test how well we do with the mpg data set:


In [30]:
training_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt'
test_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt'
test(training_url, test_url)

+ 15 15	8	390.0	190.0	3850	8.5	amc ambassador dpl
+ 15 15	8	383.0	170.0	3563	10.0	dodge challenger se
+ 15 15	8	340.0	160.0	3609	8.0	plymouth 'cuda 340
- 20 15	8	400.0	150.0	3761	9.5	chevrolet monte carlo
+ 15 15	8	455.0	225.0	3086	10.0	buick estate wagon (sw)
+ 25 25	4	113.0	95.00	2372	15.0	toyota corona mark ii
- 25 20	6	198.0	95.00	2833	15.5	plymouth duster
- 25 20	6	199.0	97.00	2774	15.5	amc hornet
+ 20 20	6	200.0	85.00	2587	16.0	ford maverick
- 35 25	4	97.00	88.00	2130	14.5	datsun pl510
+ 25 25	4	97.00	46.00	1835	20.5	volkswagen 1131 deluxe sedan
+ 25 25	4	110.0	87.00	2672	17.5	peugeot 504
- 35 25	4	107.0	90.00	2430	14.5	audi 100 ls
- 30 25	4	104.0	95.00	2375	17.5	saab 99e
- 20 25	4	121.0	113.0	2234	12.5	bmw 2002
+ 20 20	6	199.0	90.00	2648	15.0	amc gremlin
- 15 10	8	360.0	215.0	4615	14.0	ford f250
- 15 10	8	307.0	200.0	4376	15.0	chevy c20
+ 10 10	8	318.0	210.0	4382	13.5	dodge d200
- 15 10	8	304.0	193.0	4732	18.5	hi 1200d
- 35 25	4	97.00	88.00	2130	14.5	datsun pl510
- 25 30	4	140.0

The '+' means we classified that instance correctly and the '-' means we didn't. So we were about 55% accurate. There were 8 different classes we were trying to predict: 10, 15, 20, 25, 30, 35, 40, and 45. So just by guessing we would only be 1/8 = 12.5% accurate. So 55% doesn't sound so bad. 

Let's see if we can improve on that. Suppose we don't normalize the data. Let's write another test function that does that:

In [31]:
from urllib.request import urlopen

def test(training_url, test_url):
 """Test the classifier on a test set of data"""
 classifier = Classifier(training_url, normalize=False)
 
 
 html = urlopen(test_url)
 lines = html.read().decode('UTF-8').split('\n')
 
 numCorrect = 0.0
 for line in lines:
 data = line.strip().split('\t')
 #print(data)
 if data != ['']:
 vector = []
 classInColumn = -1
 for i in range(len(classifier.format)):
 if classifier.format[i] == 'num':
 vector.append(float(data[i]))
 elif classifier.format[i] == 'class':
 classInColumn = i
 theClass= classifier.classify(vector)
 prefix = '-'
 if theClass == data[classInColumn]:
 # it is correct
 numCorrect += 1
 prefix = '+'
 print("%s %12s %s" % (prefix, theClass, line))
 print("%4.2f%% correct" % (numCorrect * 100/ len(lines)))

and run it

In [32]:
training_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt'
test_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt'
test(training_url, test_url)

+ 15 15	8	390.0	190.0	3850	8.5	amc ambassador dpl
- 20 15	8	383.0	170.0	3563	10.0	dodge challenger se
- 10 15	8	340.0	160.0	3609	8.0	plymouth 'cuda 340
+ 15 15	8	400.0	150.0	3761	9.5	chevrolet monte carlo
+ 15 15	8	455.0	225.0	3086	10.0	buick estate wagon (sw)
- 20 25	4	113.0	95.00	2372	15.0	toyota corona mark ii
+ 20 20	6	198.0	95.00	2833	15.5	plymouth duster
+ 20 20	6	199.0	97.00	2774	15.5	amc hornet
- 25 20	6	200.0	85.00	2587	16.0	ford maverick
- 30 25	4	97.00	88.00	2130	14.5	datsun pl510
- 30 25	4	97.00	46.00	1835	20.5	volkswagen 1131 deluxe sedan
+ 25 25	4	110.0	87.00	2672	17.5	peugeot 504
- 35 25	4	107.0	90.00	2430	14.5	audi 100 ls
- 20 25	4	104.0	95.00	2375	17.5	saab 99e
- 20 25	4	121.0	113.0	2234	12.5	bmw 2002
- 25 20	6	199.0	90.00	2648	15.0	amc gremlin
- 15 10	8	360.0	215.0	4615	14.0	ford f250
- 15 10	8	307.0	200.0	4376	15.0	chevy c20
- 15 10	8	318.0	210.0	4382	13.5	dodge d200
- 15 10	8	304.0	193.0	4732	18.5	hi 1200d
- 30 25	4	97.00	88.00	2130	14.5	datsun pl510
- 25 30	4	140.0

Hmmm. That seemed to make it worse. But now, with this test procedure, and our code divided into training and test sets we can fiddle with what columns to include, or with the weights of the different columns (maybe 0-60 should weight heavier than the number of cylinders, for example) and quickly see if what improves our accuracy

### How good is it.
Now we are done tuning our classifier and the accuracy seems fine. Maybe we would like to submit it to a data mining programming competition, or write a research paper saying how wonderfully accurate it is, or just simply finish the Titanic project. So we need to say how accurate it is. It would be tempting to report the accuracy we just computed that used our test set. But the problem is, we spent a lot of time **tuning** our classifier **on** that test set. Of course it will do well on that test set. This accuracy may not reflect the true accuracy of our classifier on other data. So often a **second test set** is used. And this is what the Titanic `test.csv` file is. In the Titanic test dataset we don't even know the correct classification, so we cannot use it for tuning. But I can use the results you get from running your classifier on the second test set to determine the accuracy of your classifier.

#### my accuracy went down
When I was in the tuning phase of my classifier I was getting in the low 80% accuracy range using the method I just showed above. When I ran that classifier on the second test set, it was only in the mid 70% accuracy range. 




## How I approached the Titanic Problem
#### 1. I did all the above, used the chapter 4 code, did the unitTest and ran through a few datasets including the MPG dataset. 
#### 2. I massaged both Titanic Data Files.
The original files looked like:

```
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
```

I converted them to tab separated fields with no quotes:

```
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25		S	0
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C	100
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925		S	100
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S	100
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05		S	0
```
I could have handled the original format in my original code but this seemed easier.


####3. I renamed the original Titanic `test.csv` file `unknown.csv` (I didn't really do this at this point but I figured it makes this more understandable)

####4. Outside of the original Python code, I divided the `train.csv` Titanic dataset into two files. 
About 100 lines of the file I put into a new file called `testing.csv`. The remaining lines I put in a file called `training.csv`. 
####5. You need to either modify the code above to handle local files (like the code in the book) or put these files you created on a web