{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Titanic Walkthough\n", "\n", "> WARNING: I am a bit loopy from post-operative drugs. Hope all this makes sense\n", "\n", "### First, a non-Titanic Example\n", "\n", "Let's take a look at this car MPG table:\n", "\n", "| make | mpg | cylinders | cubic inches | HP | weight | secs 0-60 |\n", "| :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n", "|fiat 128| 30 | 4 | 68 | 49 | 1867 | 19.5 |\n", "| chevrolet chevelle malibu | 10 | 8 | 307 | 130 | 3504 | 12 |\n", "| plymouth 'cuda 340 | 15 | 8 | 340 | 160 | 3609 | 8 |\n", "| datsun 1200 | 35 | 4 | 72 | 69 | 1613 | 18 |\n", "\n", "and we are trying to predict the MPG in 5 MPG increments of these cars. That is, given a new car with 8 cylinders, 400c.i., 175 HP, 4464 pounds and 0-60 in 11.5 seconds we are trying to predict its MPG.\n", "\n", "Here is the classifier code from chapter 5 slightly modified:\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from urllib.request import urlopen \n", "\n", "class Classifier:\n", "\n", " def __init__(self, url, normalize=True):\n", "\n", " self.medianAndDeviation = []\n", " self.normalize = normalize\n", " # reading the data in from the url\n", " \n", " html = urlopen(url)\n", " lines = html.read().decode('UTF-8').split('\\n')\n", " self.format = lines[0].strip().split('\\t')\n", " self.data = []\n", " for line in lines[1:]:\n", " fields = line.strip().split('\\t')\n", " ignore = []\n", " vector = []\n", " for i in range(len(fields)):\n", " if self.format[i] == 'num':\n", " vector.append(float(fields[i]))\n", " elif self.format[i] == 'comment':\n", " ignore.append(fields[i])\n", " elif self.format[i] == 'class':\n", " classification = fields[i]\n", " self.data.append((classification, vector, ignore))\n", " self.rawData = list(self.data)\n", " # get length of instance vector\n", " self.vlen = len(self.data[0][1])\n", " # now normalize the data\n", " if self.normalize:\n", " for i in range(self.vlen):\n", " self.normalizeColumn(i)\n", "\n", "\n", " \n", " \n", " ##################################################\n", " ###\n", " ### CODE TO COMPUTE THE MODIFIED STANDARD SCORE\n", "\n", " def getMedian(self, alist):\n", " \"\"\"return median of alist\"\"\"\n", " if alist == []:\n", " return []\n", " blist = sorted(alist)\n", " length = len(alist)\n", " if length % 2 == 1:\n", " # length of list is odd so return middle element\n", " return blist[int(((length + 1) / 2) - 1)]\n", " else:\n", " # length of list is even so compute midpoint\n", " v1 = blist[int(length / 2)]\n", " v2 =blist[(int(length / 2) - 1)]\n", " return (v1 + v2) / 2.0\n", " \n", "\n", " def getAbsoluteStandardDeviation(self, alist, median):\n", " \"\"\"given alist and median return absolute standard deviation\"\"\"\n", " sum = 0\n", " for item in alist:\n", " sum += abs(item - median)\n", " return sum / len(alist)\n", "\n", "\n", " def normalizeColumn(self, columnNumber):\n", " \"\"\"given a column number, normalize that column in self.data\"\"\"\n", " # first extract values to list\n", " col = [v[1][columnNumber] for v in self.data]\n", " median = self.getMedian(col)\n", " asd = self.getAbsoluteStandardDeviation(col, median)\n", " #print(\"Median: %f ASD = %f\" % (median, asd))\n", " self.medianAndDeviation.append((median, asd))\n", " for v in self.data:\n", " v[1][columnNumber] = (v[1][columnNumber] - median) / asd\n", "\n", "\n", " def normalizeVector(self, v):\n", " \"\"\"We have stored the median and asd for each column.\n", " We now use them to normalize vector v\"\"\"\n", " vector = list(v)\n", " if self.normalize:\n", " for i in range(len(vector)):\n", " (median, asd) = self.medianAndDeviation[i]\n", " vector[i] = (vector[i] - median) / asd\n", " return vector\n", "\n", " \n", " ###\n", " ### END NORMALIZATION\n", " ##################################################\n", "\n", "\n", "\n", " def manhattan(self, vector1, vector2):\n", " \"\"\"Computes the Manhattan distance.\"\"\"\n", " return sum(map(lambda v1, v2: abs(v1 - v2), vector1, vector2))\n", "\n", "\n", " def nearestNeighbor(self, itemVector):\n", " \"\"\"return nearest neighbor to itemVector\"\"\"\n", " return min([ (self.manhattan(itemVector, item[1]), item)\n", " for item in self.data])\n", " \n", " def classify(self, itemVector):\n", " \"\"\"Return class we think item Vector is in\"\"\"\n", " return(self.nearestNeighbor(self.normalizeVector(itemVector))[1][0])\n", " \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is our same old nearest neighbor code converted to a classifier. It is short and sweet. Before I wrote the class I wrote the following `unitTest` code:. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def unitTest():\n", " classifier = Classifier('https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/athletesTrainingSet.txt')\n", " br = ('Basketball', [72, 162], ['Brittainey Raven'])\n", " nl = ('Gymnastics', [61, 76], ['Viktoria Komova'])\n", " cl = (\"Basketball\", [74, 190], ['Crystal Langhorne'])\n", " # first check normalize function\n", " brNorm = classifier.normalizeVector(br[1])\n", " nlNorm = classifier.normalizeVector(nl[1])\n", " clNorm = classifier.normalizeVector(cl[1])\n", " assert(brNorm == classifier.data[1][1])\n", " assert(nlNorm == classifier.data[-1][1])\n", " print('normalizeVector fn OK')\n", " # check distance\n", " assert (round(classifier.manhattan(clNorm, classifier.data[1][1]), 5) == 1.16823)\n", " assert(classifier.manhattan(brNorm, classifier.data[1][1]) == 0)\n", " assert(classifier.manhattan(nlNorm, classifier.data[-1][1]) == 0)\n", " print('Manhattan distance fn OK')\n", " # Brittainey Raven's nearest neighbor should be herself\n", " result = classifier.nearestNeighbor(brNorm)\n", " assert(result[1][2]== br[2])\n", " # Nastia Liukin's nearest neighbor should be herself\n", " result = classifier.nearestNeighbor(nlNorm)\n", " assert(result[1][2]== nl[2])\n", " # Crystal Langhorne's nearest neighbor is Jennifer Lacy\"\n", " assert(classifier.nearestNeighbor(clNorm)[1][2][0] == \"Jennifer Lacy\")\n", " print(\"Nearest Neighbor fn OK\")\n", " # Check if classify correctly identifies sports\n", " assert(classifier.classify(br[1]) == 'Basketball')\n", " assert(classifier.classify(cl[1]) == 'Basketball')\n", " assert(classifier.classify(nl[1]) == 'Gymnastics')\n", " print('Classify fn OK')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method just checks the other methods that I write to make sure they work as expected.\n", "If you are not familiar with `assert`, something like\n", "\n", " x = 3\n", " assert(x == 5)\n", "\n", "so that `assert(x == 5)` line checks to make sure x equals five. If not, as in this case, the program terminates and prints out the assert that fails. It is good practice to write test code before starting to write the actual code. In my writing of the class, I first wrote `normalizeVector`, then `manhattan`, then `nearestNeighbor` and finally `classify` and my unitTest matches that order. Let's run it now to make sure the code passes the unit test." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "normalizeVector fn OK\n", "Manhattan distance fn OK\n", "Nearest Neighbor fn OK\n", "Classify fn OK\n" ] } ], "source": [ "unitTest()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great. Now we have some confidence that our code works.\n", "\n", "## My classifier is better than your classifer\n", "\n", "Now, we would like to have a somewhat objective way of saying if one classifier is better than another. One way is to report the accuracy. So if our classifier was correct 90 out of 100 times we would say it is 90% accurate. That makes sense. \n", "\n", "### First try\n", "For our first attempt we will load the data as before. We will call that our training set. That might look like:\n", "\n", "\n", "\n", "| make | mpg | cylinders | cubic inches | HP | weight | secs 0-60 |\n", "| :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n", "|fiat 128| 30 | 4 | 68 | 49 | 1867 | 19.5 |\n", "| chevrolet chevelle malibu | 10 | 8 | 307 | 130 | 3504 | 12 |\n", "| plymouth 'cuda 340 | 15 | 8 | 340 | 160 | 3609 | 8 |\n", "| datsun 1200 | 35 | 4 | 72 | 69 | 1613 | 18 |\n", "\n", "Next we are going to go through that table again, but now, for each entry, we are going to find its nearest neighbor and use that to predict the MPG. Then we will see if our predicted value matches the actual value. We will just count all those up and compute the accuracy. \n", "Here's the problem. \n", "\n", "1. we are trying to get a predicted class for fiat 128\n", "2. we find the nearest neighbor for fiat 128. (it will be itself)\n", "3. we see if the mpg of the nearest neighbor matches the actual mpg. (it does)\n", "4. and we are on our way to a wildly optimistic estimate of being 100% accurate.\n", "\n", "Let's see if we can improve on this\n", "\n", "### Try 2\n", "The simpliest solution is to divide our data into two parts. One part, we will call the training data and that is what we will use to load into the classifier. The second part, we will call the test data, and that is what we will use to test the classifier. So, if Fiat 128 is in the training data, it will not be in the testing data and vice versa.\n", "\n", "> NOTE: this division of training and testing data IS NOT THE SAME AS the Titanic files training.csv and testing.csv. In what I am talking about here we need to divide the training.csv data into 2 parts.\n", "\n", "I've done that with this simple MPG data set.\n", "\n", "* [Here is my training set](https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt)\n", "* [And here is my test set](https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt)\n", "\n", "And I will build my classifier with the data in the training set, and create a little test program that tests the classifier with data from test set. Here is that code:\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from urllib.request import urlopen\n", "\n", "def test(training_url, test_url):\n", " \"\"\"Test the classifier on a test set of data\"\"\"\n", " classifier = Classifier(training_url)\n", " \n", " \n", " html = urlopen(test_url)\n", " lines = html.read().decode('UTF-8').split('\\n')\n", " \n", " numCorrect = 0.0\n", " for line in lines:\n", " data = line.strip().split('\\t')\n", " #print(data)\n", " if data != ['']:\n", " vector = []\n", " classInColumn = -1\n", " for i in range(len(classifier.format)):\n", " if classifier.format[i] == 'num':\n", " vector.append(float(data[i]))\n", " elif classifier.format[i] == 'class':\n", " classInColumn = i\n", " theClass= classifier.classify(vector)\n", " prefix = '-'\n", " if theClass == data[classInColumn]:\n", " # it is correct\n", " numCorrect += 1\n", " prefix = '+'\n", " print(\"%s %12s %s\" % (prefix, theClass, line))\n", " print(\"%4.2f%% correct\" % (numCorrect * 100/ len(lines)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and now let's test how well we do with the mpg data set:\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+ 15 15\t8\t390.0\t190.0\t3850\t8.5\tamc ambassador dpl\n", "+ 15 15\t8\t383.0\t170.0\t3563\t10.0\tdodge challenger se\n", "+ 15 15\t8\t340.0\t160.0\t3609\t8.0\tplymouth 'cuda 340\n", "- 20 15\t8\t400.0\t150.0\t3761\t9.5\tchevrolet monte carlo\n", "+ 15 15\t8\t455.0\t225.0\t3086\t10.0\tbuick estate wagon (sw)\n", "+ 25 25\t4\t113.0\t95.00\t2372\t15.0\ttoyota corona mark ii\n", "- 25 20\t6\t198.0\t95.00\t2833\t15.5\tplymouth duster\n", "- 25 20\t6\t199.0\t97.00\t2774\t15.5\tamc hornet\n", "+ 20 20\t6\t200.0\t85.00\t2587\t16.0\tford maverick\n", "- 35 25\t4\t97.00\t88.00\t2130\t14.5\tdatsun pl510\n", "+ 25 25\t4\t97.00\t46.00\t1835\t20.5\tvolkswagen 1131 deluxe sedan\n", "+ 25 25\t4\t110.0\t87.00\t2672\t17.5\tpeugeot 504\n", "- 35 25\t4\t107.0\t90.00\t2430\t14.5\taudi 100 ls\n", "- 30 25\t4\t104.0\t95.00\t2375\t17.5\tsaab 99e\n", "- 20 25\t4\t121.0\t113.0\t2234\t12.5\tbmw 2002\n", "+ 20 20\t6\t199.0\t90.00\t2648\t15.0\tamc gremlin\n", "- 15 10\t8\t360.0\t215.0\t4615\t14.0\tford f250\n", "- 15 10\t8\t307.0\t200.0\t4376\t15.0\tchevy c20\n", "+ 10 10\t8\t318.0\t210.0\t4382\t13.5\tdodge d200\n", "- 15 10\t8\t304.0\t193.0\t4732\t18.5\thi 1200d\n", "- 35 25\t4\t97.00\t88.00\t2130\t14.5\tdatsun pl510\n", "- 25 30\t4\t140.0\t90.00\t2264\t15.5\tchevrolet vega 2300\n", "+ 25 25\t4\t113.0\t95.00\t2228\t14.0\ttoyota corona\n", "+ 20 20\t6\t232.0\t100.0\t2634\t13.0\tamc gremlin\n", "- 20 15\t6\t225.0\t105.0\t3439\t15.5\tplymouth satellite custom\n", "- 20 15\t6\t250.0\t100.0\t3329\t15.5\tchevrolet chevelle malibu\n", "+ 20 20\t6\t250.0\t88.00\t3302\t15.5\tford torino 500\n", "+ 20 20\t6\t232.0\t100.0\t3288\t15.5\tamc matador\n", "+ 15 15\t8\t350.0\t165.0\t4209\t12.0\tchevrolet impala\n", "+ 15 15\t8\t400.0\t175.0\t4464\t11.5\tpontiac catalina brougham\n", "+ 15 15\t8\t351.0\t153.0\t4154\t13.5\tford galaxie 500\n", "+ 15 15\t8\t318.0\t150.0\t4096\t13.0\tplymouth fury iii\n", "- 15 10\t8\t383.0\t180.0\t4955\t11.5\tdodge monaco (sw)\n", "+ 15 15\t8\t400.0\t170.0\t4746\t12.0\tford country squire (sw)\n", "- 10 15\t8\t400.0\t175.0\t5140\t12.0\tpontiac safari (sw)\n", "+ 20 20\t6\t258.0\t110.0\t2962\t13.5\tamc hornet sportabout (sw)\n", "+ 20 20\t4\t140.0\t72.00\t2408\t19.0\tchevrolet vega (sw)\n", "+ 20 20\t6\t250.0\t100.0\t3282\t15.0\tpontiac firebird\n", "+ 20 20\t6\t250.0\t88.00\t3139\t14.5\tford mustang\n", "- 35 25\t4\t122.0\t86.00\t2220\t14.0\tmercury capri 2000\n", "- 35 30\t4\t116.0\t90.00\t2123\t14.0\topel 1900\n", "- 35 30\t4\t79.00\t70.00\t2074\t19.5\tpeugeot 304\n", "+ 30 30\t4\t88.00\t76.00\t2065\t14.5\tfiat 124b\n", "+ 30 30\t4\t71.00\t65.00\t1773\t19.0\ttoyota corolla 1200\n", "+ 35 35\t4\t72.00\t69.00\t1613\t18.0\tdatsun 1200\n", "- 40 25\t4\t97.00\t60.00\t1834\t19.0\tvolkswagen model 111\n", "- 35 25\t4\t91.00\t70.00\t1955\t20.5\tplymouth cricket\n", "+ 25 25\t4\t113.0\t95.00\t2278\t15.5\ttoyota corona hardtop\n", "+ 25 25\t4\t97.50\t80.00\t2126\t17.0\tdodge colt hardtop\n", "- 45 25\t4\t97.00\t54.00\t2254\t23.5\tvolkswagen type 3\n", "54.90% correct\n" ] } ], "source": [ "training_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt'\n", "test_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt'\n", "test(training_url, test_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The '+' means we classified that instance correctly and the '-' means we didn't. So we were about 55% accurate. There were 8 different classes we were trying to predict: 10, 15, 20, 25, 30, 35, 40, and 45. So just by guessing we would only be 1/8 = 12.5% accurate. So 55% doesn't sound so bad. \n", "\n", "Let's see if we can improve on that. Suppose we don't normalize the data. Let's write another test function that does that:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from urllib.request import urlopen\n", "\n", "def test(training_url, test_url):\n", " \"\"\"Test the classifier on a test set of data\"\"\"\n", " classifier = Classifier(training_url, normalize=False)\n", " \n", " \n", " html = urlopen(test_url)\n", " lines = html.read().decode('UTF-8').split('\\n')\n", " \n", " numCorrect = 0.0\n", " for line in lines:\n", " data = line.strip().split('\\t')\n", " #print(data)\n", " if data != ['']:\n", " vector = []\n", " classInColumn = -1\n", " for i in range(len(classifier.format)):\n", " if classifier.format[i] == 'num':\n", " vector.append(float(data[i]))\n", " elif classifier.format[i] == 'class':\n", " classInColumn = i\n", " theClass= classifier.classify(vector)\n", " prefix = '-'\n", " if theClass == data[classInColumn]:\n", " # it is correct\n", " numCorrect += 1\n", " prefix = '+'\n", " print(\"%s %12s %s\" % (prefix, theClass, line))\n", " print(\"%4.2f%% correct\" % (numCorrect * 100/ len(lines)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and run it" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+ 15 15\t8\t390.0\t190.0\t3850\t8.5\tamc ambassador dpl\n", "- 20 15\t8\t383.0\t170.0\t3563\t10.0\tdodge challenger se\n", "- 10 15\t8\t340.0\t160.0\t3609\t8.0\tplymouth 'cuda 340\n", "+ 15 15\t8\t400.0\t150.0\t3761\t9.5\tchevrolet monte carlo\n", "+ 15 15\t8\t455.0\t225.0\t3086\t10.0\tbuick estate wagon (sw)\n", "- 20 25\t4\t113.0\t95.00\t2372\t15.0\ttoyota corona mark ii\n", "+ 20 20\t6\t198.0\t95.00\t2833\t15.5\tplymouth duster\n", "+ 20 20\t6\t199.0\t97.00\t2774\t15.5\tamc hornet\n", "- 25 20\t6\t200.0\t85.00\t2587\t16.0\tford maverick\n", "- 30 25\t4\t97.00\t88.00\t2130\t14.5\tdatsun pl510\n", "- 30 25\t4\t97.00\t46.00\t1835\t20.5\tvolkswagen 1131 deluxe sedan\n", "+ 25 25\t4\t110.0\t87.00\t2672\t17.5\tpeugeot 504\n", "- 35 25\t4\t107.0\t90.00\t2430\t14.5\taudi 100 ls\n", "- 20 25\t4\t104.0\t95.00\t2375\t17.5\tsaab 99e\n", "- 20 25\t4\t121.0\t113.0\t2234\t12.5\tbmw 2002\n", "- 25 20\t6\t199.0\t90.00\t2648\t15.0\tamc gremlin\n", "- 15 10\t8\t360.0\t215.0\t4615\t14.0\tford f250\n", "- 15 10\t8\t307.0\t200.0\t4376\t15.0\tchevy c20\n", "- 15 10\t8\t318.0\t210.0\t4382\t13.5\tdodge d200\n", "- 15 10\t8\t304.0\t193.0\t4732\t18.5\thi 1200d\n", "- 30 25\t4\t97.00\t88.00\t2130\t14.5\tdatsun pl510\n", "- 25 30\t4\t140.0\t90.00\t2264\t15.5\tchevrolet vega 2300\n", "- 20 25\t4\t113.0\t95.00\t2228\t14.0\ttoyota corona\n", "- 25 20\t6\t232.0\t100.0\t2634\t13.0\tamc gremlin\n", "- 20 15\t6\t225.0\t105.0\t3439\t15.5\tplymouth satellite custom\n", "+ 15 15\t6\t250.0\t100.0\t3329\t15.5\tchevrolet chevelle malibu\n", "- 15 20\t6\t250.0\t88.00\t3302\t15.5\tford torino 500\n", "- 15 20\t6\t232.0\t100.0\t3288\t15.5\tamc matador\n", "+ 15 15\t8\t350.0\t165.0\t4209\t12.0\tchevrolet impala\n", "+ 15 15\t8\t400.0\t175.0\t4464\t11.5\tpontiac catalina brougham\n", "+ 15 15\t8\t351.0\t153.0\t4154\t13.5\tford galaxie 500\n", "+ 15 15\t8\t318.0\t150.0\t4096\t13.0\tplymouth fury iii\n", "+ 10 10\t8\t383.0\t180.0\t4955\t11.5\tdodge monaco (sw)\n", "+ 15 15\t8\t400.0\t170.0\t4746\t12.0\tford country squire (sw)\n", "- 10 15\t8\t400.0\t175.0\t5140\t12.0\tpontiac safari (sw)\n", "+ 20 20\t6\t258.0\t110.0\t2962\t13.5\tamc hornet sportabout (sw)\n", "+ 20 20\t4\t140.0\t72.00\t2408\t19.0\tchevrolet vega (sw)\n", "- 15 20\t6\t250.0\t100.0\t3282\t15.0\tpontiac firebird\n", "- 15 20\t6\t250.0\t88.00\t3139\t14.5\tford mustang\n", "- 20 25\t4\t122.0\t86.00\t2220\t14.0\tmercury capri 2000\n", "- 40 30\t4\t116.0\t90.00\t2123\t14.0\topel 1900\n", "- 40 30\t4\t79.00\t70.00\t2074\t19.5\tpeugeot 304\n", "- 40 30\t4\t88.00\t76.00\t2065\t14.5\tfiat 124b\n", "- 35 30\t4\t71.00\t65.00\t1773\t19.0\ttoyota corolla 1200\n", "- 30 35\t4\t72.00\t69.00\t1613\t18.0\tdatsun 1200\n", "- 30 25\t4\t97.00\t60.00\t1834\t19.0\tvolkswagen model 111\n", "- 30 25\t4\t91.00\t70.00\t1955\t20.5\tplymouth cricket\n", "- 20 25\t4\t113.0\t95.00\t2278\t15.5\ttoyota corona hardtop\n", "- 35 25\t4\t97.50\t80.00\t2126\t17.0\tdodge colt hardtop\n", "+ 25 25\t4\t97.00\t54.00\t2254\t23.5\tvolkswagen type 3\n", "31.37% correct\n" ] } ], "source": [ "training_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTrainingSet.txt'\n", "test_url = 'https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/mpgTestSet.txt'\n", "test(training_url, test_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm. That seemed to make it worse. But now, with this test procedure, and our code divided into training and test sets we can fiddle with what columns to include, or with the weights of the different columns (maybe 0-60 should weight heavier than the number of cylinders, for example) and quickly see if what improves our accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How good is it.\n", "Now we are done tuning our classifier and the accuracy seems fine. Maybe we would like to submit it to a data mining programming competition, or write a research paper saying how wonderfully accurate it is, or just simply finish the Titanic project. So we need to say how accurate it is. It would be tempting to report the accuracy we just computed that used our test set. But the problem is, we spent a lot of time **tuning** our classifier **on** that test set. Of course it will do well on that test set. This accuracy may not reflect the true accuracy of our classifier on other data. So often a **second test set** is used. And this is what the Titanic `test.csv` file is. In the Titanic test dataset we don't even know the correct classification, so we cannot use it for tuning. But I can use the results you get from running your classifier on the second test set to determine the accuracy of your classifier.\n", "\n", "#### my accuracy went down\n", "When I was in the tuning phase of my classifier I was getting in the low 80% accuracy range using the method I just showed above. When I ran that classifier on the second test set, it was only in the mid 70% accuracy range. \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How I approached the Titanic Problem\n", "#### 1. I did all the above, used the chapter 4 code, did the unitTest and ran through a few datasets including the MPG dataset. \n", "#### 2. I massaged both Titanic Data Files.\n", "The original files looked like:\n", "\n", "```\n", "1,0,3,\"Braund, Mr. Owen Harris\",male,22,1,0,A/5 21171,7.25,,S\n", "2,1,1,\"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\",female,38,1,0,PC 17599,71.2833,C85,C\n", "3,1,3,\"Heikkinen, Miss. Laina\",female,26,0,0,STON/O2. 3101282,7.925,,S\n", "4,1,1,\"Futrelle, Mrs. Jacques Heath (Lily May Peel)\",female,35,1,0,113803,53.1,C123,S\n", "5,0,3,\"Allen, Mr. William Henry\",male,35,0,0,373450,8.05,,S\n", "```\n", "\n", "I converted them to tab separated fields with no quotes:\n", "\n", "```\n", "1\t0\t3\tBraund, Mr. Owen Harris\tmale\t22\t1\t0\tA/5 21171\t7.25\t\tS\t0\n", "2\t1\t1\tCumings, Mrs. John Bradley (Florence Briggs Thayer)\tfemale\t38\t1\t0\tPC 17599\t71.2833\tC85\tC\t100\n", "3\t1\t3\tHeikkinen, Miss. Laina\tfemale\t26\t0\t0\tSTON/O2. 3101282\t7.925\t\tS\t100\n", "4\t1\t1\tFutrelle, Mrs. Jacques Heath (Lily May Peel)\tfemale\t35\t1\t0\t113803\t53.1\tC123\tS\t100\n", "5\t0\t3\tAllen, Mr. William Henry\tmale\t35\t0\t0\t373450\t8.05\t\tS\t0\n", "```\n", "I could have handled the original format in my original code but this seemed easier.\n", "\n", "\n", "####3. I renamed the original Titanic `test.csv` file `unknown.csv` (I didn't really do this at this point but I figured it makes this more understandable)\n", "\n", "####4. Outside of the original Python code, I divided the `train.csv` Titanic dataset into two files. \n", "About 100 lines of the file I put into a new file called `testing.csv`. The remaining lines I put in a file called `training.csv`. \n", "####5. You need to either modify the code above to handle local files (like the code in the book) or put these files you created on a web" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }