{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Naive Bayes Classifier\n", "\n", "For this notebook we are going to yet again use the Titanic data we used in previous notebooks:\n", "\n", "| ID | name | survived | sex | age group | PcClass | SibSp |\n", "| :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n", "|1 | Braund, Mr. Owen Harris | 0 | male | adult | 3 | Y |\n", "|2 | Cumings, Mrs. John Bradley | 1 | female | adult | 1 | Y |\n", "|8 | Palsson, Master. Gosta Leonard | 0 | male | child | 3 | Y |\n", "|11 | Sandstrom, Miss. Marguerite Rut | 1 | female | child | 1| Y |\n", "| 40 | Nicola-Yarred, Miss. Jamila | 1 | female | teen | 1 | Y |\n", "|50 | Arnold-Franchi, Mrs. Josef | 0 | female | teen | 3 | Y |\n", "|205 | Cohen, Mr. Gurshon \"\"Gus\" | 1 | male | teen |3 | N |\n", "|307 | Allison, Master. Hudson Trevor | 1 | male | child | 1 | Y |\n", "|507| Quick, Mrs. Frederick Charles | 1 | female | adult | 2 | Y |\n", "|529 | Salonen, Mr. Johan Werner | 0 | male | adult | 3 | N |\n", "|581 | Christy, Miss. Julie Rachel | 1 | female | adult | 2 | Y |\n", "| 871 | Balkic, Mr. Cerin | 0 | male | adult | 3 | N |\n", "| 17 | Rice, Master. Eugene | 0 | male | child | 3 | Y |\n", "|16 |Hewlett, Mrs. (Mary D Kingcome)| 1 | female | adult | 2 | N|\n", "|68 | Crease, Mr. Ernest James | 0 | male | teen | 3 | N |\n", "|263| Taussig, Mr. Emil | 0 | male | adult | 1 | N |\n", "|330| Hippach, Miss. Jean Gertrude | 1 | female | teen | 1 | N|\n", "|505 | Maioni, Miss. Roberta | 1 | female | teen| 1| N |\n", "|533 | Elias, Mr. Joseph Jr | 0 | male | teen | 3 | Y |\n", "| 731 | Ilmakangas, Miss. Pieta Sofia | 0 | female | adult | 3 | Y |\n", "\n", "\n", "Here is a function that reads this information from a file.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('0', ['male', 'adult', '3', 'Y']), ('1', ['female', 'adult', '1', 'Y']), ('0', ['male', 'child', '3', 'Y']), ('1', ['female', 'child', '1', 'Y']), ('1', ['female', 'teen', '1', 'Y']), ('0', ['female', 'teen', '3', 'Y']), ('1', ['male', 'teen', '3', 'N']), ('1', ['male', 'child', '1', 'Y']), ('1', ['female', 'adult', '2', 'Y']), ('0', ['male', 'adult', '3', 'N']), ('1', ['female', 'adult', '2', 'Y']), ('0', ['male', 'adult', '3', 'N']), ('0', ['male', 'child', '3', 'Y']), ('1', ['female', 'adult', '2', 'N']), ('0', ['male', 'teen', '3', 'N']), ('0', ['male', 'adult', '1', 'N']), ('1', ['female', 'teen', '1', 'N']), ('1', ['female', 'teen', '1', 'N']), ('0', ['male', 'teen', '3', 'Y']), ('0', ['female', 'adult', '3', 'Y'])]\n" ] } ], "source": [ "from urllib.request import urlopen \n", "def readDataFile(urlFile, dataFormat, skipHeader=True):\n", " \n", " html = urlopen(urlFile) \n", " tempLines = html.read().decode('UTF-8').split('\\n')\n", " #header = lines[0].strip().split('\\t')\n", " \n", " dFormat = dataFormat.split(\"\\t\")\n", " attributes = []\n", " for i in range(len(dFormat)):\n", " if dFormat[i] == 'class':\n", " classificationColumn = i\n", " \n", " elif dFormat[i] == 'attr':\n", " attributes.append(i)\n", " #print(classificationColumn, attributes)\n", " \n", " \n", " data = []\n", " if skipHeader == True:\n", " lines = tempLines[1:]\n", " else:\n", " lines = tempLines\n", " for line in lines:\n", " attrVector = []\n", " instance = line.strip().split('\\t')\n", " #currentClass = \"%s=%s\" %(classificationName, instance[classificationColumn])\n", " currentClass = instance[classificationColumn]\n", " for attribute in attributes:\n", " attrVector.append(instance[attribute])\n", " \n", " \n", " data.append((currentClass, attrVector))\n", " return data\n", "\n", " \n", "theFormat = 'comment\\tcomment\\tclass\\tattr\\tattr\\tattr\\tattr'\n", "data = readDataFile('https://raw.githubusercontent.com/zacharski/cs419/master/entropyNotebookData.txt', theFormat) \n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, each line of the data file is converted to a tuple. The first element of the tuple is the classification of that instance, and the second element is a list of the attribute values of that instance.\n", "\n", "So this row of the table\n", "\n", "| ID | name | survived | sex | age group | PcClass | SibSp |\n", "| :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n", "|1 | Braund, Mr. Owen Harris | 0 | male | adult | 3 | Y |\n", "\n", "is converted to the tuple\n", "\n", "`('0', ['male', 'adult', '3', 'Y'])`\n", "\n", "\n", "## Prior probability\n", "The first thing we need to do is compute the prior probability (in our case `P(survived)` and `P(not survived)`). Let's call this function `computePrior` and the function signature will be\n", "\n", "`computePrior(data)`\n", "\n", "which will compute the prior probabilities of the classifications. In this example, the classifications are '1' meaning *survived* and '0' meaning *not survived.* It should return a dictionary of the form:\n", "\n", "`{'0': 0.5, '1': 0.5}`\n", "\n", "So `P(survived=0) = 0.5` and `P(survived=1) = 0.5`\n", "\n", "\n", "####Checkpoint 1: 10xp\n", "\n", "Go ahead and write & test that function.\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "defaultdict(, {'0': 0.5, '1': 0.5})\n" ] } ], "source": [ "from collections import defaultdict\n", "\n", "def computePrior(data):\n", " ### To Be Done (TBD)\n", " results = defaultdict(int)\n", " # TBD\n", " return results\n", "\n", "priorProbabilities = computePrior(data)\n", "print(priorProbabilities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "my answer: `defaultdict(, {'0': 0.5, '1': 0.5})`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Conditional Probabilities\n", "\n", "Next, we need to compute conditional probabilities, such as `P(PcClass=1|survived = 1)`\n", "\n", "Let's call this `computeConditionals` and the signature will be\n", "\n", "`computeConditionals(data)`\n", "\n", "For each of the attribute columns, we are computing all the conditional probabilities of the form `P(attributeColumn=x | classificationColumn=y)\n", "\n", "The `classificationColumn` will be the name of the column that contains the categories we are trying to train our classifier to predict (in this case `survived`). \n", "\n", "The output will be a list of conditional probabilities. The data structure to store the probabilities is up to you. I used the following:\n", "\n", "`defaultdict(lambda : defaultdict(lambda : defaultdict(int)))`\n", "\n", "So a dictionary of dictionaries of dictionaries of integers. \n", "\n", "The probability `P(column0=male|survived=1)` is represented as `result['1'][0]['male']`\n", "\n", "where '1' is the value of the survived column, 0 represents the 0th column and 'male' is the value of that 0th column.\n", "\n", "Once I collect the counts, I need to convert them to probabilities by iterating over the counts. So something like:\n", "\n", " # now compute probability\n", " for (category, columns) in results.items():\n", " for (column, values) in columns.items():\n", " for (value, count) in values.items():\n", " # do work here\n", " \n", "\n", "\n", "\n", "The format of the data is:\n", "\n", " [('0', ['male', 'adult', '3', 'Y']), \n", " ('1', ['female', 'adult', '1', 'Y']), \n", " ('0', ['male', 'child', '3', 'Y']), \n", " ('1', ['female', 'child', '1', 'Y']), \n", " ('1', ['female', 'teen', '1', 'Y']), \n", " ('0', ['female', 'teen', '3', 'Y']), \n", " ('1', ['male', 'teen', '3', 'N']), \n", " ('1', ['male', 'child', '1', 'Y']), \n", " ('1', ['female', 'adult', '2', 'Y']), \n", " ...]\n", " \n", "and, again, we are going to compute probabilities like `P('male'|'1')` (the probability of a male given that the person survived).\n", "\n", "Before you write this function, you should know what you expect the function to return (how will you know that the function is implemented correctly. To that end please compute by hand the following (edit this cell and execute it):\n", "\n", "#### Checkpoint 2 10 xp\n", "\n", "| probability | value |\n", "| :---: | ---: |\n", "| P(male | survived) | 0.25 |\n", "|P(1st class | survived) | ? |\n", "|P(2nd class | survived) | ? |\n", "|P(3rd class | survived) | ? |\n", "|P(adult | survived) | ? |" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "defaultdict(. at 0x105a26f28>, {})\n", "0\n", "0\n" ] } ], "source": [ "from collections import defaultdict\n", "def computeConditionals(data):\n", " categoryCounts = defaultdict(int)\n", " results = defaultdict(lambda : defaultdict(lambda : defaultdict(int)))\n", " # TBD\n", " return results\n", "\n", "conditionalProbabilities = computeConditionals(data)\n", "print(conditionalProbabilities)\n", "\n", "print(conditionalProbabilities['1'][0]['male']) # should return .25\n", "print(conditionalProbabilities['1'][1]['teen']) # should return .3846...\n", "## please add test cases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "that was\n", "\n", "####Checkpoint 3: 40xp\n", "\n", "My output is \n", "\n", " defaultdict(, {'0': 10, '1': 10})\n", " [['male', 'female'], ['adult', 'child', 'teen'], ['3', '1', '2'], ['Y', 'N']]\n", " defaultdict(. at 0x108e0d488>, {'0': defaultdict(... at 0x108e0d378>, {0: defaultdict(, {'male': \n", " 0.8, 'female': 0.2}), 1: defaultdict(, {'child': 0.2, 'teen': 0.3, 'adult': 0.5}), 2: \n", " defaultdict(, {'1': 0.1, '3': 0.9}), 3: defaultdict(, {'Y': 0.6, 'N': 0.4})}), '1': \n", " defaultdict(... at 0x108e0d950>, {0: \n", " defaultdict(, {'male': 0.2, 'female': 0.8}), 1: defaultdict(, {'child': 0.2, 'teen': \n", " 0.4, 'adult': 0.4}), 2: defaultdict(, {'1': 0.6, '3': 0.1, '2': 0.3}), 3: defaultdict(, \n", " {'Y': 0.6, 'N': 0.4})})})\n", "\n", " 0.25\n", "\n", " 0.38461538461538464\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classifier\n", "\n", "Finally we are going to use the prior and conditional probabilities to classify instances. For example, did Miss Anna McGowan, a teen in third class with no siblings onboard survive or not? We would represent those attributes in the vector `['female', 'teen', '1', 'Y']` and we might call our classifier with something like:\n", "\n", "`classify(['female', 'teen', '1', 'Y'], priorProbabilities, conditionalProbabilities)`\n", "\n", "To help in verifying your code is correct I would like you to output a tuple of the form:\n", "\n", "`('1', 0.9292035398230087)`\n", "\n", "meaning we would classify Miss McGowan as '1' (survived) with the normalized probability of .9292\n", "\n", "\n", "Go ahead and write that function now:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('temp', 9999.0)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from operator import itemgetter\n", "\n", "def classifier(instance, priorP, conditionalP):\n", " results = []\n", " # just adding these 2 lines to have something\n", " t = 1\n", " results.append(('temp', 9999))\n", " r2 = sorted(results, key=itemgetter(1), reverse=True)\n", " return((r2[0][0], r2[0][1] / t))\n", "\n", "classifier(['female', 'teen', '1', 'Y'], priorProbabilities, conditionalProbabilities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### my output was\n", "\n", "`('1', 0.9292035398230087)`\n", "\n", "that was\n", "\n", "#### Checkpoint 4 - 40xp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#####Congratulations you just finished writing a Bayes classifier!\n", "\n", "# Additional work:\n", "### 75 xp for the first of these 40 xp for the other.\n", "\n", "####Can you write a classifier that trains on the full Titanic dataset? \n", "You will need to convert the column age to be categories like 'teen', 'adult', 'child'\n", "Test it on 100 instances you haven't trained on and print out the accuracy.\n", "\n", "####Can you write a classifier that works with [The House Vote dataset](https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch6/house-votes.zip)?\n", "\n", "### remember a decision tree program that uses the full Titanic dataset is worth 100xp\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }