{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Entropy\n", "\n", "For this notebook we are going to use the Titanic data we used in class:\n", "\n", "| ID | name | survived | sex | age group | PcClass | SibSp |\n", "| :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n", "|1 | Braund, Mr. Owen Harris | 0 | male | adult | 3 | Y |\n", "|2 | Cumings, Mrs. John Bradley | 1 | female | adult | 1 | Y |\n", "|8 | Palsson, Master. Gosta Leonard | 0 | male | child | 3 | Y |\n", "|11 | Sandstrom, Miss. Marguerite Rut | 1 | female | child | 1| Y |\n", "| 40 | Nicola-Yarred, Miss. Jamila | 1 | female | teen | 1 | Y |\n", "|50 | Arnold-Franchi, Mrs. Josef | 0 | female | teen | 3 | Y |\n", "|205 | Cohen, Mr. Gurshon \"\"Gus\" | 1 | male | teen |3 | N |\n", "|307 | Allison, Master. Hudson Trevor | 1 | male | child | 1 | Y |\n", "|507| Quick, Mrs. Frederick Charles | 1 | female | adult | 2 | Y |\n", "|529 | Salonen, Mr. Johan Werner | 0 | male | adult | 3 | N |\n", "|581 | Christy, Miss. Julie Rachel | 1 | female | adult | 2 | Y |\n", "| 871 | Balkic, Mr. Cerin | 0 | male | adult | 3 | N |\n", "| 17 | Rice, Master. Eugene | 0 | male | child | 3 | Y |\n", "|16 |Hewlett, Mrs. (Mary D Kingcome)| 1 | female | adult | 2 | N|\n", "|68 | Crease, Mr. Ernest James | 0 | male | teen | 3 | N |\n", "|263| Taussig, Mr. Emil | 0 | male | adult | 1 | N |\n", "|330| Hippach, Miss. Jean Gertrude | 1 | female | teen | 1 | N|\n", "|505 | Maioni, Miss. Roberta | 1 | female | teen| 1| N |\n", "|533 | Elias, Mr. Joseph Jr | 0 | male | teen | 3 | Y |\n", "| 731 | Ilmakangas, Miss. Pieta Sofia | 0 | female | adult | 3 | Y |\n", "\n", "In a bit, we are going to represent this data in Python and compute entropy on it. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## But first ... entropy\n", "\n", "In class we did a bunch of work with the entropy formula:\n", "\n", "$$H(X)=-\\sum_{i\\in{X}}p(i)\\log_2(p(i))$$\n", "\n", "One example we computed was that of transmitting a message about the status of Professor Davies house indicating one of 4 conditions: no one home, both Professor Davies and his wife were home together, only Professor Davies and only his wife and we gave the probabilities as follows:\n", "\n", "| condition | prob |\n", "| :--: | :--: |\n", "|no one home | 0.5 |\n", "|both home | 0.25 |\n", "| only Davies## | 0.125 |\n", "| only spouse | 0.125 |\n", "\n", "and we computed the entropy as:\n", "\n", "$$H(X) = - (p(1/2)\\log_2(p(1/2)) + p(1/4)\\log_2(p(1/4)) + p(1/8)\\log_2(p(1/8)) + p(1/8)\\log_2(p(1/8)))$$\n", "\n", "$$ = - (p(1/2)(-1)) + p(1/4)(-2)) + p(1/8)\\log_2(-3)) + p(1/8)(-3))) = - ( -1/2 + - 1/2 + - 3/8 + -3/8) = 1.75$$\n", "\n", "In data mining/machine learning we often use the notation info[4, 2, 1, 1] to represent this entropy formula. The number of numbers in the list [4, 2, 1, 1] represent how many conditions there are and each number represents how many occurences are in each condition. So in this case we are interested in 4 conditions (no one home, both home, only D, and only spouse) and in our little sample 4 times there was no one home, twice both were home and so on. We compute info[4, 2, 1, 1] the same way we did entropy above. The denominator of the fractions is the sum of all the numbers in the list, the numerator for each component in the associated number in the list. So:\n", "\n", "\n", "$$info[4, 2, 1, 1\\ = - (p(4/8)\\log_2(p(4/8)) + p(2/8)\\log_2(p(2/8)) + p(1/8)\\log_2(p(1/8)) + p(1/8)\\log_2(p(1/8)))$$\n", "\n", "$$ = - (p(1/2)(-1)) + p(1/4)(-2)) + p(1/8)\\log_2(-3)) + p(1/8)(-3))) = - ( -1/2 + - 1/2 + - 3/8 + -3/8) = 1.75$$\n", "\n", "and we get the same result.\n", "\n", "Let's say there are 15 women in my 110 class and 5 men and I want to compute the entropy (or info) of the sex variable. That would be represented as:\n", "\n", "$$ info[15, 5]$$\n", "\n", "and we would compute that as:\n", "\n", "$$ info[15, 5] = - (p(15/20)\\log_2(p(15/20)) + p(5/20)\\log_2(p(5/20)) = - (.75(-0.415) + .25(-2)) = .81125\n", "\n", "### we need an info function\n", "I would like you to write a function, info, that takes a list as an argument and returns the entropy of that list.\n", "So for example info([4, 2, 1, 1]) should return 1.75 and info([15, 5]) should return about 0.81125.\n", "I wrote a unit test for this info function that you should use.\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO passes all tests\n" ] } ], "source": [ "from math import log\n", "\n", "def info(conditions):\n", " \"\"\"Conditions is a list, for example [4, 3, 2, 1] \n", " and info should return the entropy of that list of conditions\"\"\"\n", " \n", " # TBD\n", " \n", " return 0\n", "\n", "def infoUnitTest():\n", " assert(round(info([4, 2, 1, 1]), 5) == 1.75)\n", " assert(round(info([15, 5]), 5) == 0.81128)\n", " assert(info([15 , 0]) == 0)\n", " assert(info([4, 2, 0, 1, 0, 1]) == 1.75)\n", " print (\"INFO passes all tests\")\n", " \n", "infoUnitTest()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That was \n", "### Checkpoint 1\n", "Congratulations" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Representing the data\n", "Now we need to represent the data in the above table (also available in [this file](https://raw.githubusercontent.com/zacharski/cs419/master/entropyNotebookData.txt) ). In a Python data structure. When you are thinking about this data structure keep in mind that eventually we will want to deal with subsets of the data. For example, \"all instances where sex is male\" or \"all instances where sex is female and class is first class.\" Why don't you do that now while I relax...\n", "\n", "### ok, if you want to read data from a file here is a little start...\n", "This is just a sketch of what I did. You can do something completely different" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ID', 'name', 'survived', 'sex', 'age group', 'PcClass', 'SibSp']\n" ] } ], "source": [ "from urllib.request import urlopen \n", "def readDataFile(urlFile):\n", " html = urlopen(urlFile)\n", " lines = html.read().decode('UTF-8').split('\\n')\n", " header = lines[0].strip().split('\\t')\n", " print(header)\n", " data = []\n", " for line in lines[1:]:\n", " print(line)\n", " return (header, data)\n", "\n", " \n", " \n", "data = readDataFile('https://raw.githubusercontent.com/zacharski/cs419/master/entropyNotebookData.txt') \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## InfoVal\n", "This function will change a bit later but let's start one step at a time. I want to write a function `infoVal` that takes three arguments: the dataset, the column, the value from that column we are interested in, and finally the column we are modeling. The column we are modeling means, which columns contains the variable distribution we are modeling. In the Titanic case we are modeling the distribution of those who survived and those who didn't. Those variables are in the *survived* column. The function will return the `info` of that data. So if the above dataset is called `data` consider\n", "\n", "```\n", "infoVal(data, 'PcClass', '1', 'survived')\n", "```\n", "\n", "There are 7 instances of first class passengers. 6 of those survived and 1 did not so we want\n", "```\n", "infoVal(data, 'PcClass', '1', 'survived')\n", "```\n", "to return\n", "\n", "``` \n", "info([6, 1])\n", "```\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "def infoVal(data, column, val, modelColumn):\n", " # TBD\n", " return info([1, 1])\n", "\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, let's test that function (you might need to alter this to match you data and infoVal method):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "all tests passed!\n" ] } ], "source": [ "def unitTestInfoVal():\n", " # the following should return the results of info[6,1]\n", " assert(round(infoVal(data, 'PcClass', '1', 'survived'), 5) == round(info([6, 1]), 5)) \n", " assert(infoVal(data, 'PcClass', '2', 'survived') == 0)\n", " assert(round(infoVal(data, 'sex', 'female', 'PcClass'), 5) == 1.48548)\n", " print(\"all tests passed!\")\n", "unitTestInfoVal()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fantastic.\n", "\n", "\n", "Finally, I would like you to write a function to compute the entropy for a column. We went over this in class. You should first write a unitTest function. You can collaborate with others when writing this function. (**If you don't write a unit test, how do you know your code is working correctly?**) Next, on your own, write the actual function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "def columnEntropy(data, col, mcol)...\n", "\n", "def unitTestColumnEntropy):\n", " TBD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Picking a good question\n", "The goal of our project is to come up with a 20-question like approach to classifying instances in our test set. And we are trying to come up with the 'best' first question. In the Titanic example, what should we ask first? A question about sex, age group, Passenger class, or Sibling Spouse? Write a function that given a datasetand a modelColumn, will return the best question to ask. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "####XP\n", "\n", "* `info` 20xp\n", "* `infoVal` 25xp\n", "* column entropy 25xp\n", "* good question 10xp" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "###Current status\n", "\n", "Ok, So it looks like PcClass is the best first question. Here is where we are at in my rough drawing of this tree:\n", "\n", "\n", " +---1---> 6 survived, 1 did not, e = 0.5917 -> ?\n", " |\n", " |\n", " S --> PcClass -- +---2---> 3 survivied, e = 0\n", " |\n", " |\n", " +---3---> 1 survived, 6 did not, entropy 0.469\n", "\n", "\n", "Now we are at the question mark at the 1 class part of the tree (where 6 survived and 1 did not). We want to know what is the best next question to ask at that point. So in the case of a first class passenger, is it better to ask about sex, age group, or sib/sp? This is a refinement of what we did above.\n", "\n", "Can you finish writing the decision tree code? (50xp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }