{ "metadata": { "name": "", "signature": "sha256:29da615d0ea91abd5639debfb13a98d23cb21b1148828c89e63dc435c854160c" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "One student emailed with the following question:\n", "\n", "
\n", "

\n", "Right now I'm trying to edit the example entropy function to the one you wrote on the board in class. \n", "

\n", "

\n", "My question is the example code only has one P_s, right? Our goal is to add four different values but I don't understand how to code P_w, P_h and so on. Would you give me more details and advice on this?\n", "

\n", "
\n", " \n", "As a hint, I will rewrite the `entropy` formula I wrote to explicitly look up the various categories.\n", " " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def entropy(series):\n", " \"\"\"Normalized Shannon Index\"\"\"\n", " # a series in which all the entries are equal should result in normalized entropy of 1.0\n", " \n", " # eliminate 0s\n", " series1 = series[series!=0]\n", "\n", " # if len(series) < 2 (i.e., 0 or 1) then return 0\n", " \n", " if len(series1) > 1:\n", " # calculate the maximum possible entropy for given length of input series\n", " max_s = -np.log(1.0/len(series))\n", " \n", " total = float(sum(series1))\n", " p = series1.astype('float')/float(total)\n", " return sum(-p*np.log(p))/max_s\n", " else:\n", " return 0.0" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# supporint imports \n", "\n", "import numpy as np\n", "from pandas import Series" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "def entropy_term(p):\n", " \"\"\"Individual Shannon entropy term -- handles the case in which p is 0\"\"\"\n", " if p == 0:\n", " return 0\n", " else:\n", " return -p*np.log(p)\n", "\n", "def entropy5_explicit_labels(series):\n", " \"\"\"entropy5 calculation for an input Series with 5 categories\"\"\"\n", " # calculate the normalizing term -- what's the maximum entropy\n", " # there are five categories here\n", " max_s = -np.log(1.0/5)\n", " total = float(series['White']+series['Black']+series['Asian']+ \\\n", " series['Hispanic']+series['Other'])\n", " \n", " s = entropy_term(series['White']/total) + \\\n", " entropy_term(series['Black']/total) + \\\n", " entropy_term(series['Asian']/total) + \\\n", " entropy_term(series['Hispanic']/total) + \\\n", " entropy_term(series['Other']/total)\n", " \n", " s = s/max_s\n", " return s\n", "\n", "def entropy4_explicit_labels(series):\n", " \"\"\"entropy4 calculation for an input Series with 4 categories\"\"\"\n", " # calculate the normalizing term -- what's the maximum entropy\n", " # there are five categories here\n", " max_s = -np.log(1.0/4)\n", " # don't include Other in the total\n", " total = float(series['White']+series['Black']+series['Asian']+ \\\n", " series['Hispanic'])\n", " \n", " s = entropy_term(series['White']/total) + \\\n", " entropy_term(series['Black']/total) + \\\n", " entropy_term(series['Asian']/total) + \\\n", " entropy_term(series['Hispanic']/total) \n", " \n", " s = s/max_s\n", " return s " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "# Using the population figures for the Houston Metro Area\n", "# Make a pandas Series out of the dict\n", "\n", "houston = Series({'Asian': 384596,\n", " 'Black': 998883,\n", " 'Hispanic': 2099412,\n", " 'Other': 103437,\n", " 'White': 2360472})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the `entropy` function can be used to do both the entropy5 and entropy4 calculation\n", "by just changing the subset of the `houston` Series being passed into `entropy`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# comparing two ways of doing the entropy5 calculation\n", "(entropy(houston[['White', 'Black', 'Asian', 'Hispanic', 'Other']]),\n", " entropy5_explicit_labels(houston))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 36, "text": [ "(0.79628076626851163, 0.79628076626851163)" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "# comparing two ways of doing the entropy4 calculation \n", "# don't include Other\n", "\n", "(entropy(houston[['White', 'Black', 'Asian', 'Hispanic']]),\n", " entropy4_explicit_labels(houston))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 37, "text": [ "(0.87642479416885899, 0.87642479416885899)" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating a `entropy_rice` function is left to the reader...." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }