{
 "metadata": {
  "name": "",
  "signature": "sha256:29da615d0ea91abd5639debfb13a98d23cb21b1148828c89e63dc435c854160c"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One student emailed with the following question:\n",
      "\n",
      "<blockquote>\n",
      "<p>\n",
      "Right now I'm trying to edit the example entropy function to the one you wrote on the board in class. \n",
      "</p>\n",
      "<p>\n",
      "My question is the example code only has one P_s, right? Our goal is to add four different values but I don't understand how to code P_w, P_h and so on. Would you give me more details and advice on this?\n",
      "</p>\n",
      "</blockquote>\n",
      "   \n",
      "As a hint, I will rewrite the `entropy` formula I wrote to explicitly look up the various categories.\n",
      "   "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def entropy(series):\n",
      "    \"\"\"Normalized Shannon Index\"\"\"\n",
      "    # a series in which all the entries are equal should result in normalized entropy of 1.0\n",
      "    \n",
      "    # eliminate 0s\n",
      "    series1 = series[series!=0]\n",
      "\n",
      "    # if len(series) < 2 (i.e., 0 or 1) then return 0\n",
      "    \n",
      "    if len(series1) > 1:\n",
      "        # calculate the maximum possible entropy for given length of input series\n",
      "        max_s = -np.log(1.0/len(series))\n",
      "    \n",
      "        total = float(sum(series1))\n",
      "        p = series1.astype('float')/float(total)\n",
      "        return sum(-p*np.log(p))/max_s\n",
      "    else:\n",
      "        return 0.0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# supporint imports \n",
      "\n",
      "import numpy as np\n",
      "from pandas import Series"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def entropy_term(p):\n",
      "    \"\"\"Individual Shannon entropy term -- handles the case in which p is 0\"\"\"\n",
      "    if p == 0:\n",
      "        return 0\n",
      "    else:\n",
      "        return -p*np.log(p)\n",
      "\n",
      "def entropy5_explicit_labels(series):\n",
      "    \"\"\"entropy5 calculation for an input Series with 5 categories\"\"\"\n",
      "    # calculate the normalizing term -- what's the maximum entropy\n",
      "    # there are five categories here\n",
      "    max_s = -np.log(1.0/5)\n",
      "    total = float(series['White']+series['Black']+series['Asian']+ \\\n",
      "                  series['Hispanic']+series['Other'])\n",
      "    \n",
      "    s = entropy_term(series['White']/total) + \\\n",
      "        entropy_term(series['Black']/total) + \\\n",
      "        entropy_term(series['Asian']/total) + \\\n",
      "        entropy_term(series['Hispanic']/total) + \\\n",
      "        entropy_term(series['Other']/total)\n",
      "    \n",
      "    s = s/max_s\n",
      "    return s\n",
      "\n",
      "def entropy4_explicit_labels(series):\n",
      "    \"\"\"entropy4 calculation for an input Series with 4 categories\"\"\"\n",
      "    # calculate the normalizing term -- what's the maximum entropy\n",
      "    # there are five categories here\n",
      "    max_s = -np.log(1.0/4)\n",
      "    # don't include Other in the total\n",
      "    total = float(series['White']+series['Black']+series['Asian']+ \\\n",
      "                  series['Hispanic'])\n",
      "    \n",
      "    s = entropy_term(series['White']/total) + \\\n",
      "        entropy_term(series['Black']/total) + \\\n",
      "        entropy_term(series['Asian']/total) + \\\n",
      "        entropy_term(series['Hispanic']/total) \n",
      "    \n",
      "    s = s/max_s\n",
      "    return s    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Using the population figures for the Houston Metro Area\n",
      "# Make a pandas Series out of the dict\n",
      "\n",
      "houston = Series({'Asian': 384596,\n",
      " 'Black': 998883,\n",
      " 'Hispanic': 2099412,\n",
      " 'Other': 103437,\n",
      " 'White': 2360472})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note how the `entropy` function can be used to do both the entropy5 and entropy4 calculation\n",
      "by just changing the subset of the `houston` Series being passed into `entropy`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# comparing two ways of doing the entropy5 calculation\n",
      "(entropy(houston[['White', 'Black', 'Asian', 'Hispanic', 'Other']]),\n",
      " entropy5_explicit_labels(houston))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 36,
       "text": [
        "(0.79628076626851163, 0.79628076626851163)"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# comparing two ways of doing the entropy4 calculation \n",
      "# don't include Other\n",
      "\n",
      "(entropy(houston[['White', 'Black', 'Asian', 'Hispanic']]),\n",
      " entropy4_explicit_labels(houston))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 37,
       "text": [
        "(0.87642479416885899, 0.87642479416885899)"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Calculating a `entropy_rice` function is left to the reader...."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}