{
 "metadata": {
  "name": "",
  "signature": "sha256:6c7e72aac55581fc1d765007dbd591dcecd846eaac6ea8dcbee626783e04095d"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Working with Open Data* Midterm (March 18, 2014)\n",
      "\n",
      "There are **94** points in this exam:  2 each for the **47 questions**.  The questions are either **multiple choice** or **short answers**.  For **multiple choice**, just write the **number** of the choice selected.\n",
      "\n",
      "\n",
      "\n",
      "Name: `______________________________________`\n",
      "\n",
      "\n",
      "`"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "World Population"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider this code to construct a DataFrame of populations of countries."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import requests\n",
      "from pandas import DataFrame\n",
      "\n",
      "# read population in from JSON-formatted data derived from the Wikipedia\n",
      "pop_json_url = \"https://gist.github.com/rdhyee/8511607/\" + \\\n",
      "     \"raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json\"\n",
      "pop_list= requests.get(pop_json_url).json()\n",
      "\n",
      "df = DataFrame(pop_list)\n",
      "df[:5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>0</th>\n",
        "      <th>1</th>\n",
        "      <th>2</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td>         China</td>\n",
        "      <td> 1385566537</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 2</td>\n",
        "      <td>         India</td>\n",
        "      <td> 1252139596</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 3</td>\n",
        "      <td> United States</td>\n",
        "      <td>  320050716</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 4</td>\n",
        "      <td>     Indonesia</td>\n",
        "      <td>  249865631</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 5</td>\n",
        "      <td>        Brazil</td>\n",
        "      <td>  200361925</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 3 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "   0              1           2\n",
        "0  1          China  1385566537\n",
        "1  2          India  1252139596\n",
        "2  3  United States   320050716\n",
        "3  4      Indonesia   249865631\n",
        "4  5         Brazil   200361925\n",
        "\n",
        "[5 rows x 3 columns]"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the `dtypes` of the columns."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.dtypes"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "0    float64\n",
        "1     object\n",
        "2      int64\n",
        "dtype: object"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q1**: What is the relationship between `s` and the population of China, where `s` is defined as\n",
      "\n",
      "    s = sum(df[df[1].str.startswith('C')][2])\n",
      "    \n",
      "1. `s` is **greater** than the population of China\n",
      "2. `s` is the **same** as the population of China\n",
      "3. `s` is **less** than the population of China\n",
      "4. `s` is not a number.\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A1**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q2**: What is the relationship between `s2` and the population of China, where `s2` is defined by:\n",
      "\n",
      "    s2 = sum(df[df[1].str.startswith('X')][2])\n",
      "    \n",
      "1. `s2` is **greater** than the population of China\n",
      "1. `s2` is the **same** as the population of China\n",
      "1. `s2` is **less** than the population of China\n",
      "1. `s2` is not a number.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A2**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q3**: What happens when the following statement is run?\n",
      "\n",
      "    df.columns = ['Number','Country','Population']\n",
      "    \n",
      "1. Nothing\n",
      "1. `df` gets a new attribute called `columns`\n",
      "1. `df`'s columns are renamed based on the list\n",
      "1. Throws an exception"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A3**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q4**: This statement does the following\n",
      "\n",
      "    df.columns = ['Number','Country']\n",
      "    \n",
      "1. Nothing\n",
      "1. `df` gets a new attribute called `columns`\n",
      "1. `df`'s columns are renamed based on the list\n",
      "1. Throws an exception"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A4**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q5**: How would you rewrite the following statement to get the same result as\n",
      "\n",
      "    s = sum(df[df[1].str.startswith('C')][2])\n",
      "\n",
      "after running:\n",
      "\n",
      "    df.columns = ['Number','Country','Population']"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A5**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q6**.  What is\n",
      "\n",
      "```Python\n",
      "    len(df[df[\"Population\"] > 1000000000])\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A6**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q7**. What is\n",
      "\n",
      "```Python\n",
      "    \";\".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A7**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q8**. What is\n",
      "\n",
      " ```Python\n",
      "      len(\";\".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))\n",
      " ```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A8**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Pandas Series"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pandas import DataFrame, Series\n",
      "import numpy as np\n",
      "\n",
      "s1 = Series(np.arange(-1,4))\n",
      "s1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "0   -1\n",
        "1    0\n",
        "2    1\n",
        "3    2\n",
        "4    3\n",
        "dtype: int64"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q9**: What is\n",
      "\n",
      "```Python\n",
      "    s1 + 1\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A9**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q10**: What is\n",
      "\n",
      "```Python\n",
      "s1.apply(lambda k: 2*k).sum()\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A10**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q11**: What is \n",
      "\n",
      "```Python\n",
      "    s1.cumsum()[3]\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A11**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q12**: What is\n",
      "\n",
      "    s1.cumsum() - s1.cumsum()"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A12**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q13**. What is\n",
      "\n",
      "```Python\n",
      "    len(s1.cumsum() - s1.cumsum())\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A13**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q14**: What is\n",
      "\n",
      "    np.any(s1 > 2)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A14**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q15**. What is\n",
      "\n",
      "```Python\n",
      "    np.all(s1<3)\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A15**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Census API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider the following code to load population(s) from the Census API."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from census import Census\n",
      "from us import states\n",
      "\n",
      "import settings\n",
      "\n",
      "c = Census(settings.CENSUS_KEY)\n",
      "c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q16**: What is the purpose of `settings.CENSUS_KEY`?\n",
      "\n",
      "1. It is the password for the Census Python package\n",
      "1. It is an API Access key for authentication with the Census API\n",
      "1. It is an API Access key for authentication with Github\n",
      "1. It is key shared by all users of the Census API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A16**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q17**. When we run\n",
      "\n",
      "```bash\n",
      "pip install census\n",
      "```\n",
      "\n",
      "we are:\n",
      "\n",
      "1. installing a Python module from PyPI\n",
      "1. installing the Python module census from continuum.io's repository\n",
      "1. signing ourselves up for a census API key\n",
      "1. None of the above"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A17**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider `r1` and `r2`:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q18**: What is the difference between `r1` and `r2`?\n",
      "\n",
      "```Python\n",
      "    r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})\n",
      "    r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A18**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q19**. What's the relationship between `len(r1)` and `len(r2)`?\n",
      "\n",
      "1. `len(r1)` is less than `len(r2)`\n",
      "1. `len(r1)` equals `len(r2)`\n",
      "1. `len(r1)` is greater than `len(r2)`\n",
      "1. None of the above"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A19**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q20**: Which is a correct geographic hierarchy?\n",
      "\n",
      "Nation > States = Nation is subdivided into States\n",
      "\n",
      "1. Counties > States\n",
      "1. Counties > Block Groups > Census Tracks\n",
      "1. Census Tracts > Block Groups > Census Blocks\n",
      "1. Places > Counties\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A20**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pandas import DataFrame\n",
      "import numpy as np\n",
      "from census import Census\n",
      "from us import states\n",
      "\n",
      "import settings\n",
      "\n",
      "c = Census(settings.CENSUS_KEY)\n",
      "\n",
      "r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})\n",
      "df1 = DataFrame(r)\n",
      "\n",
      "df1.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>NAME</th>\n",
        "      <th>P0010001</th>\n",
        "      <th>state</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>    Alabama</td>\n",
        "      <td>  4779736</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>     Alaska</td>\n",
        "      <td>   710231</td>\n",
        "      <td> 02</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>    Arizona</td>\n",
        "      <td>  6392017</td>\n",
        "      <td> 04</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>   Arkansas</td>\n",
        "      <td>  2915918</td>\n",
        "      <td> 05</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> California</td>\n",
        "      <td> 37253956</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 3 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "         NAME  P0010001 state\n",
        "0     Alabama   4779736    01\n",
        "1      Alaska    710231    02\n",
        "2     Arizona   6392017    04\n",
        "3    Arkansas   2915918    05\n",
        "4  California  37253956    06\n",
        "\n",
        "[5 rows x 3 columns]"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(df1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "52"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q21**: Why does `df` have 52 items? Please explain"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A21**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider the two following expressions:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print df1.P0010001.sum()\n",
      "print\n",
      "print df1.P0010001.astype(int).sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789\n",
        "\n",
        "312471327\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q22**: Why is `df1.P0010001.sum()` different from `df1.P0010001.astype(int).sum()`? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A22**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q23**: Describe the output of the following:\n",
      "\n",
      "```Python\n",
      "df1.P0010001 = df1.P0010001.astype(int)\n",
      "df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A23**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q24**: After running:\n",
      "\n",
      "```Python\n",
      "    df1.set_index('NAME', inplace=True)\n",
      "```\n",
      "\n",
      "how would you access the Series for the state of Nebraska?\n",
      "\n",
      "1. `df1['Nebraska']`\n",
      "1. `df1[1]`\n",
      "1. `df1.ix['Nebraska']`\n",
      "1. `df1[df1['NAME'] == 'Nebraska']`"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A24**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q25**. What is `len(states.STATES)`?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A25**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q26**. What is\n",
      "\n",
      "```Python\n",
      "len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A26**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the next question, we will make use of the negation operator `~`.  Take a look at a specific example"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "~Series([True, True, False, True])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "0    False\n",
        "1    False\n",
        "2     True\n",
        "3    False\n",
        "dtype: bool"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q27**. What is\n",
      "\n",
      "```Python\n",
      "    list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]\n",
      "```    "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A27**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider `pop1` and `pop2`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pop1 = df1['P0010001'].astype('int').sum() \n",
      "pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()\n",
      "\n",
      "pop1-pop2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 30,
       "text": [
        "3725789"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q28**. What does `pop11 - pop2` represent?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A28**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Generator and range"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q29**. Given that\n",
      "\n",
      "    range(10)\n",
      "    \n",
      "is\n",
      "\n",
      "    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
      "\n",
      "How to get the total of every integer from 1 to 100?\n",
      "\n",
      "1. `sum(range(1, 101))`\n",
      "1. `sum(range(100))`\n",
      "1. `sum(range(1, 100))`\n",
      "1. None of the above"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A29**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q30**. What output is produced from\n",
      "\n",
      "```Python\n",
      "# itertools is a great library\n",
      "# http://docs.python.org/2/library/itertools.html#itertools.count\n",
      "# itertools.count(start=0, step=1):\n",
      "# \"Make an iterator that returns evenly spaced values starting with step.\"\n",
      "\n",
      "from itertools import islice, count\n",
      "c = count(0, 1)\n",
      "print c.next()\n",
      "print c.next()\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A30**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q31**. Recalling that \n",
      "\n",
      "    1+2+3+...+100 = 5050\n",
      "    \n",
      "what is:\n",
      "\n",
      "```Python\n",
      "(2*Series(np.arange(101))).sum()\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A31**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Census Places"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider the follow generator that we used to query for census places."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "from pandas import DataFrame\n",
      "\n",
      "import census\n",
      "import settings\n",
      "import us\n",
      "\n",
      "from itertools import islice\n",
      "\n",
      "c=census.Census(settings.CENSUS_KEY)\n",
      "\n",
      "def places(variables=\"NAME\"):\n",
      "    \n",
      "    for state in us.states.STATES:\n",
      "        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}\n",
      "        for place in c.sf1.get(variables, geo=geo):\n",
      "            yield place\n",
      "   "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 34
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we compute a DataFrame for the places: `places_df`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "r = list(islice(places(\"NAME,P0010001\"), None))\n",
      "places_df = DataFrame(r)\n",
      "places_df.P0010001 = places_df.P0010001.astype('int')\n",
      "\n",
      "print \"number of places\", len(places_df)\n",
      "print \"total pop\", places_df.P0010001.sum()\n",
      "places_df.head()\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "number of places 29261\n",
        "total pop 228457238\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>NAME</th>\n",
        "      <th>P0010001</th>\n",
        "      <th>place</th>\n",
        "      <th>state</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>      Abanda CDP</td>\n",
        "      <td>  192</td>\n",
        "      <td> 00100</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  Abbeville city</td>\n",
        "      <td> 2688</td>\n",
        "      <td> 00124</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> Adamsville city</td>\n",
        "      <td> 4522</td>\n",
        "      <td> 00460</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>    Addison town</td>\n",
        "      <td>  758</td>\n",
        "      <td> 00484</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>      Akron town</td>\n",
        "      <td>  356</td>\n",
        "      <td> 00676</td>\n",
        "      <td> 01</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 4 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 35,
       "text": [
        "              NAME  P0010001  place state\n",
        "0       Abanda CDP       192  00100    01\n",
        "1   Abbeville city      2688  00124    01\n",
        "2  Adamsville city      4522  00460    01\n",
        "3     Addison town       758  00484    01\n",
        "4       Akron town       356  00676    01\n",
        "\n",
        "[5 rows x 4 columns]"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We display the most populous places from California"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>NAME</th>\n",
        "      <th>P0010001</th>\n",
        "      <th>place</th>\n",
        "      <th>state</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>2714</th>\n",
        "      <td>   Los Angeles city</td>\n",
        "      <td> 3792621</td>\n",
        "      <td> 44000</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3112</th>\n",
        "      <td>     San Diego city</td>\n",
        "      <td> 1307402</td>\n",
        "      <td> 66000</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3122</th>\n",
        "      <td>      San Jose city</td>\n",
        "      <td>  945942</td>\n",
        "      <td> 68000</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3116</th>\n",
        "      <td> San Francisco city</td>\n",
        "      <td>  805235</td>\n",
        "      <td> 67000</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2425</th>\n",
        "      <td>        Fresno city</td>\n",
        "      <td>  494665</td>\n",
        "      <td> 27000</td>\n",
        "      <td> 06</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 4 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 36,
       "text": [
        "                    NAME  P0010001  place state\n",
        "2714    Los Angeles city   3792621  44000    06\n",
        "3112      San Diego city   1307402  66000    06\n",
        "3122       San Jose city    945942  68000    06\n",
        "3116  San Francisco city    805235  67000    06\n",
        "2425         Fresno city    494665  27000    06\n",
        "\n",
        "[5 rows x 4 columns]"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q32**. Given\n",
      "\n",
      "    places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()\n",
      "\n",
      "is\n",
      "\n",
      "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
      "<table border=\"1\" class=\"dataframe\">\n",
      "  <thead>\n",
      "    <tr style=\"text-align: right;\">\n",
      "      <th></th>\n",
      "      <th>NAME</th>\n",
      "      <th>P0010001</th>\n",
      "      <th>place</th>\n",
      "      <th>state</th>\n",
      "    </tr>\n",
      "  </thead>\n",
      "  <tbody>\n",
      "    <tr>\n",
      "      <th>2714</th>\n",
      "      <td>   Los Angeles city</td>\n",
      "      <td> 3792621</td>\n",
      "      <td> 44000</td>\n",
      "      <td> 06</td>\n",
      "    </tr>\n",
      "    <tr>\n",
      "      <th>3112</th>\n",
      "      <td>     San Diego city</td>\n",
      "      <td> 1307402</td>\n",
      "      <td> 66000</td>\n",
      "      <td> 06</td>\n",
      "    </tr>\n",
      "    <tr>\n",
      "      <th>3122</th>\n",
      "      <td>      San Jose city</td>\n",
      "      <td>  945942</td>\n",
      "      <td> 68000</td>\n",
      "      <td> 06</td>\n",
      "    </tr>\n",
      "    <tr>\n",
      "      <th>3116</th>\n",
      "      <td> San Francisco city</td>\n",
      "      <td>  805235</td>\n",
      "      <td> 67000</td>\n",
      "      <td> 06</td>\n",
      "    </tr>\n",
      "    <tr>\n",
      "      <th>2425</th>\n",
      "      <td>        Fresno city</td>\n",
      "      <td>  494665</td>\n",
      "      <td> 27000</td>\n",
      "      <td> 06</td>\n",
      "    </tr>\n",
      "  </tbody>\n",
      "</table>\n",
      "<p>5 rows \u00d7 4 columns</p>\n",
      "</div>\n",
      "<br/>\n",
      "what is\n",
      "\n",
      "```Python\n",
      "places_df.ix[3122]['label']\n",
      "```\n",
      "\n",
      "after we add the `label` column with:\n",
      "\n",
      "```Python\n",
      "places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)\n",
      "```\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A32**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q33**.  What is\n",
      "\n",
      "```Python\n",
      "places_df[\"NAME\"][3122]\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A33**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Alphabet and apply"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's set up a DataFrame with some letters and properties of letters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# numpy and pandas related imports \n",
      "\n",
      "import numpy as np\n",
      "from pandas import Series, DataFrame\n",
      "import pandas as pd\n",
      "\n",
      "# for example, using lower and uppercase English letters\n",
      "\n",
      "import string\n",
      "\n",
      "lower = Series(list(string.lowercase), name='lower')\n",
      "upper = Series(list(string.uppercase), name='upper')\n",
      "\n",
      "df2 = pd.concat((lower, upper), axis=1)\n",
      "df2['ord'] = df2['lower'].apply(ord)\n",
      "df2.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>lower</th>\n",
        "      <th>upper</th>\n",
        "      <th>ord</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> a</td>\n",
        "      <td> A</td>\n",
        "      <td>  97</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> b</td>\n",
        "      <td> B</td>\n",
        "      <td>  98</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> c</td>\n",
        "      <td> C</td>\n",
        "      <td>  99</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> d</td>\n",
        "      <td> D</td>\n",
        "      <td> 100</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> e</td>\n",
        "      <td> E</td>\n",
        "      <td> 101</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 3 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 39,
       "text": [
        "  lower upper  ord\n",
        "0     a     A   97\n",
        "1     b     B   98\n",
        "2     c     C   99\n",
        "3     d     D  100\n",
        "4     e     E  101\n",
        "\n",
        "[5 rows x 3 columns]"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that `string.upper` takes a letter and returns its uppercase version.  For example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "string.upper('b')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 40,
       "text": [
        "'B'"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q34**. What is\n",
      "\n",
      "```Python\n",
      "np.all(df2['lower'].apply(string.upper) == df2['upper'])\n",
      "```\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A34**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q35**. What is\n",
      "\n",
      "```Python\n",
      "df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A35**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Berkeley I School generator"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Please remind yourself what `enumerate` does."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words = ['Berkeley', 'I', 'School']\n",
      "\n",
      "for (i, word) in islice(enumerate(words),1):\n",
      "    print (i, word)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(0, 'Berkeley')\n"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q36**. What is \n",
      "\n",
      "```Python\n",
      "list(enumerate(words))[2][1]\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A36**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now consider the generator `g2`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def g2():\n",
      "    words = ['Berkeley', 'I', 'School']\n",
      "    for word in words:\n",
      "        if word != 'I':\n",
      "            for letter in list(word):\n",
      "                yield letter\n",
      "            \n",
      "my_g2 = g2()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q37**. What is\n",
      "\n",
      "```Python\n",
      "len(list(my_g2))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A37**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def g3():\n",
      "    words = ['Berkeley', 'I', 'School']\n",
      "    for word in words:\n",
      "        yield words\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q38**. What is\n",
      "\n",
      "```Python\n",
      "len(list(g3()))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A38**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Groupby"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Consider using `groupby` with a DataFrame with states."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import us\n",
      "import census\n",
      "import settings\n",
      "\n",
      "import pandas as pd\n",
      "import numpy as np\n",
      "from pandas import DataFrame, Series\n",
      "from itertools import islice\n",
      "\n",
      "c = census.Census(settings.CENSUS_KEY)\n",
      "\n",
      "def states(variables='NAME'):\n",
      "    geo={'for':'state:*'}\n",
      "    \n",
      "    states_fips = set([state.fips for state in us.states.STATES])\n",
      "    # need to filter out non-states\n",
      "    for r in c.sf1.get(variables, geo=geo, year=2010):\n",
      "        if r['state'] in states_fips:\n",
      "            yield r\n",
      "            \n",
      "# make a dataframe from the total populations of states in the 2010 Census\n",
      "\n",
      "df = DataFrame(states('NAME,P0010001'))\n",
      "df.P0010001 = df.P0010001.astype('int')\n",
      "df['first_letter'] = df.NAME.apply(lambda s:s[0])\n",
      "\n",
      "df.head()\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>NAME</th>\n",
        "      <th>P0010001</th>\n",
        "      <th>state</th>\n",
        "      <th>first_letter</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>    Alabama</td>\n",
        "      <td>  4779736</td>\n",
        "      <td> 01</td>\n",
        "      <td> A</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>     Alaska</td>\n",
        "      <td>   710231</td>\n",
        "      <td> 02</td>\n",
        "      <td> A</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>    Arizona</td>\n",
        "      <td>  6392017</td>\n",
        "      <td> 04</td>\n",
        "      <td> A</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>   Arkansas</td>\n",
        "      <td>  2915918</td>\n",
        "      <td> 05</td>\n",
        "      <td> A</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> California</td>\n",
        "      <td> 37253956</td>\n",
        "      <td> 06</td>\n",
        "      <td> C</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>5 rows \u00d7 4 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 49,
       "text": [
        "         NAME  P0010001 state first_letter\n",
        "0     Alabama   4779736    01            A\n",
        "1      Alaska    710231    02            A\n",
        "2     Arizona   6392017    04            A\n",
        "3    Arkansas   2915918    05            A\n",
        "4  California  37253956    06            C\n",
        "\n",
        "[5 rows x 4 columns]"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For reference, here's a list of all the states"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print list(df.NAME)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'District of Columbia', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']\n"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q39**. What is\n",
      "\n",
      "\n",
      "```Python\n",
      "df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']\n",
      "```\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A39**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q40**. What is\n",
      "\n",
      "\n",
      "```Python\n",
      "df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']\n",
      "```\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A40**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q41**. What is\n",
      "\n",
      "\n",
      "```Python\n",
      "df.groupby('first_letter').agg('count')['first_letter']['P']\n",
      "```\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A41**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q42**. What is\n",
      "\n",
      "```Python\n",
      "len(df.groupby('NAME'))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A42**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Diversity Index"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Recall the code from the diversity calculations"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def normalize(s):\n",
      "    \"\"\"take a Series and divide each item by the sum so that the new series adds up to 1.0\"\"\"\n",
      "    total = np.sum(s)\n",
      "    return s.astype('float') / total\n",
      "\n",
      "def entropy(series):\n",
      "    \"\"\"Normalized Shannon Index\"\"\"\n",
      "    # a series in which all the entries are equal should result in normalized entropy of 1.0\n",
      "    \n",
      "    # eliminate 0s\n",
      "    series1 = series[series!=0]\n",
      "\n",
      "    # if len(series) < 2 (i.e., 0 or 1) then return 0\n",
      "    \n",
      "    if len(series1) > 1:\n",
      "        # calculate the maximum possible entropy for given length of input series\n",
      "        max_s = -np.log(1.0/len(series))\n",
      "    \n",
      "        total = float(sum(series1))\n",
      "        p = series1.astype('float')/float(total)\n",
      "        return sum(-p*np.log(p))/max_s\n",
      "    else:\n",
      "        return 0.0\n",
      "\n",
      "def gini_simpson(s):\n",
      "    # https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index\n",
      "    s1 = normalize(s)\n",
      "    return 1-np.sum(s1*s1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 55
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q43**.  Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?\n",
      "\n",
      "1. Regardless of how you distribute the people, you'll get the same entropy.\n",
      "1. Put 10 people in any single category, and then 0 in the rest.\n",
      "1. Distribute the people evenly over all the categories.\n",
      "1. Put 5 people in each category."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A43**:\n",
      "<pre>\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q44**. What is\n",
      "\n",
      "```Python\n",
      "entropy(Series([0,0,10,0,0]))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A44**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q45**. What is\n",
      "\n",
      "```Python\n",
      "entropy(Series([10,0,0,0,0]))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A45**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q46**. What is\n",
      "\n",
      "```Python\n",
      "entropy(Series([1,1,1,1,1]))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**A46**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Q47**. What is\n",
      "\n",
      "```Python\n",
      "gini_simpson(Series([2,2,2,2,2]))\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "**A47**:\n",
      "<pre>\n",
      "\n",
      "\n",
      "</pre>"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}