{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Hi!\n",
      "\n",
      "I'm Julia."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "import pandas as pd\n",
      "pd.set_option('display.mpl_style', 'default')\n",
      "figsize(15, 6)\n",
      "pd.set_option('display.line_width', 4000)\n",
      "pd.set_option('display.max_columns', 100)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Goal (Today)\n",
      "\n",
      "Know how to use pandas to answer some specific questions about a dataset\n",
      "\n",
      "Roadmap:\n",
      "\n",
      "1. Demo with rats\n",
      "2. Dataframes: what makes pandas *powerful*\n",
      "3. Selecting data from a dataframe\n",
      "4. Time series and indexes and resampling\n",
      "5. Groupby + aggregate"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Some notes about installation:\n",
      "\n",
      "### Don't do this:\n",
      "\n",
      "```\n",
      "sudo apt-get install ipython-notebook\n",
      "```\n",
      "\n",
      "### Instead, do this:\n",
      "\n",
      "```\n",
      "pip install ipython tornado pyzmq\n",
      "```\n",
      "\n",
      "or install Anaconda from [http://store.continuum.io](http://store.continuum.io) (what I do)\n",
      "\n",
      "You can start IPython notebook by running\n",
      "\n",
      "```\n",
      "ipython notebook --pylab inline\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# First: Read the data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Download and read the data\n",
      "!wget \"http://bit.ly/311-data-tar-gz\" -O 311-data.tar.gz\n",
      "!wget \"https://raw2.github.com/jvns/talks/master/pyladiesfeb2014/tiny.csv\" -O tiny.csv\n",
      "!tar -xzf \"311-data.tar.gz\" # wget does different things\n",
      "orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plot(orig_data['Longitude'], orig_data['Latitude'], '.', color=\"purple\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Example 1: Graph the number of noise complaints each hour in New York"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "complaints = orig_data[['Created Date', 'Complaint Type']]\n",
      "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n",
      "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Example 2: What are the most common complaint types?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Roadmap:\n",
      "\n",
      "1. Dataframes: what makes pandas *powerful*\n",
      "1. Selecting data from a dataframe\n",
      "1. Time series and indexes\n",
      "1. Graphing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# What is pandas?\n",
      "<img src=\"https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTssUO5rA3mL-WaA0LnTu4s0sdcoUAOFmUOle_i8EEIG-SpS9oT\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# A few awesome things about pandas\n",
      "\n",
      "* Really, really, really, really good at time series\n",
      "* Can import Excel files (!!!)\n",
      "* Fast (joining dataframes, etc.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "source": [
      "This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language *R*.\n",
      "\n",
      "If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# 1. Dataframes: what makes pandas powerful"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "people = pd.read_csv('tiny.csv')\n",
      "people"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "source": [
      "This is a like a SQL database, or an R dataframe. There are 3 *columns*, called 'name', 'age', and 'height, and 6 *rows*."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# 2. Selecting data from a dataframe\n",
      "\n",
      "I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Load the first 5 rows of our CSV\n",
      "small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# How to get a column\n",
      "small_requests['Complaint Type']"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# How to get a subset of the columns\n",
      "small_requests[['Complaint Type', 'Created Date']]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# How to get 3 rows\n",
      "small_requests[:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Get the first 3 rows of a column"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "small_requests['Agency Name'][:3]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "small_requests[:3]['Agency Name']"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Compare a column to a value"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "small_requests['Complaint Type']"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This is like our numpy example from before\n",
      "small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "source": [
      "That's numpy in action! Using `==` on a column of a dataframe gives us a series of `True` and `False` values"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Selecting only the rows with noise complaints"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This is like our numpy example earlier\n",
      "noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']\n",
      "noise_complaints"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Any Dataframe has an *index*, which is a integer or date or something else associated to each row."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# How to get a specific row\n",
      "small_requests.ix[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# How not to get a row\n",
      "small_requests[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Exercise 1: Selecting things from dataframes\n",
      "\n",
      "* Find out how many complaints were filed with the NYPD\n",
      "* How many complaints were filed in the zip code 10007?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Back to our example"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.\n",
      "#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "complaints = orig_data[['Created Date', 'Complaint Type']]\n",
      "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n",
      "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Indexes"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints[:3]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints = noise_complaints.set_index('Created Date')"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints[:3]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Sorting the index"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "source": [
      "Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints = noise_complaints.sort_index()\n",
      "noise_complaints[:3]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Counting the complaints each hour"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints.resample('H', how=len)[:3]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Example 1: done!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints.resample('H', how=len).plot()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Chaining commands together"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "complaints = orig_data[['Created Date', 'Complaint Type']]\n",
      "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n",
      "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Exercise 2: Time series resampling\n",
      "\n",
      "* Find the number of noise complaints every day!\n",
      "* Find how many complaints about rodents there are each week. Make a graph!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Example 2: What are the most common complaint types?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "orig_data['Complaint Type'].value_counts()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Exercise 3: Do the same thing for a different column"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code here."
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Example 3: Which weekday has the most noise complaints?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "complaints = orig_data[['Created Date', 'Complaint Type']]\n",
      "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n",
      "noise_complaints = noise_complaints.set_index(\"Created Date\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "noise_complaints['weekday'] = noise_complaints.index.weekday\n",
      "noise_complaints[:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Count the complaints by weekday\n",
      "counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)\n",
      "counts_by_weekday"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# change the index to be actual days\n",
      "counts_by_weekday.index = [\"Sunday\", \"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "counts_by_weekday.plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Exercise 4: Count the complaints by hour instead"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# A few more cool things"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# String searching"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We need to get rid of the NA values for this to work\n",
      "street_names = orig_data['Street Name'].fillna('')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "manhattan_streets = street_names[street_names.str.contains(\"MANHATTAN\")]\n",
      "manhattan_streets"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "manhattan_streets.value_counts()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Looking at complaints close to us"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Our current latitude and longitude\n",
      "our_lat, our_long = 40.714151,-74.00878"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.Series(distance_from_us).hist()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "close_complaints = orig_data[distance_from_us < 0.00005]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}