{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Hi!\n", "\n", "I'm Julia." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "import pandas as pd\n", "pd.set_option('display.mpl_style', 'default')\n", "figsize(15, 6)\n", "pd.set_option('display.line_width', 4000)\n", "pd.set_option('display.max_columns', 100)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Goal (Today)\n", "\n", "Know how to use pandas to answer some specific questions about a dataset\n", "\n", "Roadmap:\n", "\n", "1. Demo with rats\n", "2. Dataframes: what makes pandas *powerful*\n", "3. Selecting data from a dataframe\n", "4. Time series and indexes and resampling\n", "5. Groupby + aggregate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Some notes about installation:\n", "\n", "### Don't do this:\n", "\n", "```\n", "sudo apt-get install ipython-notebook\n", "```\n", "\n", "### Instead, do this:\n", "\n", "```\n", "pip install ipython tornado pyzmq\n", "```\n", "\n", "or install Anaconda from [http://store.continuum.io](http://store.continuum.io) (what I do)\n", "\n", "You can start IPython notebook by running\n", "\n", "```\n", "ipython notebook --pylab inline\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First: Read the data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Download and read the data\n", "!wget \"http://bit.ly/311-data-tar-gz\" -O 311-data.tar.gz\n", "!wget \"https://raw2.github.com/jvns/talks/master/pyladiesfeb2014/tiny.csv\" -O tiny.csv\n", "!tar -xzf \"311-data.tar.gz\" # wget does different things\n", "orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "plot(orig_data['Longitude'], orig_data['Latitude'], '.', color=\"purple\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example 1: Graph the number of noise complaints each hour in New York" ] }, { "cell_type": "code", "collapsed": false, "input": [ "complaints = orig_data[['Created Date', 'Complaint Type']]\n", "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n", "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example 2: What are the most common complaint types?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Roadmap:\n", "\n", "1. Dataframes: what makes pandas *powerful*\n", "1. Selecting data from a dataframe\n", "1. Time series and indexes\n", "1. Graphing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What is pandas?\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# A few awesome things about pandas\n", "\n", "* Really, really, really, really good at time series\n", "* Can import Excel files (!!!)\n", "* Fast (joining dataframes, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language *R*.\n", "\n", "If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 1. Dataframes: what makes pandas powerful" ] }, { "cell_type": "code", "collapsed": false, "input": [ "people = pd.read_csv('tiny.csv')\n", "people" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This is a like a SQL database, or an R dataframe. There are 3 *columns*, called 'name', 'age', and 'height, and 6 *rows*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2. Selecting data from a dataframe\n", "\n", "I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Load the first 5 rows of our CSV\n", "small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# How to get a column\n", "small_requests['Complaint Type']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# How to get a subset of the columns\n", "small_requests[['Complaint Type', 'Created Date']]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# How to get 3 rows\n", "small_requests[:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Get the first 3 rows of a column" ] }, { "cell_type": "code", "collapsed": false, "input": [ "small_requests['Agency Name'][:3]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "small_requests[:3]['Agency Name']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Compare a column to a value" ] }, { "cell_type": "code", "collapsed": false, "input": [ "small_requests['Complaint Type']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# This is like our numpy example from before\n", "small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "That's numpy in action! Using `==` on a column of a dataframe gives us a series of `True` and `False` values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Selecting only the rows with noise complaints" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# This is like our numpy example earlier\n", "noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']\n", "noise_complaints" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any Dataframe has an *index*, which is a integer or date or something else associated to each row." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# How to get a specific row\n", "small_requests.ix[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# How not to get a row\n", "small_requests[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 1: Selecting things from dataframes\n", "\n", "* Find out how many complaints were filed with the NYPD\n", "* How many complaints were filed in the zip code 10007?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Back to our example" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.\n", "#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "complaints = orig_data[['Created Date', 'Complaint Type']]\n", "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n", "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Indexes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints[:3]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints = noise_complaints.set_index('Created Date')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints[:3]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sorting the index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES" ] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints = noise_complaints.sort_index()\n", "noise_complaints[:3]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Counting the complaints each hour" ] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints.resample('H', how=len)[:3]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example 1: done!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints.resample('H', how=len).plot()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Chaining commands together" ] }, { "cell_type": "code", "collapsed": false, "input": [ "complaints = orig_data[['Created Date', 'Complaint Type']]\n", "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n", "noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2: Time series resampling\n", "\n", "* Find the number of noise complaints every day!\n", "* Find how many complaints about rodents there are each week. Make a graph!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example 2: What are the most common complaint types?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "orig_data['Complaint Type'].value_counts()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 3: Do the same thing for a different column" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Your code here." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example 3: Which weekday has the most noise complaints?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "complaints = orig_data[['Created Date', 'Complaint Type']]\n", "noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']\n", "noise_complaints = noise_complaints.set_index(\"Created Date\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "noise_complaints['weekday'] = noise_complaints.index.weekday\n", "noise_complaints[:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Count the complaints by weekday\n", "counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)\n", "counts_by_weekday" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# change the index to be actual days\n", "counts_by_weekday.index = [\"Sunday\", \"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\"]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "counts_by_weekday.plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 4: Count the complaints by hour instead" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A few more cool things" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# String searching" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# We need to get rid of the NA values for this to work\n", "street_names = orig_data['Street Name'].fillna('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "manhattan_streets = street_names[street_names.str.contains(\"MANHATTAN\")]\n", "manhattan_streets" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "manhattan_streets.value_counts()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Looking at complaints close to us" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Our current latitude and longitude\n", "our_lat, our_long = 40.714151,-74.00878" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.Series(distance_from_us).hist()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "close_complaints = orig_data[distance_from_us < 0.00005]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }