{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook will give an overview of using the excellent [HDF5 Data Format](https://www.hdfgroup.org/HDF5/) for high performance computing and [Plotly](https://plotly.com/) to graph data stored in this files.\n",
    " Plotly is a web-based graphing platform that lets you make and share interactive graphs and dashboards. You can use it for free online--sign up for an account [here](https:www.plot.ly)--and on-premise with [Chart Studio Enterprise](https://plotly.com/product/enterprise/).\n",
    "\n",
    "For those unfamilar with the HDF5 file format:\n",
    "HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.\n",
    "\n",
    "-- [The HDF5 Group](https://www.hdfgroup.org/HDF5/)\n",
    "\n",
    "The HDF group has some great reasons to use their files - namely that it works great with all kind of data. You can [read more here.](https://www.hdfgroup.org/why_hdf/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import display\n",
    "import plotly.plotly as py # interactive graphing\n",
    "from plotly.graph_objs import Bar, Scatter, Marker, Layout, Data, Figure, Heatmap, XAxis, YAxis\n",
    "import plotly.tools as tls\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset that we'll be using is data from [NYC's open data portal](https://nycopendata.socrata.com/data). We'll be exploring a 100mb dataset covering traffic accidents in NYC. While we are capable of fitting this data into memory, the HDF5 file format has some unique affordances that allow us to query and save data in convenient ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the first thing we'll want to do is open up an access point to this HDF5 file, doing so is simple because pandas provides ready access to doing so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pd.set_option('io.hdf.default_format','table')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "store = pd.HDFStore('nypd_motors.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've opened up our store, let's start storing some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# df = pd.read_csv('NYPD_motor_collisions.csv', parse_dates=['DATE'])\n",
    "# df.columns = [col.lower().replace(\" \", \"_\") for col in df.columns]\n",
    "# store.append(\"nypd\", df,format='table',data_columns=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'pandas.io.pytables.HDFStore'>\n",
       "File path: nypd_motors.h5\n",
       "/nypd            frame_table  (typ->appendable,nrows->596990,ncols->29,indexers->[index],dc->[date,time,borough,zip_code,latitude,longitude,location,on_street_name,cross_street_name,off_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,unique_key,vehicle_type_code_1,vehicle_type_code_2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing that's nice about the HDF5 file is that it's kind of like a key value store. It's simple to use, and allows you to store things just like you might in a file system type hierarchy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's awesome about the HDF5 format is that it's almost like a miniature file system. It supports hierarchical data and is accessed like a python dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "store.get_storer(\"df\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>borough</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>location</th>\n",
       "      <th>on_street_name</th>\n",
       "      <th>cross_street_name</th>\n",
       "      <th>off_street_name</th>\n",
       "      <th>...</th>\n",
       "      <th>contributing_factor_vehicle_2</th>\n",
       "      <th>contributing_factor_vehicle_3</th>\n",
       "      <th>contributing_factor_vehicle_4</th>\n",
       "      <th>contributing_factor_vehicle_5</th>\n",
       "      <th>unique_key</th>\n",
       "      <th>vehicle_type_code_1</th>\n",
       "      <th>vehicle_type_code_2</th>\n",
       "      <th>vehicle_type_code_3</th>\n",
       "      <th>vehicle_type_code_4</th>\n",
       "      <th>vehicle_type_code_5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015-06-02</td>\n",
       "      <td>13:48</td>\n",
       "      <td>MANHATTAN</td>\n",
       "      <td>10038</td>\n",
       "      <td>40.711780</td>\n",
       "      <td>-73.999701</td>\n",
       "      <td>(40.7117796, -73.9997006)</td>\n",
       "      <td>ST JAMES PLACE</td>\n",
       "      <td>MADISON STREET</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3232026</td>\n",
       "      <td>VAN</td>\n",
       "      <td>VAN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2015-06-02</td>\n",
       "      <td>13:40</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Turning Improperly</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3232021</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>SPORT UTILITY / STATION WAGON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2015-06-02</td>\n",
       "      <td>13:40</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1200 WATERS PLACE - PARKING LOT</td>\n",
       "      <td>...</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3232261</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2015-06-02</td>\n",
       "      <td>13:40</td>\n",
       "      <td>MANHATTAN</td>\n",
       "      <td>10004</td>\n",
       "      <td>40.706701</td>\n",
       "      <td>-74.016047</td>\n",
       "      <td>(40.7067007, -74.0160467)</td>\n",
       "      <td>WEST STREET</td>\n",
       "      <td>MORRIS STREET</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3232015</td>\n",
       "      <td>UNKNOWN</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2015-06-02</td>\n",
       "      <td>13:38</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WOOLLEY AVENUE</td>\n",
       "      <td>GURDON STREET</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>Other Vehicular</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3233372</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>PASSENGER VEHICLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        date   time    borough  zip_code   latitude  longitude  \\\n",
       "0 2015-06-02  13:48  MANHATTAN     10038  40.711780 -73.999701   \n",
       "1 2015-06-02  13:40        NaN       NaN        NaN        NaN   \n",
       "2 2015-06-02  13:40        NaN       NaN        NaN        NaN   \n",
       "3 2015-06-02  13:40  MANHATTAN     10004  40.706701 -74.016047   \n",
       "4 2015-06-02  13:38        NaN       NaN        NaN        NaN   \n",
       "\n",
       "                    location  on_street_name cross_street_name  \\\n",
       "0  (40.7117796, -73.9997006)  ST JAMES PLACE    MADISON STREET   \n",
       "1                        NaN             NaN               NaN   \n",
       "2                        NaN             NaN               NaN   \n",
       "3  (40.7067007, -74.0160467)     WEST STREET     MORRIS STREET   \n",
       "4                        NaN  WOOLLEY AVENUE     GURDON STREET   \n",
       "\n",
       "                   off_street_name         ...          \\\n",
       "0                              NaN         ...           \n",
       "1                              NaN         ...           \n",
       "2  1200 WATERS PLACE - PARKING LOT         ...           \n",
       "3                              NaN         ...           \n",
       "4                              NaN         ...           \n",
       "\n",
       "   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \\\n",
       "0                    Unspecified                            NaN   \n",
       "1             Turning Improperly                            NaN   \n",
       "2                    Unspecified                            NaN   \n",
       "3                    Unspecified                            NaN   \n",
       "4                Other Vehicular                            NaN   \n",
       "\n",
       "   contributing_factor_vehicle_4  contributing_factor_vehicle_5  unique_key  \\\n",
       "0                            NaN                            NaN     3232026   \n",
       "1                            NaN                            NaN     3232021   \n",
       "2                            NaN                            NaN     3232261   \n",
       "3                            NaN                            NaN     3232015   \n",
       "4                            NaN                            NaN     3233372   \n",
       "\n",
       "   vehicle_type_code_1            vehicle_type_code_2  vehicle_type_code_3  \\\n",
       "0                  VAN                            VAN                  NaN   \n",
       "1    PASSENGER VEHICLE  SPORT UTILITY / STATION WAGON                  NaN   \n",
       "2    PASSENGER VEHICLE              PASSENGER VEHICLE                  NaN   \n",
       "3              UNKNOWN              PASSENGER VEHICLE                  NaN   \n",
       "4    PASSENGER VEHICLE              PASSENGER VEHICLE                  NaN   \n",
       "\n",
       "  vehicle_type_code_4 vehicle_type_code_5  \n",
       "0                 NaN                 NaN  \n",
       "1                 NaN                 NaN  \n",
       "2                 NaN                 NaN  \n",
       "3                 NaN                 NaN  \n",
       "4                 NaN                 NaN  \n",
       "\n",
       "[5 rows x 29 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store.select(\"nypd\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "boroughs = store.select(\"nypd\", \"columns=['borough']\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "boroughs['COUNT'] = 1\n",
    "borough_groups = boroughs.groupby('borough')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([u'BRONX', u'BROOKLYN', u'MANHATTAN', u'QUEENS', u'STATEN ISLAND'], dtype='object')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "borough_groups.sum().index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = Data([Bar(y=borough_groups.sum()['COUNT'], x=borough_groups.sum().index)])\n",
    "layout = Layout(xaxis=XAxis(title=\"Borough\"), yaxis=YAxis(title='Accident Count'))\n",
    "fig = Figure(data=data, layout=layout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/474.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "dates_borough = store.select(\"nypd\", \"columns=['date', 'borough']\").sort('date')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "dates_borough['COUNT'] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>COUNT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>borough</th>\n",
       "      <th>date</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">BRONX</th>\n",
       "      <th>2012-07-01</th>\n",
       "      <td>39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012-07-02</th>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012-07-03</th>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012-07-04</th>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012-07-05</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    COUNT\n",
       "borough date             \n",
       "BRONX   2012-07-01     39\n",
       "        2012-07-02     71\n",
       "        2012-07-03     73\n",
       "        2012-07-04     51\n",
       "        2012-07-05     60"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "date_borough_sum = dates_borough.groupby(['borough', \"date\"]).sum()\n",
    "date_borough_sum.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = []\n",
    "for g, df in date_borough_sum.reset_index().groupby('borough'):\n",
    "    data.append(Scatter(x= df.date, y=df.COUNT,name=g))\n",
    "layout = Layout(xaxis=XAxis(title=\"Date\"), yaxis=YAxis(title=\"Accident Count\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/274.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(Figure(data=Data(data), layout=layout), filename='nypd_crashes/over_time')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Luckily for us, while this graph is a bit of a mess, we can still zoom in on specific times and ranges. This makes plotly perfect for exploring datasets. You can create a high level visual of the data then zoom into a more detailed level.\n",
    "\n",
    "See below where using the above graph I could zoom in on a particular point and anontate it for future investigation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/274.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tls.embed(\"https://plotly.com/~bill_chambers/274\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "car_types = store.select(\"nypd\", \"columns=['vehicle_type_code_1', 'vehicle_type_code_2']\")\n",
    "car_types['COUNT'] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "code_1 = car_types.groupby('vehicle_type_code_1').sum()\n",
    "code_2 = car_types.groupby('vehicle_type_code_2').sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = Data([\n",
    "        Bar(x=code_1.index, y=code_1.COUNT,name='First Vehicle Type'),\n",
    "        Bar(x=code_2.index, y=code_2.COUNT,name='Second Vehicle Type')\n",
    "     ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/476.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(Figure(data=data, layout=Layout(barmode='group', yaxis=YAxis(title=\"Vehicle Incidents\"))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No big surprises here, we can see that passenger vehicles, likely being the most prevalent vehicles, are the ones involved in the most accidents for the first and second vehicles. However this does make for some more interesting questions, does this extrapolate to each vehicle class. That is, do all kinds of vehicles hit all other vehicles in more or less the same frequency? \n",
    "\n",
    "Let's explore large commercial vehicles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "large_vehicles = car_types.groupby(\n",
    "    'vehicle_type_code_1'\n",
    ").get_group(\n",
    "    'LARGE COM VEH(6 OR MORE TIRES)'\n",
    ").groupby('vehicle_type_code_2').sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/478.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = Data([Bar(x=large_vehicles.index,y=large_vehicles.COUNT)])\n",
    "py.iplot(Figure(data=data, layout=Layout(yaxis=YAxis(title=\"Incident Per Vehicle Type\"))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At first glance it seems alright, but it's worth more exploration - let's Z-Score the data and compare their scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>COUNT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>vehicle_type_code_2</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AMBULANCE</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BICYCLE</th>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BUS</th>\n",
       "      <td>151</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FIRE TRUCK</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LARGE COM VEH(6 OR MORE TIRES)</th>\n",
       "      <td>878</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                COUNT\n",
       "vehicle_type_code_2                  \n",
       "AMBULANCE                           9\n",
       "BICYCLE                            98\n",
       "BUS                               151\n",
       "FIRE TRUCK                          6\n",
       "LARGE COM VEH(6 OR MORE TIRES)    878"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "large_vehicles.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>COUNT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>vehicle_type_code_2</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AMBULANCE</th>\n",
       "      <td>842</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BICYCLE</th>\n",
       "      <td>13891</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BUS</th>\n",
       "      <td>8935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FIRE TRUCK</th>\n",
       "      <td>412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LARGE COM VEH(6 OR MORE TIRES)</th>\n",
       "      <td>10299</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                COUNT\n",
       "vehicle_type_code_2                  \n",
       "AMBULANCE                         842\n",
       "BICYCLE                         13891\n",
       "BUS                              8935\n",
       "FIRE TRUCK                        412\n",
       "LARGE COM VEH(6 OR MORE TIRES)  10299"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "code_2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def z_score(df):\n",
    "    df['zscore'] = ((df.COUNT - df.COUNT.mean())/df.COUNT.std())\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = Data([\n",
    "        Bar(x=z_score(code_2).index,y=z_score(code_2).zscore, name='All Vehicles'),\n",
    "        Bar(x=z_score(large_vehicles).index,y=z_score(large_vehicles).zscore,name='Large Vehicles'),\n",
    "        \n",
    "     ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/480.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(Figure(data=data, layout=Layout(yaxis=YAxis(title=\"Incident Per Vehicle Type\"))),name='nypd_crashes/large vs all vehicles')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that things are relatively similar, except that large vehicles seem to hit large vehicles much more than most others. This could warrant further investigation.\n",
    "\n",
    "While grouped bar charts can be useful for these kinds of comparisons, it can be great to visualize this data with heatmaps as well. We can create one of these by creation a contingency table or cross tabulation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "cont_table = pd.crosstab(car_types['vehicle_type_code_1'], car_types['vehicle_type_code_2']).apply(np.log)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of the different magnitudes of data, I decided to log scale it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\"seamless=\"seamless\" src=\"https://plotly.com/~bill_chambers/333.embed\" height=\"525\" width=\"100%\"></iframe>"
      ],
      "text/plain": [
       "<plotly.tools.PlotlyDisplay object>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "py.iplot(Data([\n",
    "            Heatmap(z = cont_table.values, x=cont_table.columns, y=cont_table.index, colorscale='Jet')\n",
    "        ]),filename='nypd_crashes/vehicle to vehicle heatmap')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this we are able to see more interesting nuances in the data. For instance taxis seems to have lots of accidents with other taxis, while vans and station wagons also seem to have many accidents.\n",
    "\n",
    "There's clearly a lot to explore in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<link href=\"//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700\" rel=\"stylesheet\" type=\"text/css\" />"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<link rel=\"stylesheet\" type=\"text/css\" href=\"http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css\">"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already up-to-date: publisher in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages\r\n"
     ]
    }
   ],
   "source": [
    "from IPython.display import display, HTML\n",
    "\n",
    "display(HTML('<link href=\"//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700\" rel=\"stylesheet\" type=\"text/css\" />'))\n",
    "display(HTML('<link rel=\"stylesheet\" type=\"text/css\" href=\"http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css\">'))\n",
    "\n",
    "! pip install publisher --upgrade\n",
    "import publisher\n",
    "publisher.publish(\n",
    "    'PyTables.ipynb', 'ipython-notebooks/pytables/', 'PyTables and HDF5 Tutorial',\n",
    "    'An introduction to PyTables and HDF5 for data analysis in Python.', name='Pytables and Plotly')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}