{ "cells": [ { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import datetime\n", "from datetime import date\n", "from dateutil.rrule import rrule, DAILY\n", "from datetime import datetime\n", "from dateutil import tz\n", "from __future__ import division\n", "import geoplotlib as glp\n", "from geoplotlib.utils import BoundingBox, DataAccessObject\n", "from sklearn.ensemble import RandomForestClassifier\n", "import itertools\n", "from sklearn import cluster\n", "from sklearn.cross_validation import cross_val_score\n", "\n", "pd.set_option('display.max_columns', None)\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NYC Traffic and Weather - Explainer Notebook\n", "*This notebook is an 'explainer' for the visualization site found [here](https://masve.github.io/saav-deliveries).*\n", "\n", "*For more details all the notebooks that have been created for this project can be found [here](https://github.com/masve/saav-deliveries/tree/master/project).*\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "> What is your dataset?\n", "\n", "Our dataset is [NYPD Motor Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95). It contains records for every reported incident in the NYC area. Records are available from July 2012 till today. Specifically, where the collision took place, the cause of the collision, injuries, fatalities and more.\n", "\n", "The other dataset used was an extraction of weather conditions for the NYC area. This was pulled from [Weather Underground](https://www.wunderground.com) using their csv service for hourly updates. For each hour, there are measurements of the temperature, visibility, windspeed as well as the overall weather conditions such as Rain, Snow, Clear etc.\n", "\n", "> Why did you choose this/these particular dataset(s)?\n", "\n", "Choosing the collisions dataset, was mainly out of interest in finding out where, how and why collisions happen.\n", "\n", "The weather dataset was an afterthough and was mainly something we wanted to investigate after having looked collisions and the road conditions / traffic infrastructure.\n", "\n", "\n", "\n", "> What was your goal for the end user's experience?\n", "\n", "The main goal is to provide the user/reader with knowledge about how the weather affects the road and in that sense the collision rate. We would assume that more collisions happens in bad conditions, but can we quantify that? The user should also end up with an idea of where incidents happens the most, maybe to give an idea for the government of NYC for improvements. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Stats. Let's understand the dataset better\n", ">Write about your choices in data cleaning and preprocessing\n", "\n", "Talking about the collisions dataset, the cleaning was mainly done for the columns that we knew we wanted to investigate or otherwise we though was important in the data exploration. Notably we wanted to make sure that we had location data for all the rows we investigated.\n", "\n", "The weather information was not directly available to us and required a lot of HTTP requests to Weather Undergrounds servers. Below is how we did it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# ============================================\n", "# Downloading Weather Data (KJFK)\n", "# ============================================\n", "\n", "# Getting weather data from wunderground\n", "start_date = date(2012, 7, 1)\n", "end_date = date(2016, 2, 29)\n", "\n", "# csv container for all daily weather infromation\n", "frames = []\n", "\n", "# url template for http requests. %s/%s/%s represent year/month/day\n", "url_template = 'https://www.wunderground.com/history/airport/KJFK/%s/%s/%s/DailyHistory.html?\\\n", " req_city=New+York&req_state=NY&req_statename=New+York&reqdb.zip=10001\\\n", " &reqdb.magic=4&reqdb.wmo=99999&format=1.csv'\n", "month = \"\"\n", "\n", "# Query wunderground for daily weather information (returned as csv)\n", "for dt in rrule(DAILY, dtstart=start_date, until=end_date):\n", " if (month != dt.strftime(\"%m\")):\n", " month = dt.strftime(\"%m\")\n", " print 'Downloading to memory: ' + dt.strftime(\"%Y-%m\")\n", " # download and append csv file to csv container\n", " frames.append(pd.read_csv(url_template % (dt.strftime(\"%Y\"),dt.strftime(\"%m\"), dt.strftime(\"%d\"))))\n", "\n", "# Combine all csv's to one big and save it\n", "print \"Saving data to csv...\"\n", "data = pd.concat(frames)\n", "data.to_csv('weather_data_nyc_kjfk.csv', sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have two datasets. Collisions and weather. However to avoid having to lookup in in a secondary dataset, that is the weather information, we merged the two datasets together. For the most part we had weather data for each hour for all the rows we wanted to investigate, with only a combined gap of a couple of days. In additions some hours had more than one row of weather information. We ignored these factors as we saw them insignificant to the overall result anyways.\n", "\n", "In order to join the datasets they both had to have some columns in common. Which needed to be the date and time (hour). The weather dataset already had a datetime column in UTC. What we did was convert to NYC local time and add the columns Year, Month, Day, and Hour:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# ============================================\n", "# Cleaning the Weather Data\n", "# ============================================\n", "\n", "# Read downloaded dataset\n", "weather = pd.read_csv('datasets/weather_data_nyc_kjfk.csv')\n", "\n", "# Convert UTC time to NYC actual\n", "def UTCtoActual(utcDate):\n", " from_zone = tz.gettz('UTC')\n", " to_zone = tz.gettz('America/New_York')\n", " \n", " utc = datetime.strptime(utcDate.DateUTC, '%Y-%m-%d %H:%M:%S')\\\n", " .replace(tzinfo=from_zone)\\\n", " .astimezone(to_zone)\n", " s = pd.Series([utc.year, utc.month, utc.day, utc.hour])\n", " s.columns = ['Year', 'Month', 'Day', 'Hour']\n", " return s\n", "\n", "# Apply the above function to every row in the weather dataset and save the file.\n", "weather[['Year', 'Month', 'Day', 'Hour']] = weather.apply(UTCtoActual, axis=1)\n", "weather.to_csv('datasets/weather_data_nyc_kjfk_clean.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With both datasets now having a 'common' ground for joining. This can now be done:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# ============================================\n", "# Merging Datasets\n", "# NYPD Motor Vehicle Collisions\n", "# and \n", "# Weather Underground Extract (KJFK)\n", "# ============================================\n", "\n", "# Read the datasets to be merged\n", "incidents = pd.read_csv('datasets/NYPD_Motor_Vehicle_Collisions.csv')\n", "weather = pd.read_csv('datasets/weather_data_nyc_kjfk_clean2.csv')\n", "\n", "# Features from the dataset to merge in\n", "features = ['Conditions', 'Precipitationmm', \\\n", " 'TemperatureC', 'VisibilityKm']\n", "\n", "# Looks up weather data on the date and hour requested\n", "def lookup_weather2(year, month, day, hour):\n", " w = weather[(weather.Year == year) & (weather.Month == month) & (weather.Day == day) & (weather.Hour == hour)]\n", " return w\n", "\n", "# Looks up weather data, if nothing is found on the hour, it either looks an hour ahead or back.\n", "# Otherwise returns empty.\n", "def lookup_weather(date, time):\n", " month = int(date.split('/')[0])\n", " day = int(date.split('/')[1])\n", " year = int(date.split('/')[2])\n", " hour = int(time.split(':')[0])\n", " d = lookup_weather2(year, month, day, hour).head(1)\n", " if (d.empty):\n", " dt_back = datetime.datetime(year, month, day, hour) - datetime.timedelta(hours=1)\n", " dt_forward = datetime.datetime(year, month, day, hour) + datetime.timedelta(hours=1)\n", " \n", " d_back = lookup_weather2(dt_back.year, dt_back.month, dt_back.day, dt_back.hour)\n", " if (not d_back.empty): return d_back\n", " \n", " d_forward = lookup_weather2(dt_forward.year, dt_forward.month, dt_forward.day, dt_forward.hour)\n", " if (not d_forward.empty): return d_forward\n", " return d\n", "\n", "# Merges the datasets\n", "def merge_weather(incident):\n", " date = incident.DATE\n", " time = incident.TIME\n", "\n", " w = lookup_weather(date, time)\n", "\n", " try:\n", " # Default values\n", " con = \"-\"\n", " temp = \"-\"\n", " rainmm = \"-\"\n", " viskm = \"-\"\n", "\n", " # If weather data is different from null\n", " if (not pd.isnull(w['Conditions'].iloc[0])):\n", " con = w['Conditions'].iloc[0]\n", " if (not pd.isnull(w['TemperatureC'].iloc[0])):\n", " temp = w['TemperatureC'].iloc[0]\n", " if (not pd.isnull(w['Precipitationmm'].iloc[0])):\n", " rainmm = w['Precipitationmm'].iloc[0]\n", " if (not pd.isnull(w['VisibilityKm'].iloc[0])):\n", " viskm = w['VisibilityKm'].iloc[0]\n", " \n", " s = pd.Series([con, rainmm, temp, viskm])\n", " return s\n", " except:\n", " print date + \"x\" + time\n", " s = pd.Series([None,None,None,None])\n", " return s\n", " \n", "print \"Applying weather data to incidents...\"\n", "incidents[features] = incidents[incidents.DATE.str.split('/').str.get(2) != '2016'].apply(merge_weather, axis=1)\n", "print \"Saving weather in-riched incident data...\"\n", "incidents.to_csv('datasets/NYPD_Motor_Vehicle_Collisions_Weather_FINAL.csv', sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This was how we preprocessed and got all the available information that we require for making our data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)\n", "\n", "Our primary focus is on the dataset addressing NYC Motor Vehicle Collisions and the merged dataset, consisting of incidents along with their weather conditions.\n", "\n", "#### NYPD Motor Vehicle Collisions\n", "\n", "[Source](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95)\n", "\n", "The dataset contained within a csv file is of size 149MB, with 29 features and 769054 records.\n", "\n", "\n", "\n", "We started on getting finding answers to our burning questions and some basic statistics on our data. Some questions that we wanted to address was the number of collisions that are happening within each borough. As an example shown in below chart, it shows the number of incidents recorded within each borough over the years.\n", "\n", "\n", "\n", "Further investigation on the dataset, we made use of the geolocation to get an even more precise geographical representation on where incidents are happening than just grouping by borough. We presented the location of incidents in a heatmap by using geoplotlib.kde to map and show the density incidents in their respective geographical location. \n", "\n", "\n", "\n", "#### Weather Underground\n", "\n", "[Source](https://www.wunderground.com), processed dataset can he found [here](https://github.com/masve/saav-deliveries/blob/master/project/datasets/weather_data_nyc_kjfk_clean2.csv.zip).\n", "\n", "The dataset for the weather, is contained within a csv file with a file size of 3.9MB. The data is constructed with 22 features and 36002 rows.\n", "\n", "Now that we have made some modelling on the collision data, we wanted to know the cause of action for these collisions. There are definitely many contribufaction factors for each recorded incidents, however we thought there could be other factors that lies beneath these incidents, which is why we chose a common external factor, the weather. \n", "\n", "We retrieved a set of weather data, from the same time period as the dataset for the motor vehicle collisions and merged them by the time. The data contains 30 different weather conditions\n", "\n", "
\n",
" \n", " | \n",
" \n",
" \n", " | \n",
"
---|