{ "cells": [ { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.046585, "end_time": "2021-03-14T03:52:35.098103", "exception": false, "start_time": "2021-03-14T03:52:35.051518", "status": "completed" }, "tags": [] }, "source": [ "# Accidents and Traffic in Atlanta\n", "\n", "### What is my goal for this project?\n", "As someone studying machine learning and data science, I didn't really have a good portfolio of projects to show. I had some experience working on this type of stuff in graduate school, and I have completed several online machine learning and data science courses, but I didn't have many projects to show for that. \n", "\n", "My initial idea for this project was to browse the data sets on Kaggle and pick one that seemed interesting. I would then look at the data, train a classifier or regressor to do some predictions, and put together some written thoughts on it. I didn't really have a plan when I picked the dataset, I just wanted to mess around with it and see what came to me.\n", "\n", "### What dataset did I choose?\n", "After spending some time browsing through various data sets, I ended up choosing [US Accidents (4.2 million records)\n", "A Countrywide Traffic Accident Dataset (2016 - 2020)](https://www.kaggle.com/sobhanmoosavi/us-accidents). There is an accopanying paper for this data set located at [arxiv.org](https://arxiv.org/abs/1906.05409). Using this data set, I narrowed the scope of the data down to the Atlanta, Georgia area where I recently moved.\n", "\n", "### About the Data:\n", "\n", "According to the author, the data was collected in real time using multiple traffic APIs. The majority of the data, approximately 63% and 36%, came from Mapquest and Bing repsectively. The data was collected from February 2016 until December 2020. The author of the data set made some comments on the validity of the data [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/159189). Overall, the author believes that this data set is a subset of the entire accidents in the United States. The author also discussed a change in collection techniques for MapQuest after August 2017 [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/126883). \n", "\n", "Given the above information, while the data may not be complete, it should be enough to give an interesting view into the traffic around the Atlanta area.\n", "\n", "The following table is a description of the column values as per the author's [website](https://smoosavi.org/datasets/us_accidents).\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AttributeDescription
IDThis is a unique identifier of the accident record.
SourceIndicates source of the accident report (i.e. the API which reported the accident.).
TMCA traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.
SeverityShows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
Start_TimeShows start time of the accident in local time zone.
End_TimeShows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
Start_LatShows latitude in GPS coordinate of the start point.
Start_LngShows longitude in GPS coordinate of the start point.
End_LatShows latitude in GPS coordinate of the end point.
End_LngShows longitude in GPS coordinate of the end point.
Distance(mi)The length of the road extent affected by the accident.
DescriptionShows natural language description of the accident.
NumberShows the street number in address field.
StreetShows the street name in address field.
SideShows the relative side of the street (Right/Left) in address field.
CityShows the city in address field.
CountyShows the county in address field.
StateShows the state in address field.
ZipcodeShows the zipcode in address field.
CountryShows the country in address field.
TimezoneShows timezone based on the location of the accident (eastern, central, etc.).
Airport_CodeDenotes an airport-based weather station which is the closest one to location of the accident.
Weather_TimestampShows the time-stamp of weather observation record (in local time).
Temperature(F)Shows the temperature (in Fahrenheit).
Wind_Chill(F)Shows the wind chill (in Fahrenheit).
Humidity(%)Shows the humidity (in percentage).
Pressure(in)Shows the air pressure (in inches).
Visibility(mi)Shows visibility (in miles).
Wind_DirectionShows wind direction.
Wind_Speed(mph)Shows wind speed (in miles per hour).
Precipitation(in)Shows precipitation amount in inches, if there is any.
Weather_ConditionShows the weather condition (rain, snow, thunderstorm, fog, etc.)
AmenityA amenity in a nearby location.
BumpA POI annotation which indicates presence of speed bump or hump in a nearby location.
CrossingA POI annotation which indicates presence of crossing in a nearby location.
Give_WayA POI annotation which indicates presence of give_way in a nearby location.
JunctionA POI annotation which indicates presence of junction in a nearby location.
No_ExitA POI annotation which indicates presence of no_exit in a nearby location.
RailwayA POI annotation which indicates presence of railway in a nearby location.
RoundaboutA POI annotation which indicates presence of roundabout in a nearby location.
StationA POI annotation which indicates presence of station in a nearby location.
StopA POI annotation which indicates presence of stop in a nearby location.
Traffic_CalmingA POI annotation which indicates presence of traffic_calming in a nearby location.
Traffic_SignalA POI annotation which indicates presence of traffic_signal in a nearby location.
Turning_LoopA POI annotation which indicates presence of turning_loop in a nearby location.
Sunrise_SunsetShows the period of day (i.e. day or night) based on sunrise/sunset.
Civil_TwilightShows the period of day (i.e. day or night) based on civil twilight.
Nautical_TwilightShows the period of day (i.e. day or night) based on nautical twilight.
Astronomical_TwilightShows the period of day (i.e. day or night) based on astronomical twilight.
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the project?\n", "For this project, I will be looking at plots of traffic accidents and the impact on traffic when there is an auto accident. The impacts on traffic in this data set are rated by severity from 1 to 4 with 1 meaning the accident caused minimal traffic delays and 4 meaning an accident had an extreme effect on traffic delays. \n", "\n", "The data set doesn't define a clear timetable for the severity of the delays. However, the author posted some approximate estimates [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/152370). Delays are estimated at the following.\n", "\n", "| Severity | Time |\n", "| :--- | :--- |\n", "| 1 | 2m 30s |\n", "| 2 | 3m 15s |\n", "| 3 | 8m |\n", "| 4 | 18m |\n", "\n", "
\n", "\n", "To examine the data, I will plot several different types of traffic maps. Next, I will divide the data into train and test sets in order to train a severity classifier for accidents that will predict the traffic delay given an accidents location and relevant information. \n", "\n", "To look at the data and train the classifier I will use the years 2017 through 2019. I will discard 2016 since data from the first part of the year is missing. I will also not consider the year 2020 in the first part of the project since various COVID measures had a serious impact on traffic in Atlanta. In the last part of the project, I will compare the 2020 data to the previous years to see how the data shows COVID affected traffic delays and accidents." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "papermill": { "duration": 1.441403, "end_time": "2021-03-14T03:52:36.581869", "exception": false, "start_time": "2021-03-14T03:52:35.140466", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.set_option('display.max_columns', None)\n", "\n", "import plotly.graph_objects as go\n", "from plotly.subplots import make_subplots\n", "\n", "from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.dummy import DummyClassifier\n", "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.feature_selection import RFECV\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif\n", "\n", "import json\n", "\n", "with open('credentials.json') as f:\n", " json_data = json.load(f)\n", " mapbox_key = json_data['mapbox_key']\n", " \n", "random_state = 0\n", "\n", "import plotly.io as pio\n", "pio.renderers.default = \"notebook\"" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.043435, "end_time": "2021-03-14T03:52:36.669388", "exception": false, "start_time": "2021-03-14T03:52:36.625953", "status": "completed" }, "tags": [] }, "source": [ "# Cleaning and processing the data\n", "\n", "The first step is to read in the data and take a look at it. The data contains 49 columns of information and 4,232,541 rows of accident entries. The columns contain varying information on location, weather, dates, times, traffic impact, source, and other information. The exact explanation of each column can be found [here](https://smoosavi.org/datasets/us_accidents)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "papermill": { "duration": 55.308991, "end_time": "2021-03-14T03:53:32.021522", "exception": false, "start_time": "2021-03-14T03:52:36.712531", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
0A-1MapQuest201.032016-02-08 05:46:002016-02-08 11:00:0039.865147-84.058723NaNNaN0.010Right lane blocked due to accident on I-70 Eas...NaNI-70 ERDaytonMontgomeryOH45424USUS/EasternKFFO2016-02-08 05:58:0036.9NaN91.029.6810.0CalmNaN0.02Light RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
1A-2MapQuest201.022016-02-08 06:07:592016-02-08 06:37:5939.928059-82.831184NaNNaN0.010Accident on Brice Rd at Tussing Rd. Expect del...2584.0Brice RdLReynoldsburgFranklinOH43068-3402USUS/EasternKCMH2016-02-08 05:51:0037.9NaN100.029.6510.0CalmNaN0.00Light RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightDay
2A-3MapQuest201.022016-02-08 06:49:272016-02-08 07:19:2739.063148-84.032608NaNNaN0.010Accident on OH-32 State Route 32 Westbound at ...NaNState Route 32RWilliamsburgClermontOH45176USUS/EasternKI692016-02-08 06:56:0036.033.3100.029.6710.0SW3.5NaNOvercastFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightDayDay
3A-4MapQuest201.032016-02-08 07:23:342016-02-08 07:53:3439.747753-84.205582NaNNaN0.010Accident on I-75 Southbound at Exits 52 52B US...NaNI-75 SRDaytonMontgomeryOH45417USUS/EasternKDAY2016-02-08 07:38:0035.131.096.029.649.0SW4.6NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightDayDayDay
4A-5MapQuest201.022016-02-08 07:39:072016-02-08 08:09:0739.627781-84.188354NaNNaN0.010Accident on McEwen Rd at OH-725 Miamisburg Cen...NaNMiamisburg Centerville RdRDaytonMontgomeryOH45459USUS/EasternKMGY2016-02-08 07:53:0036.033.389.029.656.0SW3.5NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseDayDayDayDay
......................................................................................................................................................
4232536A-4239402BingNaN22019-08-23 18:03:252019-08-23 18:32:0134.002480-117.37936033.99888-117.370940.543At Market St - Accident.NaNPomona Fwy ERRiversideRiversideCA92501USUS/PacificKRAL2019-08-23 17:53:0086.086.040.028.9210.0W13.00.00FairFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
4232537A-4239403BingNaN22019-08-23 19:11:302019-08-23 19:38:2332.766960-117.14806032.76555-117.153630.338At Camino Del Rio/Mission Center Rd - Accident.NaNI-8 WRSan DiegoSan DiegoCA92108USUS/PacificKMYF2019-08-23 18:53:0070.070.073.029.3910.0SW6.00.00FairFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
4232538A-4239404BingNaN22019-08-23 19:00:212019-08-23 19:28:4933.775450-117.84779033.77740-117.857270.561At Glassell St/Grand Ave - Accident. in the ri...NaNGarden Grove FwyROrangeOrangeCA92866USUS/PacificKSNA2019-08-23 18:53:0073.073.064.029.7410.0SSW10.00.00Partly CloudyFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
4232539A-4239405BingNaN22019-08-23 19:00:212019-08-23 19:29:4233.992460-118.40302033.98311-118.395650.772At CA-90/Marina Fwy/Jefferson Blvd - Accident.NaNSan Diego Fwy SRCulver CityLos AngelesCA90230USUS/PacificKSMO2019-08-23 18:51:0071.071.081.029.6210.0SW8.00.00FairFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
4232540A-4239406BingNaN22019-08-23 18:52:062019-08-23 19:21:3134.133930-117.23092034.13736-117.239340.537At Highland Ave/Arden Ave - Accident.NaNCA-210 WRHighlandSan BernardinoCA92346USUS/PacificKSBD2019-08-23 20:50:0079.079.047.028.637.0SW7.00.00FairFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
\n", "

4232541 rows × 49 columns

\n", "
" ], "text/plain": [ " ID Source TMC Severity Start_Time \\\n", "0 A-1 MapQuest 201.0 3 2016-02-08 05:46:00 \n", "1 A-2 MapQuest 201.0 2 2016-02-08 06:07:59 \n", "2 A-3 MapQuest 201.0 2 2016-02-08 06:49:27 \n", "3 A-4 MapQuest 201.0 3 2016-02-08 07:23:34 \n", "4 A-5 MapQuest 201.0 2 2016-02-08 07:39:07 \n", "... ... ... ... ... ... \n", "4232536 A-4239402 Bing NaN 2 2019-08-23 18:03:25 \n", "4232537 A-4239403 Bing NaN 2 2019-08-23 19:11:30 \n", "4232538 A-4239404 Bing NaN 2 2019-08-23 19:00:21 \n", "4232539 A-4239405 Bing NaN 2 2019-08-23 19:00:21 \n", "4232540 A-4239406 Bing NaN 2 2019-08-23 18:52:06 \n", "\n", " End_Time Start_Lat Start_Lng End_Lat End_Lng \\\n", "0 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN \n", "1 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN \n", "2 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN \n", "3 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN \n", "4 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN \n", "... ... ... ... ... ... \n", "4232536 2019-08-23 18:32:01 34.002480 -117.379360 33.99888 -117.37094 \n", "4232537 2019-08-23 19:38:23 32.766960 -117.148060 32.76555 -117.15363 \n", "4232538 2019-08-23 19:28:49 33.775450 -117.847790 33.77740 -117.85727 \n", "4232539 2019-08-23 19:29:42 33.992460 -118.403020 33.98311 -118.39565 \n", "4232540 2019-08-23 19:21:31 34.133930 -117.230920 34.13736 -117.23934 \n", "\n", " Distance(mi) Description \\\n", "0 0.010 Right lane blocked due to accident on I-70 Eas... \n", "1 0.010 Accident on Brice Rd at Tussing Rd. Expect del... \n", "2 0.010 Accident on OH-32 State Route 32 Westbound at ... \n", "3 0.010 Accident on I-75 Southbound at Exits 52 52B US... \n", "4 0.010 Accident on McEwen Rd at OH-725 Miamisburg Cen... \n", "... ... ... \n", "4232536 0.543 At Market St - Accident. \n", "4232537 0.338 At Camino Del Rio/Mission Center Rd - Accident. \n", "4232538 0.561 At Glassell St/Grand Ave - Accident. in the ri... \n", "4232539 0.772 At CA-90/Marina Fwy/Jefferson Blvd - Accident. \n", "4232540 0.537 At Highland Ave/Arden Ave - Accident. \n", "\n", " Number Street Side City County \\\n", "0 NaN I-70 E R Dayton Montgomery \n", "1 2584.0 Brice Rd L Reynoldsburg Franklin \n", "2 NaN State Route 32 R Williamsburg Clermont \n", "3 NaN I-75 S R Dayton Montgomery \n", "4 NaN Miamisburg Centerville Rd R Dayton Montgomery \n", "... ... ... ... ... ... \n", "4232536 NaN Pomona Fwy E R Riverside Riverside \n", "4232537 NaN I-8 W R San Diego San Diego \n", "4232538 NaN Garden Grove Fwy R Orange Orange \n", "4232539 NaN San Diego Fwy S R Culver City Los Angeles \n", "4232540 NaN CA-210 W R Highland San Bernardino \n", "\n", " State Zipcode Country Timezone Airport_Code \\\n", "0 OH 45424 US US/Eastern KFFO \n", "1 OH 43068-3402 US US/Eastern KCMH \n", "2 OH 45176 US US/Eastern KI69 \n", "3 OH 45417 US US/Eastern KDAY \n", "4 OH 45459 US US/Eastern KMGY \n", "... ... ... ... ... ... \n", "4232536 CA 92501 US US/Pacific KRAL \n", "4232537 CA 92108 US US/Pacific KMYF \n", "4232538 CA 92866 US US/Pacific KSNA \n", "4232539 CA 90230 US US/Pacific KSMO \n", "4232540 CA 92346 US US/Pacific KSBD \n", "\n", " Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) \\\n", "0 2016-02-08 05:58:00 36.9 NaN 91.0 \n", "1 2016-02-08 05:51:00 37.9 NaN 100.0 \n", "2 2016-02-08 06:56:00 36.0 33.3 100.0 \n", "3 2016-02-08 07:38:00 35.1 31.0 96.0 \n", "4 2016-02-08 07:53:00 36.0 33.3 89.0 \n", "... ... ... ... ... \n", "4232536 2019-08-23 17:53:00 86.0 86.0 40.0 \n", "4232537 2019-08-23 18:53:00 70.0 70.0 73.0 \n", "4232538 2019-08-23 18:53:00 73.0 73.0 64.0 \n", "4232539 2019-08-23 18:51:00 71.0 71.0 81.0 \n", "4232540 2019-08-23 20:50:00 79.0 79.0 47.0 \n", "\n", " Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) \\\n", "0 29.68 10.0 Calm NaN \n", "1 29.65 10.0 Calm NaN \n", "2 29.67 10.0 SW 3.5 \n", "3 29.64 9.0 SW 4.6 \n", "4 29.65 6.0 SW 3.5 \n", "... ... ... ... ... \n", "4232536 28.92 10.0 W 13.0 \n", "4232537 29.39 10.0 SW 6.0 \n", "4232538 29.74 10.0 SSW 10.0 \n", "4232539 29.62 10.0 SW 8.0 \n", "4232540 28.63 7.0 SW 7.0 \n", "\n", " Precipitation(in) Weather_Condition Amenity Bump Crossing \\\n", "0 0.02 Light Rain False False False \n", "1 0.00 Light Rain False False False \n", "2 NaN Overcast False False False \n", "3 NaN Mostly Cloudy False False False \n", "4 NaN Mostly Cloudy False False False \n", "... ... ... ... ... ... \n", "4232536 0.00 Fair False False False \n", "4232537 0.00 Fair False False False \n", "4232538 0.00 Partly Cloudy False False False \n", "4232539 0.00 Fair False False False \n", "4232540 0.00 Fair False False False \n", "\n", " Give_Way Junction No_Exit Railway Roundabout Station Stop \\\n", "0 False False False False False False False \n", "1 False False False False False False False \n", "2 False False False False False False False \n", "3 False False False False False False False \n", "4 False False False False False False False \n", "... ... ... ... ... ... ... ... \n", "4232536 False False False False False False False \n", "4232537 False False False False False False False \n", "4232538 False True False False False False False \n", "4232539 False False False False False False False \n", "4232540 False False False False False False False \n", "\n", " Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "0 False False False Night \n", "1 False False False Night \n", "2 False True False Night \n", "3 False False False Night \n", "4 False True False Day \n", "... ... ... ... ... \n", "4232536 False False False Day \n", "4232537 False False False Day \n", "4232538 False False False Day \n", "4232539 False False False Day \n", "4232540 False False False Day \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "0 Night Night Night \n", "1 Night Night Day \n", "2 Night Day Day \n", "3 Day Day Day \n", "4 Day Day Day \n", "... ... ... ... \n", "4232536 Day Day Day \n", "4232537 Day Day Day \n", "4232538 Day Day Day \n", "4232539 Day Day Day \n", "4232540 Day Day Day \n", "\n", "[4232541 rows x 49 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all = pd.read_csv('US_Accidents_Dec20.csv')\n", "df_all" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.043944, "end_time": "2021-03-14T03:53:32.110338", "exception": false, "start_time": "2021-03-14T03:53:32.066394", "status": "completed" }, "tags": [] }, "source": [ "### Narrowing the data to Atlanta\n", "In order to narrow the data to the Atlanta, Georgia area, I started by retrieving the lattitude and longitude coordinates for Atlanta via Google. The listed coordinates for Atlanta are 33.7490° N, 84.3880° W. To select the area around Atlanta, I opened a map and selected what I thought was a good representation of the city and outer suburbs which is approximately a 35 square mile area. The first step is to retrieve the rows in the data set that correspond to this area." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "papermill": { "duration": 0.210946, "end_time": "2021-03-14T03:53:32.365966", "exception": false, "start_time": "2021-03-14T03:53:32.155020", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
146255A-146262MapQuest245.032016-11-30 15:13:442016-11-30 17:26:2833.546177-84.577347NaNNaN0.01Two lanes blocked due to accident on I-85 Nort...NaNSenoia RdRFairburnFultonGA30213USUS/EasternKATL2016-11-30 15:09:0063.0NaN97.029.753.0WSW9.20.05RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
146256A-146263MapQuest201.032016-11-30 15:25:272016-11-30 16:54:3633.766376-84.527321NaNNaN0.01Restrictions on exit ramp due to accident on I...NaNGA-402 ERAtlantaFultonGA30336USUS/EasternKFTY2016-11-30 15:09:0063.0NaN90.029.733.0SSW5.80.04RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
146257A-146264MapQuest229.032016-11-30 14:42:272016-11-30 16:57:0733.786896-84.493134NaNNaN0.01Slow traffic due to accident on I-285 Southbou...NaNDonald Lee Hollowell Pkwy NWRAtlantaFultonGA30331USUS/EasternKFTY2016-11-30 14:40:0063.0NaN90.029.732.5SSW8.10.62Heavy RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
146258A-146265MapQuest201.022016-11-30 16:27:582016-11-30 16:57:4133.697849-84.418266NaNNaN0.01Accident on GA-166 Arthur Langford Pkwy at Syl...NaNSylvan RdRAtlantaFultonGA30344USUS/EasternKATL2016-11-30 16:52:0063.0NaN97.029.779.0WSW10.40.01OvercastFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
146259A-146266MapQuest201.032016-11-30 16:14:202016-11-30 16:58:5933.696915-84.404984NaNNaN0.01Accident on I-75 Southbound at Exits 242 243 I...NaNArthur Langford Pkwy ERAtlantaFultonGA30315USUS/EasternKATL2016-11-30 15:52:0063.0NaN97.029.7010.0SW8.10.13Light RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
\n", "
" ], "text/plain": [ " ID Source TMC Severity Start_Time \\\n", "146255 A-146262 MapQuest 245.0 3 2016-11-30 15:13:44 \n", "146256 A-146263 MapQuest 201.0 3 2016-11-30 15:25:27 \n", "146257 A-146264 MapQuest 229.0 3 2016-11-30 14:42:27 \n", "146258 A-146265 MapQuest 201.0 2 2016-11-30 16:27:58 \n", "146259 A-146266 MapQuest 201.0 3 2016-11-30 16:14:20 \n", "\n", " End_Time Start_Lat Start_Lng End_Lat End_Lng \\\n", "146255 2016-11-30 17:26:28 33.546177 -84.577347 NaN NaN \n", "146256 2016-11-30 16:54:36 33.766376 -84.527321 NaN NaN \n", "146257 2016-11-30 16:57:07 33.786896 -84.493134 NaN NaN \n", "146258 2016-11-30 16:57:41 33.697849 -84.418266 NaN NaN \n", "146259 2016-11-30 16:58:59 33.696915 -84.404984 NaN NaN \n", "\n", " Distance(mi) Description \\\n", "146255 0.01 Two lanes blocked due to accident on I-85 Nort... \n", "146256 0.01 Restrictions on exit ramp due to accident on I... \n", "146257 0.01 Slow traffic due to accident on I-285 Southbou... \n", "146258 0.01 Accident on GA-166 Arthur Langford Pkwy at Syl... \n", "146259 0.01 Accident on I-75 Southbound at Exits 242 243 I... \n", "\n", " Number Street Side City County State \\\n", "146255 NaN Senoia Rd R Fairburn Fulton GA \n", "146256 NaN GA-402 E R Atlanta Fulton GA \n", "146257 NaN Donald Lee Hollowell Pkwy NW R Atlanta Fulton GA \n", "146258 NaN Sylvan Rd R Atlanta Fulton GA \n", "146259 NaN Arthur Langford Pkwy E R Atlanta Fulton GA \n", "\n", " Zipcode Country Timezone Airport_Code Weather_Timestamp \\\n", "146255 30213 US US/Eastern KATL 2016-11-30 15:09:00 \n", "146256 30336 US US/Eastern KFTY 2016-11-30 15:09:00 \n", "146257 30331 US US/Eastern KFTY 2016-11-30 14:40:00 \n", "146258 30344 US US/Eastern KATL 2016-11-30 16:52:00 \n", "146259 30315 US US/Eastern KATL 2016-11-30 15:52:00 \n", "\n", " Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) \\\n", "146255 63.0 NaN 97.0 29.75 \n", "146256 63.0 NaN 90.0 29.73 \n", "146257 63.0 NaN 90.0 29.73 \n", "146258 63.0 NaN 97.0 29.77 \n", "146259 63.0 NaN 97.0 29.70 \n", "\n", " Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) \\\n", "146255 3.0 WSW 9.2 0.05 \n", "146256 3.0 SSW 5.8 0.04 \n", "146257 2.5 SSW 8.1 0.62 \n", "146258 9.0 WSW 10.4 0.01 \n", "146259 10.0 SW 8.1 0.13 \n", "\n", " Weather_Condition Amenity Bump Crossing Give_Way Junction \\\n", "146255 Rain False False False False False \n", "146256 Rain False False False False False \n", "146257 Heavy Rain False False False False False \n", "146258 Overcast False False False False False \n", "146259 Light Rain False False False False False \n", "\n", " No_Exit Railway Roundabout Station Stop Traffic_Calming \\\n", "146255 False False False False False False \n", "146256 False False False False False False \n", "146257 False False False False False False \n", "146258 False False False False False False \n", "146259 False False False False False False \n", "\n", " Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight \\\n", "146255 False False Day Day \n", "146256 False False Day Day \n", "146257 False False Day Day \n", "146258 False False Day Day \n", "146259 False False Day Day \n", "\n", " Nautical_Twilight Astronomical_Twilight \n", "146255 Day Day \n", "146256 Day Day \n", "146257 Day Day \n", "146258 Day Day \n", "146259 Day Day " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "atlanta_lat = 33.7490\n", "atlanta_lng = -84.3880\n", "radius = 0.3\n", "\n", "df = df_all[(df_all['Start_Lat'] <= atlanta_lat + radius) & \n", " (df_all['Start_Lat'] >= atlanta_lat - radius) & \n", " (df_all['Start_Lng'] >= atlanta_lng - radius) & \n", " (df_all['Start_Lng'] <= atlanta_lng + radius)]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.045056, "end_time": "2021-03-14T03:53:32.456567", "exception": false, "start_time": "2021-03-14T03:53:32.411511", "status": "completed" }, "tags": [] }, "source": [ "### Removing non accident data\n", "For the column [TMC (Traffic Message Channel)](https://wiki.openstreetmap.org/wiki/TMC/Event_Code_List), it lists a variety of codes used to describe traffic incidents. After referencing all the codes listed in the Atlanta data, there are 140 entries for code 406 which is described as 'entry slip road closed'. This didn't sound like an accident, so I decided to investigate further. Fortunately, the accidents have a text description. Looking through these descriptions, several incidents didn't seem to be accidents at all. For example, several roads were closed due to fallen trees, protests, and various other activities. After looking through them, most of the entries that reference an accident have the text 'accident' in the description. So, I will filter out any rows that don't have that description." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "papermill": { "duration": 0.089682, "end_time": "2021-03-14T03:53:32.591887", "exception": false, "start_time": "2021-03-14T03:53:32.502205", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "to_remove = ~df[df['TMC'] == 406]['Description'].str.contains('accident')\n", "df = df.drop(df[df['TMC'] == 406][to_remove].index)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.045422, "end_time": "2021-03-14T03:53:32.683303", "exception": false, "start_time": "2021-03-14T03:53:32.637881", "status": "completed" }, "tags": [] }, "source": [ "### Removing columns that won't be used\n", "The data set contains certain columns that won't be used for mapping and predictions. \n", "\n", "In the paper, the original traffic accident data only contained GPS data. The author used a tool that converted that GPS data into the data for number, street, side, city, county, state, country, and zip code. Given that the original data didn't contain this information, I decided to stick with the original lattitude and longitude coordinates to do mapping and prediction.\n", "\n", "For the columns End_Lat and End_Lng, many values were missing, so I decided to use the starting coordinates. Distance(mi) is also a function of Start - End, so I removed it as well.\n", "\n", "The columns ID and Source were irrelevant, so I removed those.\n", "\n", "The column TMC was used to clean in a previous step, so it is no longer needed. Also, TMC codes may not be available at the time of an accident to predict the delay. \n", "\n", "The column End_Time will be removed since the Severity column will be used as a target. Knowing the end time of an accident isn't something that would be known beforehand in prediction.\n", "\n", "The column Description will be removed because it is a text description of the accident, and also wouldn't be known beforehand.\n", "\n", "The column Timezone will be irrelevant since the entire area is in the same timezone.\n", "\n", "The column Airport_Code will be removed since it gives no localized information on the accident.\n", "\n", "The column Weather_Timestamp will be removed since the time of the last weather update can't provide additional information. After some research, I found several websites that have historical weather APIs, but they all charged a fee which I'm avoiding for this project. So, checking a timestamp to see if more accurate weather data is available can't be accomplished.\n", "\n", "This leaves the following columns:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "papermill": { "duration": 0.076281, "end_time": "2021-03-14T03:53:32.805395", "exception": false, "start_time": "2021-03-14T03:53:32.729114", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Index(['Severity', 'Start_Time', 'Start_Lat', 'Start_Lng', 'Temperature(F)',\n", " 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)',\n", " 'Wind_Direction', 'Wind_Speed(mph)', 'Precipitation(in)',\n", " 'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Give_Way',\n", " 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',\n", " 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop', 'Sunrise_Sunset',\n", " 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.drop(['ID', 'Source', 'TMC', 'End_Time', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street', \n", " 'Side', 'City', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'County'], axis=1)\n", "df.columns" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.04686, "end_time": "2021-03-14T03:53:32.898319", "exception": false, "start_time": "2021-03-14T03:53:32.851459", "status": "completed" }, "tags": [] }, "source": [ "### Converting the timestamp to columns\n", "For this project, I'm going to plot several charts by time, I'm going to convert the timestamp into it's own separate columns for simplicity. Columns will be made for Year, Month, Day, Hour, and DayOfWeek. The hours will be rounded to the nearest hour. DayOfWeek will be in the format of 0 to 6 where 0 is Monday." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "papermill": { "duration": 0.214784, "end_time": "2021-03-14T03:53:33.161005", "exception": false, "start_time": "2021-03-14T03:53:32.946221", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
14625532016113015233.546177-84.57734763.0NaN97.029.753.0WSW9.20.05RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
14625632016113015233.766376-84.52732163.0NaN90.029.733.0SSW5.80.04RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
14625732016113015233.786896-84.49313463.0NaN90.029.732.5SSW8.10.62Heavy RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
14625822016113016233.697849-84.41826663.0NaN97.029.779.0WSW10.40.01OvercastFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
14625932016113016233.696915-84.40498463.0NaN97.029.7010.0SW8.10.13Light RainFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "146255 3 2016 11 30 15 2 33.546177 -84.577347 \n", "146256 3 2016 11 30 15 2 33.766376 -84.527321 \n", "146257 3 2016 11 30 15 2 33.786896 -84.493134 \n", "146258 2 2016 11 30 16 2 33.697849 -84.418266 \n", "146259 3 2016 11 30 16 2 33.696915 -84.404984 \n", "\n", " Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) \\\n", "146255 63.0 NaN 97.0 29.75 \n", "146256 63.0 NaN 90.0 29.73 \n", "146257 63.0 NaN 90.0 29.73 \n", "146258 63.0 NaN 97.0 29.77 \n", "146259 63.0 NaN 97.0 29.70 \n", "\n", " Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) \\\n", "146255 3.0 WSW 9.2 0.05 \n", "146256 3.0 SSW 5.8 0.04 \n", "146257 2.5 SSW 8.1 0.62 \n", "146258 9.0 WSW 10.4 0.01 \n", "146259 10.0 SW 8.1 0.13 \n", "\n", " Weather_Condition Amenity Bump Crossing Give_Way Junction \\\n", "146255 Rain False False False False False \n", "146256 Rain False False False False False \n", "146257 Heavy Rain False False False False False \n", "146258 Overcast False False False False False \n", "146259 Light Rain False False False False False \n", "\n", " No_Exit Railway Roundabout Station Stop Traffic_Calming \\\n", "146255 False False False False False False \n", "146256 False False False False False False \n", "146257 False False False False False False \n", "146258 False False False False False False \n", "146259 False False False False False False \n", "\n", " Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight \\\n", "146255 False False Day Day \n", "146256 False False Day Day \n", "146257 False False Day Day \n", "146258 False False Day Day \n", "146259 False False Day Day \n", "\n", " Nautical_Twilight Astronomical_Twilight \n", "146255 Day Day \n", "146256 Day Day \n", "146257 Day Day \n", "146258 Day Day \n", "146259 Day Day " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#convert to datetime format\n", "df['Start_Time'] = pd.to_datetime(df['Start_Time'], infer_datetime_format=True)\n", "#round to nearest hour\n", "df['Start_Time'] = df['Start_Time'].dt.round(\"H\")\n", "#create day of week\n", "df.insert(2, 'DayOfWeek', '')\n", "df['DayOfWeek'] = df['Start_Time'].dt.weekday\n", "#create an hour of the day column\n", "df.insert(2, 'Hour', '')\n", "df['Hour'] = df['Start_Time'].dt.hour\n", "#create day of week column\n", "df.insert(2, 'Day', '')\n", "df['Day'] = df['Start_Time'].dt.day\n", "#create month column\n", "df.insert(2, 'Month', '')\n", "df['Month'] = df['Start_Time'].dt.month\n", "#create year column\n", "df.insert(2, 'Year', '')\n", "df['Year'] = df['Start_Time'].dt.year\n", "df = df.drop(['Start_Time'], axis=1)\n", "display(df.head())" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.075389, "end_time": "2021-03-14T03:53:33.296660", "exception": false, "start_time": "2021-03-14T03:53:33.221271", "status": "completed" }, "tags": [] }, "source": [ "### Checking for missing data\n", "Before the data can be used, any missing data must be dealt with. The first step is to identify which values are missing an how many. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "papermill": { "duration": 0.26513, "end_time": "2021-03-14T03:53:33.640470", "exception": false, "start_time": "2021-03-14T03:53:33.375340", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
null countspercentage
Wind_Chill(F)3925356.60
Precipitation(in)3839555.36
Wind_Speed(mph)805911.62
Humidity(%)3380.49
Temperature(F)3170.46
Wind_Direction2810.41
Visibility(mi)2650.38
Pressure(in)2390.34
Weather_Condition2370.34
\n", "
" ], "text/plain": [ " null counts percentage\n", "Wind_Chill(F) 39253 56.60\n", "Precipitation(in) 38395 55.36\n", "Wind_Speed(mph) 8059 11.62\n", "Humidity(%) 338 0.49\n", "Temperature(F) 317 0.46\n", "Wind_Direction 281 0.41\n", "Visibility(mi) 265 0.38\n", "Pressure(in) 239 0.34\n", "Weather_Condition 237 0.34" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def check_na():\n", " return pd.merge(left=df.isna().sum()[df.isna().sum() != 0].rename('null counts'), \n", " right=df.isnull().mean()[df.isnull().mean() != 0].rename('percentage').round(4)*100, \n", " left_index=True, right_index=True).sort_values(ascending=False, by=['percentage'])\n", "check_na()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.047583, "end_time": "2021-03-14T03:53:33.735683", "exception": false, "start_time": "2021-03-14T03:53:33.688100", "status": "completed" }, "tags": [] }, "source": [ "There are nine columns with missing values. For Wind_Chill(F) and Precipitation(in), over half of the values are missing. Since I don't have access to historical weather data in order to fill those in, those two columns will be completely dropped. Wind_Chill(F) is a computation of temperature and wind speed, so that information should still be encoded in the data. While Precepitation(in) would probably be a good indicator, there is just too much information to fill in without having access to an outside data source.\n", "\n", "For the remaining missing values, it is possible to fill in some data from surrounding values. For a missing value, we can check another entry on that same date in the same area and use that value. For example, if Weather_Condition is missing from an entry with lat/lng x1 y1 on date Jan 1 2000 at 8am, and we have another entry with lat/lng x2 y2 on Jan 1 2000 at 8am, it's reasonable that those would have the same Weather_Condition or other weather values." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "papermill": { "duration": 193.186337, "end_time": "2021-03-14T03:56:46.970744", "exception": false, "start_time": "2021-03-14T03:53:33.784407", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
null countspercentage
Wind_Speed(mph)1440.21
Temperature(F)1360.20
Humidity(%)1360.20
Pressure(in)1360.20
Visibility(mi)1360.20
Wind_Direction1360.20
Weather_Condition1360.20
\n", "
" ], "text/plain": [ " null counts percentage\n", "Wind_Speed(mph) 144 0.21\n", "Temperature(F) 136 0.20\n", "Humidity(%) 136 0.20\n", "Pressure(in) 136 0.20\n", "Visibility(mi) 136 0.20\n", "Wind_Direction 136 0.20\n", "Weather_Condition 136 0.20" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.drop(['Wind_Chill(F)', 'Precipitation(in)'], axis=1)\n", "\n", "def get_missing_row_values(row_id):\n", " #attempts to fill in the missing value from a row with other incidents that happened on the same day\n", " row = df.loc[row_id].copy()\n", " #find all entries that happen on that day\n", " columns = ['Year', 'Month', 'Day']\n", " values = [row.Year, row.Month, row.Day]\n", " matches = df[(df[columns] == values).all(1)].copy()\n", " #drop duplicate row that was passed in\n", " matches = matches.drop(row_id)\n", " matches.insert(6, 'LL_Dist', '')\n", " #compute distance between the coordinates\n", " matches['LL_Dist'] = (np.sqrt((row.Start_Lat - matches.Start_Lat)**2 + (row.Start_Lng - matches.Start_Lng)**2))\n", " matches.insert(6, 'Time_Diff', '')\n", " #compute time between the accidents\n", " matches['Time_Diff'] = abs(row.Hour - matches.Hour)\n", " #sort matches by time differential then distance \n", " matches = matches.sort_values(['Time_Diff', 'LL_Dist'], ascending=[True, True])\n", " #to make things simple, we will take the match that is closest hour and least distance\n", " #this may lead to some small innacuracy, but the missing values are few, so it should be ok\n", " #since the match dataframe is sorted by hour then distance, the first matching value should be used\n", " #possible missing values are:\n", " #'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition'\n", " def fill_row(col_name):\n", " nonlocal row\n", " nonlocal matches\n", " if pd.isna(row[col_name]): \n", " value = matches[col_name].first_valid_index()\n", " if value:\n", " row[col_name] = matches.loc[value][col_name]\n", " fill_row('Temperature(F)')\n", " fill_row('Humidity(%)')\n", " fill_row('Pressure(in)')\n", " fill_row('Visibility(mi)')\n", " fill_row('Wind_Direction')\n", " fill_row('Wind_Speed(mph)')\n", " fill_row('Weather_Condition')\n", " return row\n", "\n", "df_nan = df[df.isna().any(axis=1)].copy()\n", "for i, row in df_nan.iterrows():\n", " filled_row = get_missing_row_values(i)\n", " df.loc[i] = filled_row\n", " \n", "check_na()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.047405, "end_time": "2021-03-14T03:56:47.066282", "exception": false, "start_time": "2021-03-14T03:56:47.018877", "status": "completed" }, "tags": [] }, "source": [ "After filling in missing weather values from nearby dates, most of the values have been filled in. There are still a small number missing. Next, I will take a look at the values that are missing and other values from those same dates." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "papermill": { "duration": 0.105227, "end_time": "2021-03-14T03:56:47.219054", "exception": false, "start_time": "2021-03-14T03:56:47.113827", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Year Month Day\n", "2016 7 17 1\n", " 10 29 7\n", "2018 6 29 72\n", " 30 16\n", "2020 11 8 48\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_nan = df[df.isna().any(axis=1)].copy()\n", "display(df_nan.groupby(['Year', 'Month', 'Day']).size())" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.048801, "end_time": "2021-03-14T03:56:47.317037", "exception": false, "start_time": "2021-03-14T03:56:47.268236", "status": "completed" }, "tags": [] }, "source": [ "The missing data is spread out over five separate days. Next, let's take a look at other data from those days to see why no values were filled in.\n", "Note: I will use head() to display part of the data for brevity." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "papermill": { "duration": 0.262568, "end_time": "2021-03-14T03:56:47.628557", "exception": false, "start_time": "2021-03-14T03:56:47.365989", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
2809500420167170633.868-84.2851372.091.030.1110.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "2809500 4 2016 7 17 0 6 33.868 -84.28513 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "2809500 72.0 91.0 30.11 10.0 \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "2809500 Calm NaN Clear False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "2809500 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "2809500 False False False False Night \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "2809500 Night Night Night " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
15047532016102919533.917973-84.33813573.046.030.1410.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightDayDay
15047632016102919533.937080-84.15866168.056.030.1810.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightDayDay
15047732016102920533.912239-84.20781766.959.030.1710.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15047832016102922533.699013-84.26616770.057.030.1810.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15047932016102922533.827091-84.25260261.075.030.1810.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15048032016102922533.821548-84.35938360.178.030.199.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15048132016102923533.864868-84.43967462.872.030.2010.0CalmNaNClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "150475 3 2016 10 29 19 5 33.917973 -84.338135 \n", "150476 3 2016 10 29 19 5 33.937080 -84.158661 \n", "150477 3 2016 10 29 20 5 33.912239 -84.207817 \n", "150478 3 2016 10 29 22 5 33.699013 -84.266167 \n", "150479 3 2016 10 29 22 5 33.827091 -84.252602 \n", "150480 3 2016 10 29 22 5 33.821548 -84.359383 \n", "150481 3 2016 10 29 23 5 33.864868 -84.439674 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "150475 73.0 46.0 30.14 10.0 \n", "150476 68.0 56.0 30.18 10.0 \n", "150477 66.9 59.0 30.17 10.0 \n", "150478 70.0 57.0 30.18 10.0 \n", "150479 61.0 75.0 30.18 10.0 \n", "150480 60.1 78.0 30.19 9.0 \n", "150481 62.8 72.0 30.20 10.0 \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "150475 Calm NaN Clear False False \n", "150476 Calm NaN Clear False False \n", "150477 Calm NaN Clear False False \n", "150478 Calm NaN Clear False False \n", "150479 Calm NaN Clear False False \n", "150480 Calm NaN Clear False False \n", "150481 Calm NaN Clear False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "150475 False False False False False False False \n", "150476 False False False False False False False \n", "150477 False False False False False False False \n", "150478 False False False False False False False \n", "150479 False False False False False False False \n", "150480 False False False False False False False \n", "150481 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "150475 False False False False Night \n", "150476 False False False False Night \n", "150477 False False False False Night \n", "150478 False False False False Night \n", "150479 False False False False Night \n", "150480 False False False False Night \n", "150481 False False False False Night \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "150475 Night Day Day \n", "150476 Night Day Day \n", "150477 Night Night Night \n", "150478 Night Night Night \n", "150479 Night Night Night \n", "150480 Night Night Night \n", "150481 Night Night Night " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
2094112220186294433.849075-84.430061NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightNightNight
2094116320186294433.711597-84.217415NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
2094124320186296433.746490-84.430565NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightDayDayDay
2094125320186296433.742832-84.403595NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseNightDayDayDay
2094136220186297433.823612-84.367332NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseDayDayDayDay
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "2094112 2 2018 6 29 4 4 33.849075 -84.430061 \n", "2094116 3 2018 6 29 4 4 33.711597 -84.217415 \n", "2094124 3 2018 6 29 6 4 33.746490 -84.430565 \n", "2094125 3 2018 6 29 6 4 33.742832 -84.403595 \n", "2094136 2 2018 6 29 7 4 33.823612 -84.367332 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "2094112 NaN NaN NaN NaN \n", "2094116 NaN NaN NaN NaN \n", "2094124 NaN NaN NaN NaN \n", "2094125 NaN NaN NaN NaN \n", "2094136 NaN NaN NaN NaN \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "2094112 NaN NaN NaN False False \n", "2094116 NaN NaN NaN False False \n", "2094124 NaN NaN NaN False False \n", "2094125 NaN NaN NaN False False \n", "2094136 NaN NaN NaN False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "2094112 False False False False False False False \n", "2094116 False False False False False False False \n", "2094124 False False False False False False False \n", "2094125 False False True False False False False \n", "2094136 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "2094112 False False True False Night \n", "2094116 False False False False Night \n", "2094124 False False False False Night \n", "2094125 False False False False Night \n", "2094136 False False True False Day \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "2094112 Night Night Night \n", "2094116 Night Night Night \n", "2094124 Day Day Day \n", "2094125 Day Day Day \n", "2094136 Day Day Day " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
19839363201863011534.037842-84.562691NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
19839483201863013533.703480-84.170181NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
19839493201863013533.928226-84.176018NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
19839503201863014533.793625-84.392815NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
19839553201863015533.793625-84.392815NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "1983936 3 2018 6 30 11 5 34.037842 -84.562691 \n", "1983948 3 2018 6 30 13 5 33.703480 -84.170181 \n", "1983949 3 2018 6 30 13 5 33.928226 -84.176018 \n", "1983950 3 2018 6 30 14 5 33.793625 -84.392815 \n", "1983955 3 2018 6 30 15 5 33.793625 -84.392815 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "1983936 NaN NaN NaN NaN \n", "1983948 NaN NaN NaN NaN \n", "1983949 NaN NaN NaN NaN \n", "1983950 NaN NaN NaN NaN \n", "1983955 NaN NaN NaN NaN \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "1983936 NaN NaN NaN False False \n", "1983948 NaN NaN NaN False False \n", "1983949 NaN NaN NaN False False \n", "1983950 NaN NaN NaN False False \n", "1983955 NaN NaN NaN False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "1983936 False False False False False False False \n", "1983948 False False False False False False False \n", "1983949 False False False False False False False \n", "1983950 False False False False False False False \n", "1983955 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "1983936 False False False False Day \n", "1983948 False False False False Day \n", "1983949 False False False False Day \n", "1983950 False False False False Day \n", "1983955 False False False False Day \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "1983936 Day Day Day \n", "1983948 Day Day Day \n", "1983949 Day Day Day \n", "1983950 Day Day Day \n", "1983955 Day Day Day " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
5801212202011812633.852207-84.369438NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseDayDayDayDay
5801233202011815633.745125-84.390442NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
5801253202011815633.760025-84.493500NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
5801284202011817633.834286-84.250153NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
5801303202011818633.797981-84.395782NaNNaNNaNNaNNaNNaNNaNFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseDayDayDayDay
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "580121 2 2020 11 8 12 6 33.852207 -84.369438 \n", "580123 3 2020 11 8 15 6 33.745125 -84.390442 \n", "580125 3 2020 11 8 15 6 33.760025 -84.493500 \n", "580128 4 2020 11 8 17 6 33.834286 -84.250153 \n", "580130 3 2020 11 8 18 6 33.797981 -84.395782 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "580121 NaN NaN NaN NaN \n", "580123 NaN NaN NaN NaN \n", "580125 NaN NaN NaN NaN \n", "580128 NaN NaN NaN NaN \n", "580130 NaN NaN NaN NaN \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "580121 NaN NaN NaN False False \n", "580123 NaN NaN NaN False False \n", "580125 NaN NaN NaN False False \n", "580128 NaN NaN NaN False False \n", "580130 NaN NaN NaN False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "580121 False False False False False False False \n", "580123 False False False False False False False \n", "580125 False False False False False False False \n", "580128 False False False False False False False \n", "580130 False False True False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "580121 False False True False Day \n", "580123 False False False False Day \n", "580125 False False False False Day \n", "580128 False False False False Day \n", "580130 False False False False Day \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "580121 Day Day Day \n", "580123 Day Day Day \n", "580125 Day Day Day \n", "580128 Day Day Day \n", "580130 Day Day Day " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)))\n", "display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)))\n", "display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 29)).head(5))\n", "display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 30)).head(5))\n", "display(df.groupby(['Year', 'Month', 'Day']).get_group((2020, 11, 8)).head(5))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.051836, "end_time": "2021-03-14T03:56:47.733016", "exception": false, "start_time": "2021-03-14T03:56:47.681180", "status": "completed" }, "tags": [] }, "source": [ "For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows." ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.052386, "end_time": "2021-03-14T03:56:47.837821", "exception": false, "start_time": "2021-03-14T03:56:47.785435", "status": "completed" }, "tags": [] }, "source": [ "For the dates July 17 2016 and October 29 2016 only the Wind_Speed(mph) is missing. The Wind_Direction on all of those days happen to be calm, so I will take the mean of all calm days to fill in these missing values. First, I will look at the mean values for each category." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "papermill": { "duration": 0.09039, "end_time": "2021-03-14T03:56:47.980849", "exception": false, "start_time": "2021-03-14T03:56:47.890459", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Wind_Direction\n", "CALM 0.000000\n", "Calm 4.953956\n", "E 8.261937\n", "ENE 8.385809\n", "ESE 7.376227\n", "East 8.070343\n", "N 6.769912\n", "NE 7.790666\n", "NNE 6.377500\n", "NNW 9.618584\n", "NW 10.239534\n", "North 7.409104\n", "S 7.427520\n", "SE 6.816618\n", "SSE 7.167884\n", "SSW 7.515848\n", "SW 7.628337\n", "South 7.583828\n", "VAR 4.663056\n", "Variable 4.815705\n", "W 8.767478\n", "WNW 9.799287\n", "WSW 7.646964\n", "West 8.538314\n", "Name: Wind_Speed(mph), dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Wind_Direction').mean()['Wind_Speed(mph)']" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.051591, "end_time": "2021-03-14T03:56:48.085222", "exception": false, "start_time": "2021-03-14T03:56:48.033631", "status": "completed" }, "tags": [] }, "source": [ "As can be seen above, the Wind_Direction categories have overlapping values such as CALM and Calm, E and East, etc. Next, I will combine these groups with each other and fill in the missing Wind_Speed(mph) mean values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "papermill": { "duration": 0.339672, "end_time": "2021-03-14T03:56:48.479218", "exception": false, "start_time": "2021-03-14T03:56:48.139546", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
2809500420167170633.868-84.2851372.091.030.1110.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "2809500 4 2016 7 17 0 6 33.868 -84.28513 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "2809500 72.0 91.0 30.11 10.0 \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "2809500 Calm 3.1 Clear False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "2809500 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "2809500 False False False False Night \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "2809500 Night Night Night " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
15047532016102919533.917973-84.33813573.046.030.1410.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightDayDay
15047632016102919533.937080-84.15866168.056.030.1810.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightDayDay
15047732016102920533.912239-84.20781766.959.030.1710.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15047832016102922533.699013-84.26616770.057.030.1810.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15047932016102922533.827091-84.25260261.075.030.1810.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15048032016102922533.821548-84.35938360.178.030.199.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
15048132016102923533.864868-84.43967462.872.030.2010.0Calm3.1ClearFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "150475 3 2016 10 29 19 5 33.917973 -84.338135 \n", "150476 3 2016 10 29 19 5 33.937080 -84.158661 \n", "150477 3 2016 10 29 20 5 33.912239 -84.207817 \n", "150478 3 2016 10 29 22 5 33.699013 -84.266167 \n", "150479 3 2016 10 29 22 5 33.827091 -84.252602 \n", "150480 3 2016 10 29 22 5 33.821548 -84.359383 \n", "150481 3 2016 10 29 23 5 33.864868 -84.439674 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "150475 73.0 46.0 30.14 10.0 \n", "150476 68.0 56.0 30.18 10.0 \n", "150477 66.9 59.0 30.17 10.0 \n", "150478 70.0 57.0 30.18 10.0 \n", "150479 61.0 75.0 30.18 10.0 \n", "150480 60.1 78.0 30.19 9.0 \n", "150481 62.8 72.0 30.20 10.0 \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump \\\n", "150475 Calm 3.1 Clear False False \n", "150476 Calm 3.1 Clear False False \n", "150477 Calm 3.1 Clear False False \n", "150478 Calm 3.1 Clear False False \n", "150479 Calm 3.1 Clear False False \n", "150480 Calm 3.1 Clear False False \n", "150481 Calm 3.1 Clear False False \n", "\n", " Crossing Give_Way Junction No_Exit Railway Roundabout Station \\\n", "150475 False False False False False False False \n", "150476 False False False False False False False \n", "150477 False False False False False False False \n", "150478 False False False False False False False \n", "150479 False False False False False False False \n", "150480 False False False False False False False \n", "150481 False False False False False False False \n", "\n", " Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "150475 False False False False Night \n", "150476 False False False False Night \n", "150477 False False False False Night \n", "150478 False False False False Night \n", "150479 False False False False Night \n", "150480 False False False False Night \n", "150481 False False False False Night \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "150475 Night Day Day \n", "150476 Night Day Day \n", "150477 Night Night Night \n", "150478 Night Night Night \n", "150479 Night Night Night \n", "150480 Night Night Night \n", "150481 Night Night Night " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.replace({'CALM' : 'Calm', 'East' : 'E', 'North' : 'N', 'South' : 'S', 'VAR' : 'Variable', 'West' : 'W'}, inplace=True)\n", "\n", "ws_mean = df.groupby('Wind_Direction').mean()['Wind_Speed(mph)'].loc['Calm'].round(1)\n", "\n", "ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)).index\n", "df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean\n", "display(df.loc[ws_index])\n", "\n", "ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)).index\n", "df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean\n", "display(df.loc[ws_index])" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.055617, "end_time": "2021-03-14T03:56:48.590385", "exception": false, "start_time": "2021-03-14T03:56:48.534768", "status": "completed" }, "tags": [] }, "source": [ "For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows. The only remaining NaN values fall on these dates, so I will just drop all NaN rows." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "papermill": { "duration": 0.073552, "end_time": "2021-03-14T03:56:48.718529", "exception": false, "start_time": "2021-03-14T03:56:48.644977", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df = df.drop(df_nan.index)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.05683, "end_time": "2021-03-14T03:56:48.831079", "exception": false, "start_time": "2021-03-14T03:56:48.774249", "status": "completed" }, "tags": [] }, "source": [ "### Converting binary values\n", "To use the data set in learning algorithms, the data values must be converted to numerical form. There are several columns of True and False values. There are also several columns of Day and Night values. These values will be converted to 1 and 0." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "papermill": { "duration": 0.452565, "end_time": "2021-03-14T03:56:49.339503", "exception": false, "start_time": "2021-03-14T03:56:48.886938", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "tf_columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way',\n", " 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',\n", " 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']\n", "df.replace({False : 0, True : 1}, inplace=True)\n", "\n", "df.replace({'Day' : 1, 'Night' : 0}, inplace=True)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.054935, "end_time": "2021-03-14T03:56:49.449546", "exception": false, "start_time": "2021-03-14T03:56:49.394611", "status": "completed" }, "tags": [] }, "source": [ "### Converting categorical values\n", "Now categorical values need to be converted to numerical values. In the Wind_Direction column, there are values such as N, NE, NNE, etc that need to be converted. The Weather_Condition column will be skipped for now. I plan to take a look at weather values in one of the plots and values such as Rainy, Windy, Clear, etc will be useful. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "papermill": { "duration": 0.071753, "end_time": "2021-03-14T03:56:49.577011", "exception": false, "start_time": "2021-03-14T03:56:49.505258", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df['Wind_Direction'] = df['Wind_Direction'].astype('category').cat.codes" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.054502, "end_time": "2021-03-14T03:56:49.686744", "exception": false, "start_time": "2021-03-14T03:56:49.632242", "status": "completed" }, "tags": [] }, "source": [ "### Removing columns with single data values\n", "\n", "Some of the columns have no variance. For example, a column that contains all 0s or all 1s isn't really useful to a model, so they should be dropped." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "papermill": { "duration": 0.143577, "end_time": "2021-03-14T03:56:49.886625", "exception": false, "start_time": "2021-03-14T03:56:49.743048", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 69207\n", "Name: Bump, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 69207\n", "Name: Roundabout, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 69207\n", "Name: Turning_Loop, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "drop_no_variance = []\n", "for col in df.columns:\n", " if len(df[col].value_counts()) <= 1:\n", " drop_no_variance.append(col)\n", " display(df[col].value_counts())\n", "df = df.drop(drop_no_variance, axis=1)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.05806, "end_time": "2021-03-14T03:56:50.001999", "exception": false, "start_time": "2021-03-14T03:56:49.943939", "status": "completed" }, "tags": [] }, "source": [ "### Display and Save the data\n", "The data has been cleaned and is ready to be plotted and fed into a learning algorithm. Below shows there are no missing values and shows the newly cleaned data set." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "papermill": { "duration": 1.059866, "end_time": "2021-03-14T03:56:51.120629", "exception": false, "start_time": "2021-03-14T03:56:50.060763", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Severity 0\n", "Year 0\n", "Month 0\n", "Day 0\n", "Hour 0\n", "DayOfWeek 0\n", "Start_Lat 0\n", "Start_Lng 0\n", "Temperature(F) 0\n", "Humidity(%) 0\n", "Pressure(in) 0\n", "Visibility(mi) 0\n", "Wind_Direction 0\n", "Wind_Speed(mph) 0\n", "Weather_Condition 0\n", "Amenity 0\n", "Crossing 0\n", "Give_Way 0\n", "Junction 0\n", "No_Exit 0\n", "Railway 0\n", "Station 0\n", "Stop 0\n", "Traffic_Calming 0\n", "Traffic_Signal 0\n", "Sunrise_Sunset 0\n", "Civil_Twilight 0\n", "Nautical_Twilight 0\n", "Astronomical_Twilight 0\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityCrossingGive_WayJunctionNo_ExitRailwayStationStopTraffic_CalmingTraffic_SignalSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
032016113015233.546177-84.57734763.097.029.753.0179.2Rain00000000001111
132016113015233.766376-84.52732163.090.029.733.0125.8Rain00000000001111
232016113015233.786896-84.49313463.090.029.732.5128.1Heavy Rain00000000001111
322016113016233.697849-84.41826663.097.029.779.01710.4Overcast00000000001111
432016113016233.696915-84.40498463.097.029.7010.0138.1Light Rain00000000001111
..........................................................................................
692022201982318433.665910-84.34476083.067.028.9110.077.0Mostly Cloudy00000000001111
692032201982321433.920200-84.32015080.074.028.9410.045.0Fair00010000000011
692042201982319433.803530-84.24960085.063.028.9210.075.0Partly Cloudy00000000011111
692052201982319433.803350-84.24924085.063.028.9210.075.0Partly Cloudy00000000011111
692062201982320433.895722-84.25270680.074.028.9410.045.0Fair00000000000111
\n", "

69207 rows × 29 columns

\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "0 3 2016 11 30 15 2 33.546177 -84.577347 \n", "1 3 2016 11 30 15 2 33.766376 -84.527321 \n", "2 3 2016 11 30 15 2 33.786896 -84.493134 \n", "3 2 2016 11 30 16 2 33.697849 -84.418266 \n", "4 3 2016 11 30 16 2 33.696915 -84.404984 \n", "... ... ... ... ... ... ... ... ... \n", "69202 2 2019 8 23 18 4 33.665910 -84.344760 \n", "69203 2 2019 8 23 21 4 33.920200 -84.320150 \n", "69204 2 2019 8 23 19 4 33.803530 -84.249600 \n", "69205 2 2019 8 23 19 4 33.803350 -84.249240 \n", "69206 2 2019 8 23 20 4 33.895722 -84.252706 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) \\\n", "0 63.0 97.0 29.75 3.0 \n", "1 63.0 90.0 29.73 3.0 \n", "2 63.0 90.0 29.73 2.5 \n", "3 63.0 97.0 29.77 9.0 \n", "4 63.0 97.0 29.70 10.0 \n", "... ... ... ... ... \n", "69202 83.0 67.0 28.91 10.0 \n", "69203 80.0 74.0 28.94 10.0 \n", "69204 85.0 63.0 28.92 10.0 \n", "69205 85.0 63.0 28.92 10.0 \n", "69206 80.0 74.0 28.94 10.0 \n", "\n", " Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Crossing \\\n", "0 17 9.2 Rain 0 0 \n", "1 12 5.8 Rain 0 0 \n", "2 12 8.1 Heavy Rain 0 0 \n", "3 17 10.4 Overcast 0 0 \n", "4 13 8.1 Light Rain 0 0 \n", "... ... ... ... ... ... \n", "69202 7 7.0 Mostly Cloudy 0 0 \n", "69203 4 5.0 Fair 0 0 \n", "69204 7 5.0 Partly Cloudy 0 0 \n", "69205 7 5.0 Partly Cloudy 0 0 \n", "69206 4 5.0 Fair 0 0 \n", "\n", " Give_Way Junction No_Exit Railway Station Stop Traffic_Calming \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "... ... ... ... ... ... ... ... \n", "69202 0 0 0 0 0 0 0 \n", "69203 0 1 0 0 0 0 0 \n", "69204 0 0 0 0 0 0 0 \n", "69205 0 0 0 0 0 0 0 \n", "69206 0 0 0 0 0 0 0 \n", "\n", " Traffic_Signal Sunrise_Sunset Civil_Twilight Nautical_Twilight \\\n", "0 0 1 1 1 \n", "1 0 1 1 1 \n", "2 0 1 1 1 \n", "3 0 1 1 1 \n", "4 0 1 1 1 \n", "... ... ... ... ... \n", "69202 0 1 1 1 \n", "69203 0 0 0 1 \n", "69204 1 1 1 1 \n", "69205 1 1 1 1 \n", "69206 0 0 1 1 \n", "\n", " Astronomical_Twilight \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "... ... \n", "69202 1 \n", "69203 1 \n", "69204 1 \n", "69205 1 \n", "69206 1 \n", "\n", "[69207 rows x 29 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = df.reset_index(drop=True)\n", "display(df.isna().sum())\n", "display(df)\n", "df.to_csv('clean_atlanta_accidents.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.093262, "end_time": "2021-03-14T03:56:51.285981", "exception": false, "start_time": "2021-03-14T03:56:51.192719", "status": "completed" }, "tags": [] }, "source": [ "# Exploring the data\n", "\n", "In this section, I'll take a deeper look at the data. Several plots will be made to get an idea of the traffic accident patterns.\n", "\n", "I will be looking at data from the years 2017, 2018, and 2019. Since 2016 is an incomplete year, it won't be considered. I also won't be considering the year 2020 since COVID-19 was reported to have a significant effect on Atlanta traffic. However, in the last section of this project, I will take a look at the data and make some comparisons between 2020 and the other three years.\n", "\n", "So, the first step is to remove and separate the years of data." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "papermill": { "duration": 0.149314, "end_time": "2021-03-14T03:56:51.519492", "exception": false, "start_time": "2021-03-14T03:56:51.370178", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityCrossingGive_WayJunctionNo_ExitRailwayStationStopTraffic_CalmingTraffic_SignalSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
0320171214033.652431-84.39627855.9100.030.110.268.1Light Rain00010000001111
1320171214033.744976-84.39034355.9100.030.110.268.1Light Rain00000000001111
2320171214033.928226-84.17601853.196.030.130.556.9Light Rain00000000001111
3220171215033.821548-84.35938353.196.030.121.0144.6Light Rain00000000001111
4320171215033.843380-84.48767153.6100.030.131.536.9Rain00000000001111
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "0 3 2017 1 2 14 0 33.652431 -84.396278 \n", "1 3 2017 1 2 14 0 33.744976 -84.390343 \n", "2 3 2017 1 2 14 0 33.928226 -84.176018 \n", "3 2 2017 1 2 15 0 33.821548 -84.359383 \n", "4 3 2017 1 2 15 0 33.843380 -84.487671 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction \\\n", "0 55.9 100.0 30.11 0.2 6 \n", "1 55.9 100.0 30.11 0.2 6 \n", "2 53.1 96.0 30.13 0.5 5 \n", "3 53.1 96.0 30.12 1.0 14 \n", "4 53.6 100.0 30.13 1.5 3 \n", "\n", " Wind_Speed(mph) Weather_Condition Amenity Crossing Give_Way Junction \\\n", "0 8.1 Light Rain 0 0 0 1 \n", "1 8.1 Light Rain 0 0 0 0 \n", "2 6.9 Light Rain 0 0 0 0 \n", "3 4.6 Light Rain 0 0 0 0 \n", "4 6.9 Rain 0 0 0 0 \n", "\n", " No_Exit Railway Station Stop Traffic_Calming Traffic_Signal \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "0 1 1 1 1 \n", "1 1 1 1 1 \n", "2 1 1 1 1 \n", "3 1 1 1 1 \n", "4 1 1 1 1 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SeverityYearMonthDayHourDayOfWeekStart_LatStart_LngTemperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Weather_ConditionAmenityCrossingGive_WayJunctionNo_ExitRailwayStationStopTraffic_CalmingTraffic_SignalSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilight
02202012316333.744804-84.35077745.097.029.000.2535.0Fog00000000000000
132020123113333.758720-84.37939555.083.029.207.0016.0Cloudy00000000001111
232020123114333.711750-84.21736152.097.029.023.0000.0Cloudy00000000001111
332020123115333.912098-84.20875554.093.029.013.0013.0Cloudy00000000001111
432020123116333.797970-84.39292957.089.029.186.0063.0Cloudy00000000001111
\n", "
" ], "text/plain": [ " Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng \\\n", "0 2 2020 12 31 6 3 33.744804 -84.350777 \n", "1 3 2020 12 31 13 3 33.758720 -84.379395 \n", "2 3 2020 12 31 14 3 33.711750 -84.217361 \n", "3 3 2020 12 31 15 3 33.912098 -84.208755 \n", "4 3 2020 12 31 16 3 33.797970 -84.392929 \n", "\n", " Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction \\\n", "0 45.0 97.0 29.00 0.25 3 \n", "1 55.0 83.0 29.20 7.00 1 \n", "2 52.0 97.0 29.02 3.00 0 \n", "3 54.0 93.0 29.01 3.00 1 \n", "4 57.0 89.0 29.18 6.00 6 \n", "\n", " Wind_Speed(mph) Weather_Condition Amenity Crossing Give_Way Junction \\\n", "0 5.0 Fog 0 0 0 0 \n", "1 6.0 Cloudy 0 0 0 0 \n", "2 0.0 Cloudy 0 0 0 0 \n", "3 3.0 Cloudy 0 0 0 0 \n", "4 3.0 Cloudy 0 0 0 0 \n", "\n", " No_Exit Railway Station Stop Traffic_Calming Traffic_Signal \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight \n", "0 0 0 0 0 \n", "1 1 1 1 1 \n", "2 1 1 1 1 \n", "3 1 1 1 1 \n", "4 1 1 1 1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_covid = df[df['Year'] == 2020].copy().reset_index(drop=True)\n", "df.drop(df.loc[df['Year'] == 2016].index, inplace=True)\n", "df.drop(df.loc[df['Year'] == 2020].index, inplace=True)\n", "df = df.reset_index(drop=True)\n", "display(df.head())\n", "display(df_covid.head())" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.061651, "end_time": "2021-03-14T03:56:51.647579", "exception": false, "start_time": "2021-03-14T03:56:51.585928", "status": "completed" }, "tags": [] }, "source": [ "### Looking at traffic density\n", "Next, we'll take a look at the accident density for 2017, 2018, and 2019. Each year will be plotted separately using a heatmap to plot the accident locations. This should give us an idea of any hot spots in the area. Having lived in Atlanta for several years and recently moving back, I would expect the hotspots to be on the interstates around the city. The 285 perimeter should be fairly heavy. I20, I75, and I85 should also be very heavy with an even larger hotspot in the downtown connector area." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "papermill": { "duration": 0.581903, "end_time": "2021-03-14T03:56:52.293407", "exception": false, "start_time": "2021-03-14T03:56:51.711504", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_yearly_heatmap():\n", " zmin = 0.9\n", " zmax = 5.1\n", " radius = 2\n", " mapbox_style = 'light'\n", "\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", "\n", " fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]], \n", " subplot_titles=('2017 Traffic Accidents
({} Accidents)'.format(df_2017.shape[0]), \n", " '2018 Traffic Accidents
({} Accidents)'.format(df_2018.shape[0]), \n", " '2019 Traffic Accidents
({} Accidents)'.format(df_2019.shape[0])), \n", " vertical_spacing=0.05, horizontal_spacing=0.01)\n", " \n", " fig.add_trace(go.Densitymapbox(lat=df_2017['Start_Lat'], lon=df_2017['Start_Lng'], z=[1] * df_2017.shape[0], radius=radius, colorscale='Turbo', \n", " colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=1)\n", "\n", " fig.add_trace(go.Densitymapbox(lat=df_2018['Start_Lat'], lon=df_2018['Start_Lng'], z=[1] * df_2018.shape[0], radius=radius, colorscale='Turbo', \n", " colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=2)\n", "\n", " fig.add_trace(go.Densitymapbox(lat=df_2019['Start_Lat'], lon=df_2019['Start_Lng'], z=[1] * df_2019.shape[0], radius=radius, colorscale='Turbo', \n", " colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=3)\n", "\n", " fig.update_layout(width=900, height=450, showlegend=False,\n", " mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), \n", " mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), \n", " mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style))\n", "\n", " #fig.write_html('yearly_heatmap.html')\n", " fig.show()\n", " \n", " \n", "plot_yearly_heatmap()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.085748, "end_time": "2021-03-14T03:56:52.466452", "exception": false, "start_time": "2021-03-14T03:56:52.380704", "status": "completed" }, "tags": [] }, "source": [ "As were the expectations, we can clearly see the outlines of I20, I75, and I85. The downtown connector area is also quite heavy as expected. Zooming in on the maps, you can also see smaller sections of accident scattered around the map, but they don't show up quite as well since they aren't as dense. \n", "\n", "### Looking at weather over the 2017-2019\n", "Next, I will plot traffic accidents for the combined three years with several weather conditions taken into account. The weather conditions are listed as follows." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "papermill": { "duration": 0.109127, "end_time": "2021-03-14T03:56:52.661263", "exception": false, "start_time": "2021-03-14T03:56:52.552136", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Clear 10593\n", "Mostly Cloudy 8245\n", "Overcast 6842\n", "Partly Cloudy 4741\n", "Fair 4630\n", "Light Rain 3404\n", "Scattered Clouds 3198\n", "Cloudy 1414\n", "Rain 602\n", "Light Drizzle 432\n", "Heavy Rain 262\n", "Fog 245\n", "Haze 172\n", "Light Thunderstorms and Rain 111\n", "Thunderstorm 92\n", "Heavy Thunderstorms and Rain 72\n", "Thunder in the Vicinity 48\n", "Light Snow 45\n", "Thunderstorms and Rain 44\n", "Heavy T-Storm 40\n", "Mist 30\n", "T-Storm 30\n", "Light Rain with Thunder 24\n", "Thunder 23\n", "Mostly Cloudy / Windy 17\n", "Light Rain Showers 12\n", "Light Rain / Windy 11\n", "Drizzle and Fog 10\n", "Rain / Windy 10\n", "Light Ice Pellets 6\n", "Snow 6\n", "Light Freezing Rain 6\n", "Fair / Windy 6\n", "Patches of Fog 6\n", "Partly Cloudy / Windy 5\n", "Cloudy / Windy 5\n", "Heavy Rain / Windy 4\n", "Wintry Mix 4\n", "Light Rain Shower 4\n", "Smoke 3\n", "Rain Showers 2\n", "T-Storm / Windy 2\n", "Heavy T-Storm / Windy 1\n", "Squalls 1\n", "Thunder / Windy 1\n", "Drizzle 1\n", "Name: Weather_Condition, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Weather_Condition'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.08755, "end_time": "2021-03-14T03:56:52.836112", "exception": false, "start_time": "2021-03-14T03:56:52.748562", "status": "completed" }, "tags": [] }, "source": [ "Most of these conditions have too small of values to have enough data to map. However, most of these conditions overlap in their type of weather. For example, you have T-Storm, Heavy T-Storm, Thunderstorms and Rain, Heavy Thunderstorms and Rain, and more which could all be combined into a general Thunderstorm category. For this plot, I will divide the data into 7 categories including Clear, Rain, Thunderstorm, Windy, Winter, Cloudy, Fog. These might not be as granular as they could be, but it is a good starting place to have a look at the weather effects.\n", "\n", "My expectations would be to see larger traffic delays with the worse weather." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "papermill": { "duration": 0.313914, "end_time": "2021-03-14T03:56:53.236798", "exception": false, "start_time": "2021-03-14T03:56:52.922884", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_density_weather_map(df_data):\n", " weather_types = {}\n", " weather_types['Clear'] = ['Clear', 'Fair']\n", " weather_types['Rain'] = ['Light Rain', 'Rain', 'Light Drizzle', 'Heavy Rain', 'Mist', 'Light Rain Showers', 'Rain / Windy', 'Light Rain Shower', 'Rain Showers', \n", " 'Drizzle']\n", " weather_types['Thunderstorms'] = ['Light Thunderstorms and Rain', 'Thunderstorm', 'Heavy Thunderstorms and Rain', 'Thunder in the Vicinity', 'Thunderstorms and Rain', \n", " 'Heavy T-Storm', 'T-Storm', 'Light Rain with Thunder', 'Thunder', 'Light Rain / Windy', 'Heavy Rain / Windy', 'T-Storm / Windy', \n", " 'Squalls', 'Heavy T-Storm / Windy', 'Thunder / Windy']\n", " weather_types['Windy'] = ['Mostly Cloudy / Windy', 'Fair / Windy', 'Cloudy / Windy', 'Partly Cloudy / Windy']\n", " weather_types['Winter'] = ['Light Snow', 'Light Ice Pellets', 'Snow', 'Light Freezing Rain', 'Wintry Mix']\n", " weather_types['Cloudy'] = ['Mostly Cloudy', 'Overcast', 'Partly Cloudy', 'Scattered Clouds', 'Cloudy']\n", " weather_types['Fog'] = ['Fog', 'Haze', 'Drizzle and Fog', 'Patches of Fog', 'Smoke']\n", " \n", " \n", " df_clear = df_data[df_data['Weather_Condition'].isin(weather_types['Clear'])]\n", " df_rain = df_data[df_data['Weather_Condition'].isin(weather_types['Rain'])]\n", " df_tstorm = df_data[df_data['Weather_Condition'].isin(weather_types['Thunderstorms'])]\n", " df_windy = df_data[df_data['Weather_Condition'].isin(weather_types['Windy'])]\n", " df_winter = df_data[df_data['Weather_Condition'].isin(weather_types['Winter'])]\n", " df_cloudy = df_data[df_data['Weather_Condition'].isin(weather_types['Cloudy'])]\n", " df_fog = df_data[df_data['Weather_Condition'].isin(weather_types['Fog'])]\n", " \n", " fig = go.Figure()\n", " \n", " zmin = 0.9\n", " zmax = 5.1\n", " colorscale = ['blue', 'yellow', 'orange', 'red']\n", " fig.add_densitymapbox(lat=df_clear['Start_Lat'], lon=df_clear['Start_Lng'], z=[1] * df_clear.shape[0], radius=2, colorscale='Turbo', visible=True, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_rain['Start_Lat'], lon=df_rain['Start_Lng'], z=[1] * df_rain.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_tstorm['Start_Lat'], lon=df_tstorm['Start_Lng'], z=[1] * df_tstorm.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_windy['Start_Lat'], lon=df_windy['Start_Lng'], z=[1] * df_windy.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_winter['Start_Lat'], lon=df_winter['Start_Lng'], z=[1] * df_winter.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_cloudy['Start_Lat'], lon=df_cloudy['Start_Lng'], z=[1] * df_cloudy.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " fig.add_densitymapbox(lat=df_fog['Start_Lat'], lon=df_fog['Start_Lng'], z=[1] * df_fog.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax)\n", " \n", " label_string = 'Number of Accidents: {}'\n", " annotation_clear = dict(text=label_string.format(df_clear.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_rain = dict(text=label_string.format(df_rain.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_tstorm = dict(text=label_string.format(df_tstorm.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_windy = dict(text=label_string.format(df_windy.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_winter = dict(text=label_string.format(df_winter.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_cloudy = dict(text=label_string.format(df_cloudy.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_fog = dict(text=label_string.format(df_fog.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper')\n", " annotation_weather_label = dict(text='Weather Condition', showarrow=False, x=-0.3, y=1.06, yref='paper')\n", "\n", " button_clear = dict(method='update', args=[dict(visible=[True, False, False, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_clear])], label='Clear')\n", " button_rain = dict(method='update', args=[dict(visible=[False, True, False, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_rain])], label='Rain')\n", " button_tstorm = dict(method='update', args=[dict(visible=[False, False, True, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_tstorm])], label='Thunderstorm')\n", " button_windy = dict(method='update', args=[dict(visible=[False, False, False, True, False, False, False]), dict(annotations=[annotation_weather_label, annotation_windy])], label='Windy')\n", " button_winter = dict(method='update', args=[dict(visible=[False, False, False, False, True, False, False]), dict(annotations=[annotation_weather_label, annotation_winter])], label='Winter')\n", " button_cloudy = dict(method='update', args=[dict(visible=[False, False, False, False, False, True, False]), dict(annotations=[annotation_weather_label, annotation_cloudy])], label='Cloudy')\n", " button_fog = dict(method='update', args=[dict(visible=[False, False, False, False, False, False, True]), dict(annotations=[annotation_weather_label, annotation_fog])], label='Fog')\n", " \n", " fig.update_layout(width=600, height=600, mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8.5, style='light'), \n", " title=dict(text='Atlanta Accident Map', x=0.5, xanchor='center', xref='paper'), \n", " updatemenus=[dict(buttons=[button_clear, button_rain, button_tstorm, button_windy, button_winter, button_cloudy, button_fog], type='buttons')], \n", " annotations=[annotation_weather_label, annotation_clear], margin=dict(l=100))\n", " #fig.write_html('weather_heatmap.html')\n", " fig.show()\n", " \n", "plot_density_weather_map(df)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.115046, "end_time": "2021-03-14T03:56:53.467940", "exception": false, "start_time": "2021-03-14T03:56:53.352894", "status": "completed" }, "tags": [] }, "source": [ "Above, we can see the different accidents with weather events by selecting a button on the left side. The results of this aren't quite what I anticipated, but they make sense. The data seems to be imbalanced towards Clear and Cloudy. Atlanta being in the south where it is usually warm, I expected a small count of Winter accidents. Also, fog while not rare, isn't too common, so I expected it to be low. The Windy category was also expected to be low, since most windy days were grouped into other categories such as Heavy Rain / Wind and Heavy T-Storm Windy being grouped into Thunderstorms. However, I expected Rain and Thunderstorms to be much higher. I looked briefy into more info on how Mapquest and Bing categories weather data, but I didn't turn up any info. We could reason that the counts are much lower because Thunderstorms and Rain are brief events. For example, when it rains on a day, it may only rain for a couple hours. Therefore, there is only a small window for the accidents to occur in. It entirely depends on how the weather is classified. Does Rain mean rain was currently falling from the sky? Does it mean that it rained that day? There are many questions we need to answer to quantify this data, but I don't have that information available. So, with no further information, I will continue to the next section.\n", "\n", "### Looking at daily and hourly accidents\n", "\n", "One of the big concepts of big city traffic is the daily commute. People flood the road ways during 'rush hour' to go to and from work. News stations report on it constantly throughout the day. Rush hour usually runs from early to mid morning and then again from late afternoon until evening. To get a look at this concept, I'll plot the years 2017-2019 hourly traffic data for a week. The plot will be animated and cycle through the 24 hour 7 day week of each year." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "papermill": { "duration": 3.168731, "end_time": "2021-03-14T03:56:56.752719", "exception": false, "start_time": "2021-03-14T03:56:53.583988", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#animate all yearly and covid\n", "def plot_animate_delays_by_year():\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", " mapbox_style = 'light'\n", " zmin = 0\n", " zmax = 2\n", " buttons_list = [dict(type=\"buttons\",\n", " buttons=[dict(label=\"Play\", method=\"animate\", args=[None,dict(frame=dict(duration=500,redraw=True),fromcurrent=True)]),\n", " dict(label=\"Pause\", method=\"animate\", args=[[None],dict(frame=dict(duration=0,redraw=True),mode=\"immediate\")])],\n", " direction=\"left\", pad={\"r\": 10, \"t\": 35}, showactive=False, x=0.27, xanchor=\"right\", y=0, yanchor=\"top\")]\n", " sliders_list = [dict(active=0, visible=True, yanchor=\"top\", xanchor=\"left\", \n", " currentvalue=dict(font=dict(size=20, color='#000000'), prefix=\"\", visible=True, xanchor=\"center\"), \n", " pad=dict(b=10, t=10), len=0.8, x=0.3, y=0, tickcolor='white', font={'color' : 'white'}, steps=[])]\n", " fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]], \n", " vertical_spacing=0.05, horizontal_spacing=0.01)\n", "\n", " fig_frames = []\n", " label_string = 'Number of Accidents: {}'\n", " title_annotations = [dict(text='2017 Accidents', showarrow=False, x=0.1, y=1.13, yref='paper'), \n", " dict(text='2018 Accidents', showarrow=False, x=0.5, y=1.13, yref='paper'), \n", " dict(text='2019 Accidents', showarrow=False, x=0.9, y=1.13, yref='paper')]\n", "\n", " for d in range(0, 7):\n", " for h in range(0, 24):\n", " annotation_list = title_annotations.copy()\n", " df_2017_temp = df_2017[(df_2017['DayOfWeek'] == d) & (df_2017['Hour'] == h)]\n", " df_2018_temp = df_2018[(df_2018['DayOfWeek'] == d) & (df_2018['Hour'] == h)]\n", " df_2019_temp = df_2019[(df_2019['DayOfWeek'] == d) & (df_2019['Hour'] == h)]\n", " annotation_list.append(dict(text=label_string.format(df_2017_temp.shape[0]), showarrow=False, x=0.075, y=1.07, yref='paper'))\n", " annotation_list.append(dict(text=label_string.format(df_2018_temp.shape[0]), showarrow=False, x=0.5, y=1.07, yref='paper'))\n", " annotation_list.append(dict(text=label_string.format(df_2019_temp.shape[0]), showarrow=False, x=0.94, y=1.07, yref='paper'))\n", " if d == 0 and h == 0:\n", " fig.add_trace(go.Densitymapbox(lat=df_2017_temp['Start_Lat'], lon=df_2017_temp['Start_Lng'], z=[1] * df_2017_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=1)\n", " fig.add_trace(go.Densitymapbox(lat=df_2018_temp['Start_Lat'], lon=df_2018_temp['Start_Lng'], z=[1] * df_2018_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=2)\n", " fig.add_trace(go.Densitymapbox(lat=df_2019_temp['Start_Lat'], lon=df_2019_temp['Start_Lng'], z=[1] * df_2019_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=3)\n", " starting_annotations = annotation_list.copy()\n", "\n", " day = d + 1\n", " hour = '0{}'.format(h) if h <= 9 else '{}'.format(h)\n", " timestamp = pd.Timestamp('2021-02-0{}T{}:00:00'.format(day, hour))\n", " day_text = timestamp.strftime('%A - %I:%M %p')\n", " f1 = go.Densitymapbox(lat=df_2017_temp['Start_Lat'], lon=df_2017_temp['Start_Lng'], z=[1] * df_2017_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear'))\n", " f2 = go.Densitymapbox(lat=df_2018_temp['Start_Lat'], lon=df_2018_temp['Start_Lng'], z=[1] * df_2018_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear'))\n", " f3 = go.Densitymapbox(lat=df_2019_temp['Start_Lat'], lon=df_2019_temp['Start_Lng'], z=[1] * df_2019_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, \n", " colorscale='Turbo', colorbar=dict(tickmode='linear'))\n", " layout = go.Layout(annotations=annotation_list)\n", " frame = go.Frame(data=[f1, f2, f3], traces=[0, 1, 2], name=day_text, layout=layout)\n", " fig_frames.append(frame)\n", " slider_time_step = dict(args=[[day_text], dict(mode='immediate', frame=dict(duration=300, redraw=True))], method='animate', label=day_text)\n", " sliders_list[0]['steps'].append(slider_time_step)\n", "\n", " fig_layout = go.Layout(width=900, height=450, showlegend=False,\n", " mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), \n", " mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), \n", " mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style),\n", " updatemenus=buttons_list, sliders=sliders_list, annotations=starting_annotations)\n", "\n", " fig.frames = fig_frames\n", " fig.update_layout(fig_layout)\n", " #fig.write_html('yearly_animation_heatmap.html')\n", " fig.show()\n", "\n", "plot_animate_delays_by_year()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.149916, "end_time": "2021-03-14T03:56:57.054285", "exception": false, "start_time": "2021-03-14T03:56:56.904369", "status": "completed" }, "tags": [] }, "source": [ "Looking at the above plots gives an interesting look into how accidents occur over the week, but it is hard to get an accurate view of the information. You can see accidents pick up in the morning rush hour time, but there seems to be a steady flow of accidents throughout the rest of the daylight hours. Below, we will look at a more detailed plot." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "papermill": { "duration": 0.549744, "end_time": "2021-03-14T03:56:57.752679", "exception": false, "start_time": "2021-03-14T03:56:57.202935", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_line_delays_by_year():\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", " \n", " df_2017_counts = df_2017.groupby(['DayOfWeek', 'Hour']).count()\n", " df_2018_counts = df_2018.groupby(['DayOfWeek', 'Hour']).count()\n", " df_2019_counts = df_2019.groupby(['DayOfWeek', 'Hour']).count()\n", " \n", " hours = [pd.Timestamp(2021, 2 , 1, x).strftime('%I %p') for x in range(0, 24)]\n", " y_2017_data_list = []\n", " y_2018_data_list = []\n", " y_2019_data_list = []\n", " \n", " #some days/hour are empty, 0 counts need to be filled in\n", " for d in range(0, 7):\n", " data_list_2017 = []\n", " data_list_2018 = []\n", " data_list_2019 = []\n", " for h in range(0, 24):\n", " if (d, h) not in df_2017_counts.index:\n", " data_list_2017.append(0)\n", " else:\n", " data_list_2017.append(df_2017_counts.loc[(d, h)][0])\n", " if (d, h) not in df_2018_counts.index:\n", " data_list_2018.append(0)\n", " else:\n", " data_list_2018.append(df_2018_counts.loc[(d, h)][0])\n", " if (d, h) not in df_2019_counts.index:\n", " data_list_2019.append(0)\n", " else:\n", " data_list_2019.append(df_2019_counts.loc[(d, h)][0])\n", " y_2017_data_list.append(data_list_2017)\n", " y_2018_data_list.append(data_list_2018)\n", " y_2019_data_list.append(data_list_2019)\n", " \n", " fig = make_subplots(rows=5, cols=2, subplot_titles=['Monday', 'Saturday', 'Tuesday', 'Sunday', 'Wednesday', '', 'Thursday', '', 'Friday'], \n", " vertical_spacing=0.125, horizontal_spacing=0.1)\n", " \n", " line_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple')]\n", " \n", " mark_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple')]\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[0], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[0], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[0], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5), row=1, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[1], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[1], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[1], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=2, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[2], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=3, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[2], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=3, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[2], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=3, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[3], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=4, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[3], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=4, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[3], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=4, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[4], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=5, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[4], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=5, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[4], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=5, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[5], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=1, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[5], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=1, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[5], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=1, col=2)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[6], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=2, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[6], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=2, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[6], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=2, col=2)\n", " \n", " fig.update_layout(width=800, height=800)\n", " \n", " for r in range(1, 6):\n", " for c in range(1, 3):\n", " fig.update_yaxes(row=r, col=c, range=[0, 400])\n", " #fig.write_html('weekday_hourly_plots.html')\n", " fig.show()\n", "plot_line_delays_by_year()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.156871, "end_time": "2021-03-14T03:56:58.063394", "exception": false, "start_time": "2021-03-14T03:56:57.906523", "status": "completed" }, "tags": [] }, "source": [ "Given the plots above, it is easier to see the weekly rush hour traffic effect. During the Monday to Friday work week, you see a sharp spike around 7-8 AM and another small hump in the evening around 5-6 PM. The number of accidents during the morning rush hour seems to be significantly higher than evening rush hour accidents. Also, the weekend takes on an entirely different shape. Accidents seem to occur in a hump that falls around the afternoon, and it has many fewer accidents.\n", "\n", "# Predicting traffic severity from an accident\n", "\n", "In the previous sections, I looked at the locations of traffic accidents around Atlanta. These accidents also came with rating on the severity of traffic delays that these accidents caused. In the following section, I will train a classifier to predict the delay an accident might cause. I will split the data into train and test sets, and see what kind of accuracy I can get from predicting severity of an accident. I will use several types of classifiers to see which gives the best accuracy.\n", "\n", "\n", "### Weather Encoding\n", "\n", "Before splitting the data into train and test sets, there is one thing I must fix. In the previous sections, I used the weather data to make a plot. For the weather data to be used in the following classifiers, it must be converted to a numerical format.\n", "\n", "Initially, I used one hot encoding to convert the data to numerical format. I did this because because it seemed difficult to apply ordinality to the weather conditions.. Meaning, if the weather was ordinal, you could organize the weather conditions from least to most severe. While this may be true to some extent, it's not clear cut. Yes, Thunderstorms are worse than a clear day, but is fog worse than light or heavy rain? Is wind worse than light rain? Is heavy wind worse than light rain? There were many different weather types that would be hard to rank, and many that would seem to overlap. Thus, one hot encoding seemed the best option.\n", "\n", "After comparing categorical values to one hot encoding, I noticed no real increase in accuracy results with one hot encoding, in fact, most models were worse. This is most likely due to the fact that a lot of the weather conditions were sparse, as seen in a previous section where many weather condition only had a single digit number of data points. The weather conditions in this dataset are poorly categorized with many overlappig types as seen in the section where I plotted the weather types vs accidents. Also, using one hot encoding caused a substancial increase in run time for certain models.\n", "\n", "However, combining the weather data like I did in the previous weather plot or in some different order, may help results. I may come back later to attempt that, but for now, I will use categorical values. In this specific project, the one hot encoding of the original weather conditions didn't increase the accuracy." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "df['Weather_Condition'] = df['Weather_Condition'].astype('category').cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting the train/test data\n", "\n", "The first step is to randomly divide the data into train and test splits. I will use an 80/20 split of the data." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "papermill": { "duration": 0.181773, "end_time": "2021-03-14T03:56:58.397951", "exception": false, "start_time": "2021-03-14T03:56:58.216178", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "features = ['Hour', 'DayOfWeek', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)',\n", " 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',\n", " 'Weather_Condition', 'Amenity', 'Crossing', 'Give_Way',\n", " 'Junction', 'No_Exit', 'Railway', 'Station', 'Stop',\n", " 'Traffic_Calming', 'Traffic_Signal', 'Sunrise_Sunset',\n", " 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight',\n", " 'Start_Lat', 'Start_Lng']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(df[features], df['Severity'], test_size=0.2, random_state=random_state)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.151091, "end_time": "2021-03-14T03:56:58.700666", "exception": false, "start_time": "2021-03-14T03:56:58.549575", "status": "completed" }, "tags": [] }, "source": [ "### Dummy Classifier \n", "\n", "The first classifier I will implement is a dummy classifier. This classifier will simply predict the target with the most frequent count that it has seen with the training data. It is a good measurement to compare other classifiers to." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "papermill": { "duration": 0.168854, "end_time": "2021-03-14T03:56:59.019812", "exception": false, "start_time": "2021-03-14T03:56:58.850958", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "3 22097\n", "2 12550\n", "4 1708\n", "1 14\n", "Name: Severity, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Dummy Classifier Score: 0.6069504014076762'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_dummy = DummyClassifier(strategy='most_frequent', random_state=random_state)\n", "clf_dummy.fit(X_train, y_train)\n", "clf_dummy_score = clf_dummy.score(X_test, y_test)\n", "display(y_train.value_counts())\n", "display('Dummy Classifier Score: {}'.format(clf_dummy_score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the value_counts of the training set above, traffic delays of severity 3 are the most common target. The dummy classifier always predicts a severity of 3. Using this technique, it achieves an accuracy rating of approximately 60.7%.\n", "\n", "### Feature Selection and Models\n", "\n", "For this project, I was looking to read up on and get experience with several different classifiers in the scikit-learn library. For the following models, I will use multiple types of feature selection, and I will use the same types for each different classifier in order to see a comparison to how well they work.\n", "\n", "For the first model, I will use all the features. The models will be trained using default settings.\n", "\n", "For the second model, I will use the features I feel would give the most information on an accident. The models will be trained on default settings. The features used will be:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FeatureWhy
LatitudeLocation of an accident is important.
LongitudeLocation of an accident is important.
Traffic SignalMany accidents happen at stop lights. Blocking of intersections can cause extreme delays.
CrossingAccidents at road crossings without stop lights can backup the roads causing delays.
StopAccidents at stop signs could cause traffic to backyp causing extra delays.
HourTime of day can affect delays. Example: Delays can be more or less depending on if they happen in rush hour or not.
Day Of The WeekPrevious graphs showed less accidents happened on the weekend, so this could help predict delays.
Weather ConditionWeather could be important in predicting delays. Example: Rain slows down traffic. Having an accident could increase the slowdown.
\n", "\n", "There are several features I will choose to skip. For example hour of the day should capture the same information as twilight and sunset, so I left those out. Wind speed, humidity, visibility, and pressure should be somewhat captured in the weather condition category. Other features didn't seem like they would add much information.\n", "\n", "For the third model, I will use the top seven features from the SelectKBest algorithm from the scikit-learn library. The model will be trained on default settings. The chart below shows the SelectKBest scores for the dataset. Latitude and Longitude had the strongest score. There was a steep dropoff after that. I decided to go with the top seven because it includes hour and day of the week. Day of the week seems like it could be useful since the previous charts showed Saturdays and Sundays having less accidents. \n", " \n", "For the fourth model, I will use Recursive Feature Elimination with Cross Validation (RFECV) from the scikit-learn library. The model will be trained on default settings.\n", "\n", "For the fifth model, I will use all features as used in the first model, however I will perform a grid search with a few hand selected parameter values. For brevity, I will run the parameter search and then hard code the results in to speed up the run time of the notebook." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "best_features = SelectKBest(score_func=mutual_info_classif, k='all')\n", "fit = best_features.fit(X_train, y_train)\n", "feature_scores = pd.Series(fit.scores_, index=X_train.columns)\n", "feature_scores.sort_values(ascending=True).plot(kind='barh', figsize=(5,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, I will select the seven features from the SelectKBest algorithm and select my choices for the second feature set." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "k_best_features = ['Hour', 'DayOfWeek', 'Crossing', 'Traffic_Signal', 'Astronomical_Twilight', 'Start_Lat', 'Start_Lng']\n", "my_selected_features = ['Hour', 'DayOfWeek', 'Weather_Condition', 'Crossing', 'Stop', 'Traffic_Signal', 'Start_Lat', 'Start_Lng']" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.153831, "end_time": "2021-03-14T03:56:59.328825", "exception": false, "start_time": "2021-03-14T03:56:59.174994", "status": "completed" }, "tags": [] }, "source": [ "## Decision Tree Classifier\n", "\n", "For the first classifier, I will use a simple decision tree. As stated in the feature selection and model section, I will run five different variations of the decision tree." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Decision Tree - All Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7060376113493897'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_dt_all = DecisionTreeClassifier(random_state=random_state)\n", "clf_dt_all.fit(X_train, y_train)\n", "clf_dt_all_score = clf_dt_all.score(X_test, y_test)\n", "display('Decision Tree - All Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_dt_all_score))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Decision Tree - My Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7225338172220389'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_dt_my = DecisionTreeClassifier(random_state=random_state)\n", "clf_dt_my.fit(X_train[my_selected_features], y_train)\n", "clf_dt_my_score = clf_dt_my.score(X_test[my_selected_features], y_test)\n", "display('Decision Tree - My Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_dt_my_score))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Decision Tree - KBest(7) with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7373804025074233'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_dt_kb = DecisionTreeClassifier(random_state=random_state)\n", "clf_dt_kb.fit(X_train[k_best_features], y_train)\n", "clf_dt_kb_score = clf_dt_kb.score(X_test[k_best_features], y_test)\n", "display('Decision Tree - KBest(7) with Default Model Parameters:')\n", "display('Score: {}'.format(clf_dt_kb_score))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "papermill": { "duration": 41.662366, "end_time": "2021-03-14T03:57:41.142328", "exception": false, "start_time": "2021-03-14T03:56:59.479962", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Decision Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"Features Selected: ['Start_Lat', 'Start_Lng']\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7627845595513032'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "estimator_dt = DecisionTreeClassifier(random_state=random_state)\n", "selector_dt = RFECV(estimator_dt, scoring='accuracy')\n", "selector_dt.fit(X_train, y_train)\n", "selector_dt_score = selector_dt.score(X_test, y_test)\n", "features_dt = [features[x] for x in np.where(selector_dt.support_)[0]]\n", "display('Decision Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters')\n", "display('Features Selected: {}'.format(features_dt))\n", "display('Score: {}'.format(selector_dt_score))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "papermill": { "duration": 0.5076, "end_time": "2021-03-14T03:57:42.573484", "exception": false, "start_time": "2021-03-14T03:57:42.065884", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Decision Tree - All Features with Grid Search'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7462883536786539'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# clf_dt = DecisionTreeClassifier(random_state=random_state)\n", "# params_dt = {'criterion' : ['gini', 'entropy'], \n", "# 'max_depth': [15, 25, 50, 100, None], \n", "# 'min_samples_split': [2, 5, 10, 15, 20, 30, 50, 100, 200], \n", "# 'min_samples_leaf' : [1, 2, 5, 10, 15, 20], \n", "# 'splitter' : ['best', 'random']}\n", "# clf_dt_search = GridSearchCV(estimator=clf_dt, param_grid=params_dt, cv=3, n_jobs=-1, verbose=0)\n", "# clf_dt_search.fit(X_train, y_train)\n", "# clf_dt_search_score = clf_dt_search.score(X_test, y_test)\n", "# display('Decision Tree Search Score: {}'.format(clf_dt_search_score))\n", "# display('Search Best Parameters: {}'.format(clf_dt_search.best_params_))\n", "\n", "params_dt_found = {'criterion': 'gini', \n", " 'max_depth': 25, \n", " 'min_samples_leaf': 5, \n", " 'min_samples_split': 100, \n", " 'splitter': 'best'}\n", "clf_dt_search = DecisionTreeClassifier(random_state=random_state, **params_dt_found)\n", "clf_dt_search.fit(X_train, y_train)\n", "clf_dt_search_score = clf_dt_search.score(X_test, y_test)\n", "display('Decision Tree - All Features with Grid Search')\n", "display('Score: {}'.format(clf_dt_search_score))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.157085, "end_time": "2021-03-14T03:57:43.212370", "exception": false, "start_time": "2021-03-14T03:57:43.055285", "status": "completed" }, "tags": [] }, "source": [ "For the standard decision tree, the best accuracy was achieved using recursive feature elimination. The model was able to get an accuracy of 76.3% using only the lattitude and longitude features. This is approximately a 15.6% improvement over the dummy classifier which had an accuracy of 60.7%.\n", "\n", "### Random Forest Classifier\n", "\n", "For the next classifier, I'll try the random forest classifier. For the fifth classifier on the previous model, I used GridSearchCV to narrow down parameters. For this model, GridSearchCV was taking much longer. Due to this, all the following classifiers will use a RandomizedSearchCV. The randomized search does not try all parameters provided. You provide the number of iterations you would like to try, and it randomly samples the values to provide parameters to try for those iterations." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Random Forest - All Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7529968107335313'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_rf_all = RandomForestClassifier(random_state=random_state)\n", "clf_rf_all.fit(X_train, y_train)\n", "clf_rf_all_score = clf_rf_all.score(X_test, y_test)\n", "display('Random Forest - All Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_rf_all_score))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Random Forest - My Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7638843066094798'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_rf_my = RandomForestClassifier(random_state=random_state)\n", "clf_rf_my.fit(X_train[my_selected_features], y_train)\n", "clf_rf_my_score = clf_rf_my.score(X_test[my_selected_features], y_test)\n", "display('Random Forest - My Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_rf_my_score))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Random Forest - KBest(7) with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.752446937204443'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_rf_kb = RandomForestClassifier(random_state=random_state)\n", "clf_rf_kb.fit(X_train[k_best_features], y_train)\n", "clf_rf_kb_score = clf_rf_kb.score(X_test[k_best_features], y_test)\n", "display('Random Forest - KBest(7) with Default Model Parameters:')\n", "display('Score: {}'.format(clf_rf_kb_score))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Random Forest - Recursive Feature Elimination with Cross Validation with Default Model Parameters'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"Features Selected: ['Start_Lat', 'Start_Lng']\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7740019795447047'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "estimator_rf = RandomForestClassifier(random_state=random_state)\n", "selector_rf = RFECV(estimator_rf, scoring='accuracy')\n", "selector_rf.fit(X_train, y_train)\n", "selector_rf_score = selector_rf.score(X_test, y_test)\n", "features_rf = [features[x] for x in np.where(selector_rf.support_)[0]]\n", "display('Random Forest - Recursive Feature Elimination with Cross Validation with Default Model Parameters')\n", "display('Features Selected: {}'.format(features_rf))\n", "display('Score: {}'.format(selector_rf_score))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Random Forest - All Features with Randomized Search'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7826899813043'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# clf_rf = RandomForestClassifier(random_state=random_state)\n", "# params_rf = {'criterion' : ['gini', 'entropy'], \n", "# 'max_depth': [15, 25, 50, 100, None], \n", "# 'max_features': ['auto', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5], \n", "# 'min_samples_split': [2, 5, 10, 15, 20, 30, 50, 100, 200], \n", "# 'min_samples_leaf' : [1, 2, 5, 10, 15, 20], \n", "# 'bootstrap' : [True, False], \n", "# 'n_estimators' : [50, 100, 200, 300]}\n", "# clf_rf_search = RandomizedSearchCV(estimator=clf_rf, param_distributions=params_rf, n_iter=200, cv=3, n_jobs=-1, verbose=0)\n", "# clf_rf_search.fit(X_train, y_train)\n", "# clf_rf_search_score = clf_rf_search.score(X_test, y_test)\n", "# display('Random Forest Search Score: {}'.format(clf_rf_score))\n", "# display('Search Best Parameters: {}'.format(clf_rf_search.best_params_))\n", "\n", "params_rf_found = {'criterion' : 'entropy', \n", " 'max_depth': None, \n", " 'max_features': 0.5, \n", " 'min_samples_split': 15, \n", " 'min_samples_leaf' : 1, \n", " 'bootstrap' : False, \n", " 'n_estimators' : 50}\n", "clf_rf_search = RandomForestClassifier(random_state=random_state, **params_rf_found)\n", "clf_rf_search.fit(X_train, y_train)\n", "clf_rf_search_score = clf_rf_search.score(X_test, y_test)\n", "display('Random Forest - All Features with Randomized Search')\n", "display('Score: {}'.format(clf_rf_search_score))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.160183, "end_time": "2021-03-14T04:12:09.857544", "exception": false, "start_time": "2021-03-14T04:12:09.697361", "status": "completed" }, "tags": [] }, "source": [ "For the random forest classifier, the best accuracy was achieved using a randomized paramater search. The random forest had an accuracy of 78.3% which is approximately 17.6% higher than the dummy classifier of 60.7%\n", "\n", "### Gradient Boosting Classifier\n", "\n", "For the next classifier, I'll try the gradient boosting classifier. I will use another random parameter search to find the best parameters." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Gradient Boosting - All Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7316617178049049'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_gb_all = GradientBoostingClassifier(random_state=random_state)\n", "clf_gb_all.fit(X_train, y_train)\n", "clf_gb_all_score = clf_gb_all.score(X_test, y_test)\n", "display('Gradient Boosting - All Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_gb_all_score))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Gradient Boosting - My Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7321016166281755'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_gb_my = GradientBoostingClassifier(random_state=random_state)\n", "clf_gb_my.fit(X_train[my_selected_features], y_train)\n", "clf_gb_my_score = clf_gb_my.score(X_test[my_selected_features], y_test)\n", "display('Gradient Boosting - My Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_gb_my_score))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Gradient Boosting - KBest(7) with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7343011107445288'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_gb_kb = GradientBoostingClassifier(random_state=random_state)\n", "clf_gb_kb.fit(X_train[k_best_features], y_train)\n", "clf_gb_kb_score = clf_gb_kb.score(X_test[k_best_features], y_test)\n", "display('Gradient Boosting - KBest(7) with Default Model Parameters:')\n", "display('Score: {}'.format(clf_gb_kb_score))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "papermill": { "duration": 1466.875694, "end_time": "2021-03-14T04:36:36.893205", "exception": false, "start_time": "2021-03-14T04:12:10.017511", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Gradient Boosting - Recursive Feature Elimination with Cross Validation with Default Model Parameters'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"Features Selected: ['Hour', 'DayOfWeek', 'Temperature(F)', 'Pressure(in)', 'Weather_Condition', 'Amenity', 'Crossing', 'Junction', 'Traffic_Signal', 'Start_Lat', 'Start_Lng']\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.6914109754756406'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "estimator_gb = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3)\n", "selector_gb = RFECV(estimator_gb, scoring='accuracy')\n", "selector_gb.fit(X_train, y_train)\n", "selector_gb_score = selector_gb.score(X_test, y_test)\n", "features_gb = [features[x] for x in np.where(selector_gb.support_)[0]]\n", "display('Gradient Boosting - Recursive Feature Elimination with Cross Validation with Default Model Parameters')\n", "display('Features Selected: {}'.format(features_gb))\n", "display('Score: {}'.format(selector_gb_score))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "papermill": { "duration": 57.412115, "end_time": "2021-03-14T04:38:05.883145", "exception": false, "start_time": "2021-03-14T04:37:08.471030", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Gradient Boosting - All Features with Randomized Search'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7886286154184537'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# clf_gb = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3)\n", "# params_gb = {'learning_rate' : [0.09, 0.1, 0.11],\n", "# 'subsample' : [0.9, 1.0],\n", "# 'max_depth': [5, 8, 10], \n", "# 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],\n", "# 'n_estimators' : [125, 150, 175]}\n", "# clf_gb_search = RandomizedSearchCV(estimator=clf_gb, param_distributions=params_gb, n_iter=200, cv=3, n_jobs=-1, verbose=2)\n", "# clf_gb_search.fit(X_train, y_train)\n", "# clf_gb_search_score = clf_gb_search.score(X_test, y_test)\n", "# display('Gradient Boosting Search Score: {}'.format(clf_gb_score))\n", "# display('Gradient Boosting Parameters: {}'.format(clf_gb_search.best_params_))\n", "\n", "params_gb_found = {'learning_rate' : 0.09,\n", " 'subsample' : 0.9,\n", " 'max_depth': 10, \n", " 'max_features': 0.6, \n", " 'n_estimators' : 150}\n", "clf_gb_search = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3, **params_gb_found)\n", "clf_gb_search.fit(X_train, y_train)\n", "clf_gb_search_score = clf_gb_search.score(X_test, y_test)\n", "display('Gradient Boosting - All Features with Randomized Search')\n", "display('Score: {}'.format(clf_gb_search_score))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.171678, "end_time": "2021-03-14T04:38:06.880678", "exception": false, "start_time": "2021-03-14T04:38:06.709000", "status": "completed" }, "tags": [] }, "source": [ "The gradient boosting classifier managed an accuracy of 78.9% using a randomized parameter search. This is approximately 18.2% better than the dummy classifier of 60.7%.\n", "\n", "### Extra Tree Classifier\n", "\n", "For the next classifier, I'll try the extra tree classifier. I will use another random parameter search to find the best parameters." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Extra Tree - All Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7149455625206202'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_et_all = ExtraTreesClassifier(random_state=random_state)\n", "clf_et_all.fit(X_train, y_train)\n", "clf_et_all_score = clf_et_all.score(X_test, y_test)\n", "display('Extra Tree - All Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_et_all_score))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Extra Tree - My Features with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7216540195754977'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_et_my = ExtraTreesClassifier(random_state=random_state)\n", "clf_et_my.fit(X_train[my_selected_features], y_train)\n", "clf_et_my_score = clf_et_my.score(X_test[my_selected_features], y_test)\n", "display('Extra Tree - My Features with Default Model Parameters:')\n", "display('Score: {}'.format(clf_et_my_score))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Extra Tree - KBest(7) with Default Model Parameters:'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7201143736940504'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_et_kb = ExtraTreesClassifier(random_state=random_state)\n", "clf_et_kb.fit(X_train[k_best_features], y_train)\n", "clf_et_kb_score = clf_et_kb.score(X_test[k_best_features], y_test)\n", "display('Extra Tree - KBest(7) with Default Model Parameters:')\n", "display('Score: {}'.format(clf_et_kb_score))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "papermill": { "duration": 528.986408, "end_time": "2021-03-14T04:46:56.031625", "exception": false, "start_time": "2021-03-14T04:38:07.045217", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Extra Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\"Features Selected: ['Start_Lat', 'Start_Lng']\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7726822830748927'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "estimator_et = ExtraTreesClassifier(random_state=random_state)\n", "selector_et = RFECV(estimator_et, scoring='accuracy')\n", "selector_et.fit(X_train, y_train)\n", "selector_et_score = selector_et.score(X_test, y_test)\n", "features_et = [features[x] for x in np.where(selector_et.support_)[0]]\n", "display('Extra Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters')\n", "display('Features Selected: {}'.format(features_et))\n", "display('Score: {}'.format(selector_et_score))" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "papermill": { "duration": 21.05748, "end_time": "2021-03-14T04:47:22.966051", "exception": false, "start_time": "2021-03-14T04:47:01.908571", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Extra Tree - All Features with Randomized Search'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Score: 0.7394699219179589'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# clf_et = ExtraTreesClassifier(random_state=random_state)\n", "# params_et = {'criterion' : ['gini', 'entropy'], \n", "# 'max_depth': [30, 40, 50, 60, None], \n", "# 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], \n", "# 'n_estimators' : [50, 100, 200, 300], \n", "# 'bootstrap' : [True, False]}\n", "# clf_et_search = RandomizedSearchCV(estimator=clf_et, param_distributions=params_et, n_iter=200, cv=3, n_jobs=-1, verbose=3)\n", "# clf_et_search.fit(X_train, y_train)\n", "# clf_et_search_score = clf_et_search.score(X_test, y_test)\n", "# display('Extra Tree Search Score: {}'.format(clf_et_score))\n", "# display('Extra Tree Parameters: {}'.format(clf_et_search.best_params_))\n", "\n", "params_et_found = {'criterion' : 'entropy', \n", " 'max_depth': None, \n", " 'max_features': 0.6, \n", " 'n_estimators' : 200, \n", " 'bootstrap' : False}\n", "clf_et_search = ExtraTreesClassifier(random_state=random_state, **params_et_found)\n", "clf_et_search.fit(X_train, y_train)\n", "clf_et_search_score = clf_et_search.score(X_test, y_test)\n", "display('Extra Tree - All Features with Randomized Search')\n", "display('Score: {}'.format(clf_et_search_score))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.169013, "end_time": "2021-03-14T04:47:23.959471", "exception": false, "start_time": "2021-03-14T04:47:23.790458", "status": "completed" }, "tags": [] }, "source": [ "The extra tree classifier managed an accuracy of 77.3% using recursive feature elimination. This was approximately 16.6% higher than the 60.7% accuracy of the dummy classifier.\n", "\n", "### Voting Classifier\n", "\n", "The different models from the four above classifiers returned results significantly better than the dummy classifier with the best classifier being the gradient boosted trees that used a randomized parameter search and had an accuracy of 78.9%. Overall, the best results from all four models were only a couple percent difference. I tried other classifiers in my initial prototyping, but these four tree classes gave the best results. \n", "\n", "Next, I'm going to combine these four best performing models into a voting classifier. A voting classifier takes multiple models. Each model makes a prediction. The voting classifier then combines those predictions using a vote to get a voted prediction.\n", "\n", "There are multiple ways to count the vote. For a 'hard' vote, each classifier gets one vote and the classification with the most votes wins. For a 'soft' vote, the maximum probability for a class from all the models is taken. I will also use weighted voting where each classifier is given a weight based on its test accuracy. The accuracy values from the previous test data will be normalized so they all sum to 1.\n", "\n", "Below, I will take a hard and soft vote. Then, I will take a weighted hard and soft vote. The best performing models from above were the decision tree with RFECV with an accuracy of 76.3%, random forest randomized search with an accuracy of 78.3%, gradient boosting randomized search with an accuracy of 78.9%, and extra tree RFECV with an accuracy of 77.3%." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "papermill": { "duration": 1276.21342, "end_time": "2021-03-14T05:08:40.343579", "exception": false, "start_time": "2021-03-14T04:47:24.130159", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Hard Voting Score: 0.7821401077752117'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Soft Voting Score: 0.7930276036511602'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='hard')\n", "clf_vote.fit(X_train, y_train)\n", "clf_vote_score = clf_vote.score(X_test, y_test)\n", "display('Hard Voting Score: {}'.format(clf_vote_score))\n", "\n", "clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='soft')\n", "clf_vote.fit(X_train, y_train)\n", "clf_vote_score = clf_vote.score(X_test, y_test)\n", "display('Soft Voting Score: {}'.format(clf_vote_score))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "papermill": { "duration": 1290.739021, "end_time": "2021-03-14T05:30:11.247355", "exception": false, "start_time": "2021-03-14T05:08:40.508334", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Weighted Hard Voting Score: 0.7964368195315078'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'Weighted Soft Voting Score: 0.7933575277686132'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "weights = np.array([selector_dt_score, clf_rf_search_score, clf_gb_search_score, selector_et_score])\n", "weights = weights / weights.sum()\n", "\n", "clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='hard', weights=weights)\n", "clf_vote.fit(X_train, y_train)\n", "clf_vote_score = clf_vote.score(X_test, y_test)\n", "display('Weighted Hard Voting Score: {}'.format(clf_vote_score))\n", "\n", "clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='soft', weights=weights)\n", "clf_vote.fit(X_train, y_train)\n", "clf_vote_score = clf_vote.score(X_test, y_test)\n", "display('Weighted Soft Voting Score: {}'.format(clf_vote_score))" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.168781, "end_time": "2021-03-14T05:30:11.588095", "exception": false, "start_time": "2021-03-14T05:30:11.419314", "status": "completed" }, "tags": [] }, "source": [ "After working with several different models above, the final best accuracy before the voting classifier was the randomized search on the gradient boosted classifier. It had a final accuracy of 78.9%. Combing several of these classifiers into a voting classifier yielded an accuracy of 79.6% for the weighted hard voting classifier. Considering the voting classifier must store and use all four of the models which increases memory and makes it slower, this isn't a very significant increase in accuracy." ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.171627, "end_time": "2021-03-14T05:30:11.931589", "exception": false, "start_time": "2021-03-14T05:30:11.759962", "status": "completed" }, "tags": [] }, "source": [ "# A look at traffic during Covid-19\n", "\n", "In the earlier section, I removed the traffic data from the year 2020 due to the possibility it could be anomalous. It was widely reported in Atlanta that traffic flows were less. Also, [there were some reports that the severity of accidents were worse because of this.](https://www.wsbtv.com/news/local/atlanta/despite-covid-19-shutdown-there-was-huge-rise-deadly-wrecks-georgia-roads-over-last-year/FIYWJJQS6ZDJPGN67T7YHY3R7E/) The cause being since less traffic was on the road, people were traveling at higher speeds which made the accidents worse. However, a more severe accident wouldn't necessarily cause traffic delays since there may be less vehicles on the roads to impact. Given those thoughts, I thought it was worth taking a look into some plots of the data for the year.\n", "\n", "Atlanta began to see an impact from Covid-19 in March. After the first case in early March, Atlanta slowly began shutting down. It started with cancellations of events, school closures, and eventually led to shelter in place orders. It was widely reported that there were less traffic on the roads. Given that information, we would expect to see the year of 2020 be significantly lower in the various traffic plots. We will take a look at that below.\n", "\n", "### Monthly accident count\n", "\n", "Given the discussion above, I would expect to see a dip in the accident counts for 2020 starting in March. With a lot of businesses in Atlanta switching to remote work, I would expect that trend to continue somewhat into the rest of the year. However, when I moved to Atlanta in August of 2020, there seemed to be plenty of traffic on the roads, and business seemed to be going on for the most part as usual. It had been almost 20 years since I lived in Atlanta, but I definetely wouldn't have called the traffic light." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "papermill": { "duration": 0.297355, "end_time": "2021-03-14T05:30:12.402709", "exception": false, "start_time": "2021-03-14T05:30:12.105354", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_crash_count_severity_per_month():\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", " \n", " df_2017_count = df_2017.groupby(['Month']).count().iloc[:,0]\n", " df_2018_count = df_2018.groupby(['Month']).count().iloc[:,0]\n", " df_2019_count = df_2019.groupby(['Month']).count().iloc[:,0]\n", " df_2020_count = df_covid.groupby(['Month']).count().iloc[:,0]\n", " \n", " df_2017_mean = df_2017.groupby(['Month']).mean()['Severity']\n", " df_2018_mean = df_2018.groupby(['Month']).mean()['Severity']\n", " df_2019_mean = df_2019.groupby(['Month']).mean()['Severity']\n", " df_2020_mean = df_covid.groupby(['Month']).mean()['Severity']\n", " \n", " months = [pd.Timestamp(1900, x , 1).strftime('%b') for x in range(1, 13)]\n", " \n", " fig = make_subplots(rows=2, cols=1, subplot_titles=['Accident Count', 'Traffic Impact Severity (Mean)'])\n", " \n", " line_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", " \n", " mark_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", "\n", " fig.add_trace(go.Scatter(x=months, y=df_2017_count, name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2018_count, name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2019_count, name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2020_count, name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=months, y=df_2017_mean, name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2018_mean, name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2019_mean, name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=months, y=df_2020_mean, name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=1)\n", " \n", " fig.update_layout(width=900, height=600)\n", " #fig.write_html('monthly_covid_plot.html')\n", " fig.show()\n", " \n", " \n", "\n", "plot_crash_count_severity_per_month()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.172185, "end_time": "2021-03-14T05:30:12.746113", "exception": false, "start_time": "2021-03-14T05:30:12.573928", "status": "completed" }, "tags": [] }, "source": [ "The plots above show some interesting results. The accident count for the year 2020 was, for the most part, below the other three years until September of 2020. In September, there was a fairly steep climb in accident count. While I commented that traffic seemed normal when I moved here, I didn't expect to see accidents above normal levels. Also, on the traffic impact severity charts, we see a fairly significant drop in severity from September 2020 and onward.\n", "\n", "### Weekly accident count\n", "\n", "While the monthly count and impact severity wasn't quite what was expected. Let's take a look at the count and traffic impact severity for the weekly commute." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "papermill": { "duration": 0.280283, "end_time": "2021-03-14T05:30:13.199608", "exception": false, "start_time": "2021-03-14T05:30:12.919325", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_crash_count_severity_per_weekday():\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", " \n", " df_2017_count = df_2017.groupby(['DayOfWeek']).count().iloc[:,0]\n", " df_2018_count = df_2018.groupby(['DayOfWeek']).count().iloc[:,0]\n", " df_2019_count = df_2019.groupby(['DayOfWeek']).count().iloc[:,0]\n", " df_2020_count = df_covid.groupby(['DayOfWeek']).count().iloc[:,0]\n", " \n", " df_2017_mean = df_2017.groupby(['DayOfWeek']).mean()['Severity']\n", " df_2018_mean = df_2018.groupby(['DayOfWeek']).mean()['Severity']\n", " df_2019_mean = df_2019.groupby(['DayOfWeek']).mean()['Severity']\n", " df_2020_mean = df_covid.groupby(['DayOfWeek']).mean()['Severity']\n", " \n", " days = [pd.Timestamp(2021, 2 , x).strftime('%a') for x in range(1, 8)]\n", " \n", " fig = make_subplots(rows=2, cols=1, subplot_titles=['Accident Count', 'Traffic Impact Severity (Mean)'])\n", " \n", " line_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", " \n", " mark_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", "\n", " fig.add_trace(go.Scatter(x=days, y=df_2017_count, name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2018_count, name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2019_count, name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2020_count, name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=days, y=df_2017_mean, name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2018_mean, name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2019_mean, name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=1)\n", " fig.add_trace(go.Scatter(x=days, y=df_2020_mean, name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=1)\n", " \n", " fig.update_layout(width=900, height=600)\n", " #fig.write_html('daily_covid_plot.html')\n", " fig.show()\n", " \n", " \n", "\n", "plot_crash_count_severity_per_weekday()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.172368, "end_time": "2021-03-14T05:30:13.548368", "exception": false, "start_time": "2021-03-14T05:30:13.376000", "status": "completed" }, "tags": [] }, "source": [ "Taking a look at the weekly data, we can see that it follows a pattern that was seen earlier where counts are higher during the week and drop off on the weekends. We can see the accident counts are slightly lower for 2020 during the week, but not by much. However, on the weekends, the counts are slightly higher for 2020. Perhaps this could be attributed to remote work? Perhaps people being stuck inside their home all week get out on the weekends to do something? Next, we can see that traffic impact is lower than the previous three years on all days. During 2017-2019, there seemed to be higher delays on the weekend even though there were lower accident counts. However, 2020 didn't see those same impacts over the weekend.\n", "\n", "### Hourly accident count\n", "\n", "In this last plot, I'm going to look at the hourly accident counts and traffic impacts. Given the difference in weekday and weekend data seen in previous sections, I'm going to split the data into hourly plots for the weekday and weekend for all years." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "papermill": { "duration": 0.824201, "end_time": "2021-03-14T05:30:14.549744", "exception": false, "start_time": "2021-03-14T05:30:13.725543", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_crash_count_severity_per_hour():\n", " df_2017 = df[df['Year'] == 2017]\n", " df_2018 = df[df['Year'] == 2018]\n", " df_2019 = df[df['Year'] == 2019]\n", " \n", " df_2017_counts = df_2017.groupby(['DayOfWeek', 'Hour']).count()\n", " df_2018_counts = df_2018.groupby(['DayOfWeek', 'Hour']).count()\n", " df_2019_counts = df_2019.groupby(['DayOfWeek', 'Hour']).count()\n", " df_2020_counts = df_covid.groupby(['DayOfWeek', 'Hour']).count()\n", " \n", " df_2017_means = df_2017.groupby(['DayOfWeek', 'Hour']).mean()\n", " df_2018_means = df_2018.groupby(['DayOfWeek', 'Hour']).mean()\n", " df_2019_means = df_2019.groupby(['DayOfWeek', 'Hour']).mean()\n", " df_2020_means = df_covid.groupby(['DayOfWeek', 'Hour']).mean()\n", " \n", " hours = [pd.Timestamp(2021, 2 , 1, x).strftime('%I %p') for x in range(0, 24)]\n", " y_2017_data_list = [[], []]\n", " y_2018_data_list = [[], []]\n", " y_2019_data_list = [[], []]\n", " y_2020_data_list = [[], []]\n", " \n", " #some days/hour are empty, 0 counts need to be filled in\n", " for d in range(0, 7):\n", " data_list_2017_c, data_list_2017_m = [], []\n", " data_list_2018_c, data_list_2018_m = [], []\n", " data_list_2019_c, data_list_2019_m = [], []\n", " data_list_2020_c, data_list_2020_m = [], []\n", " for h in range(0, 24):\n", " if (d, h) not in df_2017_counts.index:\n", " data_list_2017_c.append(0)\n", " data_list_2017_m.append(0)\n", " else:\n", " data_list_2017_c.append(df_2017_counts.loc[(d, h)][0])\n", " data_list_2017_m.append(df_2017_means.loc[(d, h)]['Severity'])\n", " if (d, h) not in df_2018_counts.index:\n", " data_list_2018_c.append(0)\n", " data_list_2018_m.append(0)\n", " else:\n", " data_list_2018_c.append(df_2018_counts.loc[(d, h)][0])\n", " data_list_2018_m.append(df_2018_means.loc[(d, h)]['Severity'])\n", " if (d, h) not in df_2019_counts.index:\n", " data_list_2019_c.append(0)\n", " data_list_2019_m.append(0)\n", " else:\n", " data_list_2019_c.append(df_2019_counts.loc[(d, h)][0])\n", " data_list_2019_m.append(df_2019_means.loc[(d, h)]['Severity'])\n", " if (d, h) not in df_2020_counts.index:\n", " data_list_2020_c.append(0)\n", " data_list_2020_m.append(0)\n", " else:\n", " data_list_2020_c.append(df_2020_counts.loc[(d, h)][0])\n", " data_list_2020_m.append(df_2020_means.loc[(d, h)]['Severity'])\n", " y_2017_data_list[0].append(data_list_2017_c)\n", " y_2018_data_list[0].append(data_list_2018_c)\n", " y_2019_data_list[0].append(data_list_2019_c)\n", " y_2020_data_list[0].append(data_list_2020_c)\n", " y_2017_data_list[1].append(data_list_2017_m)\n", " y_2018_data_list[1].append(data_list_2018_m)\n", " y_2019_data_list[1].append(data_list_2019_m)\n", " y_2020_data_list[1].append(data_list_2020_m)\n", " \n", " fig = make_subplots(rows=2, cols=2, subplot_titles=['Weekday Accident Count', 'Weekend Accident Count Weekend', 'Weekday Traffic Impact Severity (Mean)', 'Weekend Traffic Impact Severity (Mean)'])\n", " \n", " line_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", " \n", " mark_styles = [dict(color='blue'), \n", " dict(color='green'), \n", " dict(color='purple'), \n", " dict(color='red')]\n", "\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[0][:5]).sum(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[0][:5]).sum(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[0][:5]).sum(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[0][:5]).sum(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[0][5:]).sum(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), \n", " row=1, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[0][5:]).sum(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), \n", " row=1, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[0][5:]).sum(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), \n", " row=1, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[0][5:]).sum(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), \n", " row=1, col=2)\n", "\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[1][:5]).mean(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), \n", " row=2, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[1][:5]).mean(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), \n", " row=2, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[1][:5]).mean(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), \n", " row=2, col=1)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[1][:5]).mean(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), \n", " row=2, col=1)\n", " \n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[1][5:]).mean(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), \n", " row=2, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[1][5:]).mean(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), \n", " row=2, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[1][5:]).mean(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), \n", " row=2, col=2)\n", " fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[1][5:]).mean(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), \n", " row=2, col=2)\n", " \n", " \n", " fig.update_layout(width=900, height=600, xaxis1=dict(tickmode='linear', tickangle=90), xaxis2=dict(tickmode='linear', tickangle=90))\n", " fig.update_yaxes(row=1, col=1, range=[0, 1800])\n", " fig.update_yaxes(row=1, col=2, range=[0, 1800])\n", " fig.update_yaxes(row=2, col=1, range=[0, 5])\n", " fig.update_yaxes(row=2, col=2, range=[0, 5])\n", " #fig.write_html('weekday_hourly_covid_plots.html')\n", " fig.show()\n", " \n", " \n", "\n", "plot_crash_count_severity_per_hour()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.175422, "end_time": "2021-03-14T05:30:14.899741", "exception": false, "start_time": "2021-03-14T05:30:14.724319", "status": "completed" }, "tags": [] }, "source": [ "As can be seen in the above charts, the accident counts for 2020 during the week are significantly lower during the morning rush hour phase. While the rest of the time the counts fall close to the counts for the other three years. There is also not much difference in traffic severity during the weekday between all the years. For the weekend data, all data are close except for a slight dip on traffic severity in the early morning hours of the weekend.\n", "\n", "# Summary\n", "\n", "In summary, this project was meant to be an excercise to get familiar with cleaning an unprocessed data set, making interactive plots, and getting more familiar with the scikit-learn library. \n", "\n", "The data was a set of traffic accidents and the traffic delays caused by them. To begin, I narrowed the data down to the Atlanta, Georgia area. I cleaned the data by removing values that weren't needed and filling in missing values. After the data was prepared, I used the Plotly library to make various interactive plots to visualize the data. \n", "\n", "I looked at heatmap plots for the years 2017, 2018, and 2019. The heatmaps showed large concentrations of accidents located around Atlanta's interstates. I looked at a weather plot for all years that showed a heatmap for each type of weather condition. This data was imbalanced towards clear and cloudy, so not much information could be gathered from the plot. Next, I plotted animated heatmaps for 2017, 2018, and 2019 that showed hourly heatmaps of accidents from Monday through Sunday. You could see that there was a slight increase in accidents during rush hour times, but it didn't give a clear picture of the data. I looked further at the rush hour traffic incidents by plotting line graphs that showed each weekday broken down into hours where you could clearly see rise in rush hour accidents during the week.\n", "\n", "After looking at various plots, I divided the data into train/test sets which I used to train Decision Tree, Random Forest, Gradient Boosted Trees, and Extra Tree classifiers. For each of these classifier, I trained several models using various techniques. Next, I combined some of the best models into a voting classifier. Overall, I managed to get significantly better results than the dummy classifier which chose the most frequent category. \n", "\n", "Finally, I took a look at traffic data for the year 2020. I compared this data to the data I used in previous sections in order to get an idea how Covid-19 affected accidents and delays. I looked at the monthly accident count for all years. For that plot, 2020 started off below other years, but had increased to above other years towards the end of the year. I plotted the weekly accident count to take a look at the Monday-Sunday work week that showed 2020 was slightly below during the week, however, it was slightly above on the weekends. Finally, I plotted the hourly info for the four years, which only showed a significant less difference on weekday morning rush hour, and all other times were reasonably close. Overall, the data for 2020 didn't seem to match what it was expected to be.\n", "\n", "Finally, I have one final note about the plots and data for this project. Many of the traffic plots had different results than what was expected. I want to call back to the About the Data section. In that section, it was mentioned that this dataset is most likely a subset of all the accidents that happened. While the data came from multiple large reliable sources, it is unlikely that they captured all accidents that happened. With that information, it is impossible to predict what data is missing, and this could be the cause of the results being so different than expected. This project was intended to be a practice exercise and not a completely accurate representation of traffic accidents." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "papermill": { "default_parameters": {}, "duration": 5868.621768, "end_time": "2021-03-14T05:30:17.996621", "environment_variables": {}, "exception": null, "input_path": "__notebook__.ipynb", "output_path": "__notebook__.ipynb", "parameters": {}, "start_time": "2021-03-14T03:52:29.374853", "version": "2.2.2" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }