{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning\n", "\n", "In this note book we check what data is missing data and look at outliers. \n", "\n", "We find that most stations have 1-2% of days of data missing, however Newtown has 10% data. A large proporition is Sat/Sun data from Feb to June of 2013. Major station work was undertaking in 2012 and the station reopend in Oct 29, 2012. It is possible the station was closed on the weekend though to June 2013 to complete the second stage of work.\n", "\n", "We find there are a relatively small number of outliers that can be easily detected. We then create cleaned hourly data by replacing these outliers with values interpolated from the neightbours. Finally we we create daily data for each station/date, from both the cleaned and uncleaned hourly data." ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-02-06T02:51:50.797948", "start_time": "2017-02-06T02:51:50.794627" }, "code_folding": [] }, "source": [ "## The Data" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "This data is based on the train turnstile \"validation\" data from 2013 to June 2016 provided by Transport for NSW. It has transformed so there is an a single observation for each Station, Date and Hour for the Exit and Entry figure. \n", "\n", "It includes:\n", "* Station: The train station\n", "* Hours: 5pm - 2am\n", "* Date: The date of the observation such that trips between 5pm and 2am the next day are the same date.\n", "* Night: THe name of the day for Date, e.g., \"Friday\".\n", "* Exit: The number of people exiting during the hour\n", "* Entry: The number of people entering during the hour" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.654542", "start_time": "2017-02-06T03:44:38.260398" }, "collapsed": false, "init_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from datetime import timedelta, datetime\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "plt.rcParams['figure.figsize'] = (12, 2)\n", "\n", "import seaborn as sns\n", "sns.set(font_scale=0.6)\n", "\n", "import outlier\n", "from utilities import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.669136", "start_time": "2017-02-06T03:44:39.656522" }, "collapsed": true }, "outputs": [], "source": [ "df_hourly = pd.read_pickle(\"TrainValidationData/df_hourly.pkl\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.677813", "start_time": "2017-02-06T03:44:39.671474" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['5PM', '6PM', '7PM', '8PM', '9PM', '10PM', '11PM', '12AM', '1AM', '2AM']\n", "['Friday', 'Saturday', 'Sunday']\n", "['Bondi Junction Station', 'Central Station', 'Circular Quay Station', 'Kings Cross Station', 'Martin Place Station', 'Museum Station', 'Newtown Station', 'North Sydney Station', 'Parramatta Station', 'St James Station', 'Town Hall Station']\n" ] } ], "source": [ "print(list(df_hourly.Hour.cat.categories))\n", "print(list(df_hourly.Night.cat.categories))\n", "print(list(df_hourly.Station.cat.categories))" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-01-25T05:46:54.617604", "start_time": "2017-01-25T05:46:54.610586" }, "run_control": { "frozen": false, "read_only": false } }, "source": [ "# Missing data\n", "\n", "In summary in this section we discover that:\n", "1. Most stations are missing 1% to 2% of day however Newtown Station is missing 10% (40 days).\n", "2. The missing data for Kings Cross and Parramatta is a few whole days on Saturday and probably due to the station being closed for station or trackwork.\n", "3. The missing Newtown data is a combination of two factors. The the first 5 months of Saturdays from Feb-June 2013 are missing, which is probably the second stage of station upgrade works. There is also evidence that data is more likely to be missing later in the evening, possibly due to the gates being left open and people not swiping out." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.711126", "start_time": "2017-02-06T03:44:39.680264" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of days 549\n", "Bondi Junction Station: 11\n", " Central Station: 2\n", " Circular Quay Station: 13\n", " Kings Cross Station: 14\n", " Martin Place Station: 10\n", " Museum Station: 16\n", " Newtown Station: 58\n", " North Sydney Station: 12\n", " Parramatta Station: 6\n", " St James Station: 17\n", " Town Hall Station: 0\n" ] } ], "source": [ "daily = df_hourly.reset_index().pivot_table(index = 'Date', columns = ['Station'], values='Exit', aggfunc=sum)\n", "print(\"Number of days\", len(daily))\n", "for station in daily.columns:\n", " print(\"{:>22}: {:}\".format(station, daily[station].isnull().sum()))" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "In the above we see number of days where there was no data for the entier period (7pm-1am). We see that: \n", "- Town Hall has no missing days and Centeral only 1 \n", "- Paramatta is about 1% missing days\n", "- All other except Newtown are around 2%\n", "- Newtown is 10% and needs closers investigation" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.717534", "start_time": "2017-02-06T03:44:39.713536" }, "code_folding": [], "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "def missing_days(station, daily):\n", " for dt in daily[daily[station].isnull()].index:\n", " print(dt, dt.weekday_name)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.729484", "start_time": "2017-02-06T03:44:39.719469" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2013-03-09 00:00:00 Saturday\n", "2013-03-10 00:00:00 Sunday\n", "2013-05-18 00:00:00 Saturday\n", "2013-05-19 00:00:00 Sunday\n", "2013-07-21 00:00:00 Sunday\n", "2013-08-03 00:00:00 Saturday\n", "2013-08-04 00:00:00 Sunday\n", "2013-10-26 00:00:00 Saturday\n", "2014-08-31 00:00:00 Sunday\n", "2014-11-29 00:00:00 Saturday\n", "2014-11-30 00:00:00 Sunday\n", "2016-01-31 00:00:00 Sunday\n", "2016-04-16 00:00:00 Saturday\n", "2016-07-09 00:00:00 Saturday\n" ] } ], "source": [ "missing_days('Kings Cross Station', daily)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.738227", "start_time": "2017-02-06T03:44:39.731944" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2013-03-16 00:00:00 Saturday\n", "2013-03-17 00:00:00 Sunday\n", "2013-05-05 00:00:00 Sunday\n", "2013-11-16 00:00:00 Saturday\n", "2013-11-17 00:00:00 Sunday\n", "2014-04-06 00:00:00 Sunday\n" ] } ], "source": [ "missing_days('Parramatta Station', daily)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this we see that Kings Cross and Parramatta are missing just a few Saturdays of data. This is likely to be station closures for works." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.758825", "start_time": "2017-02-06T03:44:39.740739" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2013-02-02 00:00:00 Saturday\n", "2013-02-03 00:00:00 Sunday\n", "2013-02-09 00:00:00 Saturday\n", "2013-02-10 00:00:00 Sunday\n", "2013-02-16 00:00:00 Saturday\n", "2013-02-17 00:00:00 Sunday\n", "2013-02-23 00:00:00 Saturday\n", "2013-02-24 00:00:00 Sunday\n", "2013-03-02 00:00:00 Saturday\n", "2013-03-09 00:00:00 Saturday\n", "2013-03-10 00:00:00 Sunday\n", "2013-03-16 00:00:00 Saturday\n", "2013-03-17 00:00:00 Sunday\n", "2013-03-23 00:00:00 Saturday\n", "2013-03-24 00:00:00 Sunday\n", "2013-03-29 00:00:00 Friday\n", "2013-03-30 00:00:00 Saturday\n", "2013-03-31 00:00:00 Sunday\n", "2013-04-06 00:00:00 Saturday\n", "2013-04-07 00:00:00 Sunday\n", "2013-04-13 00:00:00 Saturday\n", "2013-04-14 00:00:00 Sunday\n", "2013-04-20 00:00:00 Saturday\n", "2013-04-21 00:00:00 Sunday\n", "2013-04-27 00:00:00 Saturday\n", "2013-04-28 00:00:00 Sunday\n", "2013-05-04 00:00:00 Saturday\n", "2013-05-05 00:00:00 Sunday\n", "2013-05-11 00:00:00 Saturday\n", "2013-05-12 00:00:00 Sunday\n", "2013-05-18 00:00:00 Saturday\n", "2013-06-22 00:00:00 Saturday\n", "2013-06-23 00:00:00 Sunday\n", "2013-06-29 00:00:00 Saturday\n", "2013-06-30 00:00:00 Sunday\n", "2013-07-13 00:00:00 Saturday\n", "2013-07-14 00:00:00 Sunday\n", "2013-07-27 00:00:00 Saturday\n", "2013-07-28 00:00:00 Sunday\n", "2013-08-03 00:00:00 Saturday\n", "2013-08-04 00:00:00 Sunday\n", "2013-10-12 00:00:00 Saturday\n", "2013-10-13 00:00:00 Sunday\n", "2013-11-02 00:00:00 Saturday\n", "2013-11-03 00:00:00 Sunday\n", "2014-02-09 00:00:00 Sunday\n", "2014-02-16 00:00:00 Sunday\n", "2014-04-26 00:00:00 Saturday\n", "2014-07-26 00:00:00 Saturday\n", "2015-04-11 00:00:00 Saturday\n", "2015-08-29 00:00:00 Saturday\n", "2015-11-14 00:00:00 Saturday\n", "2015-11-21 00:00:00 Saturday\n", "2015-11-22 00:00:00 Sunday\n", "2015-12-19 00:00:00 Saturday\n", "2016-01-23 00:00:00 Saturday\n", "2016-05-21 00:00:00 Saturday\n", "2016-07-30 00:00:00 Saturday\n" ] } ], "source": [ "missing_days('Newtown Station', daily)" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "This has a long run of Saturdays (Feb - June 2013) are missing. Major station work was undertaking in 2012 and the station reopend in Oct 29, 2012. It is possible the station was closed on the weekend though to June 2013 to complete the second stage of work. " ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:17:32.347477", "start_time": "2017-02-02T23:17:32.344830" } }, "source": [ "## Missing Data By Hour" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:01:39.268037", "start_time": "2017-02-02T23:01:39.262261" } }, "source": [ "For shows some stations (Central, Kings Cross, Paramatter) the missing data does not change over time, implying that the entire day is missing. For others, the count of missing data increases later at night indicating that the station closes early or the gates are left open and people do not swipe off. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-02-06T03:44:39.837821", "start_time": "2017-02-06T03:44:39.761256" }, "code_folding": [], "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/html": [ "
Hour | \n", "5PM | \n", "6PM | \n", "7PM | \n", "8PM | \n", "9PM | \n", "10PM | \n", "11PM | \n", "12AM | \n", "1AM | \n", "2AM | \n", "
---|---|---|---|---|---|---|---|---|---|---|
Station | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Bondi Junction Station | \n", "25 | \n", "23 | \n", "26 | \n", "29 | \n", "26 | \n", "61 | \n", "69 | \n", "61 | \n", "187 | \n", "498 | \n", "
Central Station | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "5 | \n", "6 | \n", "6 | \n", "6 | \n", "33 | \n", "
Circular Quay Station | \n", "27 | \n", "27 | \n", "27 | \n", "30 | \n", "30 | \n", "33 | \n", "32 | \n", "20 | \n", "214 | \n", "528 | \n", "
Kings Cross Station | \n", "27 | \n", "26 | \n", "28 | \n", "29 | \n", "29 | \n", "30 | \n", "29 | \n", "23 | \n", "134 | \n", "499 | \n", "
Martin Place Station | \n", "21 | \n", "22 | \n", "24 | \n", "23 | \n", "25 | \n", "62 | \n", "92 | \n", "142 | \n", "272 | \n", "525 | \n", "
Museum Station | \n", "32 | \n", "36 | \n", "36 | \n", "35 | \n", "36 | \n", "119 | \n", "118 | \n", "127 | \n", "290 | \n", "536 | \n", "
Newtown Station | \n", "93 | \n", "92 | \n", "96 | \n", "112 | \n", "111 | \n", "178 | \n", "197 | \n", "184 | \n", "319 | \n", "470 | \n", "
North Sydney Station | \n", "29 | \n", "28 | \n", "26 | \n", "30 | \n", "29 | \n", "29 | \n", "30 | \n", "21 | \n", "132 | \n", "491 | \n", "
Parramatta Station | \n", "15 | \n", "12 | \n", "15 | \n", "13 | \n", "18 | \n", "19 | \n", "20 | \n", "15 | \n", "76 | \n", "403 | \n", "
St James Station | \n", "31 | \n", "32 | \n", "32 | \n", "41 | \n", "144 | \n", "157 | \n", "170 | \n", "192 | \n", "369 | \n", "537 | \n", "
Town Hall Station | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "36 | \n", "435 | \n", "