{ "cells": [ { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "# Exploritory Data Analysis" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "## Abstract\n", "\n", "We examine the NSW train Exit and Entry traffic for the Kings Cross, Newtown, Parramatta, Circular Quay, Museum, Central and Town Hall stations for evidence of changes due to the introduction of the NSW Lockout Law on 24-Feb-2014. \n", "\n", "Using Baysian Change Point detection we found:\n", "1. No evidence of changes to Kings Cross or Parramatta Exit traffic from the introduction of the lockout law.\n", "2. Evidence of strong growth in the Parramatta Friday night Exit traffic that is unrelated of the lockout laws and which has increased traffic by 200% since Jan-2013\n", "2. Evidence of changes in the Newtown Friday night Exit traffic as a result of the lockout laws and which has increased traffic by 300% since the law came into effect." ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "## The Data \n", "We were provided with train turnstile \"validation\" data by Transport for NSW. This data is a summary of Exit and Entry traffic by Station, Hour and Date. \n", "\n", "The data covers the period 2013-02-01 to 2016-07-31, for Friday, Saturday and Sunday nights from 5PM to 2AM. Since this spans the transition from Magnetic tickets to OPAL cards the data is further divided by Source.\n", "\n", "In the notebook [1_Data_Transformation](1_Data_Transformation.ipynb) we transform the data from its raw form to one more suitable for our analysis. Specifically we need observations between midnight and the last train (2pm) to have the Date as the previous day. This means all observation in the period 5pm-2am will have the same Date and be considered part of the same Night. " ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T05:18:26.796721", "start_time": "2017-02-09T05:18:26.760416" }, "collapsed": false, "hide_input": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from datetime import timedelta, datetime\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "plt.rcParams['figure.figsize'] = (12, 2)\n", "\n", "import seaborn as sns\n", "sns.set(font_scale=0.6)\n", "\n", "import outlier\n", "import utilities as util" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T04:03:54.513837", "start_time": "2017-02-09T04:03:54.445172" }, "collapsed": true }, "outputs": [], "source": [ "df_raw = pd.read_pickle(\"TrainValidationData/df_raw.pkl\") \n", "stations= ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',\n", " 'Circular Quay Station', 'Museum Station',\n", " 'Central Station', 'Town Hall Station']" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "In the notebook [2_Cleaning](2_Cleaning.ipynb) we examine the data in more detail. Unsurprisingly we found some data is missing, however we conclude the impact is minimal and we can work with this data.\n", "\n", "Counting the number of whole days of missing data we find:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T04:03:55.021121", "start_time": "2017-02-09T04:03:54.517554" }, "collapsed": false, "hide_input": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of days 549\n", " Central Station: 2\n", " Circular Quay Station: 13\n", " Kings Cross Station: 14\n", " Museum Station: 16\n", " Newtown Station: 58\n", " Parramatta Station: 6\n", " Town Hall Station: 0\n" ] } ], "source": [ "daily = (df_raw.query('Station in @stations').\n", " reset_index().\n", " pivot_table(index = 'Date', columns = ['Station'], values='Exit', aggfunc=sum).dropna(axis='columns', how='all'))\n", "print(\"Number of days\", len(daily))\n", "for station in daily.columns:\n", " print(\"{:>22}: {:}\".format(station, daily[station].isnull().sum()))" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "We observe \n", "- Town Hall has no missing data and Central just two days.\n", "- Newtown is missing 10% (40 days), which needs further investigation.\n", "\n", "We examine the missing data more closely in [2_Cleaning](2_Cleaning.ipynb) and discover the missing data for:\n", "1. Kings Cross and Parramatta is a few whole days on Saturday which is probably due to the station being closed for station or trackwork.\n", "3. Newtown is largely the first 5 months of Saturday data from Feb to June of 2013. This is probably a result of the station upgrade works started in 2013.\n", "\n", "Counting the missing data by Hour of the day we get:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T04:03:59.376009", "start_time": "2017-02-09T04:03:58.553786" }, "collapsed": false, "hide_input": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/html": [ "
Hour | \n", "5PM | \n", "6PM | \n", "7PM | \n", "8PM | \n", "9PM | \n", "10PM | \n", "11PM | \n", "12AM | \n", "1AM | \n", "2AM | \n", "
---|---|---|---|---|---|---|---|---|---|---|
Station | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Central Station | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "5 | \n", "6 | \n", "6 | \n", "6 | \n", "33 | \n", "
Circular Quay Station | \n", "27 | \n", "27 | \n", "27 | \n", "30 | \n", "30 | \n", "33 | \n", "32 | \n", "20 | \n", "214 | \n", "528 | \n", "
Kings Cross Station | \n", "27 | \n", "26 | \n", "28 | \n", "29 | \n", "29 | \n", "30 | \n", "29 | \n", "23 | \n", "134 | \n", "499 | \n", "
Museum Station | \n", "32 | \n", "36 | \n", "36 | \n", "35 | \n", "36 | \n", "119 | \n", "118 | \n", "127 | \n", "290 | \n", "536 | \n", "
Newtown Station | \n", "93 | \n", "92 | \n", "96 | \n", "112 | \n", "111 | \n", "178 | \n", "197 | \n", "184 | \n", "319 | \n", "470 | \n", "
Parramatta Station | \n", "15 | \n", "12 | \n", "15 | \n", "13 | \n", "18 | \n", "19 | \n", "20 | \n", "15 | \n", "76 | \n", "403 | \n", "
Town Hall Station | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "36 | \n", "435 | \n", "
\n", " | Source | \n", "Bands | \n", "Station | \n", "Entry | \n", "Exit | \n", "Datetime | \n", "Hour | \n", "Night | \n", "
---|---|---|---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "554 | \n", "197 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "32 | \n", "33 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "537 | \n", "116 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "8 | \n", "13 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "36 | \n", "46 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "15 | \n", "27 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "42 | \n", "74 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "80 | \n", "160 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "94 | \n", "149 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "7 | \n", "9 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "14 | \n", "13 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "10 | \n", "11 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "108 | \n", "24 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "389 | \n", "438 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "7 | \n", "13 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "0 | \n", "1 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "17 to 18 | \n", "Circular Quay Station | \n", "0 | \n", "1 | \n", "2014-05-23 17:00:00 | \n", "5PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "298 | \n", "210 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "43 | \n", "18 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "244 | \n", "134 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "7 | \n", "14 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "14 | \n", "48 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "38 | \n", "36 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "43 | \n", "84 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "79 | \n", "125 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "66 | \n", "123 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "6 | \n", "8 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "7 | \n", "9 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "6 | \n", "2 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
2014-05-23 | \n", "Magnetic tickets | \n", "18 to 19 | \n", "Circular Quay Station | \n", "48 | \n", "6 | \n", "2014-05-23 18:00:00 | \n", "6PM | \n", "Friday | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "17 | \n", "16 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "54 | \n", "22 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "19 | \n", "11 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "9 | \n", "2 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "250 | \n", "120 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "21 to 22 | \n", "Circular Quay Station | \n", "132 | \n", "104 | \n", "2013-06-07 21:00:00 | \n", "9PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "51 | \n", "2 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "18 | \n", "2 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "11 | \n", "8 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "51 | \n", "6 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "16 | \n", "6 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "6 | \n", "1 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "216 | \n", "27 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "22 to 23 | \n", "Circular Quay Station | \n", "107 | \n", "61 | \n", "2013-06-07 22:00:00 | \n", "10PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "35 | \n", "0 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "8 | \n", "1 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "16 | \n", "5 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "35 | \n", "0 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "11 | \n", "2 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "4 | \n", "2 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "152 | \n", "11 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "23 to 24 | \n", "Circular Quay Station | \n", "105 | \n", "31 | \n", "2013-06-07 23:00:00 | \n", "11PM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "12 | \n", "0 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "2 | \n", "0 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "7 | \n", "1 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "8 | \n", "1 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "3 | \n", "0 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "31 | \n", "0 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "24 to 25 | \n", "Circular Quay Station | \n", "44 | \n", "2 | \n", "2013-06-08 00:00:00 | \n", "12AM | \n", "Friday | \n", "
2013-06-07 | \n", "Magnetic tickets | \n", "25 to 26 | \n", "Circular Quay Station | \n", "1 | \n", "0 | \n", "2013-06-08 01:00:00 | \n", "1AM | \n", "Friday | \n", "
2308 rows × 8 columns
\n", "