{ "cells": [ { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "This data was provided by Transport for NSW. It is train turnstile \"validation\" data from 2013 to June 2016.\n", "\n", "It includes:\n", "* Date: The date of the reading. If the reading is early in the morning the date is the prior day. This is done so all trips between 5pm and 5am are considered the same date.\n", "* Hours: The Hoursperiod the reading was taken. This is in the format \"17 to 18\" and is supposed to range from \"17 to 18\" to \"26 to 27\" where\n", " * \"17 to 18\" means 5pm - 6pm of the current date\n", " * \"25 to 26\" means 1pm-2pm the next day\n", "* Line: The train line \n", "* Station: The tran station\n", "* Exit: The number of people exiting in that hour\n", "* Entry: The number of people entering in that hour\n", "* Night: We added name of the day for the date of the reading, e.g., Friday.\n", "\n", "We found that Hours also included \"0 to 1\" and \"1 to 2\", whic we assume means 1am of the current date. These are rows are corrected so the date is the prior day and the Hours is \"24 to 25\" or \"25 to 26\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-02-08T21:39:44.009268", "start_time": "2017-02-08T21:39:42.374864" }, "collapsed": false, "init_cell": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Use scipy logsumexp().\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from datetime import timedelta, datetime\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "plt.rcParams['figure.figsize'] = (12, 2)\n", "\n", "import seaborn as sns\n", "sns.set(font_scale=0.6)\n", "\n", "import outlier\n", "import utilities as util" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "## Import and transform data\n", "\n", "Read in the 4 years of zipped CSV data as a single dataframe. Convert Source, Line, Station and Bands to categoricals to save space and speed processing." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:16:58.872052", "start_time": "2017-02-05T20:16:57.056656" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "names= ['Source', 'Line','Bands', 'Station', 'Entry', 'Exit', 'Date']\n", "dtype= {'Source':'category', 'Bands':'category', 'Station':'category', 'Entry':np.int32, 'Exit':np.int32}\n", "usecols = ['Source', 'Station', 'Date', 'Bands', 'Entry', 'Exit']\n", "\n", "df_raw = pd.concat([pd.read_csv(file, names=names, usecols=usecols, dtype=dtype, na_values=['', ' '], header=0)\n", " for file in glob('TrainValidationData/*.zip')])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:17:04.839065", "start_time": "2017-02-05T20:16:58.874786" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_raw['Date'] = pd.to_datetime(df_raw.Date, format='%d/%m/%Y %H:%M')" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-01-27T21:32:27.890116", "start_time": "2017-01-27T21:32:27.883372" }, "run_control": { "frozen": false, "read_only": false } }, "source": [ "Make Hour a categorical where the code is an interger for the hour, i.e., '17 to 18' == 17 " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:17:04.927751", "start_time": "2017-02-05T20:17:04.840937" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "BANDS = [\"{} to {}\".format(i,i+1) for i in range(0,27)]\n", "df_raw.Bands = df_raw.Bands.astype('category', ordered=True, categories=BANDS)" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "Calculate the actual Datetime from Date + Hours.code, then recompute Date so all observations between 5pm and 5am have the same date" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:17:10.137360", "start_time": "2017-02-05T20:17:04.929630" }, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_raw['Datetime'] = df_raw.Date + pd.to_timedelta(df_raw.Bands.cat.codes, unit='h')\n", "df_raw['Date'] = pd.to_datetime((df_raw['Datetime'] - timedelta(hours = 6)).dt.date)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:17:19.575881", "start_time": "2017-02-05T20:17:10.139415" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_raw['Hour'] = df_raw.Datetime.dt.strftime('%I%p').astype('category', ordered=True, categories=['05PM', '06PM', '07PM', '08PM', '09PM', '10PM', '11PM', '12AM','01AM', '02AM'])\n", "df_raw['Hour'] = df_raw.Hour.cat.rename_categories(['5PM', '6PM', '7PM', '8PM', '9PM', '10PM', '11PM', '12AM','1AM', '2AM'])\n", "df_raw['Night'] = (df_raw.Date.dt.weekday_name.\n", " astype('category', ordered=True, \n", " categories=['Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday']))" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-02-05T03:26:25.793768", "start_time": "2017-02-05T03:26:25.296738" }, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "source": [ "Sum multiple entries for Station, Night, Date, Hour to aggrigate the Source and create a unique observation Station, Night, Date, Hour" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:43.791197", "start_time": "2017-02-05T20:39:43.650316" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_hourly = df_raw.groupby(['Station', 'Night', 'Date', 'Hour'], as_index=False).agg({'Entry':'sum', 'Exit':'sum'})" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:44.039418", "start_time": "2017-02-05T20:39:44.031056" }, "code_folding": [], "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "def check(new, old):\n", " \n", " print(\"Raw Lengths: \", len(new), len(old))\n", " print(\"Non null lengths\", len(new.dropna()), len(old.dropna()))\n", " \n", " assert new.Exit.sum() == old.Exit.sum(), \"Exit totals don't match\"\n", " assert new.Entry.sum() == old.Entry.sum(), \"Exit totals don't match\"\n", " \n", " new = new.groupby(['Station', 'Date', 'Hour'], as_index=False).agg({'Entry':'sum', 'Exit':'sum'})\n", " old = new.groupby(['Station', 'Date', 'Hour'], as_index=False).agg({'Entry':'sum', 'Exit':'sum'})\n", " assert len(new) == len(old), \"Number of observations are different\"\n", " " ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "Make each row a unique observation of Exit/Entry for a give Station, Hour and Date" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:45.826570", "start_time": "2017-02-05T20:39:45.055249" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raw Lengths: 68167 1526208\n", "Non null lengths 68167 1526208\n" ] } ], "source": [ "check(df_hourly, df_raw)" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false } }, "source": [ "The data should have readings for all stations from 5pm - 2pm for all dates. Add missing observations as NaN" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:47.596424", "start_time": "2017-02-05T20:39:47.325771" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "idx_levels = df_raw.set_index(['Station','Hour', 'Date']).index.levels\n", "df_hourly = df_hourly.set_index(['Station','Hour', 'Date'])\n", "\n", "idx = pd.MultiIndex.from_product(idx_levels, names=['Station','Hour', 'Date'])\n", "df_hourly = df_hourly.reindex(index=idx).reset_index()\n", "\n", "assert df_raw.Date.min() == df_hourly.Date.min(), \"Wrong start dates\"\n", "assert df_raw.Date.max() == df_hourly.Date.max(), \"Wrong end dates\"" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:48.368245", "start_time": "2017-02-05T20:39:48.354193" }, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_hourly['Night'] = (df_hourly.Date.dt.weekday_name.\n", " astype('category', ordered=True, \n", " categories=['Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday']))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:39:49.991021", "start_time": "2017-02-05T20:39:49.200744" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raw Lengths: 100320 1526208\n", "Non null lengths 68167 1526208\n" ] } ], "source": [ "check(df_hourly, df_raw)\n", "\n", "assert (df_hourly.Date.dt.weekday_name.unique() == df_hourly.Night.cat.categories).all(), \"missing Night categories\"\n", "assert (df_hourly.Hour.unique() == df_hourly.Hour.cat.categories).all(), \"missing Hour categories\"\n", "assert (df_hourly.Station.unique() == df_hourly.Station.cat.categories).all(), \"missing Station categories\"" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:45:34.630251", "start_time": "2017-02-05T20:45:34.444764" }, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_hourly= df_hourly.query('Night in [\"Friday\", \"Saturday\", \"Sunday\"]')\n", "df_hourly.Night = df_hourly.Night.cat.remove_unused_categories()\n", "\n", "df_raw = df_raw.query('Night in [\"Friday\", \"Saturday\", \"Sunday\"]')\n", "df_raw.Night = df_raw.Night.cat.remove_unused_categories()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:47:40.724230", "start_time": "2017-02-05T20:47:40.710544" }, "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2013-02-01 2016-07-31\n" ] } ], "source": [ "print(df_raw.Date.min().date(), df_raw.Date.max().date()) " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2017-02-05T20:45:37.888205", "start_time": "2017-02-05T20:45:37.282601" }, "collapsed": true, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "df_raw.set_index('Date').to_pickle(\"TrainValidationData/df_raw.pkl\")\n", "df_hourly.set_index('Date').to_pickle(\"TrainValidationData/df_hourly.pkl\")" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "navigate_menu": false, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }