{ "metadata": { "name": "", "signature": "sha256:0e0da9e8929491e6d2d926da0152dbc26d5d1a97c37f43e2e96c84cb8bb3e13e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Bike Sharing Demand - Kaggle Competition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an ipython notebook, which details work done in building a regressor on 'bike sharing demand' for Kaggle's 'Knowledge' competition." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "We start by loading typical anaconda scientific libraries and define functions for improved plots." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import brewer2mpl\n", "from matplotlib import rcParams\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)\n", "dark2_colors = dark2_cmap.mpl_colors\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 20\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1. Load the training data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Fields (as provided by Kaggle):\n", "\n", "1. datetime: hourly date + timestamp \n", "2. season: *(indices changed from defaults received)\n", " - 1 = winter\n", " - 2 = spring\n", " - 3 = summer\n", " - 4 = fall \n", "3. holiday - whether the day is considered a holiday\n", "4. workingday - whether the day is neither a weekend nor holiday\n", "5. weather - encoded to make explicit various extreme weather events\n", "6. temp - temperature in Celsius\n", "7. atemp - \"feels like\" temperature in Celsius\n", "8. humidity - relative humidity\n", "9. windspeed - wind speed\n", "10. casual - number of non-registered user rentals initiated\n", "11. registered - number of registered user rentals initiated\n", "12. count - number of total rentals" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv\n", "results = []\n", "path = \"train.csv\"\n", "with open(path, \"r\") as d:\n", " header = d.next().strip(\"\\n\").split(\",\")\n", " for line in d:\n", " results.append(line.strip(\"\\n\").split(\",\"))\n", "data = pd.DataFrame(data=np.asarray(results), columns=header)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "2. Data Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are asked to predict the number of rentals during each hour of the day, so it makes sense to split the datetime column into 'hour of the day'. We also split it into 'day of the week' and 'month of the year' as we think these might be useful features for the model and might provide insights. For example there might be some useful information encoded in whether it is a weekday versus a weekend or also whether it is the start of the year and people start cycling as a New Year resolution." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from datetime import datetime, date, time\n", "data['hour'] = data['datetime'].map(lambda x: (datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\")).hour)\n", "data['weekday'] = data['datetime'].map(lambda x: (datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\")).weekday())\n", "data['month'] = data['datetime'].map(lambda x: (datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\")).month)\n", "data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | datetime | \n", "season | \n", "holiday | \n", "workingday | \n", "weather | \n", "temp | \n", "atemp | \n", "humidity | \n", "windspeed | \n", "casual | \n", "registered | \n", "count | \n", "hour | \n", "weekday | \n", "month | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2011-01-01 00:00:00 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "9.84 | \n", "14.395 | \n", "81 | \n", "0 | \n", "3 | \n", "13 | \n", "16 | \n", "0 | \n", "5 | \n", "1 | \n", "
1 | \n", "2011-01-01 01:00:00 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "8 | \n", "32 | \n", "40 | \n", "1 | \n", "5 | \n", "1 | \n", "
2 | \n", "2011-01-01 02:00:00 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "5 | \n", "27 | \n", "32 | \n", "2 | \n", "5 | \n", "1 | \n", "
3 | \n", "2011-01-01 03:00:00 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "3 | \n", "10 | \n", "13 | \n", "3 | \n", "5 | \n", "1 | \n", "
4 | \n", "2011-01-01 04:00:00 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "4 | \n", "5 | \n", "1 | \n", "
\n", " | temp | \n", "atemp | \n", "humidity | \n", "windspeed | \n", "season_1 | \n", "season_2 | \n", "season_3 | \n", "season_4 | \n", "holiday_0 | \n", "holiday_1 | \n", "... | \n", "month_6 | \n", "month_7 | \n", "month_8 | \n", "month_9 | \n", "month_10 | \n", "month_11 | \n", "month_12 | \n", "casual | \n", "registered | \n", "count | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "9.84 | \n", "14.395 | \n", "81 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "13 | \n", "16 | \n", "
1 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "8 | \n", "32 | \n", "40 | \n", "
2 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "27 | \n", "32 | \n", "
3 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "10 | \n", "13 | \n", "
4 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "
5 rows \u00d7 62 columns
\n", "\n", " | temp | \n", "atemp | \n", "humidity | \n", "windspeed | \n", "season_1 | \n", "season_2 | \n", "season_3 | \n", "season_4 | \n", "holiday_0 | \n", "holiday_1 | \n", "... | \n", "month_4 | \n", "month_5 | \n", "month_6 | \n", "month_7 | \n", "month_8 | \n", "month_9 | \n", "month_10 | \n", "month_11 | \n", "month_12 | \n", "log_count | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "9.84 | \n", "14.395 | \n", "81 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2.772589 | \n", "
1 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3.688879 | \n", "
2 | \n", "9.02 | \n", "13.635 | \n", "80 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3.465736 | \n", "
3 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2.564949 | \n", "
4 | \n", "9.84 | \n", "14.395 | \n", "75 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "
5 rows \u00d7 60 columns
\n", "\n", " | temp | \n", "atemp | \n", "humidity | \n", "windspeed | \n", "season_1 | \n", "season_2 | \n", "season_3 | \n", "season_4 | \n", "holiday_0 | \n", "holiday_1 | \n", "... | \n", "month_3 | \n", "month_4 | \n", "month_5 | \n", "month_6 | \n", "month_7 | \n", "month_8 | \n", "month_9 | \n", "month_10 | \n", "month_11 | \n", "month_12 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "10.66 | \n", "11.365 | \n", "56 | \n", "26.0027 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "10.66 | \n", "13.635 | \n", "56 | \n", "0.0000 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "10.66 | \n", "13.635 | \n", "56 | \n", "0.0000 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "10.66 | \n", "12.880 | \n", "56 | \n", "11.0014 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "10.66 | \n", "12.880 | \n", "56 | \n", "11.0014 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows \u00d7 59 columns
\n", "