{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Time-related feature engineering\n\nThis notebook introduces different strategies to leverage time-related features\nfor a bike sharing demand regression task that is highly dependent on business\ncycles (days, weeks, months) and yearly season cycles.\n\nIn the process, we introduce how to perform periodic feature engineering using\nthe :class:`sklearn.preprocessing.SplineTransformer` class and its\n`extrapolation=\"periodic\"` option.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data exploration on the Bike Sharing Demand dataset\n\nWe start by loading the data from the OpenML repository.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\nbike_sharing = fetch_openml(\"Bike_Sharing_Demand\", version=2, as_frame=True)\ndf = bike_sharing.frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a quick understanding of the periodic patterns of the data, let us\nhave a look at the average demand per hour during a week.\n\nNote that the week starts on a Sunday, during the weekend. We can clearly\ndistinguish the commute patterns in the morning and evenings of the work days\nand the leisure use of the bikes on the weekends with a more spread peak\ndemand around the middle of the days:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(figsize=(12, 4))\naverage_week_demand = df.groupby([\"weekday\", \"hour\"])[\"count\"].mean()\naverage_week_demand.plot(ax=ax)\n_ = ax.set(\n title=\"Average hourly bike demand during the week\",\n xticks=[i * 24 for i in range(7)],\n xticklabels=[\"Sun\", \"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\"],\n xlabel=\"Time of the week\",\n ylabel=\"Number of bike rentals\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target of the prediction problem is the absolute count of bike rentals on\na hourly basis:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df[\"count\"].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us rescale the target variable (number of hourly bike rentals) to predict\na relative demand so that the mean absolute error is more easily interpreted\nas a fraction of the maximum demand.\n\n
The fit method of the models used in this notebook all minimize the\n mean squared error to estimate the conditional mean.\n The absolute error, however, would estimate the conditional median.\n\n Nevertheless, when reporting performance measures on the test set in\n the discussion, we choose to focus on the mean absolute error instead\n of the (root) mean squared error because it is more intuitive to\n interpret. Note, however, that in this study the best models for one\n metric are also the best ones in terms of the other metric.
If the time information was only present as a date or datetime column, we\n could have expanded it into hour-in-the-day, day-in-the-week,\n day-in-the-month, month-in-the-year using pandas:\n https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components