{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Structured & Time Series Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook walks through an implementation of a deep learning model for structured time series data using Keras. We’ll use the dataset from Kaggle’s [Rossmann Store Sales competition](https://www.kaggle.com/c/rossmann-store-sales). The steps outlined below are inspired by (and partially based on) lesson 3 of Jeremy Howard’s [fast.ai course](http://course.fast.ai) where he builds a model for the Rossman dataset using PyTorch and the fast.ai library.\n", "\n", "The focus here is on implementing a deep learning model for structured data. I’ve skipped a bunch of pre-processing steps that are specific to this particular dataset but don’t reflect general principles about applying deep learning to tabular datasets. If you’re interested, you’ll find complete step-by-step instructions on creating the “joined” dataset in [this notebook](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb). With that, let’s get started!\n", "\n", "First we need to get a few imports out of the way. All of these should come standard with an Anaconda install. I’m also specifying the path where I’ve pre-saved the “joined” dataset that we’ll use as a starting point (created from running the first few sections of the above-referenced notebook).\n", "\n", "(As an aside, I’m using [Paperspace](https://www.paperspace.com) to run this notebook. If you’re not familiar with it, Paperspace is a cloud service that lets you rent GPU instances much cheaper than AWS. It’s a great way to get started if you don’t have your own hardware.)\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import datetime\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", "\n", "PATH = '/home/paperspace/data/rossmann/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data file into a pandas dataframe and take a peek at the data to see what we’re working with." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(844338, 93)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_feather(f'{PATH}joined')\n", "data.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "
---|---|---|---|---|---|
index | \n", "0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "
Store | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "
DayOfWeek | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "
Date | \n", "2015-07-31 00:00:00 | \n", "2015-07-31 00:00:00 | \n", "2015-07-31 00:00:00 | \n", "2015-07-31 00:00:00 | \n", "2015-07-31 00:00:00 | \n", "
Sales | \n", "5263 | \n", "6064 | \n", "8314 | \n", "13995 | \n", "4822 | \n", "
Customers | \n", "555 | \n", "625 | \n", "821 | \n", "1498 | \n", "559 | \n", "
Open | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
Promo | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
StateHoliday | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
SchoolHoliday | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
Year | \n", "2015 | \n", "2015 | \n", "2015 | \n", "2015 | \n", "2015 | \n", "
Month | \n", "7 | \n", "7 | \n", "7 | \n", "7 | \n", "7 | \n", "
Week | \n", "31 | \n", "31 | \n", "31 | \n", "31 | \n", "31 | \n", "
Day | \n", "31 | \n", "31 | \n", "31 | \n", "31 | \n", "31 | \n", "
Dayofweek | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "
Dayofyear | \n", "212 | \n", "212 | \n", "212 | \n", "212 | \n", "212 | \n", "
Is_month_end | \n", "True | \n", "True | \n", "True | \n", "True | \n", "True | \n", "
Is_month_start | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
Is_quarter_end | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
Is_quarter_start | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
Is_year_end | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
Is_year_start | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
Elapsed | \n", "1438300800 | \n", "1438300800 | \n", "1438300800 | \n", "1438300800 | \n", "1438300800 | \n", "
StoreType | \n", "c | \n", "a | \n", "a | \n", "c | \n", "a | \n", "
Assortment | \n", "a | \n", "a | \n", "a | \n", "c | \n", "a | \n", "
CompetitionDistance | \n", "1270 | \n", "570 | \n", "14130 | \n", "620 | \n", "29910 | \n", "
CompetitionOpenSinceMonth | \n", "9 | \n", "11 | \n", "12 | \n", "9 | \n", "4 | \n", "
CompetitionOpenSinceYear | \n", "2008 | \n", "2007 | \n", "2006 | \n", "2009 | \n", "2015 | \n", "
Promo2 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
Promo2SinceWeek | \n", "1 | \n", "13 | \n", "14 | \n", "1 | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
Min_Sea_Level_PressurehPa | \n", "1015 | \n", "1017 | \n", "1017 | \n", "1014 | \n", "1016 | \n", "
Max_VisibilityKm | \n", "31 | \n", "10 | \n", "31 | \n", "10 | \n", "10 | \n", "
Mean_VisibilityKm | \n", "15 | \n", "10 | \n", "14 | \n", "10 | \n", "10 | \n", "
Min_VisibilitykM | \n", "10 | \n", "10 | \n", "10 | \n", "10 | \n", "10 | \n", "
Max_Wind_SpeedKm_h | \n", "24 | \n", "14 | \n", "14 | \n", "23 | \n", "14 | \n", "
Mean_Wind_SpeedKm_h | \n", "11 | \n", "11 | \n", "5 | \n", "16 | \n", "11 | \n", "
Max_Gust_SpeedKm_h | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
Precipitationmm | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
CloudCover | \n", "1 | \n", "4 | \n", "2 | \n", "6 | \n", "4 | \n", "
Events | \n", "Fog | \n", "Fog | \n", "Fog | \n", "None | \n", "None | \n", "
WindDirDegrees | \n", "13 | \n", "309 | \n", "354 | \n", "282 | \n", "290 | \n", "
StateName | \n", "Hessen | \n", "Thueringen | \n", "NordrheinWestfalen | \n", "Berlin | \n", "Sachsen | \n", "
CompetitionOpenSince | \n", "2008-09-15 00:00:00 | \n", "2007-11-15 00:00:00 | \n", "2006-12-15 00:00:00 | \n", "2009-09-15 00:00:00 | \n", "2015-04-15 00:00:00 | \n", "
CompetitionDaysOpen | \n", "2510 | \n", "2815 | \n", "3150 | \n", "2145 | \n", "107 | \n", "
CompetitionMonthsOpen | \n", "24 | \n", "24 | \n", "24 | \n", "24 | \n", "3 | \n", "
Promo2Since | \n", "1900-01-01 00:00:00 | \n", "2010-03-29 00:00:00 | \n", "2011-04-04 00:00:00 | \n", "1900-01-01 00:00:00 | \n", "1900-01-01 00:00:00 | \n", "
Promo2Days | \n", "0 | \n", "1950 | \n", "1579 | \n", "0 | \n", "0 | \n", "
Promo2Weeks | \n", "0 | \n", "25 | \n", "25 | \n", "0 | \n", "0 | \n", "
AfterSchoolHoliday | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
BeforeSchoolHoliday | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
AfterStateHoliday | \n", "57 | \n", "67 | \n", "57 | \n", "67 | \n", "57 | \n", "
BeforeStateHoliday | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
AfterPromo | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
BeforePromo | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
SchoolHoliday_bw | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "
StateHoliday_bw | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
Promo_bw | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "
SchoolHoliday_fw | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
StateHoliday_fw | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
Promo_fw | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "
93 rows × 5 columns
\n", "\n", " | Store | \n", "DayOfWeek | \n", "Year | \n", "Month | \n", "Day | \n", "StateHoliday | \n", "CompetitionMonthsOpen | \n", "Promo2Weeks | \n", "StoreType | \n", "Assortment | \n", "... | \n", "Min_Humidity | \n", "Max_Wind_SpeedKm_h | \n", "Mean_Wind_SpeedKm_h | \n", "CloudCover | \n", "trend | \n", "trend_DE | \n", "AfterStateHoliday | \n", "BeforeStateHoliday | \n", "Promo | \n", "SchoolHoliday | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2015-06-30 | \n", "0 | \n", "1 | \n", "2 | \n", "5 | \n", "29 | \n", "0 | \n", "24 | \n", "0 | \n", "2 | \n", "0 | \n", "... | \n", "-1.964009 | \n", "0.047353 | \n", "-0.310342 | \n", "-2.314223 | \n", "1.008659 | \n", "0.885609 | \n", "-0.381079 | \n", "1.159128 | \n", "1.116479 | \n", "-0.476624 | \n", "
2015-06-30 | \n", "1 | \n", "1 | \n", "2 | \n", "5 | \n", "29 | \n", "0 | \n", "24 | \n", "25 | \n", "0 | \n", "0 | \n", "... | \n", "-1.147185 | \n", "-1.065656 | \n", "-0.646876 | \n", "-0.502029 | \n", "1.008659 | \n", "0.885609 | \n", "-0.063489 | \n", "1.159128 | \n", "1.116479 | \n", "-0.476624 | \n", "
2015-06-30 | \n", "2 | \n", "1 | \n", "2 | \n", "5 | \n", "29 | \n", "0 | \n", "24 | \n", "25 | \n", "0 | \n", "0 | \n", "... | \n", "-1.453494 | \n", "-0.397851 | \n", "-1.151678 | \n", "-1.861175 | \n", "1.544990 | \n", "0.885609 | \n", "-0.381079 | \n", "1.159128 | \n", "1.116479 | \n", "2.098092 | \n", "
2015-06-30 | \n", "3 | \n", "1 | \n", "2 | \n", "5 | \n", "29 | \n", "0 | \n", "24 | \n", "0 | \n", "2 | \n", "2 | \n", "... | \n", "-1.453494 | \n", "-0.175249 | \n", "-0.310342 | \n", "-0.502029 | \n", "0.025384 | \n", "0.885609 | \n", "-0.063489 | \n", "1.159128 | \n", "1.116479 | \n", "-0.476624 | \n", "
2015-06-30 | \n", "4 | \n", "1 | \n", "2 | \n", "5 | \n", "29 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "-1.096133 | \n", "-0.954355 | \n", "-0.310342 | \n", "-0.048980 | \n", "-0.421559 | \n", "0.885609 | \n", "-0.381079 | \n", "1.159128 | \n", "1.116479 | \n", "-0.476624 | \n", "
5 rows × 38 columns
\n", "