{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Rossman data preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the techniques we need to apply before feeding all the data to a Deep Learning model, we are going to take the example of the [Rossmann sales Kaggle competition](https://www.kaggle.com/c/rossmann-store-sales). Given a wide range of information about a store, we are going to try predict their sale number on a given day. This is very useful to be able to manage stock properly and be able to properly satisfy the demand without wasting anything. The official training set was giving a lot of informations about various stores in Germany, but it was also allowed to use additional data, as long as it was made public and available to all participants.\n", "\n", "We are going to reproduce most of the steps of one of the winning teams that they highlighted in [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf). In addition to the official data, teams in the top of the leaderboard also used information about the weather, the states of the stores or the Google trends of those days. We have assembled all that additional data in one file available for download [here](http://files.fast.ai/part2/lesson14/rossmann.tgz) if you want to replicate those steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A first look at the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First things first, let's import everything we will need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai2.tabular.all import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have download the previous file and decompressed it in a folder named rossmann in the fastai data folder, you should see the following list of files with this instruction:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#9) [/home/sgugger/.fastai/data/rossmann/weather.csv,/home/sgugger/.fastai/data/rossmann/sample_submission.csv,/home/sgugger/.fastai/data/rossmann/googletrend.csv,/home/sgugger/.fastai/data/rossmann/test.csv,/home/sgugger/.fastai/data/rossmann/store.csv,/home/sgugger/.fastai/data/rossmann/store_states.csv,/home/sgugger/.fastai/data/rossmann/state_names.csv,/home/sgugger/.fastai/data/rossmann/train.csv,/home/sgugger/.fastai/data/rossmann/rossmann.tgz]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = Config().data/'rossmann'\n", "path.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data that comes from Kaggle is in 'train.csv', 'test.csv', 'store.csv' and 'sample_submission.csv'. The other files are the additional data we were talking about. Let's start by loading everything using pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']\n", "tables = [pd.read_csv(path/f'{fname}.csv', low_memory=False) for fname in table_names]\n", "train, store, store_states, state_names, googletrend, weather, test = tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get an idea of the amount of data available, let's just look at the length of the training and test tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1017209, 41088)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train), len(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we have more than one million records available. Let's have a look at what's inside:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreDayOfWeekDateSalesCustomersOpenPromoStateHolidaySchoolHoliday
0152015-07-3152635551101
1252015-07-3160646251101
2352015-07-3183148211101
3452015-07-311399514981101
4552015-07-3148225591101
\n", "
" ], "text/plain": [ " Store DayOfWeek Date Sales Customers Open Promo StateHoliday \\\n", "0 1 5 2015-07-31 5263 555 1 1 0 \n", "1 2 5 2015-07-31 6064 625 1 1 0 \n", "2 3 5 2015-07-31 8314 821 1 1 0 \n", "3 4 5 2015-07-31 13995 1498 1 1 0 \n", "4 5 5 2015-07-31 4822 559 1 1 0 \n", "\n", " SchoolHoliday \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Store` column contains the id of the stores, then we are given the id of the day of the week, the exact date, if the store was open on that day, if there were any promotion in that store during that day, and if it was a state or school holiday. The `Customers` column is given as an indication, and the `Sales` column is what we will try to predict.\n", "\n", "If we look at the test table, we have the same columns, minus `Sales` and `Customers`, and it looks like we will have to predict on dates that are after the ones of the train table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdStoreDayOfWeekDateOpenPromoStateHolidaySchoolHoliday
01142015-09-171.0100
12342015-09-171.0100
23742015-09-171.0100
34842015-09-171.0100
45942015-09-171.0100
\n", "
" ], "text/plain": [ " Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday\n", "0 1 1 4 2015-09-17 1.0 1 0 0\n", "1 2 3 4 2015-09-17 1.0 1 0 0\n", "2 3 7 4 2015-09-17 1.0 1 0 0\n", "3 4 8 4 2015-09-17 1.0 1 0 0\n", "4 5 9 4 2015-09-17 1.0 1 0 0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other table given by Kaggle contains some information specific to the stores: their type, what the competition looks like, if they are engaged in a permanent promotion program, and if so since then." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
Store12345
StoreTypecaaca
Assortmentaaaca
CompetitionDistance12705701413062029910
CompetitionOpenSinceMonth9111294
CompetitionOpenSinceYear20082007200620092015
Promo201100
Promo2SinceWeekNaN1314NaNNaN
Promo2SinceYearNaN20102011NaNNaN
PromoIntervalNaNJan,Apr,Jul,OctJan,Apr,Jul,OctNaNNaN
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "Store 1 2 3 4 5\n", "StoreType c a a c a\n", "Assortment a a a c a\n", "CompetitionDistance 1270 570 14130 620 29910\n", "CompetitionOpenSinceMonth 9 11 12 9 4\n", "CompetitionOpenSinceYear 2008 2007 2006 2009 2015\n", "Promo2 0 1 1 0 0\n", "Promo2SinceWeek NaN 13 14 NaN NaN\n", "Promo2SinceYear NaN 2010 2011 NaN NaN\n", "PromoInterval NaN Jan,Apr,Jul,Oct Jan,Apr,Jul,Oct NaN NaN" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "store.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's have a quick look at our four additional dataframes. `store_states` just gives us the abbreviated name of the sate of each store." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreState
01HE
12TH
23NW
34BE
45SN
\n", "
" ], "text/plain": [ " Store State\n", "0 1 HE\n", "1 2 TH\n", "2 3 NW\n", "3 4 BE\n", "4 5 SN" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "store_states.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can match them to their real names with `state_names`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateNameState
0BadenWuerttembergBW
1BayernBY
2BerlinBE
3BrandenburgBB
4BremenHB
\n", "
" ], "text/plain": [ " StateName State\n", "0 BadenWuerttemberg BW\n", "1 Bayern BY\n", "2 Berlin BE\n", "3 Brandenburg BB\n", "4 Bremen HB" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "state_names.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which is going to be necessary if we want to use the `weather` table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
fileNordrheinWestfalenNordrheinWestfalenNordrheinWestfalenNordrheinWestfalenNordrheinWestfalen
Date2013-01-012013-01-022013-01-032013-01-042013-01-05
Max_TemperatureC871198
Mean_TemperatureC44898
Min_TemperatureC21687
Dew_PointC751098
MeanDew_PointC53897
Min_DewpointC12486
Max_Humidity9493100100100
Mean_Humidity8785939494
Min_Humidity6578778787
Max_Sea_Level_PressurehPa10131028103510361035
Mean_Sea_Level_PressurehPa10051022103010351034
Min_Sea_Level_PressurehPa10011014102610341033
Max_VisibilityKm3131311110
Mean_VisibilityKm1214856
Min_VisibilitykM410223
Max_Wind_SpeedKm_h3924262316
Mean_Wind_SpeedKm_h2616211410
Max_Gust_SpeedKm_h58NaNNaNNaNNaN
Precipitationmm5.0801.020.250
CloudCover66777
EventsRainRainRainRainRain
WindDirDegrees215225240263268
\n", "
" ], "text/plain": [ " 0 1 \\\n", "file NordrheinWestfalen NordrheinWestfalen \n", "Date 2013-01-01 2013-01-02 \n", "Max_TemperatureC 8 7 \n", "Mean_TemperatureC 4 4 \n", "Min_TemperatureC 2 1 \n", "Dew_PointC 7 5 \n", "MeanDew_PointC 5 3 \n", "Min_DewpointC 1 2 \n", "Max_Humidity 94 93 \n", "Mean_Humidity 87 85 \n", "Min_Humidity 65 78 \n", "Max_Sea_Level_PressurehPa 1013 1028 \n", "Mean_Sea_Level_PressurehPa 1005 1022 \n", "Min_Sea_Level_PressurehPa 1001 1014 \n", "Max_VisibilityKm 31 31 \n", "Mean_VisibilityKm 12 14 \n", "Min_VisibilitykM 4 10 \n", "Max_Wind_SpeedKm_h 39 24 \n", "Mean_Wind_SpeedKm_h 26 16 \n", "Max_Gust_SpeedKm_h 58 NaN \n", "Precipitationmm 5.08 0 \n", "CloudCover 6 6 \n", "Events Rain Rain \n", "WindDirDegrees 215 225 \n", "\n", " 2 3 \\\n", "file NordrheinWestfalen NordrheinWestfalen \n", "Date 2013-01-03 2013-01-04 \n", "Max_TemperatureC 11 9 \n", "Mean_TemperatureC 8 9 \n", "Min_TemperatureC 6 8 \n", "Dew_PointC 10 9 \n", "MeanDew_PointC 8 9 \n", "Min_DewpointC 4 8 \n", "Max_Humidity 100 100 \n", "Mean_Humidity 93 94 \n", "Min_Humidity 77 87 \n", "Max_Sea_Level_PressurehPa 1035 1036 \n", "Mean_Sea_Level_PressurehPa 1030 1035 \n", "Min_Sea_Level_PressurehPa 1026 1034 \n", "Max_VisibilityKm 31 11 \n", "Mean_VisibilityKm 8 5 \n", "Min_VisibilitykM 2 2 \n", "Max_Wind_SpeedKm_h 26 23 \n", "Mean_Wind_SpeedKm_h 21 14 \n", "Max_Gust_SpeedKm_h NaN NaN \n", "Precipitationmm 1.02 0.25 \n", "CloudCover 7 7 \n", "Events Rain Rain \n", "WindDirDegrees 240 263 \n", "\n", " 4 \n", "file NordrheinWestfalen \n", "Date 2013-01-05 \n", "Max_TemperatureC 8 \n", "Mean_TemperatureC 8 \n", "Min_TemperatureC 7 \n", "Dew_PointC 8 \n", "MeanDew_PointC 7 \n", "Min_DewpointC 6 \n", "Max_Humidity 100 \n", "Mean_Humidity 94 \n", "Min_Humidity 87 \n", "Max_Sea_Level_PressurehPa 1035 \n", "Mean_Sea_Level_PressurehPa 1034 \n", "Min_Sea_Level_PressurehPa 1033 \n", "Max_VisibilityKm 10 \n", "Mean_VisibilityKm 6 \n", "Min_VisibilitykM 3 \n", "Max_Wind_SpeedKm_h 16 \n", "Mean_Wind_SpeedKm_h 10 \n", "Max_Gust_SpeedKm_h NaN \n", "Precipitationmm 0 \n", "CloudCover 7 \n", "Events Rain \n", "WindDirDegrees 268 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly the googletrend table gives us the trend of the brand in each state and in the whole of Germany." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fileweektrend
0Rossmann_DE_SN2012-12-02 - 2012-12-0896
1Rossmann_DE_SN2012-12-09 - 2012-12-1595
2Rossmann_DE_SN2012-12-16 - 2012-12-2291
3Rossmann_DE_SN2012-12-23 - 2012-12-2948
4Rossmann_DE_SN2012-12-30 - 2013-01-0567
\n", "
" ], "text/plain": [ " file week trend\n", "0 Rossmann_DE_SN 2012-12-02 - 2012-12-08 96\n", "1 Rossmann_DE_SN 2012-12-09 - 2012-12-15 95\n", "2 Rossmann_DE_SN 2012-12-16 - 2012-12-22 91\n", "3 Rossmann_DE_SN 2012-12-23 - 2012-12-29 48\n", "4 Rossmann_DE_SN 2012-12-30 - 2013-01-05 67" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "googletrend.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we apply the fastai preprocessing, we will need to join the store table and the additional ones with our training and test table. Then, as we saw in our first example in chapter 1, we will need to split our variables between categorical and continuous. Before we do that, though, there is one type of variable that is a bit different from the others: dates.\n", "\n", "We could turn each particular day in a category but there are cyclical information in dates we would miss if we did that. We already have the day of the week in our tables, but maybe the day of the month also bears some significance. People might be more inclined to go shopping at the beggining or the end of the month. The number of the week/month is also important to detect seasonal influences.\n", "\n", "Then we will try to exctract meaningful information from those dates. For instance promotions on their own are important inputs, but maybe the number of running weeks with promotion is another useful information as it will influence customers. A state holiday in itself is important, but it's more significant to know if we are the day before or after such a holiday as it will impact sales. All of those might seem very specific to this dataset, but you can actually apply them in any tabular data containing time information.\n", "\n", "This first step is called feature-engineering and is extremely important: your model will try to extract useful information from your data but any extra help you can give it in advance is going to make training easier, and the final result better. In Kaggle Competitions using tabular data, it's often the way people prepared their data that makes the difference in the final leaderboard, not the exact model used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Merging tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To merge tables together, we will use this little helper function that relies on the pandas library. It will merge the tables `left` and `right` by looking at the column(s) which names are in `left_on` and `right_on`: the information in `right` will be added to the rows of the tables in `left` when the data in `left_on` inside `left` is the same as the data in `right_on` inside `right`. If `left_on` and `right_on` are the same, we don't have to pass `right_on`. We keep the fields in `right` that have the same names as fields in `left` and add a `_y` suffix (by default) to those field names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def join_df(left, right, left_on, right_on=None, suffix='_y'):\n", " if right_on is None: right_on = left_on\n", " return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n", " suffixes=(\"\", suffix))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's replace the state names in the weather table by the abbreviations, since that's what is used in the other tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fileDateStateStateName
0NordrheinWestfalen2013-01-01NWNordrheinWestfalen
1NordrheinWestfalen2013-01-02NWNordrheinWestfalen
2NordrheinWestfalen2013-01-03NWNordrheinWestfalen
3NordrheinWestfalen2013-01-04NWNordrheinWestfalen
4NordrheinWestfalen2013-01-05NWNordrheinWestfalen
\n", "
" ], "text/plain": [ " file Date State StateName\n", "0 NordrheinWestfalen 2013-01-01 NW NordrheinWestfalen\n", "1 NordrheinWestfalen 2013-01-02 NW NordrheinWestfalen\n", "2 NordrheinWestfalen 2013-01-03 NW NordrheinWestfalen\n", "3 NordrheinWestfalen 2013-01-04 NW NordrheinWestfalen\n", "4 NordrheinWestfalen 2013-01-05 NW NordrheinWestfalen" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather = join_df(weather, state_names, \"file\", \"StateName\")\n", "weather[['file', 'Date', 'State', 'StateName']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To double-check the merge happened without incident, we can check that every row has a `State` with this line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(weather[weather.State.isnull()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now safely remove the columns with the state names (`file` and `StateName`) since they we'll use the short codes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather.drop(columns=['file', 'StateName'], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add the weather informations to our `store` table, we first use the table `store_states` to match a store code with the corresponding state, then we merge with our weather table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "store = join_df(store, store_states, 'Store')\n", "store = join_df(store, weather, 'State')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And again, we can check if the merge went well by looking if new NaNs where introduced." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(store[store.Mean_TemperatureC.isnull()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to join the `googletrend` table to this `store` table. If you remember from our previous look at it, it's not exactly in the same format:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fileweektrend
0Rossmann_DE_SN2012-12-02 - 2012-12-0896
1Rossmann_DE_SN2012-12-09 - 2012-12-1595
2Rossmann_DE_SN2012-12-16 - 2012-12-2291
3Rossmann_DE_SN2012-12-23 - 2012-12-2948
4Rossmann_DE_SN2012-12-30 - 2013-01-0567
\n", "
" ], "text/plain": [ " file week trend\n", "0 Rossmann_DE_SN 2012-12-02 - 2012-12-08 96\n", "1 Rossmann_DE_SN 2012-12-09 - 2012-12-15 95\n", "2 Rossmann_DE_SN 2012-12-16 - 2012-12-22 91\n", "3 Rossmann_DE_SN 2012-12-23 - 2012-12-29 48\n", "4 Rossmann_DE_SN 2012-12-30 - 2013-01-05 67" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "googletrend.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need to change the column with the states and the columns with the dates:\n", "- in the column `fil`, the state names contain `Rossmann_DE_XX` with `XX` being the code of the state, so we want to remove `Rossmann_DE`. We will do this by creating a new column containing the last part of a split of the string by '\\_'.\n", "- in the column `week`, we will extract the date corresponding to the beginning of the week in a new column by taking the last part of a split on ' - '.\n", "\n", "In pandas, creating a new column is very easy: you just have to define them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fileweektrendDateState
0Rossmann_DE_SN2012-12-02 - 2012-12-08962012-12-02SN
1Rossmann_DE_SN2012-12-09 - 2012-12-15952012-12-09SN
2Rossmann_DE_SN2012-12-16 - 2012-12-22912012-12-16SN
3Rossmann_DE_SN2012-12-23 - 2012-12-29482012-12-23SN
4Rossmann_DE_SN2012-12-30 - 2013-01-05672012-12-30SN
\n", "
" ], "text/plain": [ " file week trend Date State\n", "0 Rossmann_DE_SN 2012-12-02 - 2012-12-08 96 2012-12-02 SN\n", "1 Rossmann_DE_SN 2012-12-09 - 2012-12-15 95 2012-12-09 SN\n", "2 Rossmann_DE_SN 2012-12-16 - 2012-12-22 91 2012-12-16 SN\n", "3 Rossmann_DE_SN 2012-12-23 - 2012-12-29 48 2012-12-23 SN\n", "4 Rossmann_DE_SN 2012-12-30 - 2013-01-05 67 2012-12-30 SN" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]\n", "googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]\n", "googletrend.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check everything went well by looking at the values in the new `State` column of our `googletrend` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array(['HE', 'TH', 'NW', 'BE', 'SN', 'SH', 'HB,NI', 'BY', 'BW', 'RP',\n", " 'ST', 'HH'], dtype=object),\n", " array(['SN', None, 'BY', 'TH', 'NW', 'BE', 'RP', 'BW', 'NI', 'SH', 'HE',\n", " 'ST', 'HH', 'SL'], dtype=object))" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "store['State'].unique(),googletrend['State'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have two additional values in the second (`None` and 'SL') but this isn't a problem since they'll be ignored when we join. One problem however is that 'HB,NI' in the first table is named 'NI' in the second one, so we need to change that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "googletrend.loc[googletrend.State=='NI', \"State\"] = 'HB,NI'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why do we have a `None` in state? As we said before, there is a global trend for Germany that corresponds to `Rosmann_DE` in the field `file`. For those, the previous split failed which gave the `None` value. We will keep this global trend and put it in a new column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trend_de = googletrend[googletrend.file == 'Rossmann_DE'][['Date', 'trend']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can merge it with the rest of our trends, by adding the suffix '\\_DE' to know it's the general trend." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "googletrend = join_df(googletrend, trend_de, 'Date', suffix='_DE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then at this stage, we can remove the columns `file` and `week`since they won't be useful anymore, as well as the rows where `State` is `None` (since they correspond to the global trend that we saved in another column)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "googletrend.drop(columns=['file', 'week'], axis=1, inplace=True)\n", "googletrend = googletrend[~googletrend['State'].isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last thing missing to be able to join this with or store table is to extract the week from the date in this table and in the store table: we need to join them on week values since each trend is given for the full week that starts on the indicated date. This is linked to the next topic in feature engineering: extracting dateparts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Adding dateparts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your table contains dates, you will need to split the information there in several column for your Deep Learning model to be able to train properly. There is the basic stuff, such as the day number, week number, month number or year number, but anything that can be relevant to your problem is also useful. Is it the beginning or the end of the month? Is it a holiday?\n", "\n", "To help with this, the fastai library as a convenience function called `add_datepart`. It will take a dataframe and a column you indicate, try to read it as a date, then add all those new columns. If we go back to our `googletrend` table, we now have gour columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trendDateStatetrend_DE
0962012-12-02SN77
1952012-12-09SN76
2912012-12-16SN85
3482012-12-23SN59
4672012-12-30SN61
\n", "
" ], "text/plain": [ " trend Date State trend_DE\n", "0 96 2012-12-02 SN 77\n", "1 95 2012-12-09 SN 76\n", "2 91 2012-12-16 SN 85\n", "3 48 2012-12-23 SN 59\n", "4 67 2012-12-30 SN 61" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "googletrend.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we add the dateparts, we will gain a lot more" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "googletrend = add_datepart(googletrend, 'Date', drop=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
trend9695914867
Date2012-12-02 00:00:002012-12-09 00:00:002012-12-16 00:00:002012-12-23 00:00:002012-12-30 00:00:00
StateSNSNSNSNSN
trend_DE7776855961
Year20122012201220122012
Month1212121212
Week4849505152
Day29162330
Dayofweek66666
Dayofyear337344351358365
Is_month_endFalseFalseFalseFalseFalse
Is_month_startFalseFalseFalseFalseFalse
Is_quarter_endFalseFalseFalseFalseFalse
Is_quarter_startFalseFalseFalseFalseFalse
Is_year_endFalseFalseFalseFalseFalse
Is_year_startFalseFalseFalseFalseFalse
Elapsed13544064001355011200135561600013562208001356825600
\n", "
" ], "text/plain": [ " 0 1 \\\n", "trend 96 95 \n", "Date 2012-12-02 00:00:00 2012-12-09 00:00:00 \n", "State SN SN \n", "trend_DE 77 76 \n", "Year 2012 2012 \n", "Month 12 12 \n", "Week 48 49 \n", "Day 2 9 \n", "Dayofweek 6 6 \n", "Dayofyear 337 344 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "Elapsed 1354406400 1355011200 \n", "\n", " 2 3 \\\n", "trend 91 48 \n", "Date 2012-12-16 00:00:00 2012-12-23 00:00:00 \n", "State SN SN \n", "trend_DE 85 59 \n", "Year 2012 2012 \n", "Month 12 12 \n", "Week 50 51 \n", "Day 16 23 \n", "Dayofweek 6 6 \n", "Dayofyear 351 358 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "Elapsed 1355616000 1356220800 \n", "\n", " 4 \n", "trend 67 \n", "Date 2012-12-30 00:00:00 \n", "State SN \n", "trend_DE 61 \n", "Year 2012 \n", "Month 12 \n", "Week 52 \n", "Day 30 \n", "Dayofweek 6 \n", "Dayofyear 365 \n", "Is_month_end False \n", "Is_month_start False \n", "Is_quarter_end False \n", "Is_quarter_start False \n", "Is_year_end False \n", "Is_year_start False \n", "Elapsed 1356825600 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "googletrend.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We chose the option `drop=False` as we want to keep the `Date` column for now. Another option is to add the `time` part of the date, but it's not relevant to our problem here. \n", "\n", "Now we can join our Google trends with the information in the `store` table, it's just a join on \\['Week', 'Year'\\] once we apply `add_datepart` to that table. Note that we only keep the initial columns of `googletrend` with `Week` and `Year` to avoid all the duplicates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "googletrend = googletrend[['trend', 'State', 'trend_DE', 'Week', 'Year']]\n", "store = add_datepart(store, 'Date', drop=False)\n", "store = join_df(store, googletrend, ['Week', 'Year', 'State'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this stage, `store` contains all the information about the stores, the weather on that day and the Google trends applicable. We only have to join it with our training and test table. We have to use `make_date` before being able to execute that merge, to convert the `Date` column of `train` and `test` to proper date format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "make_date(train, 'Date')\n", "make_date(test, 'Date')\n", "train_fe = join_df(train, store, ['Store', 'Date'])\n", "test_fe = join_df(test, store, ['Store', 'Date'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Elapsed times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another feature that can be useful is the elapsed time before/after a certain event occurs. For instance the number of days since the last promotion or before the next school holiday. Like for the date parts, there is a fastai convenience function that will automatically add them.\n", "\n", "One thing to take into account here is that you will need to use that function on the whole time series you have, even the test data: there might be a school holiday that takes place during the training data and it's going to impact those new features in the test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_ftrs = train_fe.append(test_fe, sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will consider the elapsed times for three events: 'Promo', 'StateHoliday' and 'SchoolHoliday'. Note that those must correspondon to booleans in your dataframe. 'Promo' and 'SchoolHoliday' already are (only 0s and 1s) but 'StateHoliday' has multiple values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['0', 'a', 'b', 'c'], dtype=object)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_ftrs['StateHoliday'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we refer to the explanation on Kaggle, 'b' is for Easter, 'c' for Christmas and 'a' for the other holidays. We will just converts this into a boolean that flags any holiday." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_ftrs.StateHoliday = all_ftrs.StateHoliday!='0'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can add, for each store, the number of days since or until the next promotion, state or school holiday. This will take a little while since the whole table is big." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_ftrs = add_elapsed_times(all_ftrs, ['Promo', 'StateHoliday', 'SchoolHoliday'], \n", " date_field='Date', base_field='Store')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It added a four new features. If we look at 'StateHoliday' for instance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['StateHoliday',\n", " 'AfterStateHoliday',\n", " 'BeforeStateHoliday',\n", " 'StateHoliday_bw',\n", " 'StateHoliday_fw']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[c for c in all_ftrs.columns if 'StateHoliday' in c]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column 'AfterStateHoliday' contains the number of days since the last state holiday, 'BeforeStateHoliday' the number of days until the next one. As for 'StateHoliday_bw' and 'StateHoliday_fw', they contain the number of state holidays in the past or future seven days respectively. The same four columns have been added for 'Promo' and 'SchoolHoliday'.\n", "\n", "Now that we have added those features, we can split again our tables between the training and the test one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df = all_ftrs.iloc[:len(train_fe)]\n", "test_df = all_ftrs.iloc[len(train_fe):]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One last thing the authors of this winning solution did was to remove the rows with no sales, which correspond to exceptional closures of the stores. This might not have been a good idea since even if we don't have access to the same features in the test data, it can explain why we have some spikes in the training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df = train_df[train_df.Sales != 0.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use those for training but since all those steps took a bit of time, it's a good idea to save our progress until now. We will just pickle those tables on the hard drive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df.to_pickle(path/'train_clean')\n", "test_df.to_pickle(path/'test_clean')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }