{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rossman data preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate the techniques we need to apply before feeding all the data to a Deep Learning model, we are going to take the example of the [Rossmann sales Kaggle competition](https://www.kaggle.com/c/rossmann-store-sales). Given a wide range of information about a store, we are going to try predict their sale number on a given day. This is very useful to be able to manage stock properly and be able to properly satisfy the demand without wasting anything. The official training set was giving a lot of informations about various stores in Germany, but it was also allowed to use additional data, as long as it was made public and available to all participants.\n",
    "\n",
    "We are going to reproduce most of the steps of one of the winning teams that they highlighted in [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf). In addition to the official data, teams in the top of the leaderboard also used information about the weather, the states of the stores or the Google trends of those days. We have assembled all that additional data in one file available for download [here](http://files.fast.ai/part2/lesson14/rossmann.tgz) if you want to replicate those steps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A first look at the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First things first, let's import everything we will need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai2.tabular.all import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have download the previous file and decompressed it in a folder named rossmann in the fastai data folder, you should see the following list of files with this instruction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(#9) [/home/sgugger/.fastai/data/rossmann/weather.csv,/home/sgugger/.fastai/data/rossmann/sample_submission.csv,/home/sgugger/.fastai/data/rossmann/googletrend.csv,/home/sgugger/.fastai/data/rossmann/test.csv,/home/sgugger/.fastai/data/rossmann/store.csv,/home/sgugger/.fastai/data/rossmann/store_states.csv,/home/sgugger/.fastai/data/rossmann/state_names.csv,/home/sgugger/.fastai/data/rossmann/train.csv,/home/sgugger/.fastai/data/rossmann/rossmann.tgz]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = Config().data/'rossmann'\n",
    "path.ls()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data that comes from Kaggle is in 'train.csv', 'test.csv', 'store.csv' and 'sample_submission.csv'. The other files are the additional data we were talking about. Let's start by loading everything using pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']\n",
    "tables = [pd.read_csv(path/f'{fname}.csv', low_memory=False) for fname in table_names]\n",
    "train, store, store_states, state_names, googletrend, weather, test = tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get an idea of the amount of data available, let's just look at the length of the training and test tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1017209, 41088)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train), len(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we have more than one million records available. Let's have a look at what's inside:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Sales</th>\n",
       "      <th>Customers</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>5263</td>\n",
       "      <td>555</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>6064</td>\n",
       "      <td>625</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>8314</td>\n",
       "      <td>821</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>13995</td>\n",
       "      <td>1498</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>4822</td>\n",
       "      <td>559</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Store  DayOfWeek        Date  Sales  Customers  Open  Promo StateHoliday  \\\n",
       "0      1          5  2015-07-31   5263        555     1      1            0   \n",
       "1      2          5  2015-07-31   6064        625     1      1            0   \n",
       "2      3          5  2015-07-31   8314        821     1      1            0   \n",
       "3      4          5  2015-07-31  13995       1498     1      1            0   \n",
       "4      5          5  2015-07-31   4822        559     1      1            0   \n",
       "\n",
       "   SchoolHoliday  \n",
       "0              1  \n",
       "1              1  \n",
       "2              1  \n",
       "3              1  \n",
       "4              1  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Store` column contains the id of the stores, then we are given the id of the day of the week, the exact date, if the store was open on that day, if there were any promotion in that store during that day, and if it was a state or school holiday. The `Customers` column is given as an indication, and the `Sales` column is what we will try to predict.\n",
    "\n",
    "If we look at the test table, we have the same columns, minus `Sales` and `Customers`, and it looks like we will have to predict on dates that are after the ones of the train table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Id  Store  DayOfWeek        Date  Open  Promo StateHoliday  SchoolHoliday\n",
       "0   1      1          4  2015-09-17   1.0      1            0              0\n",
       "1   2      3          4  2015-09-17   1.0      1            0              0\n",
       "2   3      7          4  2015-09-17   1.0      1            0              0\n",
       "3   4      8          4  2015-09-17   1.0      1            0              0\n",
       "4   5      9          4  2015-09-17   1.0      1            0              0"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The other table given by Kaggle contains some information specific to the stores: their type, what the competition looks like, if they are engaged in a permanent promotion program, and if so since then."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Store</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>StoreType</th>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Assortment</th>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionDistance</th>\n",
       "      <td>1270</td>\n",
       "      <td>570</td>\n",
       "      <td>14130</td>\n",
       "      <td>620</td>\n",
       "      <td>29910</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionOpenSinceMonth</th>\n",
       "      <td>9</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionOpenSinceYear</th>\n",
       "      <td>2008</td>\n",
       "      <td>2007</td>\n",
       "      <td>2006</td>\n",
       "      <td>2009</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2SinceWeek</th>\n",
       "      <td>NaN</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2SinceYear</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2010</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PromoInterval</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              0                1                2     3      4\n",
       "Store                         1                2                3     4      5\n",
       "StoreType                     c                a                a     c      a\n",
       "Assortment                    a                a                a     c      a\n",
       "CompetitionDistance        1270              570            14130   620  29910\n",
       "CompetitionOpenSinceMonth     9               11               12     9      4\n",
       "CompetitionOpenSinceYear   2008             2007             2006  2009   2015\n",
       "Promo2                        0                1                1     0      0\n",
       "Promo2SinceWeek             NaN               13               14   NaN    NaN\n",
       "Promo2SinceYear             NaN             2010             2011   NaN    NaN\n",
       "PromoInterval               NaN  Jan,Apr,Jul,Oct  Jan,Apr,Jul,Oct   NaN    NaN"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store.head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's have a quick look at our four additional dataframes. `store_states` just gives us the abbreviated name of the sate of each store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>HE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>TH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>NW</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>BE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Store State\n",
       "0      1    HE\n",
       "1      2    TH\n",
       "2      3    NW\n",
       "3      4    BE\n",
       "4      5    SN"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store_states.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can match them to their real names with `state_names`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>StateName</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>BadenWuerttemberg</td>\n",
       "      <td>BW</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bayern</td>\n",
       "      <td>BY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Berlin</td>\n",
       "      <td>BE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Brandenburg</td>\n",
       "      <td>BB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bremen</td>\n",
       "      <td>HB</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           StateName State\n",
       "0  BadenWuerttemberg    BW\n",
       "1             Bayern    BY\n",
       "2             Berlin    BE\n",
       "3        Brandenburg    BB\n",
       "4             Bremen    HB"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "state_names.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which is going to be necessary if we want to use the `weather` table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>file</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>2013-01-02</td>\n",
       "      <td>2013-01-03</td>\n",
       "      <td>2013-01-04</td>\n",
       "      <td>2013-01-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_TemperatureC</th>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>11</td>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mean_TemperatureC</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Min_TemperatureC</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dew_PointC</th>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MeanDew_PointC</th>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Min_DewpointC</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_Humidity</th>\n",
       "      <td>94</td>\n",
       "      <td>93</td>\n",
       "      <td>100</td>\n",
       "      <td>100</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mean_Humidity</th>\n",
       "      <td>87</td>\n",
       "      <td>85</td>\n",
       "      <td>93</td>\n",
       "      <td>94</td>\n",
       "      <td>94</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Min_Humidity</th>\n",
       "      <td>65</td>\n",
       "      <td>78</td>\n",
       "      <td>77</td>\n",
       "      <td>87</td>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_Sea_Level_PressurehPa</th>\n",
       "      <td>1013</td>\n",
       "      <td>1028</td>\n",
       "      <td>1035</td>\n",
       "      <td>1036</td>\n",
       "      <td>1035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mean_Sea_Level_PressurehPa</th>\n",
       "      <td>1005</td>\n",
       "      <td>1022</td>\n",
       "      <td>1030</td>\n",
       "      <td>1035</td>\n",
       "      <td>1034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Min_Sea_Level_PressurehPa</th>\n",
       "      <td>1001</td>\n",
       "      <td>1014</td>\n",
       "      <td>1026</td>\n",
       "      <td>1034</td>\n",
       "      <td>1033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_VisibilityKm</th>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mean_VisibilityKm</th>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Min_VisibilitykM</th>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_Wind_SpeedKm_h</th>\n",
       "      <td>39</td>\n",
       "      <td>24</td>\n",
       "      <td>26</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mean_Wind_SpeedKm_h</th>\n",
       "      <td>26</td>\n",
       "      <td>16</td>\n",
       "      <td>21</td>\n",
       "      <td>14</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Max_Gust_SpeedKm_h</th>\n",
       "      <td>58</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Precipitationmm</th>\n",
       "      <td>5.08</td>\n",
       "      <td>0</td>\n",
       "      <td>1.02</td>\n",
       "      <td>0.25</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CloudCover</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Events</th>\n",
       "      <td>Rain</td>\n",
       "      <td>Rain</td>\n",
       "      <td>Rain</td>\n",
       "      <td>Rain</td>\n",
       "      <td>Rain</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WindDirDegrees</th>\n",
       "      <td>215</td>\n",
       "      <td>225</td>\n",
       "      <td>240</td>\n",
       "      <td>263</td>\n",
       "      <td>268</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             0                   1  \\\n",
       "file                        NordrheinWestfalen  NordrheinWestfalen   \n",
       "Date                                2013-01-01          2013-01-02   \n",
       "Max_TemperatureC                             8                   7   \n",
       "Mean_TemperatureC                            4                   4   \n",
       "Min_TemperatureC                             2                   1   \n",
       "Dew_PointC                                   7                   5   \n",
       "MeanDew_PointC                               5                   3   \n",
       "Min_DewpointC                                1                   2   \n",
       "Max_Humidity                                94                  93   \n",
       "Mean_Humidity                               87                  85   \n",
       "Min_Humidity                                65                  78   \n",
       "Max_Sea_Level_PressurehPa                 1013                1028   \n",
       "Mean_Sea_Level_PressurehPa                1005                1022   \n",
       "Min_Sea_Level_PressurehPa                 1001                1014   \n",
       "Max_VisibilityKm                            31                  31   \n",
       "Mean_VisibilityKm                           12                  14   \n",
       "Min_VisibilitykM                             4                  10   \n",
       "Max_Wind_SpeedKm_h                          39                  24   \n",
       "Mean_Wind_SpeedKm_h                         26                  16   \n",
       "Max_Gust_SpeedKm_h                          58                 NaN   \n",
       "Precipitationmm                           5.08                   0   \n",
       "CloudCover                                   6                   6   \n",
       "Events                                    Rain                Rain   \n",
       "WindDirDegrees                             215                 225   \n",
       "\n",
       "                                             2                   3  \\\n",
       "file                        NordrheinWestfalen  NordrheinWestfalen   \n",
       "Date                                2013-01-03          2013-01-04   \n",
       "Max_TemperatureC                            11                   9   \n",
       "Mean_TemperatureC                            8                   9   \n",
       "Min_TemperatureC                             6                   8   \n",
       "Dew_PointC                                  10                   9   \n",
       "MeanDew_PointC                               8                   9   \n",
       "Min_DewpointC                                4                   8   \n",
       "Max_Humidity                               100                 100   \n",
       "Mean_Humidity                               93                  94   \n",
       "Min_Humidity                                77                  87   \n",
       "Max_Sea_Level_PressurehPa                 1035                1036   \n",
       "Mean_Sea_Level_PressurehPa                1030                1035   \n",
       "Min_Sea_Level_PressurehPa                 1026                1034   \n",
       "Max_VisibilityKm                            31                  11   \n",
       "Mean_VisibilityKm                            8                   5   \n",
       "Min_VisibilitykM                             2                   2   \n",
       "Max_Wind_SpeedKm_h                          26                  23   \n",
       "Mean_Wind_SpeedKm_h                         21                  14   \n",
       "Max_Gust_SpeedKm_h                         NaN                 NaN   \n",
       "Precipitationmm                           1.02                0.25   \n",
       "CloudCover                                   7                   7   \n",
       "Events                                    Rain                Rain   \n",
       "WindDirDegrees                             240                 263   \n",
       "\n",
       "                                             4  \n",
       "file                        NordrheinWestfalen  \n",
       "Date                                2013-01-05  \n",
       "Max_TemperatureC                             8  \n",
       "Mean_TemperatureC                            8  \n",
       "Min_TemperatureC                             7  \n",
       "Dew_PointC                                   8  \n",
       "MeanDew_PointC                               7  \n",
       "Min_DewpointC                                6  \n",
       "Max_Humidity                               100  \n",
       "Mean_Humidity                               94  \n",
       "Min_Humidity                                87  \n",
       "Max_Sea_Level_PressurehPa                 1035  \n",
       "Mean_Sea_Level_PressurehPa                1034  \n",
       "Min_Sea_Level_PressurehPa                 1033  \n",
       "Max_VisibilityKm                            10  \n",
       "Mean_VisibilityKm                            6  \n",
       "Min_VisibilitykM                             3  \n",
       "Max_Wind_SpeedKm_h                          16  \n",
       "Mean_Wind_SpeedKm_h                         10  \n",
       "Max_Gust_SpeedKm_h                         NaN  \n",
       "Precipitationmm                              0  \n",
       "CloudCover                                   7  \n",
       "Events                                    Rain  \n",
       "WindDirDegrees                             268  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly the googletrend table gives us the trend of the brand in each state and in the whole of Germany."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>week</th>\n",
       "      <th>trend</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-02 - 2012-12-08</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-09 - 2012-12-15</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-16 - 2012-12-22</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-23 - 2012-12-29</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-30 - 2013-01-05</td>\n",
       "      <td>67</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file                     week  trend\n",
       "0  Rossmann_DE_SN  2012-12-02 - 2012-12-08     96\n",
       "1  Rossmann_DE_SN  2012-12-09 - 2012-12-15     95\n",
       "2  Rossmann_DE_SN  2012-12-16 - 2012-12-22     91\n",
       "3  Rossmann_DE_SN  2012-12-23 - 2012-12-29     48\n",
       "4  Rossmann_DE_SN  2012-12-30 - 2013-01-05     67"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "googletrend.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we apply the fastai preprocessing, we will need to join the store table and the additional ones with our training and test table. Then, as we saw in our first example in chapter 1, we will need to split our variables between categorical and continuous. Before we do that, though, there is one type of variable that is a bit different from the others: dates.\n",
    "\n",
    "We could turn each particular day in a category but there are cyclical information in dates we would miss if we did that. We already have the day of the week in our tables, but maybe the day of the month also bears some significance. People might be more inclined to go shopping at the beggining or the end of the month. The number of the week/month is also important to detect seasonal influences.\n",
    "\n",
    "Then we will try to exctract meaningful information from those dates. For instance promotions on their own are important inputs, but maybe the number of running weeks with promotion is another useful information as it will influence customers. A state holiday in itself is important, but it's more significant to know if we are the day before or after such a holiday as it will impact sales. All of those might seem very specific to this dataset, but you can actually apply them in any tabular data containing time information.\n",
    "\n",
    "This first step is called feature-engineering and is extremely important: your model will try to extract useful information from your data but any extra help you can give it in advance is going to make training easier, and the final result better. In Kaggle Competitions using tabular data, it's often the way people prepared their data that makes the difference in the final leaderboard, not the exact model used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merging tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To merge tables together, we will use this little helper function that relies on the pandas library. It will merge the tables `left` and `right` by looking at the column(s) which names are in `left_on` and `right_on`: the information in `right` will be added to the rows of the tables in `left` when the data in `left_on` inside `left` is the same as the data in `right_on` inside `right`. If `left_on` and `right_on` are the same, we don't have to pass `right_on`. We keep the fields in `right` that have the same names as fields in `left` and add a `_y` suffix (by default) to those field names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def join_df(left, right, left_on, right_on=None, suffix='_y'):\n",
    "    if right_on is None: right_on = left_on\n",
    "    return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n",
    "                      suffixes=(\"\", suffix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's replace the state names in the weather table by the abbreviations, since that's what is used in the other tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>Date</th>\n",
       "      <th>State</th>\n",
       "      <th>StateName</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>NW</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-02</td>\n",
       "      <td>NW</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-03</td>\n",
       "      <td>NW</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-04</td>\n",
       "      <td>NW</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-05</td>\n",
       "      <td>NW</td>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 file        Date State           StateName\n",
       "0  NordrheinWestfalen  2013-01-01    NW  NordrheinWestfalen\n",
       "1  NordrheinWestfalen  2013-01-02    NW  NordrheinWestfalen\n",
       "2  NordrheinWestfalen  2013-01-03    NW  NordrheinWestfalen\n",
       "3  NordrheinWestfalen  2013-01-04    NW  NordrheinWestfalen\n",
       "4  NordrheinWestfalen  2013-01-05    NW  NordrheinWestfalen"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather = join_df(weather, state_names, \"file\", \"StateName\")\n",
    "weather[['file', 'Date', 'State', 'StateName']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To double-check the merge happened without incident, we can check that every row has a `State` with this line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(weather[weather.State.isnull()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now safely remove the columns with the state names (`file` and `StateName`) since they we'll use the short codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weather.drop(columns=['file', 'StateName'], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add the weather informations to our `store` table, we first use the table `store_states` to match a store code with the corresponding state, then we merge with our weather table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "store = join_df(store, store_states, 'Store')\n",
    "store = join_df(store, weather, 'State')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again, we can check if the merge went well by looking if new NaNs where introduced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(store[store.Mean_TemperatureC.isnull()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we want to join the `googletrend` table to this `store` table. If you remember from our previous look at it, it's not exactly in the same format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>week</th>\n",
       "      <th>trend</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-02 - 2012-12-08</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-09 - 2012-12-15</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-16 - 2012-12-22</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-23 - 2012-12-29</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-30 - 2013-01-05</td>\n",
       "      <td>67</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file                     week  trend\n",
       "0  Rossmann_DE_SN  2012-12-02 - 2012-12-08     96\n",
       "1  Rossmann_DE_SN  2012-12-09 - 2012-12-15     95\n",
       "2  Rossmann_DE_SN  2012-12-16 - 2012-12-22     91\n",
       "3  Rossmann_DE_SN  2012-12-23 - 2012-12-29     48\n",
       "4  Rossmann_DE_SN  2012-12-30 - 2013-01-05     67"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "googletrend.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will need to change the column with the states and the columns with the dates:\n",
    "- in the column `fil`, the state names contain `Rossmann_DE_XX` with `XX` being the code of the state, so we want to remove `Rossmann_DE`. We will do this by creating a new column containing the last part of a split of the string by '\\_'.\n",
    "- in the column `week`, we will extract the date corresponding to the beginning of the week in a new column by taking the last part of a split on ' - '.\n",
    "\n",
    "In pandas, creating a new column is very easy: you just have to define them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>week</th>\n",
       "      <th>trend</th>\n",
       "      <th>Date</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-02 - 2012-12-08</td>\n",
       "      <td>96</td>\n",
       "      <td>2012-12-02</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-09 - 2012-12-15</td>\n",
       "      <td>95</td>\n",
       "      <td>2012-12-09</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-16 - 2012-12-22</td>\n",
       "      <td>91</td>\n",
       "      <td>2012-12-16</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-23 - 2012-12-29</td>\n",
       "      <td>48</td>\n",
       "      <td>2012-12-23</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-30 - 2013-01-05</td>\n",
       "      <td>67</td>\n",
       "      <td>2012-12-30</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file                     week  trend        Date State\n",
       "0  Rossmann_DE_SN  2012-12-02 - 2012-12-08     96  2012-12-02    SN\n",
       "1  Rossmann_DE_SN  2012-12-09 - 2012-12-15     95  2012-12-09    SN\n",
       "2  Rossmann_DE_SN  2012-12-16 - 2012-12-22     91  2012-12-16    SN\n",
       "3  Rossmann_DE_SN  2012-12-23 - 2012-12-29     48  2012-12-23    SN\n",
       "4  Rossmann_DE_SN  2012-12-30 - 2013-01-05     67  2012-12-30    SN"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]\n",
    "googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]\n",
    "googletrend.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check everything went well by looking at the values in the new `State` column of our `googletrend` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array(['HE', 'TH', 'NW', 'BE', 'SN', 'SH', 'HB,NI', 'BY', 'BW', 'RP',\n",
       "        'ST', 'HH'], dtype=object),\n",
       " array(['SN', None, 'BY', 'TH', 'NW', 'BE', 'RP', 'BW', 'NI', 'SH', 'HE',\n",
       "        'ST', 'HH', 'SL'], dtype=object))"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store['State'].unique(),googletrend['State'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have two additional values in the second (`None` and 'SL') but this isn't a problem since they'll be ignored when we join. One problem however is that 'HB,NI' in the first table is named 'NI' in the second one, so we need to change that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend.loc[googletrend.State=='NI', \"State\"] = 'HB,NI'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why do we have a `None` in state? As we said before, there is a global trend for Germany that corresponds to `Rosmann_DE` in the field `file`. For those, the previous split failed which gave the `None` value. We will keep this global trend and put it in a new column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trend_de = googletrend[googletrend.file == 'Rossmann_DE'][['Date', 'trend']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can merge it with the rest of our trends, by adding the suffix '\\_DE' to know it's the general trend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend = join_df(googletrend, trend_de, 'Date', suffix='_DE')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then at this stage, we can remove the columns `file` and `week`since they won't be useful anymore, as well as the rows where `State` is `None` (since they correspond to the global trend that we saved in another column)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend.drop(columns=['file', 'week'], axis=1, inplace=True)\n",
    "googletrend = googletrend[~googletrend['State'].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last thing missing to be able to join this with or store table is to extract the week from the date in this table and in the store table: we need to join them on week values since each trend is given for the full week that starts on the indicated date. This is linked to the next topic in feature engineering: extracting dateparts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adding dateparts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your table contains dates, you will need to split the information there in several column for your Deep Learning model to be able to train properly. There is the basic stuff, such as the day number, week number, month number or year number, but anything that can be relevant to your problem is also useful. Is it the beginning or the end of the month? Is it a holiday?\n",
    "\n",
    "To help with this, the fastai library as a convenience function called `add_datepart`. It will take a dataframe and a column you indicate, try to read it as a date, then add all those new columns. If we go back to our `googletrend` table, we now have gour columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>trend</th>\n",
       "      <th>Date</th>\n",
       "      <th>State</th>\n",
       "      <th>trend_DE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>96</td>\n",
       "      <td>2012-12-02</td>\n",
       "      <td>SN</td>\n",
       "      <td>77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>95</td>\n",
       "      <td>2012-12-09</td>\n",
       "      <td>SN</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>91</td>\n",
       "      <td>2012-12-16</td>\n",
       "      <td>SN</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>48</td>\n",
       "      <td>2012-12-23</td>\n",
       "      <td>SN</td>\n",
       "      <td>59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>67</td>\n",
       "      <td>2012-12-30</td>\n",
       "      <td>SN</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   trend        Date State  trend_DE\n",
       "0     96  2012-12-02    SN        77\n",
       "1     95  2012-12-09    SN        76\n",
       "2     91  2012-12-16    SN        85\n",
       "3     48  2012-12-23    SN        59\n",
       "4     67  2012-12-30    SN        61"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "googletrend.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we add the dateparts, we will gain a lot more"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend = add_datepart(googletrend, 'Date', drop=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>trend</th>\n",
       "      <td>96</td>\n",
       "      <td>95</td>\n",
       "      <td>91</td>\n",
       "      <td>48</td>\n",
       "      <td>67</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <td>2012-12-02 00:00:00</td>\n",
       "      <td>2012-12-09 00:00:00</td>\n",
       "      <td>2012-12-16 00:00:00</td>\n",
       "      <td>2012-12-23 00:00:00</td>\n",
       "      <td>2012-12-30 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>State</th>\n",
       "      <td>SN</td>\n",
       "      <td>SN</td>\n",
       "      <td>SN</td>\n",
       "      <td>SN</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>trend_DE</th>\n",
       "      <td>77</td>\n",
       "      <td>76</td>\n",
       "      <td>85</td>\n",
       "      <td>59</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <td>2012</td>\n",
       "      <td>2012</td>\n",
       "      <td>2012</td>\n",
       "      <td>2012</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Week</th>\n",
       "      <td>48</td>\n",
       "      <td>49</td>\n",
       "      <td>50</td>\n",
       "      <td>51</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Day</th>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>16</td>\n",
       "      <td>23</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dayofweek</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dayofyear</th>\n",
       "      <td>337</td>\n",
       "      <td>344</td>\n",
       "      <td>351</td>\n",
       "      <td>358</td>\n",
       "      <td>365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_month_end</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_month_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_quarter_end</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_quarter_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_year_end</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_year_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elapsed</th>\n",
       "      <td>1354406400</td>\n",
       "      <td>1355011200</td>\n",
       "      <td>1355616000</td>\n",
       "      <td>1356220800</td>\n",
       "      <td>1356825600</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    0                    1  \\\n",
       "trend                              96                   95   \n",
       "Date              2012-12-02 00:00:00  2012-12-09 00:00:00   \n",
       "State                              SN                   SN   \n",
       "trend_DE                           77                   76   \n",
       "Year                             2012                 2012   \n",
       "Month                              12                   12   \n",
       "Week                               48                   49   \n",
       "Day                                 2                    9   \n",
       "Dayofweek                           6                    6   \n",
       "Dayofyear                         337                  344   \n",
       "Is_month_end                    False                False   \n",
       "Is_month_start                  False                False   \n",
       "Is_quarter_end                  False                False   \n",
       "Is_quarter_start                False                False   \n",
       "Is_year_end                     False                False   \n",
       "Is_year_start                   False                False   \n",
       "Elapsed                    1354406400           1355011200   \n",
       "\n",
       "                                    2                    3  \\\n",
       "trend                              91                   48   \n",
       "Date              2012-12-16 00:00:00  2012-12-23 00:00:00   \n",
       "State                              SN                   SN   \n",
       "trend_DE                           85                   59   \n",
       "Year                             2012                 2012   \n",
       "Month                              12                   12   \n",
       "Week                               50                   51   \n",
       "Day                                16                   23   \n",
       "Dayofweek                           6                    6   \n",
       "Dayofyear                         351                  358   \n",
       "Is_month_end                    False                False   \n",
       "Is_month_start                  False                False   \n",
       "Is_quarter_end                  False                False   \n",
       "Is_quarter_start                False                False   \n",
       "Is_year_end                     False                False   \n",
       "Is_year_start                   False                False   \n",
       "Elapsed                    1355616000           1356220800   \n",
       "\n",
       "                                    4  \n",
       "trend                              67  \n",
       "Date              2012-12-30 00:00:00  \n",
       "State                              SN  \n",
       "trend_DE                           61  \n",
       "Year                             2012  \n",
       "Month                              12  \n",
       "Week                               52  \n",
       "Day                                30  \n",
       "Dayofweek                           6  \n",
       "Dayofyear                         365  \n",
       "Is_month_end                    False  \n",
       "Is_month_start                  False  \n",
       "Is_quarter_end                  False  \n",
       "Is_quarter_start                False  \n",
       "Is_year_end                     False  \n",
       "Is_year_start                   False  \n",
       "Elapsed                    1356825600  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "googletrend.head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We chose the option `drop=False` as we want to keep the `Date` column  for now. Another option is to add the `time` part of the date, but it's not relevant to our problem here. \n",
    "\n",
    "Now we can join our Google trends with the information in the `store` table, it's just a join on \\['Week', 'Year'\\] once we apply `add_datepart` to that table. Note that we only keep the initial columns of `googletrend` with `Week` and `Year` to avoid all the duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend = googletrend[['trend', 'State', 'trend_DE', 'Week', 'Year']]\n",
    "store = add_datepart(store, 'Date', drop=False)\n",
    "store = join_df(store, googletrend, ['Week', 'Year', 'State'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this stage, `store` contains all the information about the stores, the weather on that day and the Google trends applicable. We only have to join it with our training and test table. We have to use `make_date` before being able to execute that merge, to convert the `Date` column of `train` and `test` to proper date format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "make_date(train, 'Date')\n",
    "make_date(test, 'Date')\n",
    "train_fe = join_df(train, store, ['Store', 'Date'])\n",
    "test_fe = join_df(test, store, ['Store', 'Date'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Elapsed times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another feature that can be useful is the elapsed time before/after a certain event occurs. For instance the number of days since the last promotion or before the next school holiday. Like for the date parts, there is a fastai convenience function that will automatically add them.\n",
    "\n",
    "One thing to take into account here is that you will need to use that function on the whole time series you have, even the test data: there might be a school holiday that takes place during the training data and it's going to impact those new features in the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ftrs = train_fe.append(test_fe, sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will consider the elapsed times for three events: 'Promo', 'StateHoliday' and 'SchoolHoliday'. Note that those must correspondon to booleans in your dataframe. 'Promo' and 'SchoolHoliday' already are (only 0s and 1s) but 'StateHoliday' has multiple values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['0', 'a', 'b', 'c'], dtype=object)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ftrs['StateHoliday'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we refer to the explanation on Kaggle, 'b' is for Easter, 'c' for Christmas and 'a' for the other holidays. We will just converts this into a boolean that flags any holiday."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ftrs.StateHoliday = all_ftrs.StateHoliday!='0'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can add, for each store, the number of days since or until the next promotion, state or school holiday. This will take a little while since the whole table is big."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ftrs = add_elapsed_times(all_ftrs, ['Promo', 'StateHoliday', 'SchoolHoliday'], \n",
    "                             date_field='Date', base_field='Store')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It added a four new features. If we look at 'StateHoliday' for instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['StateHoliday',\n",
       " 'AfterStateHoliday',\n",
       " 'BeforeStateHoliday',\n",
       " 'StateHoliday_bw',\n",
       " 'StateHoliday_fw']"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[c for c in all_ftrs.columns if 'StateHoliday' in c]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column 'AfterStateHoliday' contains the number of days since the last state holiday, 'BeforeStateHoliday' the number of days until the next one. As for 'StateHoliday_bw' and 'StateHoliday_fw', they contain the number of state holidays in the past or future seven days respectively. The same four columns have been added for 'Promo' and 'SchoolHoliday'.\n",
    "\n",
    "Now that we have added those features, we can split again our tables between the training and the test one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = all_ftrs.iloc[:len(train_fe)]\n",
    "test_df  = all_ftrs.iloc[len(train_fe):]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One last thing the authors of this winning solution did was to remove the rows with no sales, which correspond to exceptional closures of the stores. This might not have been a good idea since even if we don't have access to the same features in the test data, it can explain why we have some spikes in the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = train_df[train_df.Sales != 0.]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use those for training but since all those steps took a bit of time, it's a good idea to save our progress until now. We will just pickle those tables on the hard drive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.to_pickle(path/'train_clean')\n",
    "test_df.to_pickle(path/'test_clean')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}