{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Structured and time series data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains an implementation of the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn's [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737).\n",
    "\n",
    "The motivation behind exploring this architecture is it's relevance to real-world application. Most data used for decision making day-to-day in industry is structured and/or time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.structured import *\n",
    "from fastai.column_data import *\n",
    "np.set_printoptions(threshold=50, edgeitems=20)\n",
    "\n",
    "PATH='data/rossmann/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "heading_collapsed": true,
    "scrolled": false
   },
   "source": [
    "## Create datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition. You can download all of them [here](http://files.fast.ai/part2/lesson14/rossmann.tgz).\n",
    "\n",
    "For completeness, the implementation used to put them together is included below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "def concat_csvs(dirname):\n",
    "    path = f'{PATH}{dirname}'\n",
    "    filenames=glob(f\"{PATH}/*.csv\")\n",
    "\n",
    "    wrote_header = False\n",
    "    with open(f\"{path}.csv\",\"w\") as outputfile:\n",
    "        for filename in filenames:\n",
    "            name = filename.split(\".\")[0]\n",
    "            with open(filename) as f:\n",
    "                line = f.readline()\n",
    "                if not wrote_header:\n",
    "                    wrote_header = True\n",
    "                    outputfile.write(\"file,\"+line)\n",
    "                for line in f:\n",
    "                     outputfile.write(name + \",\" + line)\n",
    "                outputfile.write(\"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# concat_csvs('googletrend')\n",
    "# concat_csvs('weather')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Feature Space:\n",
    "* train: Training set provided by competition\n",
    "* store: List of stores\n",
    "* store_states: mapping of store to the German state they are in\n",
    "* List of German state names\n",
    "* googletrend: trend of certain google keywords over time, found by users to correlate well w/ given data\n",
    "* weather: weather\n",
    "* test: testing set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "table_names = ['train', 'store', 'store_states', 'state_names', \n",
    "               'googletrend', 'weather', 'test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We'll be using the popular data manipulation framework `pandas`. Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.\n",
    "\n",
    "We're going to go ahead and load all of our csv's as dataframes into the list `tables`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We can use `head()` to get a quick look at the contents of each table:\n",
    "* train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holdiay, etc.\n",
    "* store: general info about the store including competition, etc.\n",
    "* store_states: maps store to state it is in\n",
    "* state_names: Maps state abbreviations to names\n",
    "* googletrend: trend data for particular week/state\n",
    "* weather: weather conditions for each state\n",
    "* test: Same as training table, w/o sales and customers\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Sales</th>\n",
       "      <th>Customers</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>5263</td>\n",
       "      <td>555</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>6064</td>\n",
       "      <td>625</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>8314</td>\n",
       "      <td>821</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>13995</td>\n",
       "      <td>1498</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>4822</td>\n",
       "      <td>559</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Store  DayOfWeek        Date  Sales  Customers  Open  Promo StateHoliday  \\\n",
       "0      1          5  2015-07-31   5263        555     1      1            0   \n",
       "1      2          5  2015-07-31   6064        625     1      1            0   \n",
       "2      3          5  2015-07-31   8314        821     1      1            0   \n",
       "3      4          5  2015-07-31  13995       1498     1      1            0   \n",
       "4      5          5  2015-07-31   4822        559     1      1            0   \n",
       "\n",
       "   SchoolHoliday  \n",
       "0              1  \n",
       "1              1  \n",
       "2              1  \n",
       "3              1  \n",
       "4              1  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>StoreType</th>\n",
       "      <th>Assortment</th>\n",
       "      <th>CompetitionDistance</th>\n",
       "      <th>CompetitionOpenSinceMonth</th>\n",
       "      <th>CompetitionOpenSinceYear</th>\n",
       "      <th>Promo2</th>\n",
       "      <th>Promo2SinceWeek</th>\n",
       "      <th>Promo2SinceYear</th>\n",
       "      <th>PromoInterval</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "      <td>1270.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2008.0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>570.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2007.0</td>\n",
       "      <td>1</td>\n",
       "      <td>13.0</td>\n",
       "      <td>2010.0</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>14130.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>2006.0</td>\n",
       "      <td>1</td>\n",
       "      <td>14.0</td>\n",
       "      <td>2011.0</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>c</td>\n",
       "      <td>c</td>\n",
       "      <td>620.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2009.0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>29910.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2015.0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Store StoreType Assortment  CompetitionDistance  CompetitionOpenSinceMonth  \\\n",
       "0      1         c          a               1270.0                        9.0   \n",
       "1      2         a          a                570.0                       11.0   \n",
       "2      3         a          a              14130.0                       12.0   \n",
       "3      4         c          c                620.0                        9.0   \n",
       "4      5         a          a              29910.0                        4.0   \n",
       "\n",
       "   CompetitionOpenSinceYear  Promo2  Promo2SinceWeek  Promo2SinceYear  \\\n",
       "0                    2008.0       0              NaN              NaN   \n",
       "1                    2007.0       1             13.0           2010.0   \n",
       "2                    2006.0       1             14.0           2011.0   \n",
       "3                    2009.0       0              NaN              NaN   \n",
       "4                    2015.0       0              NaN              NaN   \n",
       "\n",
       "     PromoInterval  \n",
       "0              NaN  \n",
       "1  Jan,Apr,Jul,Oct  \n",
       "2  Jan,Apr,Jul,Oct  \n",
       "3              NaN  \n",
       "4              NaN  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>HE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>TH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>NW</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>BE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Store State\n",
       "0      1    HE\n",
       "1      2    TH\n",
       "2      3    NW\n",
       "3      4    BE\n",
       "4      5    SN"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>StateName</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>BadenWuerttemberg</td>\n",
       "      <td>BW</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bayern</td>\n",
       "      <td>BY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Berlin</td>\n",
       "      <td>BE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Brandenburg</td>\n",
       "      <td>BB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bremen</td>\n",
       "      <td>HB</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           StateName State\n",
       "0  BadenWuerttemberg    BW\n",
       "1             Bayern    BY\n",
       "2             Berlin    BE\n",
       "3        Brandenburg    BB\n",
       "4             Bremen    HB"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>week</th>\n",
       "      <th>trend</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-02 - 2012-12-08</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-09 - 2012-12-15</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-16 - 2012-12-22</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-23 - 2012-12-29</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "      <td>2012-12-30 - 2013-01-05</td>\n",
       "      <td>67</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             file                     week  trend\n",
       "0  Rossmann_DE_SN  2012-12-02 - 2012-12-08     96\n",
       "1  Rossmann_DE_SN  2012-12-09 - 2012-12-15     95\n",
       "2  Rossmann_DE_SN  2012-12-16 - 2012-12-22     91\n",
       "3  Rossmann_DE_SN  2012-12-23 - 2012-12-29     48\n",
       "4  Rossmann_DE_SN  2012-12-30 - 2013-01-05     67"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>Date</th>\n",
       "      <th>Max_TemperatureC</th>\n",
       "      <th>Mean_TemperatureC</th>\n",
       "      <th>Min_TemperatureC</th>\n",
       "      <th>Dew_PointC</th>\n",
       "      <th>MeanDew_PointC</th>\n",
       "      <th>Min_DewpointC</th>\n",
       "      <th>Max_Humidity</th>\n",
       "      <th>Mean_Humidity</th>\n",
       "      <th>...</th>\n",
       "      <th>Max_VisibilityKm</th>\n",
       "      <th>Mean_VisibilityKm</th>\n",
       "      <th>Min_VisibilitykM</th>\n",
       "      <th>Max_Wind_SpeedKm_h</th>\n",
       "      <th>Mean_Wind_SpeedKm_h</th>\n",
       "      <th>Max_Gust_SpeedKm_h</th>\n",
       "      <th>Precipitationmm</th>\n",
       "      <th>CloudCover</th>\n",
       "      <th>Events</th>\n",
       "      <th>WindDirDegrees</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>94</td>\n",
       "      <td>87</td>\n",
       "      <td>...</td>\n",
       "      <td>31.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>39</td>\n",
       "      <td>26</td>\n",
       "      <td>58.0</td>\n",
       "      <td>5.08</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Rain</td>\n",
       "      <td>215</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-02</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>93</td>\n",
       "      <td>85</td>\n",
       "      <td>...</td>\n",
       "      <td>31.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>24</td>\n",
       "      <td>16</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>6.0</td>\n",
       "      <td>Rain</td>\n",
       "      <td>225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-03</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>100</td>\n",
       "      <td>93</td>\n",
       "      <td>...</td>\n",
       "      <td>31.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>26</td>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.02</td>\n",
       "      <td>7.0</td>\n",
       "      <td>Rain</td>\n",
       "      <td>240</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-04</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "      <td>100</td>\n",
       "      <td>94</td>\n",
       "      <td>...</td>\n",
       "      <td>11.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>23</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.25</td>\n",
       "      <td>7.0</td>\n",
       "      <td>Rain</td>\n",
       "      <td>263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NordrheinWestfalen</td>\n",
       "      <td>2013-01-05</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>6</td>\n",
       "      <td>100</td>\n",
       "      <td>94</td>\n",
       "      <td>...</td>\n",
       "      <td>10.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>16</td>\n",
       "      <td>10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>7.0</td>\n",
       "      <td>Rain</td>\n",
       "      <td>268</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 file        Date  Max_TemperatureC  Mean_TemperatureC  \\\n",
       "0  NordrheinWestfalen  2013-01-01                 8                  4   \n",
       "1  NordrheinWestfalen  2013-01-02                 7                  4   \n",
       "2  NordrheinWestfalen  2013-01-03                11                  8   \n",
       "3  NordrheinWestfalen  2013-01-04                 9                  9   \n",
       "4  NordrheinWestfalen  2013-01-05                 8                  8   \n",
       "\n",
       "   Min_TemperatureC  Dew_PointC  MeanDew_PointC  Min_DewpointC  Max_Humidity  \\\n",
       "0                 2           7               5              1            94   \n",
       "1                 1           5               3              2            93   \n",
       "2                 6          10               8              4           100   \n",
       "3                 8           9               9              8           100   \n",
       "4                 7           8               7              6           100   \n",
       "\n",
       "   Mean_Humidity       ...        Max_VisibilityKm  Mean_VisibilityKm  \\\n",
       "0             87       ...                    31.0               12.0   \n",
       "1             85       ...                    31.0               14.0   \n",
       "2             93       ...                    31.0                8.0   \n",
       "3             94       ...                    11.0                5.0   \n",
       "4             94       ...                    10.0                6.0   \n",
       "\n",
       "   Min_VisibilitykM  Max_Wind_SpeedKm_h  Mean_Wind_SpeedKm_h  \\\n",
       "0               4.0                  39                   26   \n",
       "1              10.0                  24                   16   \n",
       "2               2.0                  26                   21   \n",
       "3               2.0                  23                   14   \n",
       "4               3.0                  16                   10   \n",
       "\n",
       "   Max_Gust_SpeedKm_h  Precipitationmm  CloudCover  Events  WindDirDegrees  \n",
       "0                58.0             5.08         6.0    Rain             215  \n",
       "1                 NaN             0.00         6.0    Rain             225  \n",
       "2                 NaN             1.02         7.0    Rain             240  \n",
       "3                 NaN             0.25         7.0    Rain             263  \n",
       "4                 NaN             0.00         7.0    Rain             268  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>2015-09-17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Id  Store  DayOfWeek        Date  Open  Promo StateHoliday  SchoolHoliday\n",
       "0   1      1          4  2015-09-17   1.0      1            0              0\n",
       "1   2      3          4  2015-09-17   1.0      1            0              0\n",
       "2   3      7          4  2015-09-17   1.0      1            0              0\n",
       "3   4      8          4  2015-09-17   1.0      1            0              0\n",
       "4   5      9          4  2015-09-17   1.0      1            0              0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for t in tables: display(t.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "This is very representative of a typical industry dataset.\n",
    "\n",
    "The following returns summarized aggregate information to each table accross each field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "hidden": true,
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Sales</th>\n",
       "      <th>Customers</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>1.01721e+06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.01721e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>558.43</td>\n",
       "      <td>3.99834</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5773.82</td>\n",
       "      <td>633.146</td>\n",
       "      <td>0.830107</td>\n",
       "      <td>0.381515</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.178647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>321.909</td>\n",
       "      <td>1.99739</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3849.93</td>\n",
       "      <td>464.412</td>\n",
       "      <td>0.375539</td>\n",
       "      <td>0.485759</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.383056</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>280</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3727</td>\n",
       "      <td>405</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>558</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5744</td>\n",
       "      <td>609</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>838</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7856</td>\n",
       "      <td>837</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1115</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>41551</td>\n",
       "      <td>7388</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "      <td>1017209</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>1115</td>\n",
       "      <td>7</td>\n",
       "      <td>942</td>\n",
       "      <td>21734</td>\n",
       "      <td>4086</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>bool</td>\n",
       "      <td>bool</td>\n",
       "      <td>categorical</td>\n",
       "      <td>bool</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Store    DayOfWeek         Date        Sales    Customers  \\\n",
       "count         1.01721e+06  1.01721e+06          NaN  1.01721e+06  1.01721e+06   \n",
       "mean               558.43      3.99834          NaN      5773.82      633.146   \n",
       "std               321.909      1.99739          NaN      3849.93      464.412   \n",
       "min                     1            1          NaN            0            0   \n",
       "25%                   280            2          NaN         3727          405   \n",
       "50%                   558            4          NaN         5744          609   \n",
       "75%                   838            6          NaN         7856          837   \n",
       "max                  1115            7          NaN        41551         7388   \n",
       "counts            1017209      1017209      1017209      1017209      1017209   \n",
       "uniques              1115            7          942        21734         4086   \n",
       "missing                 0            0            0            0            0   \n",
       "missing_perc           0%           0%           0%           0%           0%   \n",
       "types             numeric      numeric  categorical      numeric      numeric   \n",
       "\n",
       "                     Open        Promo StateHoliday SchoolHoliday  \n",
       "count         1.01721e+06  1.01721e+06          NaN   1.01721e+06  \n",
       "mean             0.830107     0.381515          NaN      0.178647  \n",
       "std              0.375539     0.485759          NaN      0.383056  \n",
       "min                     0            0          NaN             0  \n",
       "25%                     1            0          NaN             0  \n",
       "50%                     1            0          NaN             0  \n",
       "75%                     1            1          NaN             0  \n",
       "max                     1            1          NaN             1  \n",
       "counts            1017209      1017209      1017209       1017209  \n",
       "uniques                 2            2            4             2  \n",
       "missing                 0            0            0             0  \n",
       "missing_perc           0%           0%           0%            0%  \n",
       "types                bool         bool  categorical          bool  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>StoreType</th>\n",
       "      <th>Assortment</th>\n",
       "      <th>CompetitionDistance</th>\n",
       "      <th>CompetitionOpenSinceMonth</th>\n",
       "      <th>CompetitionOpenSinceYear</th>\n",
       "      <th>Promo2</th>\n",
       "      <th>Promo2SinceWeek</th>\n",
       "      <th>Promo2SinceYear</th>\n",
       "      <th>PromoInterval</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1112</td>\n",
       "      <td>761</td>\n",
       "      <td>761</td>\n",
       "      <td>1115</td>\n",
       "      <td>571</td>\n",
       "      <td>571</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>558</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5404.9</td>\n",
       "      <td>7.2247</td>\n",
       "      <td>2008.67</td>\n",
       "      <td>0.512108</td>\n",
       "      <td>23.5954</td>\n",
       "      <td>2011.76</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>322.017</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7663.17</td>\n",
       "      <td>3.21235</td>\n",
       "      <td>6.19598</td>\n",
       "      <td>0.500078</td>\n",
       "      <td>14.142</td>\n",
       "      <td>1.67494</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>1900</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2009</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>279.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>717.5</td>\n",
       "      <td>4</td>\n",
       "      <td>2006</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>558</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2325</td>\n",
       "      <td>8</td>\n",
       "      <td>2010</td>\n",
       "      <td>1</td>\n",
       "      <td>22</td>\n",
       "      <td>2012</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>836.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6882.5</td>\n",
       "      <td>10</td>\n",
       "      <td>2013</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>2013</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75860</td>\n",
       "      <td>12</td>\n",
       "      <td>2015</td>\n",
       "      <td>1</td>\n",
       "      <td>50</td>\n",
       "      <td>2015</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>1115</td>\n",
       "      <td>1115</td>\n",
       "      <td>1115</td>\n",
       "      <td>1112</td>\n",
       "      <td>761</td>\n",
       "      <td>761</td>\n",
       "      <td>1115</td>\n",
       "      <td>571</td>\n",
       "      <td>571</td>\n",
       "      <td>571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>1115</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>654</td>\n",
       "      <td>12</td>\n",
       "      <td>23</td>\n",
       "      <td>2</td>\n",
       "      <td>24</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>354</td>\n",
       "      <td>354</td>\n",
       "      <td>0</td>\n",
       "      <td>544</td>\n",
       "      <td>544</td>\n",
       "      <td>544</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0.27%</td>\n",
       "      <td>31.75%</td>\n",
       "      <td>31.75%</td>\n",
       "      <td>0%</td>\n",
       "      <td>48.79%</td>\n",
       "      <td>48.79%</td>\n",
       "      <td>48.79%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "      <td>categorical</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>bool</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Store    StoreType   Assortment CompetitionDistance  \\\n",
       "count            1115          NaN          NaN                1112   \n",
       "mean              558          NaN          NaN              5404.9   \n",
       "std           322.017          NaN          NaN             7663.17   \n",
       "min                 1          NaN          NaN                  20   \n",
       "25%             279.5          NaN          NaN               717.5   \n",
       "50%               558          NaN          NaN                2325   \n",
       "75%             836.5          NaN          NaN              6882.5   \n",
       "max              1115          NaN          NaN               75860   \n",
       "counts           1115         1115         1115                1112   \n",
       "uniques          1115            4            3                 654   \n",
       "missing             0            0            0                   3   \n",
       "missing_perc       0%           0%           0%               0.27%   \n",
       "types         numeric  categorical  categorical             numeric   \n",
       "\n",
       "             CompetitionOpenSinceMonth CompetitionOpenSinceYear    Promo2  \\\n",
       "count                              761                      761      1115   \n",
       "mean                            7.2247                  2008.67  0.512108   \n",
       "std                            3.21235                  6.19598  0.500078   \n",
       "min                                  1                     1900         0   \n",
       "25%                                  4                     2006         0   \n",
       "50%                                  8                     2010         1   \n",
       "75%                                 10                     2013         1   \n",
       "max                                 12                     2015         1   \n",
       "counts                             761                      761      1115   \n",
       "uniques                             12                       23         2   \n",
       "missing                            354                      354         0   \n",
       "missing_perc                    31.75%                   31.75%        0%   \n",
       "types                          numeric                  numeric      bool   \n",
       "\n",
       "             Promo2SinceWeek Promo2SinceYear PromoInterval  \n",
       "count                    571             571           NaN  \n",
       "mean                 23.5954         2011.76           NaN  \n",
       "std                   14.142         1.67494           NaN  \n",
       "min                        1            2009           NaN  \n",
       "25%                       13            2011           NaN  \n",
       "50%                       22            2012           NaN  \n",
       "75%                       37            2013           NaN  \n",
       "max                       50            2015           NaN  \n",
       "counts                   571             571           571  \n",
       "uniques                   24               7             3  \n",
       "missing                  544             544           544  \n",
       "missing_perc          48.79%          48.79%        48.79%  \n",
       "types                numeric         numeric   categorical  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1115</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>558</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>322.017</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>279.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>558</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>836.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1115</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>1115</td>\n",
       "      <td>1115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>1115</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Store        State\n",
       "count            1115          NaN\n",
       "mean              558          NaN\n",
       "std           322.017          NaN\n",
       "min                 1          NaN\n",
       "25%             279.5          NaN\n",
       "50%               558          NaN\n",
       "75%             836.5          NaN\n",
       "max              1115          NaN\n",
       "counts           1115         1115\n",
       "uniques          1115           12\n",
       "missing             0            0\n",
       "missing_perc       0%           0%\n",
       "types         numeric  categorical"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>StateName</th>\n",
       "      <th>State</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>Hamburg</td>\n",
       "      <td>BB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>16</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>unique</td>\n",
       "      <td>unique</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             StateName   State\n",
       "count               16      16\n",
       "unique              16      16\n",
       "top            Hamburg      BB\n",
       "freq                 1       1\n",
       "counts              16      16\n",
       "uniques             16      16\n",
       "missing              0       0\n",
       "missing_perc        0%      0%\n",
       "types           unique  unique"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>week</th>\n",
       "      <th>trend</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>63.8142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.6502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>2072</td>\n",
       "      <td>2072</td>\n",
       "      <td>2072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>14</td>\n",
       "      <td>148</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>categorical</td>\n",
       "      <td>categorical</td>\n",
       "      <td>numeric</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     file         week    trend\n",
       "count                 NaN          NaN     2072\n",
       "mean                  NaN          NaN  63.8142\n",
       "std                   NaN          NaN  12.6502\n",
       "min                   NaN          NaN        0\n",
       "25%                   NaN          NaN       55\n",
       "50%                   NaN          NaN       64\n",
       "75%                   NaN          NaN       72\n",
       "max                   NaN          NaN      100\n",
       "counts               2072         2072     2072\n",
       "uniques                14          148       68\n",
       "missing                 0            0        0\n",
       "missing_perc           0%           0%       0%\n",
       "types         categorical  categorical  numeric"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file</th>\n",
       "      <th>Date</th>\n",
       "      <th>Max_TemperatureC</th>\n",
       "      <th>Mean_TemperatureC</th>\n",
       "      <th>Min_TemperatureC</th>\n",
       "      <th>Dew_PointC</th>\n",
       "      <th>MeanDew_PointC</th>\n",
       "      <th>Min_DewpointC</th>\n",
       "      <th>Max_Humidity</th>\n",
       "      <th>Mean_Humidity</th>\n",
       "      <th>...</th>\n",
       "      <th>Max_VisibilityKm</th>\n",
       "      <th>Mean_VisibilityKm</th>\n",
       "      <th>Min_VisibilitykM</th>\n",
       "      <th>Max_Wind_SpeedKm_h</th>\n",
       "      <th>Mean_Wind_SpeedKm_h</th>\n",
       "      <th>Max_Gust_SpeedKm_h</th>\n",
       "      <th>Precipitationmm</th>\n",
       "      <th>CloudCover</th>\n",
       "      <th>Events</th>\n",
       "      <th>WindDirDegrees</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>...</td>\n",
       "      <td>15459</td>\n",
       "      <td>15459</td>\n",
       "      <td>15459</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>3604</td>\n",
       "      <td>15840</td>\n",
       "      <td>14667</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15840</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.6441</td>\n",
       "      <td>10.389</td>\n",
       "      <td>6.19899</td>\n",
       "      <td>8.58782</td>\n",
       "      <td>6.20581</td>\n",
       "      <td>3.62614</td>\n",
       "      <td>93.6596</td>\n",
       "      <td>74.2829</td>\n",
       "      <td>...</td>\n",
       "      <td>24.0576</td>\n",
       "      <td>12.2398</td>\n",
       "      <td>7.02516</td>\n",
       "      <td>22.7666</td>\n",
       "      <td>11.9722</td>\n",
       "      <td>48.8643</td>\n",
       "      <td>0.831718</td>\n",
       "      <td>5.55131</td>\n",
       "      <td>NaN</td>\n",
       "      <td>175.897</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.64601</td>\n",
       "      <td>7.37926</td>\n",
       "      <td>6.52639</td>\n",
       "      <td>6.24478</td>\n",
       "      <td>6.08677</td>\n",
       "      <td>6.12839</td>\n",
       "      <td>7.67853</td>\n",
       "      <td>13.4866</td>\n",
       "      <td>...</td>\n",
       "      <td>8.9768</td>\n",
       "      <td>5.06794</td>\n",
       "      <td>4.9806</td>\n",
       "      <td>8.98862</td>\n",
       "      <td>5.87284</td>\n",
       "      <td>13.027</td>\n",
       "      <td>2.51351</td>\n",
       "      <td>1.68771</td>\n",
       "      <td>NaN</td>\n",
       "      <td>101.589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-11</td>\n",
       "      <td>-13</td>\n",
       "      <td>-15</td>\n",
       "      <td>-14</td>\n",
       "      <td>-15</td>\n",
       "      <td>-73</td>\n",
       "      <td>44</td>\n",
       "      <td>30</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>-1</td>\n",
       "      <td>90.75</td>\n",
       "      <td>65</td>\n",
       "      <td>...</td>\n",
       "      <td>14</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>8</td>\n",
       "      <td>39</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>94</td>\n",
       "      <td>76</td>\n",
       "      <td>...</td>\n",
       "      <td>31</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>21</td>\n",
       "      <td>11</td>\n",
       "      <td>48</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>202</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>21</td>\n",
       "      <td>16</td>\n",
       "      <td>11</td>\n",
       "      <td>13</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>100</td>\n",
       "      <td>85</td>\n",
       "      <td>...</td>\n",
       "      <td>31</td>\n",
       "      <td>14</td>\n",
       "      <td>10</td>\n",
       "      <td>27</td>\n",
       "      <td>14</td>\n",
       "      <td>55</td>\n",
       "      <td>0.25</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>256</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>39</td>\n",
       "      <td>31</td>\n",
       "      <td>24</td>\n",
       "      <td>25</td>\n",
       "      <td>20</td>\n",
       "      <td>19</td>\n",
       "      <td>100</td>\n",
       "      <td>100</td>\n",
       "      <td>...</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>101</td>\n",
       "      <td>53</td>\n",
       "      <td>111</td>\n",
       "      <td>58.93</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>...</td>\n",
       "      <td>15459</td>\n",
       "      <td>15459</td>\n",
       "      <td>15459</td>\n",
       "      <td>15840</td>\n",
       "      <td>15840</td>\n",
       "      <td>3604</td>\n",
       "      <td>15840</td>\n",
       "      <td>14667</td>\n",
       "      <td>11889</td>\n",
       "      <td>15840</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>16</td>\n",
       "      <td>990</td>\n",
       "      <td>51</td>\n",
       "      <td>45</td>\n",
       "      <td>40</td>\n",
       "      <td>40</td>\n",
       "      <td>36</td>\n",
       "      <td>40</td>\n",
       "      <td>53</td>\n",
       "      <td>71</td>\n",
       "      <td>...</td>\n",
       "      <td>24</td>\n",
       "      <td>32</td>\n",
       "      <td>24</td>\n",
       "      <td>44</td>\n",
       "      <td>29</td>\n",
       "      <td>47</td>\n",
       "      <td>41</td>\n",
       "      <td>9</td>\n",
       "      <td>21</td>\n",
       "      <td>362</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>381</td>\n",
       "      <td>381</td>\n",
       "      <td>381</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>12236</td>\n",
       "      <td>0</td>\n",
       "      <td>1173</td>\n",
       "      <td>3951</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>...</td>\n",
       "      <td>2.41%</td>\n",
       "      <td>2.41%</td>\n",
       "      <td>2.41%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>77.25%</td>\n",
       "      <td>0%</td>\n",
       "      <td>7.41%</td>\n",
       "      <td>24.94%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>categorical</td>\n",
       "      <td>categorical</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>...</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "      <td>numeric</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>13 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     file         Date Max_TemperatureC Mean_TemperatureC  \\\n",
       "count                 NaN          NaN            15840             15840   \n",
       "mean                  NaN          NaN          14.6441            10.389   \n",
       "std                   NaN          NaN          8.64601           7.37926   \n",
       "min                   NaN          NaN              -11               -13   \n",
       "25%                   NaN          NaN                8                 4   \n",
       "50%                   NaN          NaN               15                11   \n",
       "75%                   NaN          NaN               21                16   \n",
       "max                   NaN          NaN               39                31   \n",
       "counts              15840        15840            15840             15840   \n",
       "uniques                16          990               51                45   \n",
       "missing                 0            0                0                 0   \n",
       "missing_perc           0%           0%               0%                0%   \n",
       "types         categorical  categorical          numeric           numeric   \n",
       "\n",
       "             Min_TemperatureC Dew_PointC MeanDew_PointC Min_DewpointC  \\\n",
       "count                   15840      15840          15840         15840   \n",
       "mean                  6.19899    8.58782        6.20581       3.62614   \n",
       "std                   6.52639    6.24478        6.08677       6.12839   \n",
       "min                       -15        -14            -15           -73   \n",
       "25%                         1          4              2            -1   \n",
       "50%                         7          9              7             4   \n",
       "75%                        11         13             11             8   \n",
       "max                        24         25             20            19   \n",
       "counts                  15840      15840          15840         15840   \n",
       "uniques                    40         40             36            40   \n",
       "missing                     0          0              0             0   \n",
       "missing_perc               0%         0%             0%            0%   \n",
       "types                 numeric    numeric        numeric       numeric   \n",
       "\n",
       "             Max_Humidity Mean_Humidity      ...       Max_VisibilityKm  \\\n",
       "count               15840         15840      ...                  15459   \n",
       "mean              93.6596       74.2829      ...                24.0576   \n",
       "std               7.67853       13.4866      ...                 8.9768   \n",
       "min                    44            30      ...                      0   \n",
       "25%                 90.75            65      ...                     14   \n",
       "50%                    94            76      ...                     31   \n",
       "75%                   100            85      ...                     31   \n",
       "max                   100           100      ...                     31   \n",
       "counts              15840         15840      ...                  15459   \n",
       "uniques                53            71      ...                     24   \n",
       "missing                 0             0      ...                    381   \n",
       "missing_perc           0%            0%      ...                  2.41%   \n",
       "types             numeric       numeric      ...                numeric   \n",
       "\n",
       "             Mean_VisibilityKm Min_VisibilitykM Max_Wind_SpeedKm_h  \\\n",
       "count                    15459            15459              15840   \n",
       "mean                   12.2398          7.02516            22.7666   \n",
       "std                    5.06794           4.9806            8.98862   \n",
       "min                          0                0                  3   \n",
       "25%                         10                3                 16   \n",
       "50%                         11                7                 21   \n",
       "75%                         14               10                 27   \n",
       "max                         31               31                101   \n",
       "counts                   15459            15459              15840   \n",
       "uniques                     32               24                 44   \n",
       "missing                    381              381                  0   \n",
       "missing_perc             2.41%            2.41%                 0%   \n",
       "types                  numeric          numeric            numeric   \n",
       "\n",
       "             Mean_Wind_SpeedKm_h Max_Gust_SpeedKm_h Precipitationmm  \\\n",
       "count                      15840               3604           15840   \n",
       "mean                     11.9722            48.8643        0.831718   \n",
       "std                      5.87284             13.027         2.51351   \n",
       "min                            2                 21               0   \n",
       "25%                            8                 39               0   \n",
       "50%                           11                 48               0   \n",
       "75%                           14                 55            0.25   \n",
       "max                           53                111           58.93   \n",
       "counts                     15840               3604           15840   \n",
       "uniques                       29                 47              41   \n",
       "missing                        0              12236               0   \n",
       "missing_perc                  0%             77.25%              0%   \n",
       "types                    numeric            numeric         numeric   \n",
       "\n",
       "             CloudCover       Events WindDirDegrees  \n",
       "count             14667          NaN          15840  \n",
       "mean            5.55131          NaN        175.897  \n",
       "std             1.68771          NaN        101.589  \n",
       "min                   0          NaN             -1  \n",
       "25%                   5          NaN             80  \n",
       "50%                   6          NaN            202  \n",
       "75%                   7          NaN            256  \n",
       "max                   8          NaN            360  \n",
       "counts            14667        11889          15840  \n",
       "uniques               9           21            362  \n",
       "missing            1173         3951              0  \n",
       "missing_perc      7.41%       24.94%             0%  \n",
       "types           numeric  categorical        numeric  \n",
       "\n",
       "[13 rows x 24 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>Promo</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>NaN</td>\n",
       "      <td>41077</td>\n",
       "      <td>41088</td>\n",
       "      <td>NaN</td>\n",
       "      <td>41088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>20544.5</td>\n",
       "      <td>555.9</td>\n",
       "      <td>3.97917</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.854322</td>\n",
       "      <td>0.395833</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.443487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>11861.2</td>\n",
       "      <td>320.274</td>\n",
       "      <td>2.01548</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.352787</td>\n",
       "      <td>0.489035</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.496802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>10272.8</td>\n",
       "      <td>279.75</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>20544.5</td>\n",
       "      <td>553.5</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>30816.2</td>\n",
       "      <td>832.25</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>41088</td>\n",
       "      <td>1115</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>41077</td>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "      <td>41088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniques</th>\n",
       "      <td>41088</td>\n",
       "      <td>856</td>\n",
       "      <td>7</td>\n",
       "      <td>48</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>missing_perc</th>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0.03%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "      <td>0%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>types</th>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>numeric</td>\n",
       "      <td>categorical</td>\n",
       "      <td>bool</td>\n",
       "      <td>bool</td>\n",
       "      <td>bool</td>\n",
       "      <td>bool</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Id    Store DayOfWeek         Date      Open     Promo  \\\n",
       "count           41088    41088     41088          NaN     41077     41088   \n",
       "mean          20544.5    555.9   3.97917          NaN  0.854322  0.395833   \n",
       "std           11861.2  320.274   2.01548          NaN  0.352787  0.489035   \n",
       "min                 1        1         1          NaN         0         0   \n",
       "25%           10272.8   279.75         2          NaN         1         0   \n",
       "50%           20544.5    553.5         4          NaN         1         0   \n",
       "75%           30816.2   832.25         6          NaN         1         1   \n",
       "max             41088     1115         7          NaN         1         1   \n",
       "counts          41088    41088     41088        41088     41077     41088   \n",
       "uniques         41088      856         7           48         2         2   \n",
       "missing             0        0         0            0        11         0   \n",
       "missing_perc       0%       0%        0%           0%     0.03%        0%   \n",
       "types         numeric  numeric   numeric  categorical      bool      bool   \n",
       "\n",
       "             StateHoliday SchoolHoliday  \n",
       "count                 NaN         41088  \n",
       "mean                  NaN      0.443487  \n",
       "std                   NaN      0.496802  \n",
       "min                   NaN             0  \n",
       "25%                   NaN             0  \n",
       "50%                   NaN             0  \n",
       "75%                   NaN             1  \n",
       "max                   NaN             1  \n",
       "counts              41088         41088  \n",
       "uniques                 2             2  \n",
       "missing                 0             0  \n",
       "missing_perc           0%            0%  \n",
       "types                bool          bool  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for t in tables: display(DataFrameSummary(t).summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Cleaning / Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, store, store_states, state_names, googletrend, weather, test = tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1017209, 41088)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train), len(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.StateHoliday = train.StateHoliday != '0'\n",
    "test.StateHoliday = test.StateHoliday != '0'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.\n",
    "\n",
    "Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a \"\\_y\" to those on the right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def join_df(left, right, left_on, right_on=None, suffix='_y'):\n",
    "    if right_on is None: right_on = left_on\n",
    "    return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n",
    "                      suffixes=(\"\", suffix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Join weather/state names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "weather = join_df(weather, state_names, \"file\", \"StateName\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas you can add new columns to a dataframe by simply defining it. We'll do this for googletrends by extracting dates and state names from the given data and adding those columns.\n",
    "\n",
    "We're also going to replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'. This is a good opportunity to highlight pandas indexing. We can use `.loc[rows, cols]` to select a list of rows and a list of columns from the dataframe. In this case, we're selecting rows w/ statename 'NI' by using a boolean list `googletrend.State=='NI'` and selecting \"State\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]\n",
    "googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]\n",
    "googletrend.loc[googletrend.State=='NI', \"State\"] = 'HB,NI'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.\n",
    "\n",
    "You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "add_datepart(weather, \"Date\", drop=False)\n",
    "add_datepart(googletrend, \"Date\", drop=False)\n",
    "add_datepart(train, \"Date\", drop=False)\n",
    "add_datepart(test, \"Date\", drop=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Google trends data has a special category for the whole of the Germany - we'll pull that out so we can use it explicitly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "trend_de = googletrend[googletrend.file == 'Rossmann_DE']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.\n",
    "\n",
    "*Aside*: Why note just do an inner join?\n",
    "If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "store = join_df(store, store_states, \"Store\")\n",
    "len(store[store.State.isnull()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 0)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined = join_df(train, store, \"Store\")\n",
    "joined_test = join_df(test, store, \"Store\")\n",
    "len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 0)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined = join_df(joined, googletrend, [\"State\",\"Year\", \"Week\"])\n",
    "joined_test = join_df(joined_test, googletrend, [\"State\",\"Year\", \"Week\"])\n",
    "len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 0)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined = joined.merge(trend_de, 'left', [\"Year\", \"Week\"], suffixes=('', '_DE'))\n",
    "joined_test = joined_test.merge(trend_de, 'left', [\"Year\", \"Week\"], suffixes=('', '_DE'))\n",
    "len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 0)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined = join_df(joined, weather, [\"State\",\"Date\"])\n",
    "joined_test = join_df(joined_test, weather, [\"State\",\"Date\"])\n",
    "len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined, joined_test):\n",
    "    for c in df.columns:\n",
    "        if c.endswith('_y'):\n",
    "            if c in df.columns: df.drop(c, inplace=True, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll fill in missing values to avoid complications with `NA`'s. `NA` (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary *signal value* that doesn't otherwise appear in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)\n",
    "    df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)\n",
    "    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)\n",
    "    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll extract features \"CompetitionOpenSince\" and \"CompetitionDaysOpen\". Note the use of `apply()` in mapping a function across dataframe values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df[\"CompetitionOpenSince\"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, \n",
    "                                                     month=df.CompetitionOpenSinceMonth, day=15))\n",
    "    df[\"CompetitionDaysOpen\"] = df.Date.subtract(df.CompetitionOpenSince).dt.days"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll replace some erroneous / outlying data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df.loc[df.CompetitionDaysOpen < 0, \"CompetitionDaysOpen\"] = 0\n",
    "    df.loc[df.CompetitionOpenSinceYear < 1990, \"CompetitionDaysOpen\"] = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We add \"CompetitionMonthsOpen\" field, limiting the maximum to 2 years to limit number of unique categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df[\"CompetitionMonthsOpen\"] = df[\"CompetitionDaysOpen\"] // 30\n",
    "    df.loc[df.CompetitionMonthsOpen > 24, \"CompetitionMonthsOpen\"] = 24\n",
    "joined.CompetitionMonthsOpen.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Same process for Promo dates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df[\"Promo2Since\"] = pd.to_datetime(df.apply(lambda x: Week(\n",
    "        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))\n",
    "    df[\"Promo2Days\"] = df.Date.subtract(df[\"Promo2Since\"]).dt.days"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in (joined,joined_test):\n",
    "    df.loc[df.Promo2Days < 0, \"Promo2Days\"] = 0\n",
    "    df.loc[df.Promo2SinceYear < 1990, \"Promo2Days\"] = 0\n",
    "    df[\"Promo2Weeks\"] = df[\"Promo2Days\"] // 7\n",
    "    df.loc[df.Promo2Weeks < 0, \"Promo2Weeks\"] = 0\n",
    "    df.loc[df.Promo2Weeks > 25, \"Promo2Weeks\"] = 25\n",
    "    df.Promo2Weeks.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined.to_feather(f'{PATH}joined')\n",
    "joined_test.to_feather(f'{PATH}joined_test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Durations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:\n",
    "* Running averages\n",
    "* Time until next event\n",
    "* Time since last event\n",
    "\n",
    "This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.\n",
    "\n",
    "We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.\n",
    "\n",
    "Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_elapsed(df, fld, pre):\n",
    "    day1 = np.timedelta64(1, 'D')\n",
    "    last_date = np.datetime64()\n",
    "    last_store = 0\n",
    "    res = []\n",
    "\n",
    "    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):\n",
    "        if s != last_store:\n",
    "            last_date = np.datetime64()\n",
    "            last_store = s\n",
    "        if v: last_date = d\n",
    "        res.append(((d-last_date).astype('timedelta64[D]') / day1))\n",
    "    df[pre+fld] = res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll be applying this to a subset of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = [\"Date\", \"Store\", \"Promo\", \"StateHoliday\", \"SchoolHoliday\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = train[columns]\n",
    "df_test = test[columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's walk through an example.\n",
    "\n",
    "Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call `add_elapsed('SchoolHoliday', 'After')`:\n",
    "This will apply to each row with School Holiday:\n",
    "* A applied to every row of the dataframe in order of store and date\n",
    "* Will add to the dataframe the days since seeing a School Holiday\n",
    "* If we sort in the other direction, this will count the days until another holiday."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "fld = 'SchoolHoliday'\n",
    "df = df.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df, fld, 'After')\n",
    "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df, fld, 'Before')\n",
    "\n",
    "fld = 'SchoolHoliday'\n",
    "df_test = df_test.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df_test, fld, 'After')\n",
    "df_test = df_test.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df_test, fld, 'Before')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll do this for two more fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "fld = 'StateHoliday'\n",
    "df = df.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df, fld, 'After')\n",
    "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df, fld, 'Before')\n",
    "\n",
    "fld = 'StateHoliday'\n",
    "df_test = df_test.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df_test, fld, 'After')\n",
    "df_test = df_test.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df_test, fld, 'Before')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "fld = 'Promo'\n",
    "df = df.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df, fld, 'After')\n",
    "df = df.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df, fld, 'Before')\n",
    "\n",
    "fld = 'Promo'\n",
    "df_test = df_test.sort_values(['Store', 'Date'])\n",
    "get_elapsed(df_test, fld, 'After')\n",
    "df_test = df_test.sort_values(['Store', 'Date'], ascending=[True, False])\n",
    "get_elapsed(df_test, fld, 'Before')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to set the active index to Date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.set_index('Date')\n",
    "df_test = df_test.set_index('Date')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then set null values from elapsed field calculations to 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = ['SchoolHoliday', 'StateHoliday', 'Promo']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "for o in ['Before', 'After']:\n",
    "    for p in columns:\n",
    "        a = o+p\n",
    "        df[a] = df[a].fillna(0).astype(int)\n",
    "        df_test[a] = df_test[a].fillna(0).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll demonstrate window functions in pandas to calculate rolling quantities.\n",
    "\n",
    "Here we're sorting by date (`sort_index()`) and counting the number of events of interest (`sum()`) defined in `columns` in the following week (`rolling()`), grouped by Store (`groupby()`). We do the same in the opposite direction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "bwd = df[['Store']+columns].sort_index().groupby(\"Store\").rolling(7, min_periods=1).sum()\n",
    "bwd_test = df_test[['Store']+columns].sort_index().groupby(\"Store\").rolling(7, min_periods=1).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "fwd = df[['Store']+columns].sort_index(ascending=False\n",
    "                                      ).groupby(\"Store\").rolling(7, min_periods=1).sum()\n",
    "fwd_test = df_test[['Store']+columns].sort_index(ascending=False\n",
    "                                      ).groupby(\"Store\").rolling(7, min_periods=1).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we want to drop the Store indices grouped together in the window function.\n",
    "\n",
    "Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "bwd.drop('Store',1,inplace=True)\n",
    "bwd.reset_index(inplace=True)\n",
    "\n",
    "bwd_test.drop('Store',1,inplace=True)\n",
    "bwd_test.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "fwd.drop('Store',1,inplace=True)\n",
    "fwd.reset_index(inplace=True)\n",
    "\n",
    "fwd_test.drop('Store',1,inplace=True)\n",
    "fwd_test.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.reset_index(inplace=True)\n",
    "df_test.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll merge these values onto the df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])\n",
    "df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])\n",
    "\n",
    "df_test = df_test.merge(bwd_test, 'left', ['Date', 'Store'], suffixes=['', '_bw'])\n",
    "df_test = df_test.merge(fwd_test, 'left', ['Date', 'Store'], suffixes=['', '_fw'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop(columns,1,inplace=True)\n",
    "df_test.drop(columns,1,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Store</th>\n",
       "      <th>AfterSchoolHoliday</th>\n",
       "      <th>BeforeSchoolHoliday</th>\n",
       "      <th>AfterStateHoliday</th>\n",
       "      <th>BeforeStateHoliday</th>\n",
       "      <th>AfterPromo</th>\n",
       "      <th>BeforePromo</th>\n",
       "      <th>SchoolHoliday_bw</th>\n",
       "      <th>StateHoliday_bw</th>\n",
       "      <th>Promo_bw</th>\n",
       "      <th>SchoolHoliday_fw</th>\n",
       "      <th>StateHoliday_fw</th>\n",
       "      <th>Promo_fw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2015-07-31</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>57</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2015-07-30</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>56</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2015-07-29</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>55</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2015-07-28</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2015-07-27</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>53</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date  Store  AfterSchoolHoliday  BeforeSchoolHoliday  \\\n",
       "0 2015-07-31      1                   0                    0   \n",
       "1 2015-07-30      1                   0                    0   \n",
       "2 2015-07-29      1                   0                    0   \n",
       "3 2015-07-28      1                   0                    0   \n",
       "4 2015-07-27      1                   0                    0   \n",
       "\n",
       "   AfterStateHoliday  BeforeStateHoliday  AfterPromo  BeforePromo  \\\n",
       "0                 57                   0           0            0   \n",
       "1                 56                   0           0            0   \n",
       "2                 55                   0           0            0   \n",
       "3                 54                   0           0            0   \n",
       "4                 53                   0           0            0   \n",
       "\n",
       "   SchoolHoliday_bw  StateHoliday_bw  Promo_bw  SchoolHoliday_fw  \\\n",
       "0               5.0              0.0       5.0               1.0   \n",
       "1               4.0              0.0       4.0               2.0   \n",
       "2               3.0              0.0       3.0               3.0   \n",
       "3               2.0              0.0       2.0               4.0   \n",
       "4               1.0              0.0       1.0               5.0   \n",
       "\n",
       "   StateHoliday_fw  Promo_fw  \n",
       "0              0.0       1.0  \n",
       "1              0.0       2.0  \n",
       "2              0.0       3.0  \n",
       "3              0.0       4.0  \n",
       "4              0.0       5.0  "
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_feather(f'{PATH}df')\n",
    "df_test.to_feather(f'{PATH}df_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_feather(f'{PATH}df')\n",
    "df_test = pd.read_feather(f'{PATH}df_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date'] = pd.to_datetime(df.Date)\n",
    "df_test['Date'] = pd.to_datetime(df_test.Date)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',\n",
       "       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',\n",
       "       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',\n",
       "       'StateHoliday_fw', 'Promo_fw'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined = join_df(joined, df, ['Store', 'Date'])\n",
    "joined_test = join_df(joined_test, df_test, ['Store', 'Date'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The authors also removed all instances where the store had zero sale / was closed. We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little exploratory data analysis reveals that there are often periods where stores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. By ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined = joined[joined.Sales != 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll back this up as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined.reset_index(inplace=True)\n",
    "joined_test.reset_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined.to_feather(f'{PATH}joined')\n",
    "joined_test.to_feather(f'{PATH}joined_test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have our final set of engineered features.\n",
    "\n",
    "While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "heading_collapsed": true
   },
   "source": [
    "## Create features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "joined = pd.read_feather(f'{PATH}joined')\n",
    "joined_test = pd.read_feather(f'{PATH}joined_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>index</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Store</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DayOfWeek</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <td>2015-07-31 00:00:00</td>\n",
       "      <td>2015-07-31 00:00:00</td>\n",
       "      <td>2015-07-31 00:00:00</td>\n",
       "      <td>2015-07-31 00:00:00</td>\n",
       "      <td>2015-07-31 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sales</th>\n",
       "      <td>5263</td>\n",
       "      <td>6064</td>\n",
       "      <td>8314</td>\n",
       "      <td>13995</td>\n",
       "      <td>4822</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Customers</th>\n",
       "      <td>555</td>\n",
       "      <td>625</td>\n",
       "      <td>821</td>\n",
       "      <td>1498</td>\n",
       "      <td>559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Open</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>StateHoliday</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SchoolHoliday</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <td>2015</td>\n",
       "      <td>2015</td>\n",
       "      <td>2015</td>\n",
       "      <td>2015</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Week</th>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Day</th>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dayofweek</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dayofyear</th>\n",
       "      <td>212</td>\n",
       "      <td>212</td>\n",
       "      <td>212</td>\n",
       "      <td>212</td>\n",
       "      <td>212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_month_end</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_month_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_quarter_end</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_quarter_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_year_end</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Is_year_start</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elapsed</th>\n",
       "      <td>1438300800</td>\n",
       "      <td>1438300800</td>\n",
       "      <td>1438300800</td>\n",
       "      <td>1438300800</td>\n",
       "      <td>1438300800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>StoreType</th>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Assortment</th>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionDistance</th>\n",
       "      <td>1270</td>\n",
       "      <td>570</td>\n",
       "      <td>14130</td>\n",
       "      <td>620</td>\n",
       "      <td>29910</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionOpenSinceMonth</th>\n",
       "      <td>9</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CompetitionOpenSinceYear</th>\n",
       "      <td>2008</td>\n",
       "      <td>2007</td>\n",
       "      <td>2006</td>\n",
       "      <td>2009</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2SinceWeek</th>\n",
       "      <td>1</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Promo2SinceYear</th>\n",
       "      <td>1900</td>\n",
       "      <td>2010</td>\n",
       "      <td>2011</td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PromoInterval</th>\n",
       "      <td>None</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "      <td>Jan,Apr,Jul,Oct</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>State</th>\n",
       "      <td>HE</td>\n",
       "      <td>TH</td>\n",
       "      <td>NW</td>\n",
       "      <td>BE</td>\n",
       "      <td>SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>file</th>\n",
       "      <td>Rossmann_DE_HE</td>\n",
       "      <td>Rossmann_DE_TH</td>\n",
       "      <td>Rossmann_DE_NW</td>\n",
       "      <td>Rossmann_DE_BE</td>\n",
       "      <td>Rossmann_DE_SN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>week</th>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>trend</th>\n",
       "      <td>85</td>\n",
       "      <td>80</td>\n",
       "      <td>86</td>\n",
       "      <td>74</td>\n",
       "      <td>82</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>file_DE</th>\n",
       "      <td>Rossmann_DE</td>\n",
       "      <td>Rossmann_DE</td>\n",
       "      <td>Rossmann_DE</td>\n",
       "      <td>Rossmann_DE</td>\n",
       "      <td>Rossmann_DE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>week_DE</th>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "      <td>2015-08-02 - 2015-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>trend_DE</th>\n",
       "      <td>83</td>\n",
       "      <td>83</td>\n",
       "      <td>83</td>\n",
       "      <td>83</td>\n",
       "      <td>83</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date_DE</th>\n",
       "      <td>2015-08-02 00:00:00</td>\n",
       "      <td>2015-08-02 00:00:00</td>\n",
       "      <td>2015-08-02 00:00:00</td>\n",
       "      <td>2015-08-02 00:00:00</td>\n",
       "      <td>2015-08-02 00:00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 0                        1  \\\n",
       "index                                            0                        1   \n",
       "Store                                            1                        2   \n",
       "DayOfWeek                                        5                        5   \n",
       "Date                           2015-07-31 00:00:00      2015-07-31 00:00:00   \n",
       "Sales                                         5263                     6064   \n",
       "Customers                                      555                      625   \n",
       "Open                                             1                        1   \n",
       "Promo                                            1                        1   \n",
       "StateHoliday                                 False                    False   \n",
       "SchoolHoliday                                    1                        1   \n",
       "Year                                          2015                     2015   \n",
       "Month                                            7                        7   \n",
       "Week                                            31                       31   \n",
       "Day                                             31                       31   \n",
       "Dayofweek                                        4                        4   \n",
       "Dayofyear                                      212                      212   \n",
       "Is_month_end                                  True                     True   \n",
       "Is_month_start                               False                    False   \n",
       "Is_quarter_end                               False                    False   \n",
       "Is_quarter_start                             False                    False   \n",
       "Is_year_end                                  False                    False   \n",
       "Is_year_start                                False                    False   \n",
       "Elapsed                                 1438300800               1438300800   \n",
       "StoreType                                        c                        a   \n",
       "Assortment                                       a                        a   \n",
       "CompetitionDistance                           1270                      570   \n",
       "CompetitionOpenSinceMonth                        9                       11   \n",
       "CompetitionOpenSinceYear                      2008                     2007   \n",
       "Promo2                                           0                        1   \n",
       "Promo2SinceWeek                                  1                       13   \n",
       "Promo2SinceYear                               1900                     2010   \n",
       "PromoInterval                                 None          Jan,Apr,Jul,Oct   \n",
       "State                                           HE                       TH   \n",
       "file                                Rossmann_DE_HE           Rossmann_DE_TH   \n",
       "week                       2015-08-02 - 2015-08-08  2015-08-02 - 2015-08-08   \n",
       "trend                                           85                       80   \n",
       "file_DE                                Rossmann_DE              Rossmann_DE   \n",
       "week_DE                    2015-08-02 - 2015-08-08  2015-08-02 - 2015-08-08   \n",
       "trend_DE                                        83                       83   \n",
       "Date_DE                        2015-08-02 00:00:00      2015-08-02 00:00:00   \n",
       "\n",
       "                                                 2                        3  \\\n",
       "index                                            2                        3   \n",
       "Store                                            3                        4   \n",
       "DayOfWeek                                        5                        5   \n",
       "Date                           2015-07-31 00:00:00      2015-07-31 00:00:00   \n",
       "Sales                                         8314                    13995   \n",
       "Customers                                      821                     1498   \n",
       "Open                                             1                        1   \n",
       "Promo                                            1                        1   \n",
       "StateHoliday                                 False                    False   \n",
       "SchoolHoliday                                    1                        1   \n",
       "Year                                          2015                     2015   \n",
       "Month                                            7                        7   \n",
       "Week                                            31                       31   \n",
       "Day                                             31                       31   \n",
       "Dayofweek                                        4                        4   \n",
       "Dayofyear                                      212                      212   \n",
       "Is_month_end                                  True                     True   \n",
       "Is_month_start                               False                    False   \n",
       "Is_quarter_end                               False                    False   \n",
       "Is_quarter_start                             False                    False   \n",
       "Is_year_end                                  False                    False   \n",
       "Is_year_start                                False                    False   \n",
       "Elapsed                                 1438300800               1438300800   \n",
       "StoreType                                        a                        c   \n",
       "Assortment                                       a                        c   \n",
       "CompetitionDistance                          14130                      620   \n",
       "CompetitionOpenSinceMonth                       12                        9   \n",
       "CompetitionOpenSinceYear                      2006                     2009   \n",
       "Promo2                                           1                        0   \n",
       "Promo2SinceWeek                                 14                        1   \n",
       "Promo2SinceYear                               2011                     1900   \n",
       "PromoInterval                      Jan,Apr,Jul,Oct                     None   \n",
       "State                                           NW                       BE   \n",
       "file                                Rossmann_DE_NW           Rossmann_DE_BE   \n",
       "week                       2015-08-02 - 2015-08-08  2015-08-02 - 2015-08-08   \n",
       "trend                                           86                       74   \n",
       "file_DE                                Rossmann_DE              Rossmann_DE   \n",
       "week_DE                    2015-08-02 - 2015-08-08  2015-08-02 - 2015-08-08   \n",
       "trend_DE                                        83                       83   \n",
       "Date_DE                        2015-08-02 00:00:00      2015-08-02 00:00:00   \n",
       "\n",
       "                                                 4  \n",
       "index                                            4  \n",
       "Store                                            5  \n",
       "DayOfWeek                                        5  \n",
       "Date                           2015-07-31 00:00:00  \n",
       "Sales                                         4822  \n",
       "Customers                                      559  \n",
       "Open                                             1  \n",
       "Promo                                            1  \n",
       "StateHoliday                                 False  \n",
       "SchoolHoliday                                    1  \n",
       "Year                                          2015  \n",
       "Month                                            7  \n",
       "Week                                            31  \n",
       "Day                                             31  \n",
       "Dayofweek                                        4  \n",
       "Dayofyear                                      212  \n",
       "Is_month_end                                  True  \n",
       "Is_month_start                               False  \n",
       "Is_quarter_end                               False  \n",
       "Is_quarter_start                             False  \n",
       "Is_year_end                                  False  \n",
       "Is_year_start                                False  \n",
       "Elapsed                                 1438300800  \n",
       "StoreType                                        a  \n",
       "Assortment                                       a  \n",
       "CompetitionDistance                          29910  \n",
       "CompetitionOpenSinceMonth                        4  \n",
       "CompetitionOpenSinceYear                      2015  \n",
       "Promo2                                           0  \n",
       "Promo2SinceWeek                                  1  \n",
       "Promo2SinceYear                               1900  \n",
       "PromoInterval                                 None  \n",
       "State                                           SN  \n",
       "file                                Rossmann_DE_SN  \n",
       "week                       2015-08-02 - 2015-08-08  \n",
       "trend                                           82  \n",
       "file_DE                                Rossmann_DE  \n",
       "week_DE                    2015-08-02 - 2015-08-08  \n",
       "trend_DE                                        83  \n",
       "Date_DE                        2015-08-02 00:00:00  "
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined.head().T.head(40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Now that we've engineered all our features, we need to convert to input compatible with a neural network.\n",
    "\n",
    "This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "844338"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',\n",
    "    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',\n",
    "    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',\n",
    "    'SchoolHoliday_fw', 'SchoolHoliday_bw']\n",
    "\n",
    "contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',\n",
    "   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', \n",
    "   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',\n",
    "   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']\n",
    "\n",
    "n = len(joined)\n",
    "n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "dep = 'Sales'\n",
    "joined = joined[cat_vars+contin_vars + [dep, 'Date']].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "joined_test[dep] = 0\n",
    "joined_test = joined_test[cat_vars + contin_vars + [dep, 'Date', 'Id']].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "apply_cats(joined_test, joined)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for v in contin_vars:\n",
    "    joined[v] = joined[v].fillna(0).astype('float32')\n",
    "    joined_test[v] = joined_test[v].fillna(0).astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We're going to run on a sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "150000"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idxs = get_cv_idxs(n, val_pct=150000/n)\n",
    "joined_samp = joined.iloc[idxs].set_index('Date')\n",
    "samp_size = len(joined_samp); samp_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "To run on the full dataset, use this instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "samp_size = n\n",
    "joined_samp = joined.set_index('Date')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We can now process our data..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Year</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>CompetitionMonthsOpen</th>\n",
       "      <th>Promo2Weeks</th>\n",
       "      <th>StoreType</th>\n",
       "      <th>Assortment</th>\n",
       "      <th>...</th>\n",
       "      <th>Max_Wind_SpeedKm_h</th>\n",
       "      <th>Mean_Wind_SpeedKm_h</th>\n",
       "      <th>CloudCover</th>\n",
       "      <th>trend</th>\n",
       "      <th>trend_DE</th>\n",
       "      <th>AfterStateHoliday</th>\n",
       "      <th>BeforeStateHoliday</th>\n",
       "      <th>Promo</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "      <th>Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2015-07-31</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>2015</td>\n",
       "      <td>7</td>\n",
       "      <td>31</td>\n",
       "      <td>False</td>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>c</td>\n",
       "      <td>a</td>\n",
       "      <td>...</td>\n",
       "      <td>24.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>85.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>57.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-07-31</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>2015</td>\n",
       "      <td>7</td>\n",
       "      <td>31</td>\n",
       "      <td>False</td>\n",
       "      <td>24</td>\n",
       "      <td>25</td>\n",
       "      <td>a</td>\n",
       "      <td>a</td>\n",
       "      <td>...</td>\n",
       "      <td>14.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6064</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 39 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Store DayOfWeek  Year Month Day StateHoliday CompetitionMonthsOpen  \\\n",
       "Date                                                                            \n",
       "2015-07-31     1         5  2015     7  31        False                    24   \n",
       "2015-07-31     2         5  2015     7  31        False                    24   \n",
       "\n",
       "           Promo2Weeks StoreType Assortment  ...  Max_Wind_SpeedKm_h  \\\n",
       "Date                                         ...                       \n",
       "2015-07-31           0         c          a  ...                24.0   \n",
       "2015-07-31          25         a          a  ...                14.0   \n",
       "\n",
       "           Mean_Wind_SpeedKm_h CloudCover trend trend_DE AfterStateHoliday  \\\n",
       "Date                                                                         \n",
       "2015-07-31                11.0        1.0  85.0     83.0              57.0   \n",
       "2015-07-31                11.0        4.0  80.0     83.0              67.0   \n",
       "\n",
       "           BeforeStateHoliday Promo SchoolHoliday Sales  \n",
       "Date                                                     \n",
       "2015-07-31                0.0   1.0           1.0  5263  \n",
       "2015-07-31                0.0   1.0           1.0  6064  \n",
       "\n",
       "[2 rows x 39 columns]"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined_samp.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)\n",
    "yl = np.log(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "joined_test = joined_test.set_index('Date')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "df_test, _, nas, mapper = proc_df(joined_test, 'Sales', do_scale=True, skip_flds=['Id'],\n",
    "                                  mapper=mapper, na_dict=nas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Store</th>\n",
       "      <th>DayOfWeek</th>\n",
       "      <th>Year</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "      <th>StateHoliday</th>\n",
       "      <th>CompetitionMonthsOpen</th>\n",
       "      <th>Promo2Weeks</th>\n",
       "      <th>StoreType</th>\n",
       "      <th>Assortment</th>\n",
       "      <th>...</th>\n",
       "      <th>Min_Humidity</th>\n",
       "      <th>Max_Wind_SpeedKm_h</th>\n",
       "      <th>Mean_Wind_SpeedKm_h</th>\n",
       "      <th>CloudCover</th>\n",
       "      <th>trend</th>\n",
       "      <th>trend_DE</th>\n",
       "      <th>AfterStateHoliday</th>\n",
       "      <th>BeforeStateHoliday</th>\n",
       "      <th>Promo</th>\n",
       "      <th>SchoolHoliday</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2015-07-31</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.620066</td>\n",
       "      <td>0.149027</td>\n",
       "      <td>-0.142774</td>\n",
       "      <td>-1.844823</td>\n",
       "      <td>1.732492</td>\n",
       "      <td>1.724334</td>\n",
       "      <td>0.604461</td>\n",
       "      <td>1.111827</td>\n",
       "      <td>1.113717</td>\n",
       "      <td>2.04105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015-07-31</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>26</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.264031</td>\n",
       "      <td>-0.960613</td>\n",
       "      <td>-0.142774</td>\n",
       "      <td>-0.488722</td>\n",
       "      <td>1.294578</td>\n",
       "      <td>1.724334</td>\n",
       "      <td>0.926957</td>\n",
       "      <td>1.111827</td>\n",
       "      <td>1.113717</td>\n",
       "      <td>2.04105</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            Store  DayOfWeek  Year  Month  Day  StateHoliday  \\\n",
       "Date                                                           \n",
       "2015-07-31      1          5     3      7   31             1   \n",
       "2015-07-31      2          5     3      7   31             1   \n",
       "\n",
       "            CompetitionMonthsOpen  Promo2Weeks  StoreType  Assortment  \\\n",
       "Date                                                                    \n",
       "2015-07-31                     25            1          3           1   \n",
       "2015-07-31                     25           26          1           1   \n",
       "\n",
       "                ...        Min_Humidity  Max_Wind_SpeedKm_h  \\\n",
       "Date            ...                                           \n",
       "2015-07-31      ...           -1.620066            0.149027   \n",
       "2015-07-31      ...           -1.264031           -0.960613   \n",
       "\n",
       "            Mean_Wind_SpeedKm_h  CloudCover     trend  trend_DE  \\\n",
       "Date                                                              \n",
       "2015-07-31            -0.142774   -1.844823  1.732492  1.724334   \n",
       "2015-07-31            -0.142774   -0.488722  1.294578  1.724334   \n",
       "\n",
       "            AfterStateHoliday  BeforeStateHoliday     Promo  SchoolHoliday  \n",
       "Date                                                                        \n",
       "2015-07-31           0.604461            1.111827  1.113717        2.04105  \n",
       "2015-07-31           0.926957            1.111827  1.113717        2.04105  \n",
       "\n",
       "[2 rows x 38 columns]"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](http://www.fast.ai/2017/11/13/validation-sets/) on our web site.\n",
    "\n",
    "One approach is to take the last 25% of rows (sorted by date) as our validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "633253"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_ratio = 0.75\n",
    "train_size = int(samp_size * train_ratio)\n",
    "val_idx = list(range(train_size, len(df)))\n",
    "\n",
    "train_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "val_idx = np.flatnonzero(\n",
    "    (df.index <= datetime.datetime(2014, 9, 17)) & (df.index >= datetime.datetime(2014, 8, 1)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "val_idx = [0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're ready to put together our models.\n",
    "\n",
    "Root-mean-squared percent error is the metric Kaggle used for this competition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "def inv_y(a): return np.exp(a)\n",
    "\n",
    "def exp_rmspe(y_pred, targ):\n",
    "    targ = inv_y(targ)\n",
    "    pct_var = (targ - inv_y(y_pred))/targ\n",
    "    return math.sqrt((pct_var ** 2).mean())\n",
    "\n",
    "max_log_y = np.max(yl)\n",
    "y_range = (0, max_log_y * 1.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a ModelData object directly from out data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,\n",
    "                                       test_df=df_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_sz = [(c, len(joined_samp[c].cat.categories) + 1) for c in cat_vars]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Store', 1116),\n",
       " ('DayOfWeek', 8),\n",
       " ('Year', 4),\n",
       " ('Month', 13),\n",
       " ('Day', 32),\n",
       " ('StateHoliday', 3),\n",
       " ('CompetitionMonthsOpen', 26),\n",
       " ('Promo2Weeks', 27),\n",
       " ('StoreType', 5),\n",
       " ('Assortment', 4),\n",
       " ('PromoInterval', 4),\n",
       " ('CompetitionOpenSinceYear', 24),\n",
       " ('Promo2SinceYear', 9),\n",
       " ('State', 13),\n",
       " ('Week', 53),\n",
       " ('Events', 22),\n",
       " ('Promo_fw', 7),\n",
       " ('Promo_bw', 7),\n",
       " ('StateHoliday_fw', 4),\n",
       " ('StateHoliday_bw', 4),\n",
       " ('SchoolHoliday_fw', 9),\n",
       " ('SchoolHoliday_bw', 9)]"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_sz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the *cardinality* of each variable (that is, its number of unique values) to decide how large to make its *embeddings*. Each level will be associated with a vector with length defined as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "emb_szs = [(c, min(50, (c + 1) // 2)) for _, c in cat_sz]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1116, 50),\n",
       " (8, 4),\n",
       " (4, 2),\n",
       " (13, 7),\n",
       " (32, 16),\n",
       " (3, 2),\n",
       " (26, 13),\n",
       " (27, 14),\n",
       " (5, 3),\n",
       " (4, 2),\n",
       " (4, 2),\n",
       " (24, 12),\n",
       " (9, 5),\n",
       " (13, 7),\n",
       " (53, 27),\n",
       " (22, 11),\n",
       " (7, 4),\n",
       " (7, 4),\n",
       " (4, 2),\n",
       " (4, 2),\n",
       " (9, 5),\n",
       " (9, 5)]"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emb_szs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars),\n",
    "                   0.04, 1, [1000, 500], [0.001, 0.01], y_range=y_range)\n",
    "lr = 1e-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f75170882266498285643d2b090b6848",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 59%|█████▉    | 3925/6597 [00:50<00:34, 78.42it/s, loss=0.133]  "
     ]
    }
   ],
   "source": [
    "m.lr_find()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEOCAYAAABmVAtTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAIABJREFUeJzt3Xl8VOXd9/HPL5N9h5CwBQiriMqiAYVaxK2itlDvWoW2Kt2sPre1ra1Vq+1T9W7dWn20Wqu1tnet1rVaVBTcqLsQkFUWWSWsCZAQsi/X88cM4xAmECAnJ8l8369XXpxz5sw5vxzH+eY6y3WZcw4RERGAOL8LEBGRjkOhICIiYQoFEREJUyiIiEiYQkFERMIUCiIiEqZQEBGRMIWCiIiEKRRERCRMoSAiImHxfhdwuHr06OEKCgr8LkNEpFNZsGBBqXMu91DrdbpQKCgooKioyO8yREQ6FTPb2Jr1dPpIRETCFAoiIhKmUBARkTCFgoiIhCkUREQkTKEgIiJhMRMKG3dW8srSrWj4URGRlsVMKLy6bBtXPr6QyrpGv0sREemwYiYUslISACivrve5EhGRjiv2QqFKoSAi0pLYCwW1FEREWhQzoZCpUBAROSRPQ8HMJpvZKjNbY2bXR3l9hpmVmNmi0M/3vKplX0thT41CQUSkJZ71kmpmAeAB4GygGJhvZjOdc580W/Up59xVXtWxT2ZyMBQqahq83pWISKflZUthHLDGObfOOVcHPAlM9XB/B5WWFABgr0JBRKRFXoZCX2BTxHxxaFlzXzOzJWb2rJn186qY+EAcqYkBKnT6SESkRV6GgkVZ1vxx4heBAufcSOB14H+jbsjscjMrMrOikpKSIy4oIzlep49ERA7Cy1AoBiL/8s8HtkSu4Jzb6ZyrDc3+GTgp2oaccw875wqdc4W5uYccTa5FmckJuvtIROQgvAyF+cBQMxtoZonANGBm5Apm1jtidgqwwsN6yEpRKIiIHIxndx855xrM7CpgNhAAHnXOLTezW4Ai59xM4GozmwI0ALuAGV7VA5Cdmkjx7iovdyEi0ql5FgoAzrlZwKxmy34VMX0DcIOXNUTqlprAss1qKYiItCRmnmgG6JaWSFl1nd9liIh0WDEVCtmpCdTUN1FTr+6zRUSiialQ6JaaCMDuKrUWRESiiclQ2LlXoSAiEk1MhUJOejAUdlUqFEREoompUOieplAQETmY2AoFXVMQETmomAqFfQPtlGlIThGRqGIqFAJxRmZyvLq6EBFpQUyFAgS7uijT6SMRkahiMBQS2K3TRyIiUcVgKCRSptNHIiJRxV4opCRQrtNHIiJRxV4opCaopSAi0oLYC4XQQDtNTc1HBhURkZgLhazURJxDYzWLiEQRc6GQve8BNo2rICJygJgLhUCcATB/w26fKxER6XhiLhR6ZyUDMG/9Tp8rERHpeGIuFE4a0A0z6JWV4ncpIiIdTsyFQnwgDufgvjc+9bsUEZEOJ+ZCQUREWhaTofDdUweSnBCnZxVERJqJyVAYkJNKTX0TpZW1fpciItKhxGQodAuNwKbBdkRE9hfTobBbYzWLiOwnNkMhLfhU8y6FgojIfmIyFHIzkgAo3atrCiIikWIyFHLSkogzKKlQKIiIRIrJUAjEGd3TkihRS0FEZD8xGQoAeRlJaimIiDTjaSiY2WQzW2Vma8zs+oOsd6GZOTMr9LKeSLkKBRGRA3gWCmYWAB4AzgVGANPNbESU9TKAq4GPvKolGoWCiMiBvGwpjAPWOOfWOefqgCeBqVHWuxW4E6jxsJYD5GYEryk4p64uRET28TIU+gKbIuaLQ8vCzGwM0M8595KHdUSVm55EfaPTU80iIhG8DAWLsiz8Z7mZxQH3AD895IbMLjezIjMrKikpaZPi8jKDzyroDiQRkc95GQrFQL+I+XxgS8R8BnA8MNfMNgCnADOjXWx2zj3snCt0zhXm5ua2SXG56cFQ2LFHoSAiso+XoTAfGGpmA80sEZgGzNz3onOu3DnXwzlX4JwrAD4EpjjnijysKSwvMzgsZ8nedr2UISLSoXkWCs65BuAqYDawAnjaObfczG4xsyle7be19nV1sV0tBRGRsHgvN+6cmwXMarbsVy2sO8nLWppLT4qne1oiS4vL23O3IiIdWsw+0QwwJDedVdsr/C5DRKTDiOlQGNYrXQ+wiYhEiOlQ6JudSnl1PXtq9KyCiAjEeCgMyk0DYF1Jpc+ViIh0DDEdCkPy0gFYUlzmcyUiIh1DTIfCoB5pdE9LZMXWPX6XIiLSIcR0KJgZ+d1S2FKmB9hERCDGQwGgd1Yym8uq/S5DRKRDiPlQ6NctleLdVepCW0QEhQL9c1KpqW9Sb6kiIigU6NctFYBNu6p8rkRExH8Khe77QkHXFUREYj4U8rulAPCZWgoiIgqF5IQAPTOTdPpIRASFAgD9u6eycadCQUREoQAM7JHGutK9fpchIuI7hQLBPpBK99ZRVlXndykiIr5SKADH9MoE4J/zNvlciYiIvxQKwJj+2QDqGE9EYp5CAchMTmBoXjozF2/xuxQREV8pFELGD84BYHelriuISOxSKIScNiwXgHWlGoVNRGKXQiGkf6i7i+Ldel5BRGKXQiEkP9Qx3iPvrPe5EhER/ygUQlISAwBsLdcobCISuxQKEaaP609tQyMNjU1+lyIi4guFQoQvDMmhoqaBf87XQ2wiEpsUChH23YH0yxeW+VyJiIg/FAoRMpITuGBMX+LjjD019X6XIyLS7hQKzUwf15+GJsfrn2z3uxQRkXanUGhmbEE3eqQn8sHanX6XIiLS7jwNBTObbGarzGyNmV0f5fUrzGypmS0ys3fNbISX9bSGmTE0L4NPd2h8BRGJPZ6FgpkFgAeAc4ERwPQoX/pPOOdOcM6NBu4E7vaqnsMxJC+dRZvKdGuqiMQcL1sK44A1zrl1zrk64ElgauQKzrnIvqrTAOdhPa02JC8dgFXbK3yuRESkfXkZCn2ByBv+i0PL9mNm/21mawm2FK72sJ5WmxDqMXXVNoWCiMQWL0PBoiw7oCXgnHvAOTcYuA64KeqGzC43syIzKyopKWnjMg80sEcaiYE4hYKIxBwvQ6EY6Bcxnw8cbBSbJ4GvRnvBOfewc67QOVeYm5vbhiVGFx+IY3BeOqt1+khEYoyXoTAfGGpmA80sEZgGzIxcwcyGRsyeD3zqYT2HZVjPdFZv1x1IIhJbWhUKZvYjM8u0oL+Y2UIz+9LB3uOcawCuAmYDK4CnnXPLzewWM5sSWu0qM1tuZouAa4DLjuJ3aVPDemawuayaCj3ZLCIxJL6V633HOXevmZ0D5ALfBv4KzDnYm5xzs4BZzZb9KmL6R4dXbvsZ1jMDgPfW7GTy8b18rkZEpH209vTRvovG5wF/dc4tJvqF5C7juD6ZADz+0UafKxERaT+tDYUFZjaHYCjMNrMMoEs/2dUnO4W+2SmUV+v0kYjEjtaGwneB64GxzrkqIIHgKaQuLT0pniXF5fz9gw1+lyIi0i5aGwrjgVXOuTIz+xbB5wnKvSurY7jr6yMBeEqD7ohIjGhtKDwIVJnZKODnwEbg755V1UGMzM9m6ug+LN+yh+q6Rr/LERHxXGtDocE55wj2XXSvc+5eIMO7sjqO04/JA+DHT33scyUiIt5r7S2pFWZ2A3AJ8MVQD6gJ3pXVcewLhSXFXf5smYhIq1sKFwO1BJ9X2EawY7u7PKuqA8lKTWDGhAK2ltcwe/k2v8sREfFUq0IhFASPA1lm9mWgxjnX5a8p7HPasGB/Sz94bAEfrdOIbCLSdbW2m4uLgHnA14GLgI/M7EIvC+tIvji0B4Ny0wD4+XNLfK5GRMQ7rT19dCPBZxQuc85dSnAAnV96V1bHEh+I482fTuLM4Xls3FnF5rJqv0sSEfFEa0Mhzjm3I2J+52G8t8v4wWmDAXh+YbHPlYiIeKO1X+yvmtlsM5thZjOAl2nW0V0sGDewO32zU/jdnNUE79AVEelaWnuh+VrgYWAkMAp42Dl3nZeFdVQTQxed53yy3edKRETaXqtPATnnnnPOXeOc+4lz7nkvi+rIbjr/WCB4J5KISFdz0FAwswoz2xPlp8LM9rRXkR1JWlI8ZwwPPtCmHlRFpKs5aCg45zKcc5lRfjKcc5ntVWRHc+n4AQCMunkOTU26tiAiXUfM3UHUFr4wpEd4WtcWRKQrUSgcgYRAHK9fMxGAK/6xQKeRRKTLUCgcoSF5GXRLDfYJ+PDba32uRkSkbSgUjsKHvzgTgAfeWstS9aIqIl2AQuEoJMUHwtNfuf9dPdAmIp2eQuEoPfG9k8PTzy5Q9xci0rkpFI7ShCE9uHfaaACufXYJK7bG5OMbItJFKBTawNTRfTnr2OADbY++u97nakREjpxCoY08ctlYxvTP5pkFxVz7zGK/yxEROSIKhTZ085TjAHhmQTEF17/Mym06lSQinYtCoQ2NzM/mX/9nQnh+8v97h4oaPdgmIp2HQqGNndi/G/NuPDM8/40/f+RjNSIih0eh4IG8jGTW33YeAEs3l/PK0q3sqqzzuSoRkUNTKHjEzHjxqlMBuPLxhZx462t86xG1GkSkY/M0FMxsspmtMrM1ZnZ9lNevMbNPzGyJmb1hZgO8rKe9nZCfxT0XjwrPv7umlL21DT5WJCJycJ6FgpkFgAeAc4ERwHQzG9FstY+BQufcSOBZ4E6v6vHLBWPyWXHLZO6+KBgOb68u8bkiEZGWedlSGAescc6tc87VAU8CUyNXcM695ZyrCs1+COR7WI9vUhIDTB3dl+5picxevs3vckREWuRlKPQFNkXMF4eWteS7wCse1uOrQJwxcWgP/r1oC2f8fi5VdQ2UVenis4h0LPEebtuiLIvajaiZfQsoBE5r4fXLgcsB+vfv31b1tbvvTxzEC4u2sK6kkhG/mg1AelI8711/BlkpCT5XJyLibUuhGOgXMZ8PbGm+kpmdBdwITHHO1UbbkHPuYedcoXOuMDc315Ni28NxfbJ462eT9lu2t7aBUTfP4bEPN/pTlIhIBC9bCvOBoWY2ENgMTAO+EbmCmY0BHgImO+d2eFhLhzGwRxobbj8fgK3l1Yy/7U0AfvnCMkbnZ3NCfpaf5YlIjPOspeCcawCuAmYDK4CnnXPLzewWM5sSWu0uIB14xswWmdlMr+rpiHpnpbDh9vP5wcRBQHCgnobGJp+rEpFYZp1ttLDCwkJXVFTkdxltruD6l8PTb197Ov1zUn2sRkS6GjNb4JwrPNR6eqK5g/j0N+eGpyfe9ZaG9hQRXygUOoiEQBzrfnseBaEWwsAbZlG6N+p1dxERzygUOpC4OGP2TyaG5wv/53XunrOKqjp1jSEi7UOh0MEkxQe4Zepx4fn73lzDiF/N5h+6ZVVE2oFCoQO6dHwBG24/n5evPjW87KYXlnHX7JXUNejuJBHxjkKhAzuuTxYv/fBULhgT7B3kgbfWMuymV7j7tdU+VyYiXZWXD69JGzi+bxb3XDya9KT48FPP973xKU1NjovH9qNfd926KiJtR88pdDL/WV3CZY/O22/ZacNy+d/vjPOpIhHpDPScQhd12rBcbvuvE/Zb9p/VJRT+z2tsLa/mpSVbqK5r9Kk6Eens1FLoxFZs3cPvZq/ijZUHdhv13vVn0Dc7xYeqRKQjam1LQaHQBVTWNvCV+99lXUnlAa/N+8WZ5GUm+1CViHQkOn0UQ9KS4nnzp5PYcPv5bLj9fKaP+3zMid/MWuFjZSLS2ejuoy7opvOPZfmWcpYUl/PvRVv496LgMBa9s5J597ozCMRFG/9IREQthS4pLSmemVedynNXjt9v+dbyGgb/Yhb16p5bRFqgUOjCThrQnQU3ncWUUX3Iy0gKLx964ysKBhGJSqePuric9CTumz4mPD/uN6+zo6KWR95Zz7G9M5h0TJ6P1YlIR6O7j2JMTX0jo2+ZQ0395y2FlbdOJjkh4GNVIuI13X0kUSUnBHj6B/tfa7j8sQUa1EdEAIVCTBqZn82G289n/W3nccqg7ry9uoS5q0v8LktEOgCFQgwzM/7+nZMpyEnl23+dz+7KOr9LEhGfKRRiXGJ8HD+fPByAMbe+Rk29+k0SiWUKBeG8E3qHpy966AMfKxERvykUBIDXQmNDL9+yh027qnyuRkT8olAQAIb2zGDOTyaSGIjj6ic/1t1IIjFKoSBhw3pmcO05x/DxZ2XMXLzF73JExAcKBdnPjAkFDO+Vwd2vrVZXGCIxSKEg+4mLM6495xg27qxi+sMf+l2OiLQzhYIc4Izhwf6QijbuZsHG3T5XIyLtSaEgBzAz3r/+DBICxtX//JjGJl10FokVCgWJqk92Cn+YPobNZdX86t/LdDeSSIxQKEiLzjmuF+eP7M3jH33GwBtmUV2np51FujpPQ8HMJpvZKjNbY2bXR3l9opktNLMGM7vQy1rk8JkZf5j2+VgMF/zxPbUYRLo4z0LBzALAA8C5wAhgupmNaLbaZ8AM4Amv6pCjExdnrL/tPE4d0oOV2yp4/uPNfpckIh7ysqUwDljjnFvnnKsDngSmRq7gnNvgnFsC6Ib4DszMeOiSkwC4/8011DboNJJIV+VlKPQFNkXMF4eWSSeUlhTPX2eMZV1pJefd+47f5YiIR7wMBYuy7IhOSJvZ5WZWZGZFJSUaDMYvpw/P44tDe7C2pJJfvrDM73JExANehkIx0C9iPh84og51nHMPO+cKnXOFubm5bVKcHJlHLiskPSmexz7cyLG/fJXte2r8LklE2pCXoTAfGGpmA80sEZgGzPRwf9IOkuIDFN10FjMmFFBd38hvZ63wuyQRaUOehYJzrgG4CpgNrACeds4tN7NbzGwKgJmNNbNi4OvAQ2a23Kt6pO0kJwT49ZTjuHT8AP69aAtPzf/M75JEpI1YZ7vvvLCw0BUVFfldhgDlVfWMumUOAPdcPIoLxuT7XJFI19TY5Hjsgw1MG9ef5ITAEW3DzBY45woPtZ6eaJYjlpWawIc3nAnAT55aTMH1L7O1vNrnqkS6nucWFvPrFz/h93NWeb4vhYIclV5Zyfzz+6eE58ff9iZ7aup9rEik63nio+Ap2ktOKfB8XwoFOWrjB+ew5jfncufXRgJwxWML1LOqSBvaXVUHQP+cVM/3pVCQNhEfiOOisf244dzhvL92Jz968mOaFAwibSIjOZ4zQ+OceE2hIG3q8omDmD6uPy8t2cpDb6/zuxyRLqGqrpHkxCO7wHy4FArSpsyM315wPF8a0ZM7Xl3JpLve4p1P9RS6yJFyzrGnuoE0hYJ0VmbGA988kR+fNZQNO6u45C/z+M7f5lNZ2+B3aSKdzqZd1ZTureX4vlntsj+FgngiIRDHj88axopbJnPasFzeXLmDUTfP4c5XV/LZziq/yxPpNIrLgv+/DOuZ0S77UyiIp1ISA/zlskJ+evYwBuem88e5a5l411vc8epKqurUchA5lPKq4C3eWSkJ7bK/+HbZi8S0+EAcPzxzKN+fOIhrn13C9vIaHpy7lgfnriXOYNIxeZx+TC6FBd3J75ZCRvKhP/wrt+3hZ88s5hfnHcv4QTmYReuUV6Tz21kZvB01O1WhIF1MckKAP0wPDu/56rKtXPGPhTQ5eHPlDt5cuSO8XpxBXkYyqUkBqusamTq6L1ecNojs1EQA/vSftdz+ykoAvvHnj7jwpHzuunCkgkG6pBcXbyExPo6ctKR22Z/6PhJfOeeob3Qs2Lib/6wu4d01JSzbvKfF9c3AOUhNDPDzc47hr+9vYGPoGsV5J/RibEF3Lh1fQCBOASGdX3l1PaNunkN+txTeve6Mo9pWa/s+UktBfGVmJMYb4wfnMH5wDjCcsqo6VmytoLCgG+XV9azeXsHPnl7MlvIa9v0N88ZPT6N3VgqXTSjgobfXcfsrK5m1dBuzlm5j9vJtXDlpCHe8spK9tQ0M7JHGvdNGs3zLHmrqGznz2J4AVNU1kBCIIyGgS2vSMS0tLgfg2nOOabd9qqUgnUZtQyNxoVNEzb/Ia+obeW5hMU/N38SS0P9IAInxcdQ17D8E+ODcNNaWVAKQk5bIzso6slMTmD6uPykJAUbmZzHpmKN7enTm4i0c2yuDoe10x4h0TQ/OXcsdr67k41+eTbe0xKPalloK0uUkxbf88E5yQoBvnjyAb548gAUbd7Ng4y56ZaUwZVQf7n39U+55fTXjBnZncG4a/1q4Ofy+qrpGAMqq6nlw7tr9tvnU5adQ29BEYnwchQO6ER8Kopr6RpLi4w64huGc47EPN3L/m2vYUVELwE3nH8v3vjjosH/XJz76jCXFZXxxaC7nHt8LM3TNJMYs2LibO14NXjs72kA4HGopSMz7bGcVPbOSqKlrory6nmueXkTRxt37rdM9LZET+maxo6KWFVv3kBQfR2FBN5LiA+SkJdIzM5mXlmxhQ+j6xr5rHwDdUhPIy0hmZH4WK7dVsHRzOVkpCSTGxzGgeyo90pMY0z+baWP7k5WawPrSSk7/3dwD6tSYFbHjw3U7mfbwhwAMyUvn9WtOO+pttraloFAQiWLTrioWF5exu6qexIDx3pqdLN1cTnpSPLkZSeyqrKN0by3Fuz8fPyI7NYGpo/rw32cMIS8jmZr6Rn7xr6VsKa9m4Wdl+53GKhzQjW17avZ7P8DXTsxn0+4q5q3fxZ8vLeTNldt5dkEx9Y2O+Djj/m+cyDnH9VSroQsqqaile1oiD729ljtfDY6bcNn4Adxw3rFHPLBOJIWCSDtoanLsrKxjV2Udg3PTwqeYmmtobKKhyfHR+l0U5KQyICct/JpzjiXF5Tzy7npeXLwlvHzhL8+me+i0QeneWi75yzxWbN3DCX2zuOfi0QzJS/f2l+uk/vDGp7y7ppT7po+hZ2ay3+UcVENjE5t2VzNr6VbueW01cWbUNQb/eLh32mimju7bZvtSKIh0QhtKK3mqaBPfPLk/+d327zu/pr6Rv3+wgd/OWkliII7zR/Zm2th+9M5KaZd+9juDt1bt4Nt/nQ8EW25/vrSQsQXdfa4qurtmr+SBt9ZGfa3oprPokd62zyUoFES6qOLdVdz4/DL+s/rz3mdPGdSdP33rpPADfgdT39jEmyt3UDigGzlt/MXjJ+ccA2+YBcAdXzuB655bCsDdF43iq6P7EtfCsyt7axtITQi0+LoX3ltTyjcf+QiAcQXdOef4Xlw8th/pSd7d+6NQEOnCnHNs3FnFPz7cyMZdVby+YjvOwZnD8+iTncKuqjpOGZRDSkKAz3ZVce7xvchJT+TFxVu59aVPwtsZV9CdSycMYMLgHhjwxsodvL+2lFMG5TAkL53j+mTy82eX8PbqEkb1y+a6ycM5tnemZ7/X2pK9vLFiO1kpCXx5ZB/SkuJZs2Mvcz7ZRt/sFCYNyyMrSncP768t5dpnlrC5rJpzjuvJQ5cUsrW8mgsf/IDNZdUc1yeTGRMKuPCk/PD1GOccr6/Ywff/XkTvrGT++/Qh1NQ38q+Fm0lJDFCQk8bwXhlcNLYf60srWbF1DxeelH9Ez7W8umwreZnJDOuZwYNz1/DAW2sZmpfOs1dMiPr7eEGhIBJDlm0u5xfPL93vGY2DGdE7k4SAsTjK+pF3TkUzul82XxnVh7OP7bnfaat9QZWbkURa6C/e7XtquOPVlawrqaSqroHK2ka+PLI3Y/p3o6GpibNH9CRgRnV9I2tLKrnwwfdpCI3Y1yM9kT01DQc8Z/L9Lw7k4rH9SAwEqG9q4j+rSrjlpU8IxBkTh/bgwW+dFL4wW9vQyL8/3sK9b3zK5rJqvjAkhymj+nDj88swg/rG4L5yM5IoCd1G3L97KonxcWworQzXEvm7//6iUWSnJFCyt5aCnLRDXgR+ZelWrnx84QHH/9EZY+mV1X7XPBQKIjHIOUdDk2P+hl28tXIH4wfnsGDjbhIDATKS4xnWM4MJg3PCp0rqG5uYt34Xr32yna3l1Zx+TB4XnNiXxZvKmbtqB/PW76JHehIPfPNE9lTXc+fsVbyxYnv4OQwI3jI5pl8276/dyeay4N1UPTOTGJWfzXtrSqmsayQ5IY4heeks27yH+Dg74Mt2HzN49LKxfLypjI8/2807n5YC8MT3T6amvpHfzlrJupK9NH9776xknrtyAn2yU6Jut6nJ8fi8z7h55vLwvlMSAhzfN5P/+5Xj6JmZzHtrShnaM53j+gTHLaipb+T+N9dQVl1HdkoiqUkBHpy7loqaz3v3TQgY55/Qm+vOHU7vrM/3vfCz3SzZVMaiTWW8sGgLPdKT+MHEQVTWNZCVksC3ThnQ7k/SKxRExDMbSiuZuXgLd7+2mn7dU9hSVkNjk+MbJ/cnIc54buFm+manUFFTz7WTj9nv+YqyqjpWbatgXWkl89fvYtPuKgoLulNV28A5x/diwuAe4XWbmtwB5/pLKmp5eckWkhIC1NY38tKSrTz23ZNJacXIZCu27mHZ5nKO7Z15RIPW7Kio4Z7XPuX5j4s5dUguaUkBXlm6jbrGJvp3TyUtKZ6Kmvr9bjXOTI7n9WtOI8/nO6EUCiLSbpxzMfvsxPrSSh59dz3rSvdSUdNAz8xkRuVncfrwPIbmZZAY3zH61lI3FyLSbmI1EAAG9kjj1q8e73cZbaZjRJiIiHQICgUREQlTKIiISJhCQUREwhQKIiISplAQEZEwhYKIiIQpFEREJKzTPdFsZiXAxtBsFhDZo1cPoLSdSmm+b6/ee6h1D/Z6S69FW96aZe11fDvKsT3UOkdzfPXZ1Wf3aN57JJ/dAc653ENu2TnXaX+Ah5vNF/m1b6/ee6h1D/Z6S69FW96aZe11fDvKsfXy+Oqzq8+ul8f2aGrp7KePXuyk+z6c9x5q3YO93tJr0Za3dll76CjH9lDrHM3x1WdXn92jee/RfnZb1OlOHx2MmRW5VnT4JEdGx9c7Orbe0vFtvc7eUmjuYb8L6OJ0fL2jY+stHd9W6lItBREROTpdraUgIiJHQaEgIiJhCgUREQmLmVAws0lm9o6Z/cnMJvldT1djZmlmtsDMvux3LV2NmR0b+tw+a2ZX+l1PV2JmXzWzP5vZv83sS37X0xF0ilAws0fNbIeZLWu2fLKZrTKzNWZ2/SE244C9QDIgKa9cAAAG/klEQVRQ7FWtnU0bHVuA64Cnvamy82qL4+ucW+GcuwK4CNBtlSFtdGxfcM59H5gBXOxhuZ1Gp7j7yMwmEvxC/7tz7vjQsgCwGjib4Jf8fGA6EABua7aJ7wClzrkmM+sJ3O2c+2Z71d+RtdGxHUmwG4Fkgsf5pfapvuNri+PrnNthZlOA64H7nXNPtFf9HVlbHdvQ+34PPO6cW9hO5XdY8X4X0BrOubfNrKDZ4nHAGufcOgAzexKY6py7DTjYKYzdQJIXdXZGbXFszex0IA0YAVSb2SznXJOnhXcSbfXZdc7NBGaa2cuAQoE2++wacDvwigIhqFOEQgv6Apsi5ouBk1ta2cz+CzgHyAbu97a0Tu+wjq1z7kYAM5tBqEXmaXWd3+F+dicB/0Xwj5lZnlbW+R3WsQV+CJwFZJnZEOfcn7wsrjPozKFgUZa1eC7MOfcv4F/eldOlHNaxDa/g3N/avpQu6XA/u3OBuV4V08Uc7rG9D7jPu3I6n05xobkFxUC/iPl8YItPtXQ1Orbe0vH1jo7tUerMoTAfGGpmA80sEZgGzPS5pq5Cx9ZbOr7e0bE9Sp0iFMzsn8AHwDFmVmxm33XONQBXAbOBFcDTzrnlftbZGenYekvH1zs6tt7oFLekiohI++gULQUREWkfCgUREQlTKIiISJhCQUREwhQKIiISplAQEZEwhYJ4zsz2tsM+prSyi++23OckM5twBO8bY2aPhKZnmFmH6IvLzAqad0MdZZ1cM3u1vWqS9qdQkE4j1C1yVM65mc652z3Y58H6B5sEHHYoAL8A/nBEBfnMOVcCbDWzL/hdi3hDoSDtysyuNbP5ZrbEzG6OWP5CaOS25WZ2ecTyvWZ2i5l9BIw3sw1mdrOZLTSzpWY2PLRe+C9uM/ubmd1nZu+b2TozuzC0PM7M/hjax0tmNmvfa81qnGtmvzWz/wA/MrOvmNlHZvaxmb1uZj1DXTZfAfzEzBaZ2RdDf0U/F/r95kf74jSzDGCkc25xlNcGmNkboWPzhpn1Dy0fbGYfhrZ5S7SWlwVHvnvZzBab2TIzuzi0fGzoOCw2s3lmlhFqEbwTOoYLo7V2zCxgZndF/Lf6QcTLLwAaj6Srcs7pRz+e/gB7Q/9+CXiYYE+WccBLwMTQa91D/6YAy4Cc0LwDLorY1gbgh6Hp/wM8EpqeQXAAGoC/Ac+E9jGCYP/6ABcS7Ho6DuhFcGyNC6PUOxf4Y8R8Nz5/+v97wO9D078Gfhax3hPAqaHp/sCKKNs+HXguYj6y7heBy0LT3wFeCE2/BEwPTV+x73g22+7XgD9HzGcBicA6YGxoWSbBnpFTgeTQsqFAUWi6AFgWmr4cuCk0nQQUAQND832BpX5/rvTjzU9n7jpbOp8vhX4+Ds2nE/xSehu42swuCC3vF1q+E2gEnmu2nX1doC8gOM5ANC+44LgOn1hwtD2AU4FnQsu3mdlbB6n1qYjpfOApM+tN8It2fQvvOQsYERy3BYBMM8twzlVErNMbKGnh/eMjfp/HgDsjln81NP0E8Lso710K/M7M7gBecs69Y2YnAFudc/MBnHN7INiqAO43s9EEj++wKNv7EjAyoiWVRfC/yXpgB9Cnhd9BOjmFgrQnA25zzj2038LgIDJnAeOdc1VmNpfg0J4ANc65xmbbqQ3920jLn+HaiGlr9m9rVEZM/4HgEK4zQ7X+uoX3xBH8HaoPst1qPv/dDqXVHZM551ab2UnAecBtZjaH4GmeaNv4CbAdGBWquSbKOkawRTY7ymvJBH8P6YJ0TUHa02zgO2aWDmBmfc0sj+BfobtDgTAcOMWj/b8LfC10baEnwQvFrZEFbA5NXxaxvALIiJifQ7CHTgBCf4k3twIY0sJ+3ifY1TMEz9m/G5r+kODpISJe34+Z9QGqnHP/INiSOBFYCfQxs7GhdTJCF86zCLYgmoBLCI5f3Nxs4EozSwi9d1iohQHBlsVB71KSzkuhIO3GOTeH4OmPD8xsKfAswS/VV4F4M1sC3ErwS9ALzxEchGUZ8BDwEVDeivf9GnjGzN4BSiOWvwhcsO9CM3A1UBi6MPsJwfP/+3HOrSQ49GNG89dC7/926DhcAvwotPzHwDVmNo/g6adoNZ8AzDOzRcCNwP845+qAi4E/mNli4DWCf+X/EbjMzD4k+AVfGWV7jwCfAAtDt6k+xOetstOBl6O8R7oAdZ0tMcXM0p1ze80sB5gHfME5t62da/gJUOGce6SV66cC1c45Z2bTCF50nuppkQev521gqnNut181iHd0TUFizUtmlk3wgvGt7R0IIQ8CXz+M9U8ieGHYgDKCdyb5wsxyCV5fUSB0UWopiIhImK4piIhImEJBRETCFAoiIhKmUBARkTCFgoiIhCkUREQk7P8DDyXj9dAru9MAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f28219f1198>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "m.sched.plot(100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### Sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n",
    "                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n",
    "lr = 1e-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c37e4f0d61b4456686991170a6960104",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 10%|█         | 673/6597 [00:05<00:52, 113.50it/s, loss=0.0324]\n",
      " 11%|█         | 699/6597 [00:06<00:51, 113.92it/s, loss=0.0326]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Exception in thread Thread-4:\n",
      "Traceback (most recent call last):\n",
      "  File \"/home/paperspace/anaconda3/envs/fastai/lib/python3.6/threading.py\", line 916, in _bootstrap_inner\n",
      "    self.run()\n",
      "  File \"/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py\", line 144, in run\n",
      "    for instance in self.tqdm_cls._instances:\n",
      "  File \"/home/paperspace/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py\", line 60, in __iter__\n",
      "    for itemref in self.data:\n",
      "RuntimeError: Set changed size during iteration\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch      trn_loss   val_loss   exp_rmspe                       \n",
      "    0      0.014263   0.000139   0.011843  \n",
      "    1      0.01102    0.004146   0.06236                         \n",
      "    2      0.009234   0.00016    0.012563                         \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[array([ 0.00016]), 0.012563288901677318]"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.fit(lr, 3, metrics=[exp_rmspe])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5fac79e4084b4fdfa1e5a9c578e25803",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch      trn_loss   val_loss   exp_rmspe                        \n",
      "    0      0.006969   1e-05      0.003207  \n",
      "    1      0.007287   0.00012    0.011032                         \n",
      "    2      0.006732   0.000544   0.023599                         \n",
      "    3      0.007267   0.000478   0.022109                         \n",
      "    4      0.007106   0.000533   0.023349                         \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[array([ 0.00053]), 0.02334926432063833]"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.fit(lr, 5, metrics=[exp_rmspe], cycle_len=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "112ba6e363664fd8b444bc3752b5c80c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=8), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch      trn_loss   val_loss   exp_rmspe                        \n",
      "    0      0.00776    1e-05      0.003209  \n",
      "    1      0.007136   0.000492   0.022438                         \n",
      "    2      0.005963   3.2e-05    0.005661                         \n",
      "    3      0.005697   0.000341   0.018628                         \n",
      "    4      0.007375   0.000179   0.013285                         \n",
      "    5      0.006432   0.0        0.000582                         \n",
      "    6      0.005697   0.000734   0.027468                         \n",
      "    7      0.00545    0.001149   0.034479                         \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[array([ 0.00115]), 0.034479439570524843]"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.fit(lr, 2, metrics=[exp_rmspe], cycle_len=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### All"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n",
    "                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n",
    "lr = 1e-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "15d0c7d5e9634030a624fb91c90b081c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "A Jupyter Widget"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.       0.01456  0.01544  0.1148 ]                            \n",
      "\n"
     ]
    }
   ],
   "source": [
    "m.fit(lr, 1, metrics=[exp_rmspe])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "58bd52a856754c359eae9fa25f951dd6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.       0.01418  0.02066  0.12765]                           \n",
      "[ 1.       0.01081  0.01276  0.11221]                           \n",
      "[ 2.       0.00976  0.01233  0.10987]                            \n",
      "\n"
     ]
    }
   ],
   "source": [
    "m.fit(lr, 3, metrics=[exp_rmspe])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "738899f6b0574b50be6d3f6efff9e578",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.       0.00801  0.01081  0.09899]                            \n",
      "[ 1.       0.00714  0.01083  0.09846]                            \n",
      "[ 2.       0.00707  0.01088  0.09878]                            \n",
      "\n"
     ]
    }
   ],
   "source": [
    "m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n",
    "                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n",
    "lr = 1e-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "74827eff47794b5eb8d09982d30202fe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.       0.01413  0.0063   0.07628]                           \n",
      "[ 1.       0.01022  0.00859  0.08851]                           \n",
      "[ 2.       0.00932  0.00001  0.00243]                            \n",
      "\n"
     ]
    }
   ],
   "source": [
    "m.fit(lr, 3, metrics=[exp_rmspe])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "07f0b0fea31c4afa8c6770085d233230",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.       0.00748  0.       0.00167]                            \n",
      "[ 1.       0.00717  0.00009  0.00947]                            \n",
      "[ 2.       0.00643  0.00013  0.01147]                            \n",
      "\n"
     ]
    }
   ],
   "source": [
    "m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m.save('val0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m.load('val0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "x,y=m.predict_with_targs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.01147316926177568"
      ]
     },
     "execution_count": 188,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "exp_rmspe(x,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "pred_test=m.predict(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "pred_test = np.exp(pred_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "joined_test['Sales']=pred_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "csv_fn=f'{PATH}tmp/sub.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "joined_test[['Id','Sales']].to_csv(csv_fn, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='data/rossmann/tmp/sub.csv' target='_blank'>data/rossmann/tmp/sub.csv</a><br>"
      ],
      "text/plain": [
       "/home/ubuntu/fastai/courses/dl1/data/rossmann/tmp/sub.csv"
      ]
     },
     "execution_count": 202,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "FileLink(csv_fn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## RF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "((val,trn), (y_val,y_trn)) = split_by_idx(val_idx, df.values, yl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": true,
    "hidden": true
   },
   "outputs": [],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, max_features=0.99, min_samples_leaf=2,\n",
    "                          n_jobs=-1, oob_score=True)\n",
    "m.fit(trn, y_trn);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.98086411192483902,\n",
       " 0.92614447508562714,\n",
       " 0.9193358549649463,\n",
       " 0.11557443993375387)"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preds = m.predict(val)\n",
    "m.score(trn, y_trn), m.score(val, y_val), m.oob_score_, exp_rmspe(preds, y_val)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "173px",
    "width": "251px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}