{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Large Scale learning with ModelSelector\n",
    "Very often we have many different products, regions, countries, shops...for which we need to delivery forecast. This can be easily done with `ModelSelector`. `ModelSelector` though does not bind you to use multiple data partitions and can also serve as convenient layer for accessing relevant information quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('seaborn')\n",
    "plt.rcParams['figure.figsize'] = [12, 6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hcrystalball.model_selection import ModelSelector\n",
    "from hcrystalball.utils import get_sales_data\n",
    "from hcrystalball.wrappers import get_sklearn_wrapper, ThetaWrapper\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.ensemble import RandomForestRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Dummy Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = get_sales_data(n_dates=365*2, \n",
    "                    n_assortments=1, \n",
    "                    n_states=1, \n",
    "                    n_stores=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's start simple\n",
    "df_minimal = df[['Store','Sales']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get predefined sklearn models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ModelSelector` has already predefined large scale of hcrystalball models by their classes. To get this predefined gridsearch use `create_gridsearch` method. It will allow you to create hundereds of different models in a second. Here for the sake of time, we will use the advantage of the method for cv splits, default scorer etc. and just extend empty grid with two models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Note on hcb_verbose flag\n",
    "> To make the grid search run faster, we create and empty grid with `create_gridsearch` and later add just few models to it. \n",
    "As each wrapper's default is `hcb_verbose=True` (so that one can see warnings etc on the wrapper layer), \n",
    "but in grid search the default is `hcb_verbose=False` (so that the output of model selection is reasonably verbose),\n",
    "we add to the grid search models with `hcb_verbose=False` to closer simulate the default settings of the typical usage \n",
    "(where one selects in `create_gridsearch` e.g. `prophet_models=True`, `sklearn_models=True` and `hcb_verbose` flag keeps on `False` by default"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal = ModelSelector(horizon=10, frequency='D')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# see full default parameter grid in hands on exercise\n",
    "ms_minimal.create_gridsearch(\n",
    "    sklearn_models=False,\n",
    "    n_splits=2,\n",
    "    between_split_lag=None,\n",
    "    sklearn_models_optimize_for_horizon=False,\n",
    "    autosarimax_models=False,\n",
    "    prophet_models=False,\n",
    "    theta_models=False,\n",
    "    tbats_models=False,\n",
    "    exp_smooth_models=False,\n",
    "    average_ensembles=False,\n",
    "    stacking_ensembles=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extend with custom models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))\n",
    "ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))\n",
    "ms_minimal.add_model_to_gridsearch(ThetaWrapper(hcb_verbose=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run model selection\n",
    "Method `select_model` is doing majority of the magic for you - it creates forecast for each combination of columns specified in `partition_columns` and for each of the time series it will run grid_search mentioned above. Optionally once can select list of columns over which the model selection will run in parallel using prefect (`parallel_over_columns`). \n",
    "\n",
    "Required format for data is Datetime index, unsuprisingly numerical column for `target_col_name` all other columns except `partition_columns` will be used as exogenous variables - as additional features for modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal.select_model(df=df_minimal, partition_columns=['Store'], target_col_name='Sales')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal.plot_results(plot_from='2015-06');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Persist and Load\n",
    "`ModelSelector` stores multiple `ModelSelectorResults` in given folder as pickle files. As we only have 1 partition, only 1 file is written and loaded back."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_minimal.persist_results('results')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hcrystalball.model_selection import load_model_selector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_loaded = load_model_selector('results')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ms_loaded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cleanup\n",
    "import shutil\n",
    "try:\n",
    "    shutil.rmtree('results')\n",
    "except:\n",
    "    pass"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}