{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Large Scale learning with ModelSelector\n", "Very often we have many different products, regions, countries, shops...for which we need to delivery forecast. This can be easily done with `ModelSelector`. `ModelSelector` though does not bind you to use multiple data partitions and can also serve as convenient layer for accessing relevant information quickly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "plt.style.use('seaborn')\n", "plt.rcParams['figure.figsize'] = [12, 6]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.model_selection import ModelSelector\n", "from hcrystalball.utils import get_sales_data\n", "from hcrystalball.wrappers import get_sklearn_wrapper, ThetaWrapper\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Dummy Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = get_sales_data(n_dates=365*2, \n", " n_assortments=1, \n", " n_states=1, \n", " n_stores=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's start simple\n", "df_minimal = df[['Store','Sales']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get predefined sklearn models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`ModelSelector` has already predefined large scale of hcrystalball models by their classes. To get this predefined gridsearch use `create_gridsearch` method. It will allow you to create hundereds of different models in a second. Here for the sake of time, we will use the advantage of the method for cv splits, default scorer etc. and just extend empty grid with two models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note on hcb_verbose flag\n", "> To make the grid search run faster, we create and empty grid with `create_gridsearch` and later add just few models to it. \n", "As each wrapper's default is `hcb_verbose=True` (so that one can see warnings etc on the wrapper layer), \n", "but in grid search the default is `hcb_verbose=False` (so that the output of model selection is reasonably verbose),\n", "we add to the grid search models with `hcb_verbose=False` to closer simulate the default settings of the typical usage \n", "(where one selects in `create_gridsearch` e.g. `prophet_models=True`, `sklearn_models=True` and `hcb_verbose` flag keeps on `False` by default" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal = ModelSelector(horizon=10, frequency='D')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# see full default parameter grid in hands on exercise\n", "ms_minimal.create_gridsearch(\n", " sklearn_models=False,\n", " n_splits=2,\n", " between_split_lag=None,\n", " sklearn_models_optimize_for_horizon=False,\n", " autosarimax_models=False,\n", " prophet_models=False,\n", " theta_models=False,\n", " tbats_models=False,\n", " exp_smooth_models=False,\n", " average_ensembles=False,\n", " stacking_ensembles=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extend with custom models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))\n", "ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))\n", "ms_minimal.add_model_to_gridsearch(ThetaWrapper(hcb_verbose=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run model selection\n", "Method `select_model` is doing majority of the magic for you - it creates forecast for each combination of columns specified in `partition_columns` and for each of the time series it will run grid_search mentioned above. Optionally once can select list of columns over which the model selection will run in parallel using prefect (`parallel_over_columns`). \n", "\n", "Required format for data is Datetime index, unsuprisingly numerical column for `target_col_name` all other columns except `partition_columns` will be used as exogenous variables - as additional features for modeling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal.select_model(df=df_minimal, partition_columns=['Store'], target_col_name='Sales')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal.plot_results(plot_from='2015-06');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persist and Load\n", "`ModelSelector` stores multiple `ModelSelectorResults` in given folder as pickle files. As we only have 1 partition, only 1 file is written and loaded back." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_minimal.persist_results('results')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.model_selection import load_model_selector" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_loaded = load_model_selector('results')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms_loaded" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cleanup\n", "import shutil\n", "try:\n", " shutil.rmtree('results')\n", "except:\n", " pass" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }