{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Quickstart\n", "\n", "
\n", " \n", "
\n", "\n", "We are glad this package caught your attention, so let's try to briefly showcase its power." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "In this tutorial, we will use a historical Rossmann sales dataset that\n", "we load via the `hcrystalball.utils.get_sales_data` function.\n", "\n", "A description of the dataset and available columns is given in the docstring." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "plt.style.use('seaborn')\n", "plt.rcParams['figure.figsize'] = [12, 6]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.utils import get_sales_data\n", "\n", "df = get_sales_data(n_dates=100, \n", " n_assortments=2, \n", " n_states=2, \n", " n_stores=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define search space\n", "Next step is to define `ModelSelector` for which frequency the data will be resampled to, how many steps ahead the forecast should run, and optionally define column, which contains ISO code of country/region to take holiday information for given days\n", "\n", "Once done, creating grid search with possible exogenous columns and extending it with custom models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.model_selection import ModelSelector\n", "\n", "ms = ModelSelector(horizon=10, \n", " frequency='D', \n", " country_code_column='HolidayCode', \n", " )\n", "\n", "ms.create_gridsearch(sklearn_models=True,\n", " n_splits = 2,\n", " between_split_lag=None,\n", " sklearn_models_optimize_for_horizon=False,\n", " autosarimax_models=False,\n", " prophet_models=False,\n", " tbats_models=False,\n", " exp_smooth_models=False,\n", " average_ensembles=False,\n", " stacking_ensembles=False, \n", " exog_cols=['Open','Promo','SchoolHoliday','Promo2'],\n", "# holidays_days_before=2, \n", "# holidays_days_after=1, \n", "# holidays_bridge_days=True, \n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.wrappers import get_sklearn_wrapper \n", "from sklearn.linear_model import LinearRegression\n", "\n", "ms.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run model selection\n", "By default the run will partition data by `partition_columns` and do for loop over all partitions. \n", "\n", "If you have a problem, that make parallelization overhead worth trying, you can also use `parallel_columns` - subset of `partition_columns` over which the parallel run (using prefect) will be started.\n", "\n", "If expecting the run to take long, it might be good to directly store results. Here `output_path` and `persist_` methods might come convenient" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# from prefect.engine.executors import LocalDaskExecutor\n", "ms.select_model(df=df,\n", " target_col_name='Sales',\n", " partition_columns=['Assortment', 'State','Store'],\n", "# parallel_over_columns=['Assortment'],\n", "# persist_model_selector_results=False,\n", "# output_path='my_results',\n", "# executor = LocalDaskExecutor(), \n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look at the results\n", "Naturaly we are interested in which models were chosen, so that we can strip our parameter grid from the ones, which were failing and extend with more sophisticated models from most selected classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms.plot_best_wrapper_classes();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There also exists convenient method to plot the results over all (or subset of) the data partitions to see how well our model fitted the data during cross validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms.plot_results(plot_from='2015-06-01');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing 1 Time-Series\n", "To get to more information, it is advisable to go from all partitions level (`ModelSelector`) to single partition level (`ModelSelectorResult`).\n", "\n", "`ModelSelector` stores results as a list of `ModelSelectorResult` objects in `self.results`. Here we provide rich `__repr__` that hints on what information are available. \n", "\n", "Another way to get the ModelSelectorResult is to use `ModelSelector.get_result_for_partition` that ensures the same results also when loading the stored results. Here the list access method fails (`ModelSelector.results[0]`), because each ModelSelectorResults is stored with `partition_hash` name and later load ingests these files in alphabetical order.\n", "\n", "Accessing **training data** to see what is behind the model, cv_results to check the **fitting time** or how big margin my best model had over the second best one or access **model definition** and explore its **parameters** are all handy things that we found useful" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms.results[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = ms.get_result_for_partition(partition=ms.results[0].partition)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting results and errors for 1 time series\n", "On this level we can also access the forecast plots - one that we already know with cv_forecasts and one that gives us only errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res.plot_result(plot_from = '2015-06-01', title='forecasts');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res.plot_error(title='Errors');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Persist and load\n", "To enable later usage of our found results, there are plenty of methods that can help storing and loading the results of model selection in a uniform way.\n", "\n", "Some methods and functions persits/load the whole objects (`load_model_selector`, `load_model_selector_result`), while some provide direct access to the part we might only care if we run in production and have space limitations (`load_best_model`, `load_partition`, ...) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.model_selection import load_model_selector\n", "from hcrystalball.model_selection import load_model_selector_result\n", "from hcrystalball.model_selection import load_best_model\n", "\n", "res.persist(path='tmp')\n", "res = load_model_selector_result(path='tmp',partition_label=ms.results[0].partition) \n", "\n", "ms.persist_results(folder_path='tmp/results')\n", "ms = load_model_selector(folder_path='tmp/results')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms = load_model_selector(folder_path='tmp/results')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ms.plot_results(plot_from='2015-06-01');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cleanup\n", "import shutil\n", "try:\n", " shutil.rmtree('tmp')\n", "except:\n", " pass" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }