{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Seasonalities and Holidays\n", "Despite that time-series modeling is not super feature-rich, we can still leverage at least deterministic features which are known in advance - calendar patterns. \n", "\n", "The easiest way how to model seasonality and leverage the power of ML models is to create a binary feature for each repeating calendar pattern (day of the week, the month of the year...). This is implemented in the \n", "`SeasonalityTransformer`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.style.use('seaborn')\n", "plt.rcParams['figure.figsize'] = [12, 6]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.utils import get_sales_data\n", "\n", "df = get_sales_data(n_dates=365*2, \n", " n_assortments=1, \n", " n_states=1, \n", " n_stores=1)[:\"2014-12-31\"]\n", "X, y = pd.DataFrame(index=df.index), df['Sales']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Seasonality Features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.feature_extraction import SeasonalityTransformer\n", "SeasonalityTransformer(freq = 'D').fit(X, y).transform(X).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Holiday Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the important parts of time-series modeling is taking into account holidays which in most cases exhibit different behavior than it's common. hcrystalball implements HolidayTransformer which returns column with string name of the holiday based on provided country ISO code for each date in the dataset (empty string if it's not a holiday). All hcrystalball wrappers accept the output of `HolidayTransformer` and transform it into individual model formats. `HolidayTransformer` also supports some region specific holidays i.e. Germany state specific holidays, in that case the provided string shoule be in form country-region: i.e. 'DE-HE'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.feature_extraction import HolidayTransformer\n", "HolidayTransformer(country_code = 'DE').fit(X, y).transform(X).tail(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The holiday effects have very often impact on target variable even before or after holidays (i.e. whole Christmas period is affected by few public holidays), in order to model such effects around holiday `HolidayTransformer` provides 3 parameters - `days_before` - number of days the before public holiday which should be taken into account, `days_after` - number of days before the public holiday which should be taken into account and bool variable `bridge_days` which will create variable for the overlaping `days_before` and `days_after` feature - mostly for modeling working days in the middle of two public holidays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.feature_extraction import HolidayTransformer\n", "HolidayTransformer(country_code = 'DE', days_before = 2, days_after = 1, bridge_days = True).fit(X, y).transform(X)['2014-04-16':'2014-04-23']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to combine more countries/regions for the holidays. The behovior of multi holidays is wrapper specific." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.compose import TSColumnTransformer\n", "from hcrystalball.wrappers import ExponentialSmoothingWrapper\n", "from hcrystalball.wrappers import get_sklearn_wrapper\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hcrystalball.wrappers import SarimaxWrapper" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline = Pipeline([\n", " ('holidays_de', HolidayTransformer(country_code = 'DE')),\n", " ('holidays_be', HolidayTransformer(country_code = 'BE')),\n", " ('seasonality', SeasonalityTransformer(freq='D')),\n", " ('model', get_sklearn_wrapper(RandomForestRegressor, random_state=42))\n", "]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds = (pipeline.fit(X[:-10], y[:-10])\n", " .predict(X[-10:])\n", " .merge(y, left_index=True, right_index=True, how='outer')\n", " .tail(50))\n", "\n", "preds.plot(title=f\"MAE:{(preds['Sales']-preds['sklearn']).abs().mean().round(3)}\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Also country code column can be used. \n", "*Note: This column is deleted in `HolidayTransformer.transform` method in order not to pollute the further processing*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_col = Pipeline([\n", " ('holidays_de', HolidayTransformer(country_code_column = 'germany')),\n", " ('holidays_be', HolidayTransformer(country_code_column = 'belgium')), \n", " ('seasonality', SeasonalityTransformer(freq='D')),\n", " ('model', get_sklearn_wrapper(RandomForestRegressor, random_state=42))\n", "]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_col = X.copy().assign(germany='DE').assign(belgium='BE')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_col = (pipeline_col.fit(X_col[:-10], y[:-10])\n", " .predict(X_col[-10:])\n", " .merge(y, left_index=True, right_index=True, how='outer')\n", " .tail(50))\n", "\n", "preds_col.plot(title=f\"MAE:{(preds_col['Sales'] - preds_col['sklearn']).abs().mean().round(3)}\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## With exogenous variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_col['trend'] = np.arange(len(X))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_col = (pipeline_col.fit(X_col[:-10], y[:-10])\n", " .predict(X_col[-10:])\n", " .merge(y, left_index=True, right_index=True, how='outer')\n", " .tail(50))\n", "\n", "preds_col.plot(title=f\"MAE:{(preds_col['Sales'] - preds_col['sklearn']).abs().mean().round(3)}\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Individual holidays as separate columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sklearns_native_transformers = TSColumnTransformer(\n", " transformers=[\n", " ('one_hot_encoder', OneHotEncoder(sparse=False, drop='first'), ['_holiday_CZ'])\n", " ]) \n", "\n", "pipeline_col = Pipeline([\n", " ('holidays', HolidayTransformer(country_code='CZ')), \n", " ('one_hot_encoder', sklearns_native_transformers)\n", "]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_col.fit_transform(X_col['2013-12-20':'2013-12-26'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }