{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Overview \n",
"\n",
"In the 10x series of notebooks, we will look at Time Series modeling in pycaret using univariate data and no exogenous variables. We will use the famous airline dataset for illustration. Our plan of action is as follows:\n",
"\n",
"1. Perform EDA on the dataset to extract valuable insight about the process generating the time series. **(COMPLETED)**\n",
"2. Model the dataset based on exploratory analysis (univariable model without exogenous variables). **(Covered in this notebook)**\n",
"3. Use an automated approach (AutoML) to improve the performance.\n",
"4. User customizations, potential pitfalls and how to overcome them. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Only enable critical logging (Optional)\n",
"import os\n",
"os.environ[\"PYCARET_CUSTOM_LOGGING_LEVEL\"] = \"CRITICAL\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"System:\n",
" python: 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]\n",
"executable: C:\\Users\\Nikhil\\.conda\\envs\\pycaret_dev_sktime_0p11_2\\python.exe\n",
" machine: Windows-10-10.0.19044-SP0\n",
"\n",
"PyCaret required dependencies:\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\Nikhil\\.conda\\envs\\pycaret_dev_sktime_0p11_2\\lib\\site-packages\\_distutils_hack\\__init__.py:30: UserWarning: Setuptools is replacing distutils.\n",
" warnings.warn(\"Setuptools is replacing distutils.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" pip: 21.2.2\n",
" setuptools: 61.2.0\n",
" pycaret: 3.0.0\n",
" ipython: Not installed\n",
" ipywidgets: 7.7.0\n",
" numpy: 1.21.6\n",
" pandas: 1.4.2\n",
" jinja2: 3.1.2\n",
" scipy: 1.8.0\n",
" joblib: 1.1.0\n",
" sklearn: 1.0.2\n",
" pyod: Installed but version unavailable\n",
" imblearn: 0.9.0\n",
" category_encoders: 2.4.1\n",
" lightgbm: 3.3.2\n",
" numba: 0.55.1\n",
" requests: 2.27.1\n",
" matplotlib: 3.5.2\n",
" scikitplot: 0.3.7\n",
" yellowbrick: 1.4\n",
" plotly: 5.8.0\n",
" kaleido: 0.2.1\n",
" statsmodels: 0.13.2\n",
" sktime: 0.11.4\n",
" tbats: Installed but version unavailable\n",
" pmdarima: 1.8.5\n",
"\n",
"PyCaret optional dependencies:\n",
" shap: Not installed\n",
" interpret: Not installed\n",
" umap: Not installed\n",
" pandas_profiling: Not installed\n",
" explainerdashboard: Not installed\n",
" autoviz: Not installed\n",
" fairlearn: Not installed\n",
" xgboost: Not installed\n",
" catboost: Not installed\n",
" kmodes: Not installed\n",
" mlxtend: Not installed\n",
" statsforecast: 0.5.5\n",
" tune_sklearn: Not installed\n",
" ray: Not installed\n",
" hyperopt: Not installed\n",
" optuna: Not installed\n",
" skopt: Not installed\n",
" mlflow: 1.25.1\n",
" gradio: Not installed\n",
" fastapi: Not installed\n",
" uvicorn: Not installed\n",
" m2cgen: Not installed\n",
" evidently: Not installed\n",
" nltk: Not installed\n",
" pyLDAvis: Not installed\n",
" gensim: Not installed\n",
" spacy: Not installed\n",
" wordcloud: Not installed\n",
" textblob: Not installed\n",
" psutil: 5.9.0\n",
" fugue: Not installed\n",
" streamlit: Not installed\n",
" prophet: Not installed\n"
]
}
],
"source": [
"def what_is_installed():\n",
" from pycaret import show_versions\n",
" show_versions()\n",
"\n",
"try:\n",
" what_is_installed()\n",
"except ModuleNotFoundError:\n",
" !pip install pycaret\n",
" what_is_installed()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import time\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from pycaret.datasets import get_data\n",
"from pycaret.time_series import TSForecastingExperiment"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"y = get_data('airline', verbose=False)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# We want to forecast the next 12 months of data and we will use 3 fold cross-validation to test the models.\n",
"fh = 12 # or alternately fh = np.arange(1,13)\n",
"fold = 3"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Global Plot Settings\n",
"fig_kwargs={'renderer': 'notebook'}"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Description | \n",
" Value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" session_id | \n",
" 8316 | \n",
"
\n",
" \n",
" 1 | \n",
" Target | \n",
" Number of airline passengers | \n",
"
\n",
" \n",
" 2 | \n",
" Approach | \n",
" Univariate | \n",
"
\n",
" \n",
" 3 | \n",
" Exogenous Variables | \n",
" Not Present | \n",
"
\n",
" \n",
" 4 | \n",
" Original data shape | \n",
" (144, 1) | \n",
"
\n",
" \n",
" 5 | \n",
" Transformed data shape | \n",
" (144, 1) | \n",
"
\n",
" \n",
" 6 | \n",
" Transformed train set shape | \n",
" (132, 1) | \n",
"
\n",
" \n",
" 7 | \n",
" Transformed test set shape | \n",
" (12, 1) | \n",
"
\n",
" \n",
" 8 | \n",
" Rows with missing values | \n",
" 0.0% | \n",
"
\n",
" \n",
" 9 | \n",
" Fold Generator | \n",
" ExpandingWindowSplitter | \n",
"
\n",
" \n",
" 10 | \n",
" Fold Number | \n",
" 3 | \n",
"
\n",
" \n",
" 11 | \n",
" Enforce Prediction Interval | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" Seasonal Period(s) Tested | \n",
" 12 | \n",
"
\n",
" \n",
" 13 | \n",
" Seasonality Present | \n",
" True | \n",
"
\n",
" \n",
" 14 | \n",
" Seasonalities Detected | \n",
" [12] | \n",
"
\n",
" \n",
" 15 | \n",
" Primary Seasonality | \n",
" 12 | \n",
"
\n",
" \n",
" 16 | \n",
" Target Strictly Positive | \n",
" True | \n",
"
\n",
" \n",
" 17 | \n",
" Target White Noise | \n",
" No | \n",
"
\n",
" \n",
" 18 | \n",
" Recommended d | \n",
" 1 | \n",
"
\n",
" \n",
" 19 | \n",
" Recommended Seasonal D | \n",
" 1 | \n",
"
\n",
" \n",
" 20 | \n",
" Preprocess | \n",
" False | \n",
"
\n",
" \n",
" 21 | \n",
" CPU Jobs | \n",
" -1 | \n",
"
\n",
" \n",
" 22 | \n",
" Use GPU | \n",
" False | \n",
"
\n",
" \n",
" 23 | \n",
" Log Experiment | \n",
" False | \n",
"
\n",
" \n",
" 24 | \n",
" Experiment Name | \n",
" ts-default-name | \n",
"
\n",
" \n",
" 25 | \n",
" USI | \n",
" 8630 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"exp = TSForecastingExperiment()\n",
"exp.setup(data=y, fh=fh, fig_kwargs=fig_kwargs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Available Models\n",
"\n",
"`pycaret` Time Series Forecasting module has a rich set of models ranging from traditional statistical models such as ARIMA, Exponential Smoothing, ETS, etc to [Reduced Regression Models](https://github.com/pycaret/pycaret/discussions/1760)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Reference | \n",
" Turbo | \n",
"
\n",
" \n",
" ID | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" naive | \n",
" Naive Forecaster | \n",
" sktime.forecasting.naive.NaiveForecaster | \n",
" True | \n",
"
\n",
" \n",
" grand_means | \n",
" Grand Means Forecaster | \n",
" sktime.forecasting.naive.NaiveForecaster | \n",
" True | \n",
"
\n",
" \n",
" snaive | \n",
" Seasonal Naive Forecaster | \n",
" sktime.forecasting.naive.NaiveForecaster | \n",
" True | \n",
"
\n",
" \n",
" polytrend | \n",
" Polynomial Trend Forecaster | \n",
" sktime.forecasting.trend.PolynomialTrendForeca... | \n",
" True | \n",
"
\n",
" \n",
" arima | \n",
" ARIMA | \n",
" sktime.forecasting.arima.ARIMA | \n",
" True | \n",
"
\n",
" \n",
" auto_arima | \n",
" Auto ARIMA | \n",
" sktime.forecasting.arima.AutoARIMA | \n",
" True | \n",
"
\n",
" \n",
" exp_smooth | \n",
" Exponential Smoothing | \n",
" sktime.forecasting.exp_smoothing.ExponentialSm... | \n",
" True | \n",
"
\n",
" \n",
" croston | \n",
" Croston | \n",
" sktime.forecasting.croston.Croston | \n",
" True | \n",
"
\n",
" \n",
" ets | \n",
" ETS | \n",
" sktime.forecasting.ets.AutoETS | \n",
" True | \n",
"
\n",
" \n",
" theta | \n",
" Theta Forecaster | \n",
" sktime.forecasting.theta.ThetaForecaster | \n",
" True | \n",
"
\n",
" \n",
" tbats | \n",
" TBATS | \n",
" sktime.forecasting.tbats.TBATS | \n",
" False | \n",
"
\n",
" \n",
" bats | \n",
" BATS | \n",
" sktime.forecasting.bats.BATS | \n",
" False | \n",
"
\n",
" \n",
" lr_cds_dt | \n",
" Linear w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" en_cds_dt | \n",
" Elastic Net w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" ridge_cds_dt | \n",
" Ridge w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" lasso_cds_dt | \n",
" Lasso w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" lar_cds_dt | \n",
" Least Angular Regressor w/ Cond. Deseasonalize... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" llar_cds_dt | \n",
" Lasso Least Angular Regressor w/ Cond. Deseaso... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" br_cds_dt | \n",
" Bayesian Ridge w/ Cond. Deseasonalize & Detren... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" huber_cds_dt | \n",
" Huber w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" par_cds_dt | \n",
" Passive Aggressive w/ Cond. Deseasonalize & De... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" omp_cds_dt | \n",
" Orthogonal Matching Pursuit w/ Cond. Deseasona... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" knn_cds_dt | \n",
" K Neighbors w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" dt_cds_dt | \n",
" Decision Tree w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" rf_cds_dt | \n",
" Random Forest w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" et_cds_dt | \n",
" Extra Trees w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" gbr_cds_dt | \n",
" Gradient Boosting w/ Cond. Deseasonalize & Det... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" ada_cds_dt | \n",
" AdaBoost w/ Cond. Deseasonalize & Detrending | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
" lightgbm_cds_dt | \n",
" Light Gradient Boosting w/ Cond. Deseasonalize... | \n",
" pycaret.containers.models.time_series.BaseCdsD... | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name \\\n",
"ID \n",
"naive Naive Forecaster \n",
"grand_means Grand Means Forecaster \n",
"snaive Seasonal Naive Forecaster \n",
"polytrend Polynomial Trend Forecaster \n",
"arima ARIMA \n",
"auto_arima Auto ARIMA \n",
"exp_smooth Exponential Smoothing \n",
"croston Croston \n",
"ets ETS \n",
"theta Theta Forecaster \n",
"tbats TBATS \n",
"bats BATS \n",
"lr_cds_dt Linear w/ Cond. Deseasonalize & Detrending \n",
"en_cds_dt Elastic Net w/ Cond. Deseasonalize & Detrending \n",
"ridge_cds_dt Ridge w/ Cond. Deseasonalize & Detrending \n",
"lasso_cds_dt Lasso w/ Cond. Deseasonalize & Detrending \n",
"lar_cds_dt Least Angular Regressor w/ Cond. Deseasonalize... \n",
"llar_cds_dt Lasso Least Angular Regressor w/ Cond. Deseaso... \n",
"br_cds_dt Bayesian Ridge w/ Cond. Deseasonalize & Detren... \n",
"huber_cds_dt Huber w/ Cond. Deseasonalize & Detrending \n",
"par_cds_dt Passive Aggressive w/ Cond. Deseasonalize & De... \n",
"omp_cds_dt Orthogonal Matching Pursuit w/ Cond. Deseasona... \n",
"knn_cds_dt K Neighbors w/ Cond. Deseasonalize & Detrending \n",
"dt_cds_dt Decision Tree w/ Cond. Deseasonalize & Detrending \n",
"rf_cds_dt Random Forest w/ Cond. Deseasonalize & Detrending \n",
"et_cds_dt Extra Trees w/ Cond. Deseasonalize & Detrending \n",
"gbr_cds_dt Gradient Boosting w/ Cond. Deseasonalize & Det... \n",
"ada_cds_dt AdaBoost w/ Cond. Deseasonalize & Detrending \n",
"lightgbm_cds_dt Light Gradient Boosting w/ Cond. Deseasonalize... \n",
"\n",
" Reference Turbo \n",
"ID \n",
"naive sktime.forecasting.naive.NaiveForecaster True \n",
"grand_means sktime.forecasting.naive.NaiveForecaster True \n",
"snaive sktime.forecasting.naive.NaiveForecaster True \n",
"polytrend sktime.forecasting.trend.PolynomialTrendForeca... True \n",
"arima sktime.forecasting.arima.ARIMA True \n",
"auto_arima sktime.forecasting.arima.AutoARIMA True \n",
"exp_smooth sktime.forecasting.exp_smoothing.ExponentialSm... True \n",
"croston sktime.forecasting.croston.Croston True \n",
"ets sktime.forecasting.ets.AutoETS True \n",
"theta sktime.forecasting.theta.ThetaForecaster True \n",
"tbats sktime.forecasting.tbats.TBATS False \n",
"bats sktime.forecasting.bats.BATS False \n",
"lr_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"en_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"ridge_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"lasso_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"lar_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"llar_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"br_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"huber_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"par_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"omp_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"knn_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"dt_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"rf_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"et_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"gbr_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"ada_cds_dt pycaret.containers.models.time_series.BaseCdsD... True \n",
"lightgbm_cds_dt pycaret.containers.models.time_series.BaseCdsD... True "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"exp.models()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Modeling (Manual)\n",
"\n",
"In our exploratory analysis, we found that the characteristics of the data meant that we need to difference the data once and take a seasonal difference with period = 12. We also concluded that some autoregressive properties still need to be taken care of after doing this. \n",
"\n",
"More specifically, if we were building an ARIMA model, we would start with **ARIMA(1,1,0)x(0,1,0,12)**. Let's build this next."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Description | \n",
" Value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" session_id | \n",
" 42 | \n",
"
\n",
" \n",
" 1 | \n",
" Target | \n",
" Number of airline passengers | \n",
"
\n",
" \n",
" 2 | \n",
" Approach | \n",
" Univariate | \n",
"
\n",
" \n",
" 3 | \n",
" Exogenous Variables | \n",
" Not Present | \n",
"
\n",
" \n",
" 4 | \n",
" Original data shape | \n",
" (144, 1) | \n",
"
\n",
" \n",
" 5 | \n",
" Transformed data shape | \n",
" (144, 1) | \n",
"
\n",
" \n",
" 6 | \n",
" Transformed train set shape | \n",
" (132, 1) | \n",
"
\n",
" \n",
" 7 | \n",
" Transformed test set shape | \n",
" (12, 1) | \n",
"
\n",
" \n",
" 8 | \n",
" Rows with missing values | \n",
" 0.0% | \n",
"
\n",
" \n",
" 9 | \n",
" Fold Generator | \n",
" ExpandingWindowSplitter | \n",
"
\n",
" \n",
" 10 | \n",
" Fold Number | \n",
" 3 | \n",
"
\n",
" \n",
" 11 | \n",
" Enforce Prediction Interval | \n",
" False | \n",
"
\n",
" \n",
" 12 | \n",
" Seasonal Period(s) Tested | \n",
" 12 | \n",
"
\n",
" \n",
" 13 | \n",
" Seasonality Present | \n",
" True | \n",
"
\n",
" \n",
" 14 | \n",
" Seasonalities Detected | \n",
" [12] | \n",
"
\n",
" \n",
" 15 | \n",
" Primary Seasonality | \n",
" 12 | \n",
"
\n",
" \n",
" 16 | \n",
" Target Strictly Positive | \n",
" True | \n",
"
\n",
" \n",
" 17 | \n",
" Target White Noise | \n",
" No | \n",
"
\n",
" \n",
" 18 | \n",
" Recommended d | \n",
" 1 | \n",
"
\n",
" \n",
" 19 | \n",
" Recommended Seasonal D | \n",
" 1 | \n",
"
\n",
" \n",
" 20 | \n",
" Preprocess | \n",
" False | \n",
"
\n",
" \n",
" 21 | \n",
" CPU Jobs | \n",
" -1 | \n",
"
\n",
" \n",
" 22 | \n",
" Use GPU | \n",
" False | \n",
"
\n",
" \n",
" 23 | \n",
" Log Experiment | \n",
" False | \n",
"
\n",
" \n",
" 24 | \n",
" Experiment Name | \n",
" ts-default-name | \n",
"
\n",
" \n",
" 25 | \n",
" USI | \n",
" 56a3 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"exp = TSForecastingExperiment()\n",
"exp.setup(data=y, fh=fh, fold=fold, fig_kwargs=fig_kwargs, session_id=42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Note:**\n",
"\n",
"The `setup` provides some useful information out of the box.\n",
"\n",
"1. Seasonal period of 12 was tested and seasonality was detected at this period. This is what will be used in subsequent modeling automatically.\n",
"2. The data splits are also shown - 132 data points for the training dataset and 12 data points for the test dataset.\n",
"3. The training dataset will be cross validated using 3 folds. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## ARIMA Model\n",
"\n",
"**NOTE**:\n",
"\n",
"1. Specific model hyperparameters can be passed as kwargs to the model. \n",
"2. All models in the time series module are are based on the `sktime` package.\n",
"2. More details about creating and customizing time series models in pycaret can be found here: https://github.com/pycaret/pycaret/discussions/1757\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" cutoff | \n",
" MASE | \n",
" RMSSE | \n",
" MAE | \n",
" RMSE | \n",
" MAPE | \n",
" SMAPE | \n",
" R2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1956-12 | \n",
" 0.3535 | \n",
" 0.4103 | \n",
" 10.3216 | \n",
" 13.4315 | \n",
" 0.0255 | \n",
" 0.0260 | \n",
" 0.9413 | \n",
"
\n",
" \n",
" 1 | \n",
" 1957-12 | \n",
" 0.6844 | \n",
" 0.6853 | \n",
" 20.9235 | \n",
" 23.2653 | \n",
" 0.0581 | \n",
" 0.0560 | \n",
" 0.8582 | \n",
"
\n",
" \n",
" 2 | \n",
" 1958-12 | \n",
" 1.5988 | \n",
" 1.4673 | \n",
" 45.6850 | \n",
" 47.6955 | \n",
" 0.1066 | \n",
" 0.1132 | \n",
" 0.4911 | \n",
"
\n",
" \n",
" Mean | \n",
" nan | \n",
" 0.8789 | \n",
" 0.8543 | \n",
" 25.6434 | \n",
" 28.1308 | \n",
" 0.0634 | \n",
" 0.0651 | \n",
" 0.7635 | \n",
"
\n",
" \n",
" SD | \n",
" nan | \n",
" 0.5267 | \n",
" 0.4477 | \n",
" 14.8178 | \n",
" 14.4051 | \n",
" 0.0333 | \n",
" 0.0362 | \n",
" 0.1956 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"model = exp.create_model(\"arima\", order=(1,1,0), seasonal_order=(0,1,0,12))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**NOTE:**\n",
"* `create_model` will highlight the cross validation scores across the folds. The time cutoff for each fold is also displayes for convenience. Users may wish to correlate this cutoff with what they get from `plot_model(plot=\"cv\")`.\n",
"\n",
"* `create_model` retrains the model on the entire dataset after performing cross validation. This allows us to check the performance of the model against the test set simply by using `predict_model`"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Model | \n",
" MASE | \n",
" RMSSE | \n",
" MAE | \n",
" RMSE | \n",
" MAPE | \n",
" SMAPE | \n",
" R2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ARIMA | \n",
" 0.6999 | \n",
" 0.7757 | \n",
" 21.3121 | \n",
" 26.7998 | \n",
" 0.0480 | \n",
" 0.0462 | \n",
" 0.8703 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" y_pred | \n",
"
\n",
" \n",
" \n",
" \n",
" 1960-01 | \n",
" 424.7154 | \n",
"
\n",
" \n",
" 1960-02 | \n",
" 408.1599 | \n",
"
\n",
" \n",
" 1960-03 | \n",
" 472.4447 | \n",
"
\n",
" \n",
" 1960-04 | \n",
" 463.0139 | \n",
"
\n",
" \n",
" 1960-05 | \n",
" 487.5134 | \n",
"
\n",
" \n",
" 1960-06 | \n",
" 540.0299 | \n",
"
\n",
" \n",
" 1960-07 | \n",
" 616.5423 | \n",
"
\n",
" \n",
" 1960-08 | \n",
" 628.0557 | \n",
"
\n",
" \n",
" 1960-09 | \n",
" 532.5688 | \n",
"
\n",
" \n",
" 1960-10 | \n",
" 477.0820 | \n",
"
\n",
" \n",
" 1960-11 | \n",
" 432.5952 | \n",
"
\n",
" \n",
" 1960-12 | \n",
" 476.1084 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y_pred\n",
"1960-01 424.7154\n",
"1960-02 408.1599\n",
"1960-03 472.4447\n",
"1960-04 463.0139\n",
"1960-05 487.5134\n",
"1960-06 540.0299\n",
"1960-07 616.5423\n",
"1960-08 628.0557\n",
"1960-09 532.5688\n",
"1960-10 477.0820\n",
"1960-11 432.5952\n",
"1960-12 476.1084"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Out-of-sample Forecasts\n",
"y_predict = exp.predict_model(model)\n",
"y_predict"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The scores listed above are for the test set. We can see that the metrics are actually slightly better than the mean Cross validation score which implies that we have not overfit the model. More details about this can be found in **[this article](https://towardsdatascience.com/bias-variance-tradeoff-in-time-series-8434f536387a)**."
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"In a previous notebook, we saw that `plot_model` without an estimator argument works on the original dataset. In addition, by passing the model (`estimator`) to the `plot_model` call, we can plot model diagnostics as well."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot the out-of-sample forecasts\n",
"exp.plot_model(estimator=model)\n",
"\n",
"# # Alternately the following will plot the same thing.\n",
"# exp.plot_model(estimator=model, plot=\"forecast\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**NOTE:** \n",
"* `predict_model` is intelligent enough to understand the current state of the model (i.e. it is only trained using the _train_ dataset).\n",
"* Since the model has only been trained on the _train_ set so far, the predictons are made for the _test_ set.\n",
"* Later, we will see that once the model is finalized (trained on the complete _train + test_ set), `predict_model` automatically makes the true future predictons automatically.\n",
"* Also note that if the model supports prediction intervals, they are plotted by default for convenience."
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Next, let's check the goodness of fit using both diagnostic plots as well as statistical tests. Similar to plot_model, passing an estimator to the `check_stats` call will perform the tests on the model residuals."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Test | \n",
" Test Name | \n",
" Data | \n",
" Property | \n",
" Setting | \n",
" Value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Length | \n",
" | \n",
" 131.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" # Missing Values | \n",
" | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Mean | \n",
" | \n",
" -0.445207 | \n",
"
\n",
" \n",
" 3 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Median | \n",
" | \n",
" -0.9606 | \n",
"
\n",
" \n",
" 4 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Standard Deviation | \n",
" | \n",
" 11.759243 | \n",
"
\n",
" \n",
" 5 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Variance | \n",
" | \n",
" 138.27979 | \n",
"
\n",
" \n",
" 6 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Kurtosis | \n",
" | \n",
" 4.244741 | \n",
"
\n",
" \n",
" 7 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" Skewness | \n",
" | \n",
" -0.938657 | \n",
"
\n",
" \n",
" 8 | \n",
" Summary | \n",
" Statistics | \n",
" Residual | \n",
" # Distinct Values | \n",
" | \n",
" 127.0 | \n",
"
\n",
" \n",
" 9 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" Test Statictic | \n",
" {'alpha': 0.05, 'K': 24} | \n",
" 21.29991 | \n",
"
\n",
" \n",
" 10 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" Test Statictic | \n",
" {'alpha': 0.05, 'K': 48} | \n",
" 43.239443 | \n",
"
\n",
" \n",
" 11 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" p-value | \n",
" {'alpha': 0.05, 'K': 24} | \n",
" 0.620976 | \n",
"
\n",
" \n",
" 12 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" p-value | \n",
" {'alpha': 0.05, 'K': 48} | \n",
" 0.667946 | \n",
"
\n",
" \n",
" 13 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" White Noise | \n",
" {'alpha': 0.05, 'K': 24} | \n",
" True | \n",
"
\n",
" \n",
" 14 | \n",
" White Noise | \n",
" Ljung-Box | \n",
" Residual | \n",
" White Noise | \n",
" {'alpha': 0.05, 'K': 48} | \n",
" True | \n",
"
\n",
" \n",
" 15 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" Stationarity | \n",
" {'alpha': 0.05} | \n",
" True | \n",
"
\n",
" \n",
" 16 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" p-value | \n",
" {'alpha': 0.05} | \n",
" 0.0 | \n",
"
\n",
" \n",
" 17 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" Test Statistic | \n",
" {'alpha': 0.05} | \n",
" -11.577344 | \n",
"
\n",
" \n",
" 18 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" Critical Value 1% | \n",
" {'alpha': 0.05} | \n",
" -3.481682 | \n",
"
\n",
" \n",
" 19 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" Critical Value 5% | \n",
" {'alpha': 0.05} | \n",
" -2.884042 | \n",
"
\n",
" \n",
" 20 | \n",
" Stationarity | \n",
" ADF | \n",
" Residual | \n",
" Critical Value 10% | \n",
" {'alpha': 0.05} | \n",
" -2.57877 | \n",
"
\n",
" \n",
" 21 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Trend Stationarity | \n",
" {'alpha': 0.05} | \n",
" True | \n",
"
\n",
" \n",
" 22 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" p-value | \n",
" {'alpha': 0.05} | \n",
" 0.1 | \n",
"
\n",
" \n",
" 23 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Test Statistic | \n",
" {'alpha': 0.05} | \n",
" 0.031457 | \n",
"
\n",
" \n",
" 24 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Critical Value 10% | \n",
" {'alpha': 0.05} | \n",
" 0.119 | \n",
"
\n",
" \n",
" 25 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Critical Value 5% | \n",
" {'alpha': 0.05} | \n",
" 0.146 | \n",
"
\n",
" \n",
" 26 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Critical Value 2.5% | \n",
" {'alpha': 0.05} | \n",
" 0.176 | \n",
"
\n",
" \n",
" 27 | \n",
" Stationarity | \n",
" KPSS | \n",
" Residual | \n",
" Critical Value 1% | \n",
" {'alpha': 0.05} | \n",
" 0.216 | \n",
"
\n",
" \n",
" 28 | \n",
" Normality | \n",
" Shapiro | \n",
" Residual | \n",
" Normality | \n",
" {'alpha': 0.05} | \n",
" False | \n",
"
\n",
" \n",
" 29 | \n",
" Normality | \n",
" Shapiro | \n",
" Residual | \n",
" p-value | \n",
" {'alpha': 0.05} | \n",
" 0.00003 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Test Test Name Data Property \\\n",
"0 Summary Statistics Residual Length \n",
"1 Summary Statistics Residual # Missing Values \n",
"2 Summary Statistics Residual Mean \n",
"3 Summary Statistics Residual Median \n",
"4 Summary Statistics Residual Standard Deviation \n",
"5 Summary Statistics Residual Variance \n",
"6 Summary Statistics Residual Kurtosis \n",
"7 Summary Statistics Residual Skewness \n",
"8 Summary Statistics Residual # Distinct Values \n",
"9 White Noise Ljung-Box Residual Test Statictic \n",
"10 White Noise Ljung-Box Residual Test Statictic \n",
"11 White Noise Ljung-Box Residual p-value \n",
"12 White Noise Ljung-Box Residual p-value \n",
"13 White Noise Ljung-Box Residual White Noise \n",
"14 White Noise Ljung-Box Residual White Noise \n",
"15 Stationarity ADF Residual Stationarity \n",
"16 Stationarity ADF Residual p-value \n",
"17 Stationarity ADF Residual Test Statistic \n",
"18 Stationarity ADF Residual Critical Value 1% \n",
"19 Stationarity ADF Residual Critical Value 5% \n",
"20 Stationarity ADF Residual Critical Value 10% \n",
"21 Stationarity KPSS Residual Trend Stationarity \n",
"22 Stationarity KPSS Residual p-value \n",
"23 Stationarity KPSS Residual Test Statistic \n",
"24 Stationarity KPSS Residual Critical Value 10% \n",
"25 Stationarity KPSS Residual Critical Value 5% \n",
"26 Stationarity KPSS Residual Critical Value 2.5% \n",
"27 Stationarity KPSS Residual Critical Value 1% \n",
"28 Normality Shapiro Residual Normality \n",
"29 Normality Shapiro Residual p-value \n",
"\n",
" Setting Value \n",
"0 131.0 \n",
"1 0.0 \n",
"2 -0.445207 \n",
"3 -0.9606 \n",
"4 11.759243 \n",
"5 138.27979 \n",
"6 4.244741 \n",
"7 -0.938657 \n",
"8 127.0 \n",
"9 {'alpha': 0.05, 'K': 24} 21.29991 \n",
"10 {'alpha': 0.05, 'K': 48} 43.239443 \n",
"11 {'alpha': 0.05, 'K': 24} 0.620976 \n",
"12 {'alpha': 0.05, 'K': 48} 0.667946 \n",
"13 {'alpha': 0.05, 'K': 24} True \n",
"14 {'alpha': 0.05, 'K': 48} True \n",
"15 {'alpha': 0.05} True \n",
"16 {'alpha': 0.05} 0.0 \n",
"17 {'alpha': 0.05} -11.577344 \n",
"18 {'alpha': 0.05} -3.481682 \n",
"19 {'alpha': 0.05} -2.884042 \n",
"20 {'alpha': 0.05} -2.57877 \n",
"21 {'alpha': 0.05} True \n",
"22 {'alpha': 0.05} 0.1 \n",
"23 {'alpha': 0.05} 0.031457 \n",
"24 {'alpha': 0.05} 0.119 \n",
"25 {'alpha': 0.05} 0.146 \n",
"26 {'alpha': 0.05} 0.176 \n",
"27 {'alpha': 0.05} 0.216 \n",
"28 {'alpha': 0.05} False \n",
"29 {'alpha': 0.05} 0.00003 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check Goodness of Fit\n",
"exp.check_stats(model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Observations**\n",
"\n",
"1. Stationarity tests indicate that the residuals are stationary. \n",
"2. The white noise test indicates that the residuals are consistent with white noise. \n",
"\n",
"This indicates that we have done a good job of extracting most of the signal from the time series data.\n",
"\n",
"Next, we can plot the diagnostics on the residuals just like we did it on the original dataset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"exp.plot_model(model, plot='diagnostics', fig_kwargs={\"height\": 800, \"width\": 1000})"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Observations**\n",
"\n",
"1. The ACF and PACF indicate that we have captured most of the autocorelation in the data. There is no serial autocorrelation left in the data to capture.\n",
"2. The histogram and QQ plot do indicate some left skewness, but overall the results are satisfactory.\n",
"\n",
"**NOTE:**\n",
"\n",
"These plots can be obtained individually as well if needed using the following calls\n",
"\n",
"* `exp.plot_model(model, plot='residuals')`\n",
"* `exp.plot_model(model, plot='acf')`\n",
"* `exp.plot_model(model, plot='pacf')`\n",
"* `exp.plot_model(model, plot='periodogram')`\n",
"\n",
"**Another useful plot is the `insample` plot** which shows the model fit to the actual data. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"exp.plot_model(model, plot='insample')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We could also check the decomposition of the residuals to see if \n",
"\n",
"1. The residual in the decomposition the largest component?\n",
"2. there is any any visible trend or seasonality component that has not been captured in the model?"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"exp.plot_model(model, plot=\"decomp\")\n",
"exp.plot_model(model, plot=\"decomp_stl\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Reduced Regressors: LightGBM (with internal conditional deseasonalize and detrending)\n",
"\n",
"We noted above that we could use regression models for time series data as well by converting them into an appropriate format (reduced regression models). Let's see one of these in action. We will use the LightGBM regressor for this. Reduced regression models in pycaret will also detrend and conditionally deseasonalize the data to make it easier for the regression model to capture the autoregressive properties of the data."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" cutoff | \n",
" MASE | \n",
" RMSSE | \n",
" MAE | \n",
" RMSE | \n",
" MAPE | \n",
" SMAPE | \n",
" R2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1956-12 | \n",
" 0.7545 | \n",
" 0.9171 | \n",
" 22.0339 | \n",
" 30.0192 | \n",
" 0.0538 | \n",
" 0.0562 | \n",
" 0.7067 | \n",
"
\n",
" \n",
" 1 | \n",
" 1957-12 | \n",
" 0.8044 | \n",
" 0.8105 | \n",
" 24.5938 | \n",
" 27.5155 | \n",
" 0.0646 | \n",
" 0.0634 | \n",
" 0.8017 | \n",
"
\n",
" \n",
" 2 | \n",
" 1958-12 | \n",
" 0.8880 | \n",
" 1.0076 | \n",
" 25.3731 | \n",
" 32.7521 | \n",
" 0.0543 | \n",
" 0.0565 | \n",
" 0.7600 | \n",
"
\n",
" \n",
" Mean | \n",
" nan | \n",
" 0.8156 | \n",
" 0.9117 | \n",
" 24.0002 | \n",
" 30.0956 | \n",
" 0.0575 | \n",
" 0.0587 | \n",
" 0.7561 | \n",
"
\n",
" \n",
" SD | \n",
" nan | \n",
" 0.0551 | \n",
" 0.0805 | \n",
" 1.4264 | \n",
" 2.1385 | \n",
" 0.0050 | \n",
" 0.0034 | \n",
" 0.0389 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Model | \n",
" MASE | \n",
" RMSSE | \n",
" MAE | \n",
" RMSE | \n",
" MAPE | \n",
" SMAPE | \n",
" R2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" LGBMRegressor | \n",
" 0.8837 | \n",
" 0.9749 | \n",
" 26.9090 | \n",
" 33.6827 | \n",
" 0.0531 | \n",
" 0.0549 | \n",
" 0.7952 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"model = exp.create_model(\"lightgbm_cds_dt\")\n",
"y_predict = exp.predict_model(model)\n",
"exp.plot_model(estimator=model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Observations:**\n",
" \n",
"1. The overall cross validation metrics are comparable to the ARIMA model, but the forecasts on test data are not good. \n",
"\n",
"We may wish to tune the hyperparameters of the model to see if we can improve the performance."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" cutoff | \n",
" MASE | \n",
" RMSSE | \n",
" MAE | \n",
" RMSE | \n",
" MAPE | \n",
" SMAPE | \n",
" R2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1956-12 | \n",
" 0.3577 | \n",
" 0.4269 | \n",
" 10.4450 | \n",
" 13.9748 | \n",
" 0.0263 | \n",
" 0.0265 | \n",
" 0.9364 | \n",
"
\n",
" \n",
" 1 | \n",
" 1957-12 | \n",
" 1.0518 | \n",
" 1.0513 | \n",
" 32.1555 | \n",
" 35.6915 | \n",
" 0.0903 | \n",
" 0.0854 | \n",
" 0.6663 | \n",
"
\n",
" \n",
" 2 | \n",
" 1958-12 | \n",
" 0.4808 | \n",
" 0.5409 | \n",
" 13.7386 | \n",
" 17.5819 | \n",
" 0.0322 | \n",
" 0.0319 | \n",
" 0.9308 | \n",
"
\n",
" \n",
" Mean | \n",
" nan | \n",
" 0.6301 | \n",
" 0.6730 | \n",
" 18.7797 | \n",
" 22.4161 | \n",
" 0.0496 | \n",
" 0.0479 | \n",
" 0.8445 | \n",
"
\n",
" \n",
" SD | \n",
" nan | \n",
" 0.3024 | \n",
" 0.2715 | \n",
" 9.5532 | \n",
" 9.5020 | \n",
" 0.0289 | \n",
" 0.0266 | \n",
" 0.1261 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Random Grid Search\n",
"tuned_model = exp.tune_model(model)\n",
"exp.plot_model(estimator=tuned_model)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BaseCdsDtForecaster(regressor=LGBMRegressor(random_state=42), sp=12,\n",
" window_length=12)\n",
"BaseCdsDtForecaster(degree=2, deseasonal_model='multiplicative',\n",
" regressor=LGBMRegressor(bagging_freq=5, colsample_bytree=1,\n",
" learning_rate=0.0025551361408324464,\n",
" max_depth=8, min_child_samples=78,\n",
" n_estimators=224, num_leaves=253,\n",
" random_state=42,\n",
" reg_alpha=3.144127709026421e-10,\n",
" reg_lambda=3.7899513783298346e-07,\n",
" subsample=1),\n",
" sp=12, window_length=13)\n"
]
}
],
"source": [
"print(model)\n",
"print(tuned_model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This is much better than before in terms of metrics as well as the comparison to the test data. We can even compare the model performance visually."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"exp.plot_model([model, tuned_model], data_kwargs={\"labels\": [\"Baseline\", \"Tuned\"]})"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Getting Ready for Production\n",
"\n",
"So now we have built 2 models manually. We can not put one of them into production. Let's pick the Reduced regression model for this. \n",
"\n",
"Before, we can use this model for making future predictions, we need to finalize this. This step will take the model from the previous stage and without changing the model hyperparameters, train the model on the entire _train + test_ dataset so that we can make true future forecasts.\n",
"\n",
"### Finalizing Models"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" y_pred | \n",
"
\n",
" \n",
" \n",
" \n",
" 1961-01 | \n",
" 452.2415 | \n",
"
\n",
" \n",
" 1961-02 | \n",
" 442.2790 | \n",
"
\n",
" \n",
" 1961-03 | \n",
" 507.9412 | \n",
"
\n",
" \n",
" 1961-04 | \n",
" 495.7020 | \n",
"
\n",
" \n",
" 1961-05 | \n",
" 502.1397 | \n",
"
\n",
" \n",
" 1961-06 | \n",
" 573.5355 | \n",
"
\n",
" \n",
" 1961-07 | \n",
" 636.7857 | \n",
"
\n",
" \n",
" 1961-08 | \n",
" 637.9355 | \n",
"
\n",
" \n",
" 1961-09 | \n",
" 558.5830 | \n",
"
\n",
" \n",
" 1961-10 | \n",
" 489.0101 | \n",
"
\n",
" \n",
" 1961-11 | \n",
" 428.0954 | \n",
"
\n",
" \n",
" 1961-12 | \n",
" 483.7110 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y_pred\n",
"1961-01 452.2415\n",
"1961-02 442.2790\n",
"1961-03 507.9412\n",
"1961-04 495.7020\n",
"1961-05 502.1397\n",
"1961-06 573.5355\n",
"1961-07 636.7857\n",
"1961-08 637.9355\n",
"1961-09 558.5830\n",
"1961-10 489.0101\n",
"1961-11 428.0954\n",
"1961-12 483.7110"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Trains the model with the best hyperparameters on the entire dataset now\n",
"final_model = exp.finalize_model(tuned_model)\n",
"exp.plot_model(final_model)\n",
"exp.predict_model(final_model)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BaseCdsDtForecaster(degree=2, deseasonal_model='multiplicative',\n",
" regressor=LGBMRegressor(bagging_freq=5, colsample_bytree=1,\n",
" learning_rate=0.0025551361408324464,\n",
" max_depth=8, min_child_samples=78,\n",
" n_estimators=224, num_leaves=253,\n",
" random_state=42,\n",
" reg_alpha=3.144127709026421e-10,\n",
" reg_lambda=3.7899513783298346e-07,\n",
" subsample=1),\n",
" sp=12, window_length=13)\n",
"BaseCdsDtForecaster(degree=2, deseasonal_model='multiplicative',\n",
" regressor=LGBMRegressor(bagging_freq=5, colsample_bytree=1,\n",
" learning_rate=0.0025551361408324464,\n",
" max_depth=8, min_child_samples=78,\n",
" n_estimators=224, num_leaves=253,\n",
" random_state=42,\n",
" reg_alpha=3.144127709026421e-10,\n",
" reg_lambda=3.7899513783298346e-07,\n",
" subsample=1),\n",
" sp=12, window_length=13)\n"
]
}
],
"source": [
"print(tuned_model)\n",
"print(final_model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Observations:**\n",
"As we can see, the model hyperparameters are exactly the same. The only difference is that the `tuned_model` has been trained only using the training dataset, while the `final_model` has been trained using the full dataset.\n",
"\n",
"We can also plot the two models simultaneously to check the forecasts. Since the `tuned_model` has been trained on the train dataset only, it makes forecasts for the test dataset. Since the `final_model` has been trained on the entire dataset, it makes true futute predictions. "
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Save model pickle file\n",
"\n",
"Now, we can save this model as a pickle file. This model can then be loaded later for making predictions again."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transformation Pipeline and Model Successfully Saved\n"
]
}
],
"source": [
"_ = exp.save_model(final_model, \"my_final_model\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Load Model \n",
"\n",
"Now, let's say you closed your training session but want to make the predictons with the saved model. This can be done easily by loading the model again (usually done in another session)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transformation Pipeline and Model Successfully Loaded\n"
]
}
],
"source": [
"exp_load = TSForecastingExperiment()\n",
"loaded_model = exp_load.load_model(\"my_final_model\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" y_pred | \n",
"
\n",
" \n",
" \n",
" \n",
" 1961-01 | \n",
" 452.2415 | \n",
"
\n",
" \n",
" 1961-02 | \n",
" 442.2790 | \n",
"
\n",
" \n",
" 1961-03 | \n",
" 507.9412 | \n",
"
\n",
" \n",
" 1961-04 | \n",
" 495.7020 | \n",
"
\n",
" \n",
" 1961-05 | \n",
" 502.1397 | \n",
"
\n",
" \n",
" 1961-06 | \n",
" 573.5355 | \n",
"
\n",
" \n",
" 1961-07 | \n",
" 636.7857 | \n",
"
\n",
" \n",
" 1961-08 | \n",
" 637.9355 | \n",
"
\n",
" \n",
" 1961-09 | \n",
" 558.5830 | \n",
"
\n",
" \n",
" 1961-10 | \n",
" 489.0101 | \n",
"
\n",
" \n",
" 1961-11 | \n",
" 428.0954 | \n",
"
\n",
" \n",
" 1961-12 | \n",
" 483.7110 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y_pred\n",
"1961-01 452.2415\n",
"1961-02 442.2790\n",
"1961-03 507.9412\n",
"1961-04 495.7020\n",
"1961-05 502.1397\n",
"1961-06 573.5355\n",
"1961-07 636.7857\n",
"1961-08 637.9355\n",
"1961-09 558.5830\n",
"1961-10 489.0101\n",
"1961-11 428.0954\n",
"1961-12 483.7110"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Should match predictions from before the save and load\n",
"exp_load.predict_model(loaded_model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"These predictions match with the ones we got before we saved the model.\n",
"\n",
"Another use case is that the user may want to forecast for a longer horizon than the one used for the original training. This can be achieved as follows"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Example here shows forecasting out 36 months instead of the default of 12\n",
"exp.plot_model(estimator=final_model, data_kwargs={'fh': 36}) "
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Users may also be interested in learning about Multi-step forecasts. More details about this can be **[found here](https://github.com/pycaret/pycaret/discussions/1942)**."
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**That's it for this notebook. In the next notebook, we will see a more automate way to model this same data.**"
]
}
],
"metadata": {
"interpreter": {
"hash": "c161a91f6f4623a54f30c5492a42e7cf0592610fb90c8abd312086f09f8fbe0f"
},
"kernelspec": {
"display_name": "pycaret_sktime_0p11_2",
"language": "python",
"name": "pycaret_sktime_0p11_2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}