{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's go on to our modeling step. As a reminder, our plan of action was as follows:\n", "\n", "1. Perform EDA on the dataset to extract valuable insight about the process generating the time series **(COMPLETED)**.\n", "2. Build a baseline model (univariable model without exogenous variables) for benchmarking purposes. **(Covered in this notebook)**\n", "3. Build a univariate model with all exogenous variables to check best possible performance. **(Covered in this notebook)**\n", "4. Evaluate the model with exogenous variables and discuss any potential issues. **(Covered in this notebook)**\n", "5. Overcome issues identified above. **(Covered in this notebook)**\n", "6. Make future predictions with the best model.\n", "7. Replicate flow with Automated Time Series Modeling (AutoML)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Only enable critical logging (Optional)\n", "import os\n", "os.environ[\"PYCARET_CUSTOM_LOGGING_LEVEL\"] = \"CRITICAL\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "System:\n", " python: 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]\n", "executable: C:\\Users\\Nikhil\\.conda\\envs\\pycaret_dev_sktime_0p11_2\\python.exe\n", " machine: Windows-10-10.0.19044-SP0\n", "\n", "PyCaret required dependencies:\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Nikhil\\.conda\\envs\\pycaret_dev_sktime_0p11_2\\lib\\site-packages\\_distutils_hack\\__init__.py:30: UserWarning: Setuptools is replacing distutils.\n", " warnings.warn(\"Setuptools is replacing distutils.\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " pip: 21.2.2\n", " setuptools: 61.2.0\n", " pycaret: 3.0.0\n", " ipython: Not installed\n", " ipywidgets: 7.7.0\n", " numpy: 1.21.6\n", " pandas: 1.4.2\n", " jinja2: 3.1.2\n", " scipy: 1.8.0\n", " joblib: 1.1.0\n", " sklearn: 1.0.2\n", " pyod: Installed but version unavailable\n", " imblearn: 0.9.0\n", " category_encoders: 2.4.1\n", " lightgbm: 3.3.2\n", " numba: 0.55.1\n", " requests: 2.27.1\n", " matplotlib: 3.5.2\n", " scikitplot: 0.3.7\n", " yellowbrick: 1.4\n", " plotly: 5.8.0\n", " kaleido: 0.2.1\n", " statsmodels: 0.13.2\n", " sktime: 0.11.4\n", " tbats: Installed but version unavailable\n", " pmdarima: 1.8.5\n", "\n", "PyCaret optional dependencies:\n", " shap: Not installed\n", " interpret: Not installed\n", " umap: Not installed\n", " pandas_profiling: Not installed\n", " explainerdashboard: Not installed\n", " autoviz: Not installed\n", " fairlearn: Not installed\n", " xgboost: Not installed\n", " catboost: Not installed\n", " kmodes: Not installed\n", " mlxtend: Not installed\n", " statsforecast: 0.5.5\n", " tune_sklearn: Not installed\n", " ray: Not installed\n", " hyperopt: Not installed\n", " optuna: Not installed\n", " skopt: Not installed\n", " mlflow: 1.25.1\n", " gradio: Not installed\n", " fastapi: Not installed\n", " uvicorn: Not installed\n", " m2cgen: Not installed\n", " evidently: Not installed\n", " nltk: Not installed\n", " pyLDAvis: Not installed\n", " gensim: Not installed\n", " spacy: Not installed\n", " wordcloud: Not installed\n", " textblob: Not installed\n", " psutil: 5.9.0\n", " fugue: Not installed\n", " streamlit: Not installed\n", " prophet: Not installed\n" ] } ], "source": [ "def what_is_installed():\n", " from pycaret import show_versions\n", " show_versions()\n", "\n", "try:\n", " what_is_installed()\n", "except ModuleNotFoundError:\n", " !pip install pycaret\n", " what_is_installed()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pycaret.datasets import get_data\n", "from pycaret.time_series import TSForecastingExperiment" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Global Figure Settings for notebook ----\n", "global_fig_settings = {\"renderer\": \"notebook\", \"width\": 1000, \"height\": 600}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
02004-03-1018:00:002.6136015011.9104616610561131692126813.648.90.7578
12004-03-1019:00:002.012921129.4955103117492155997213.347.70.7255
22004-03-1020:00:002.21402889.093913111401141555107411.954.00.7502
32004-03-1021:00:002.21376809.294817210921221584120311.060.00.7867
42004-03-1022:00:001.61272516.583613112051161490111011.259.60.7888
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) \\\n", "0 2004-03-10 18:00:00 2.6 1360 150 11.9 \n", "1 2004-03-10 19:00:00 2.0 1292 112 9.4 \n", "2 2004-03-10 20:00:00 2.2 1402 88 9.0 \n", "3 2004-03-10 21:00:00 2.2 1376 80 9.2 \n", "4 2004-03-10 22:00:00 1.6 1272 51 6.5 \n", "\n", " PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "0 1046 166 1056 113 1692 1268 \n", "1 955 103 1174 92 1559 972 \n", "2 939 131 1140 114 1555 1074 \n", "3 948 172 1092 122 1584 1203 \n", "4 836 131 1205 116 1490 1110 \n", "\n", " T RH AH \n", "0 13.6 48.9 0.7578 \n", "1 13.3 47.7 0.7255 \n", "2 11.9 54.0 0.7502 \n", "3 11.0 60.0 0.7867 \n", "4 11.2 59.6 0.7888 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CO(GT)PT08.S1(CO)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHindex
02.61360.011.91046.0166.01056.0113.01692.01268.013.648.92004-03-10 18:00:00
12.01292.09.4955.0103.01174.092.01559.0972.013.347.72004-03-10 19:00:00
22.21402.09.0939.0131.01140.0114.01555.01074.011.954.02004-03-10 20:00:00
32.21376.09.2948.0172.01092.0122.01584.01203.011.060.02004-03-10 21:00:00
41.61272.06.5836.0131.01205.0116.01490.01110.011.259.62004-03-10 22:00:00
\n", "
" ], "text/plain": [ " CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) \\\n", "0 2.6 1360.0 11.9 1046.0 166.0 1056.0 \n", "1 2.0 1292.0 9.4 955.0 103.0 1174.0 \n", "2 2.2 1402.0 9.0 939.0 131.0 1140.0 \n", "3 2.2 1376.0 9.2 948.0 172.0 1092.0 \n", "4 1.6 1272.0 6.5 836.0 131.0 1205.0 \n", "\n", " NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH index \n", "0 113.0 1692.0 1268.0 13.6 48.9 2004-03-10 18:00:00 \n", "1 92.0 1559.0 972.0 13.3 47.7 2004-03-10 19:00:00 \n", "2 114.0 1555.0 1074.0 11.9 54.0 2004-03-10 20:00:00 \n", "3 122.0 1584.0 1203.0 11.0 60.0 2004-03-10 21:00:00 \n", "4 116.0 1490.0 1110.0 11.2 59.6 2004-03-10 22:00:00 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = get_data(\"airquality\")\n", "data[\"index\"] = pd.to_datetime(data[\"Date\"] + \" \" + data[\"Time\"])\n", "data.drop(columns=[\"Date\", \"Time\"], inplace=True)\n", "data.replace(-200, np.nan, inplace=True)\n", "target = \"CO(GT)\"\n", "\n", "exclude = ['NMHC(GT)', 'AH']\n", "data.drop(columns=exclude, inplace=True)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2: Baseline Model - Univariate forecasting without exogenous variables" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id42
1TargetCO(GT)
2ApproachUnivariate
3Exogenous VariablesNot Present
4Original data shape(9357, 1)
5Transformed data shape(9357, 1)
6Transformed train set shape(9309, 1)
7Transformed test set shape(48, 1)
8Rows with missing values18.0%
9Fold GeneratorExpandingWindowSplitter
10Fold Number3
11Enforce Prediction IntervalFalse
12Seasonal Period(s) Tested24
13Seasonality PresentTrue
14Seasonalities Detected[24]
15Primary Seasonality24
16Target Strictly PositiveTrue
17Target White NoiseNo
18Recommended d1
19Recommended Seasonal D0
20PreprocessTrue
21Numerical Imputation (Target)ffill
22Transformation (Target)None
23Scaling (Target)None
24CPU Jobs-1
25Use GPUFalse
26Log ExperimentFalse
27Experiment Namets-default-name
28USI968c
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_uni = data.copy()\n", "data_uni.set_index(\"index\", inplace=True)\n", "data_uni = data_uni[target]\n", "\n", "exp_uni = TSForecastingExperiment()\n", "exp_uni.setup(\n", " data=data_uni, fh=48,\n", " numeric_imputation_target=\"ffill\", numeric_imputation_exogenous=\"ffill\",\n", " fig_kwargs=global_fig_settings, session_id=42\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 cutoffMASERMSSEMAERMSEMAPESMAPER2
02005-03-27 14:001.21011.13151.03401.48350.57510.9044-1.3625
12005-03-29 14:002.20861.62771.88392.13161.51460.7456-2.8068
22005-03-31 14:001.13320.85530.96521.11831.17961.2402-5.1529
Meannan1.51731.20481.29441.57781.08980.9634-3.1074
SDnan0.48990.31960.41780.41900.38880.20621.5620
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = exp_uni.create_model(\"arima\", order=(0,1,0), seasonal_order=(0,1,0,24))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "exp_uni.plot_model(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On zooming in to the forecasts, we can see that the model is able to capture some of the trends (spikes) in the dataset, but not all. The performance of our baseline model indicates that mean MASE across the CV folds is 1.52 which is not that great. Any value > 1 indicates that the model is performing worse than even a naive model with one step ahead forecasts. This model needs more improvement. Let's see if adding exogenous variables can help improve the model performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3: Improved Model - Univariate forecasting with exogenous variables" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id42
1TargetCO(GT)
2ApproachUnivariate
3Exogenous VariablesPresent
4Original data shape(9357, 11)
5Transformed data shape(9357, 11)
6Transformed train set shape(9309, 11)
7Transformed test set shape(48, 11)
8Rows with missing values25.8%
9Fold GeneratorExpandingWindowSplitter
10Fold Number3
11Enforce Prediction IntervalFalse
12Seasonal Period(s) Tested24
13Seasonality PresentTrue
14Seasonalities Detected[24]
15Primary Seasonality24
16Target Strictly PositiveTrue
17Target White NoiseNo
18Recommended d1
19Recommended Seasonal D0
20PreprocessTrue
21Numerical Imputation (Target)ffill
22Transformation (Target)None
23Scaling (Target)None
24Numerical Imputation (Exogenous)ffill
25Transformation (Exogenous)None
26Scaling (Exogenous)None
27CPU Jobs-1
28Use GPUFalse
29Log ExperimentFalse
30Experiment Namets-default-name
31USI5f62
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp_exo = TSForecastingExperiment()\n", "exp_exo.setup(\n", " data=data, target=target, index=\"index\", fh=48,\n", " numeric_imputation_target=\"ffill\", numeric_imputation_exogenous=\"ffill\",\n", " fig_kwargs=global_fig_settings, session_id=42\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 cutoffMASERMSSEMAERMSEMAPESMAPER2
02005-03-27 14:000.14730.12810.12590.16800.08250.08240.9697
12005-03-29 14:000.19310.16280.16470.21320.10430.11120.9619
22005-03-31 14:000.30090.24740.25630.32350.27670.33750.4852
Meannan0.21380.17940.18230.23490.15450.17700.8056
SDnan0.06440.05010.05470.06530.08690.11410.2266
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model_exo = exp_exo.create_model(\"arima\", order=(0,1,0), seasonal_order=(0,1,0,24))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "exp_exo.plot_model(model_exo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 4: Evaluate Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad, We have managed to improve MASE to ~ 0.21 which is much better than the univariate model and also a large improvement over a naive model. We should be happy with this improvement. Let's finalize the model by training it on the entire dataset so we can make true future forecasts." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "final_model_exo = exp_exo.finalize_model(model_exo)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model was trained with exogenous variables but you have not passed any for predictions. Please pass exogenous variables to make predictions.\n", "10 exogenous variables (X) needed in order to make future predictions:\n", "['PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH']\n" ] } ], "source": [ "def safe_predict(exp, model):\n", " \"\"\"Prediction wrapper for demo purposes.\"\"\"\n", " try: \n", " exp.predict_model(model)\n", " except ValueError as exception:\n", " print(exception)\n", " exo_vars = exp.exogenous_variables\n", " print(f\"{len(exo_vars)} exogenous variables (X) needed in order to make future predictions:\\n{exo_vars}\")\n", "\n", "safe_predict(exp_exo, final_model_exo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, this approach does not come without side effects. The problem is that we have 10 exogenous variables. Hence in order to get any unknown future values for CO concentration, we will need the future values for all these exogenous variables. This is generally obtained through some forecasting process itself. But each forecast will have errors and these errors can be compounded when there are a lot of exogenous variables. \n", "\n", "**Let's see if we can trim down these exogenous variables to a handful of useful variables without compromising on forecasting performance.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 5: Parsimonious Model - Univariate forecasting with limited exogenous variables\n", "\n", "From the CCF Analysis, we found that many of the exogenous variables show a very similar correlation structure to the CO concentration. E.g. `PT08.S1(CO)`, `NOx(GT)`, `C6H6(GT)`, `PT08.S2(NMHC)` values from 24 hours before (lag = 24) show a high positive correlation to CO concentration. Instead of keeping all of them, lets pick the one with the highest positive correlation at lag 24 which is `NOx(GT)`.\n", "\n", "Similarly, `PT08.S3(NOx)` values from 24 hours ago shows the highest negative correlation to CO concentration. Let's keep this variable as well.\n", "\n", "Finally, in daily cycles, what happens 12 hours back can also impact the current value (e.g. values last night can impact the next day and vice versa). The variable with the highest correlation to CO concentration at lag = 12 is `RH`. We will keep this as well." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id42
1TargetCO(GT)
2ApproachUnivariate
3Exogenous VariablesPresent
4Original data shape(9357, 4)
5Transformed data shape(9357, 4)
6Transformed train set shape(9309, 4)
7Transformed test set shape(48, 4)
8Rows with missing values25.8%
9Fold GeneratorExpandingWindowSplitter
10Fold Number3
11Enforce Prediction IntervalFalse
12Seasonal Period(s) Tested24
13Seasonality PresentTrue
14Seasonalities Detected[24]
15Primary Seasonality24
16Target Strictly PositiveTrue
17Target White NoiseNo
18Recommended d1
19Recommended Seasonal D0
20PreprocessTrue
21Numerical Imputation (Target)ffill
22Transformation (Target)None
23Scaling (Target)None
24Numerical Imputation (Exogenous)ffill
25Transformation (Exogenous)None
26Scaling (Exogenous)None
27CPU Jobs-1
28Use GPUFalse
29Log ExperimentFalse
30Experiment Namets-default-name
31USIe941
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp_slim = TSForecastingExperiment()\n", "keep = [target, \"index\", 'NOx(GT)', \"PT08.S3(NOx)\", \"RH\"]\n", "data_slim = data[keep]\n", "exp_slim.setup(\n", " data=data_slim, target=target, index=\"index\", fh=48,\n", " numeric_imputation_target=\"ffill\", numeric_imputation_exogenous=\"ffill\",\n", " fig_kwargs=global_fig_settings, session_id=42 \n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 cutoffMASERMSSEMAERMSEMAPESMAPER2
02005-03-27 14:000.21740.18910.18570.24790.13390.12300.9340
12005-03-29 14:000.26440.22090.22550.28930.13580.15240.9299
22005-03-31 14:000.23660.19720.20150.25790.25030.33160.6729
Meannan0.23950.20240.20430.26500.17330.20230.8456
SDnan0.01930.01350.01640.01770.05440.09220.1221
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model_slim = exp_slim.create_model(\"arima\", order=(0,1,0), seasonal_order=(0,1,0,24))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "exp_slim.plot_model(model_slim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad. MASE has only increased from ~0.21 to ~0.24, but we have managed to cut our exogenous variables down from 13 to 3. This will help us when we make \"true\" unknonw future predictions since we will need the \"unknown\" future values of these exogenous variables to make the forecast for the CO concentration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Finalize the model\n", "\n", "- Train the slim model on the entire dataset so we can make true future forecasts\n", "- Save the model as a pickle file for deployment " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "final_slim_model = exp_slim.finalize_model(model_slim)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transformation Pipeline and Model Successfully Saved\n" ] }, { "data": { "text/plain": [ "(ForecastingPipeline(steps=[('transformer_exogenous',\n", " TransformerPipeline(steps=[('numerical_imputer',\n", " Imputer(method='ffill',\n", " random_state=42))])),\n", " ('forecaster',\n", " TransformedTargetForecaster(steps=[('transformer_target',\n", " TransformerPipeline(steps=[('numerical_imputer',\n", " Imputer(method='ffill',\n", " random_state=42))])),\n", " ('model',\n", " ARIMA(order=(0,\n", " 1,\n", " 0),\n", " seasonal_order=(0,\n", " 1,\n", " 0,\n", " 24)))]))]),\n", " 'final_slim_model.pkl')" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp_slim.save_model(final_slim_model, \"final_slim_model\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model was trained with exogenous variables but you have not passed any for predictions. Please pass exogenous variables to make predictions.\n", "3 exogenous variables (X) needed in order to make future predictions:\n", "['NOx(GT)', 'PT08.S3(NOx)', 'RH']\n" ] } ], "source": [ "safe_predict(exp_slim, final_slim_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we still need future values for 3 exogenous variables. We will get this in the next part using forecasting techniques." ] } ], "metadata": { "interpreter": { "hash": "c161a91f6f4623a54f30c5492a42e7cf0592610fb90c8abd312086f09f8fbe0f" }, "kernelspec": { "display_name": "pycaret_sktime_0p11_2", "language": "python", "name": "pycaret_sktime_0p11_2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 2 }