{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## skpro introduction notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Set-up instructions:** On binder, this should run out-of-the-box.\n", "\n", "To run locally instead, ensure that `skpro` with basic dependency requirements is installed in your python environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` provides `scikit-learn`-like, `scikit-base` compatible interfaces to:\n", "\n", "* tabular **supervised regressors with probabilistic prediction modes** - interval, quantile and distribution predictions\n", "* **performance metrics to evaluate probabilistic predictions**, e.g., pinball loss, empirical coverage, CRPS\n", "* **reductions** to turn non-probabilistic, `scikit-learn` regressors into probabilistic `skpro` regressors, such as bootstrap or conformal\n", "* tools for building **pipelines and composite machine learning models**, including tuning via probabilistic performance metrics\n", "* symbolic an lazy **probability distributions** with a value domain of `pandas.DataFrame`-s and a `pandas`-like interface\n", "\n", "**Section 1** provides an overview of common **probabilistic supervised regression workflows** supported by `skpro`.\n", "\n", "**Section 2** gives an more detailed introduction to **prediction modes, performance metrics, and benchmarking tools**.\n", "\n", "**Section 3** discusses **advanced composition patterns**, including various ways to add probabilistic capability to any `sklearn` regressor, pipeline building, tuning, ensembling.\n", "\n", "**Section 4** gives an introduction to how to write **custom estimators** compliant with the `skpro` interface." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# hide warnings\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic probabilistic supervised regression workflows " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` revolves around supervised probabilistic regressors:\n", "\n", "* `fit(X, y)` with tabular features `X`, labels `y`, same rows, both `pd.DataFrame`\n", "* `predict_interval(X_test, coverage=0.90)` for interval predictions of labels\n", "* `predict_quantiles(X_test, alpha=[0.05, 0.95])` for quantile predictions of labels\n", "* `predict_var(X_test)` for variance predictions of labels\n", "* `predict(X_test)` for mean predictions\n", "* `predict_proba(X_test)` for distributional prediction" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 basic deployment workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` regressors are used via `fit` then `predict_proba` etc.\n", "\n", "Same as `sklearn` regressors - `X` and `y` should be `pd.DataFrame` (`numpy` is also ok but not recommended)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "# step 1: data specification\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_new, y_train, _ = train_test_split(X, y)\n", "\n", "# step 2: specifying the regressor\n", "# example - random forest for mean prediction\n", "# linear regression for variance prediction\n", "reg_mean = RandomForestRegressor()\n", "reg_resid = LinearRegression()\n", "reg_proba = ResidualDouble(reg_mean, reg_resid)\n", "\n", "# step 3: fitting the model to training data\n", "reg_proba.fit(X_train, y_train)\n", "\n", "# step 4: predicting labels on new data\n", "\n", "# probabilistic prediction modes - pick any or multiple\n", "# we show the return types in detail below\n", "\n", "# full distribution prediction\n", "y_pred_proba = reg_proba.predict_proba(X_new)\n", "\n", "# interval prediction\n", "y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)\n", "\n", "# quantile prediction\n", "y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])\n", "\n", "# variance prediction\n", "y_pred_var = reg_proba.predict_var(X_new)\n", "\n", "# mean prediction is same as \"classical\" sklearn predict, also available\n", "y_pred_mean = reg_proba.predict(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.1 distribution predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`y_pred_proba` is an `skpro` distribution - it has index and columns like `pd.DataFrame`\n", "\n", "\"we predict that true labels are distributed according to `y_pred_proba`\"\n", "\n", "(here: distribution marginal by row/columns)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Normal(columns=Index(['target'], dtype='object'),\n",
       "       index=Index([287, 134, 213,  82, 251,  45, 350, 285,  22, 236,\n",
       "       ...\n",
       "       218, 235, 166, 361, 272, 333, 219, 310, 178, 233],\n",
       "      dtype='int64', length=111),\n",
       "       mu=array([[158.75],\n",
       "       [171.91],\n",
       "       [ 86.32],\n",
       "       [ 94.1 ],\n",
       "       [251.72],\n",
       "       [100.72],\n",
       "       [262.36],\n",
       "       [209.04],\n",
       "       [ 85.69],\n",
       "       [171.23],\n",
       "       [182.31],\n",
       "       [237.8 ],\n",
       "       [179.06],\n",
       "       [109.77],\n",
       "       [ 90.58],\n",
       "       [184.38],\n",
       "       [135.62],\n",
       "       [179.12],\n",
       "       [ 70.68],\n",
       "       [178.36]...\n",
       "       [16.83771976],\n",
       "       [19.10672381],\n",
       "       [19.63878908],\n",
       "       [16.46098141],\n",
       "       [15.70910931],\n",
       "       [12.90608099],\n",
       "       [19.66465255],\n",
       "       [20.89400588],\n",
       "       [19.28096697],\n",
       "       [22.39693358],\n",
       "       [15.26815129],\n",
       "       [18.49072135],\n",
       "       [16.44625929],\n",
       "       [14.43024188],\n",
       "       [16.19206731],\n",
       "       [21.27391581],\n",
       "       [19.50839963],\n",
       "       [12.64715474],\n",
       "       [13.93531633],\n",
       "       [17.53348762],\n",
       "       [20.01785524],\n",
       "       [19.57531732],\n",
       "       [21.54329846],\n",
       "       [13.07775327],\n",
       "       [13.55384321]]))
Please rerun this cell to show the HTML repr or trust the notebook.
" ], "text/plain": [ "Normal(columns=Index(['target'], dtype='object'),\n", " index=Index([287, 134, 213, 82, 251, 45, 350, 285, 22, 236,\n", " ...\n", " 218, 235, 166, 361, 272, 333, 219, 310, 178, 233],\n", " dtype='int64', length=111),\n", " mu=array([[158.75],\n", " [171.91],\n", " [ 86.32],\n", " [ 94.1 ],\n", " [251.72],\n", " [100.72],\n", " [262.36],\n", " [209.04],\n", " [ 85.69],\n", " [171.23],\n", " [182.31],\n", " [237.8 ],\n", " [179.06],\n", " [109.77],\n", " [ 90.58],\n", " [184.38],\n", " [135.62],\n", " [179.12],\n", " [ 70.68],\n", " [178.36]...\n", " [16.83771976],\n", " [19.10672381],\n", " [19.63878908],\n", " [16.46098141],\n", " [15.70910931],\n", " [12.90608099],\n", " [19.66465255],\n", " [20.89400588],\n", " [19.28096697],\n", " [22.39693358],\n", " [15.26815129],\n", " [18.49072135],\n", " [16.44625929],\n", " [14.43024188],\n", " [16.19206731],\n", " [21.27391581],\n", " [19.50839963],\n", " [12.64715474],\n", " [13.93531633],\n", " [17.53348762],\n", " [20.01785524],\n", " [19.57531732],\n", " [21.54329846],\n", " [13.07775327],\n", " [13.55384321]]))" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba = reg_proba.predict_proba(X_new)\n", "y_pred_proba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` distribution objects are pandas-like" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(111, 1)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([287, 134, 213, 82, 251, 45, 350, 285, 22, 236,\n", " ...\n", " 218, 235, 166, 361, 272, 333, 219, 310, 178, 233],\n", " dtype='int64', length=111)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.index # same index as X_new" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['target'], dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.columns # same columns as X_new" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "distribution objects have `sample` and methods such as `mean`, `var`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287176.508550
134161.825887
21394.371796
82107.464801
251247.966573
\n", "
" ], "text/plain": [ " target\n", "287 176.508550\n", "134 161.825887\n", "213 94.371796\n", "82 107.464801\n", "251 247.966573" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.sample().head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287158.75
134171.91
21386.32
8294.10
251251.72
\n", "
" ], "text/plain": [ " target\n", "287 158.75\n", "134 171.91\n", "213 86.32\n", "82 94.10\n", "251 251.72" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.mean().head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287362.869338
134271.792914
213289.466582
82125.589059
251479.705452
\n", "
" ], "text/plain": [ " target\n", "287 362.869338\n", "134 271.792914\n", "213 289.466582\n", "82 125.589059\n", "251 479.705452" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.var().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "more details on `skpro` distributions in the \"distributions\" tutorial!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.2 interval predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "interval prediction `y_pred_interval` is a `pd.DataFrame`:\n", "\n", "* rows are the same as `X_new`\n", "* columns indicate variables, nominal coverage, and bottom/upper bound\n", "\n", "\"we predict that value in row falls between bottom/upper with 90% chance\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.9
lowerupper
287127.416970190.083030
134144.792708199.027292
21358.334925114.305075
8275.666697112.533303
251215.694121287.745879
\n", "
" ], "text/plain": [ " target \n", " 0.9 \n", " lower upper\n", "287 127.416970 190.083030\n", "134 144.792708 199.027292\n", "213 58.334925 114.305075\n", "82 75.666697 112.533303\n", "251 215.694121 287.745879" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)\n", "y_pred_interval.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "multiple coverages can be passed:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.500.900.95
lowerupperlowerupperlowerupper
287145.901557171.598443127.416970190.083030121.414392196.085608
134160.790265183.029735144.792708199.027292139.597753204.222247
21374.84442297.79557858.334925114.30507552.973726119.666274
8286.541228101.65877275.666697112.53330372.135365116.064635
251236.947205266.492795215.694121287.745879208.792518294.647482
.....................
333176.038162203.041838156.613558222.466442150.305725228.774275
219150.266649176.673351131.271468195.668532125.103083201.836917
310190.349266219.410734169.444427240.315573162.655911247.104089
17877.29918994.94081164.609010107.63099060.488075111.751925
233155.038072173.321928141.885912186.474088137.614955190.745045
\n", "

111 rows × 6 columns

\n", "
" ], "text/plain": [ " target \n", " 0.50 0.90 0.95 \n", " lower upper lower upper lower upper\n", "287 145.901557 171.598443 127.416970 190.083030 121.414392 196.085608\n", "134 160.790265 183.029735 144.792708 199.027292 139.597753 204.222247\n", "213 74.844422 97.795578 58.334925 114.305075 52.973726 119.666274\n", "82 86.541228 101.658772 75.666697 112.533303 72.135365 116.064635\n", "251 236.947205 266.492795 215.694121 287.745879 208.792518 294.647482\n", ".. ... ... ... ... ... ...\n", "333 176.038162 203.041838 156.613558 222.466442 150.305725 228.774275\n", "219 150.266649 176.673351 131.271468 195.668532 125.103083 201.836917\n", "310 190.349266 219.410734 169.444427 240.315573 162.655911 247.104089\n", "178 77.299189 94.940811 64.609010 107.630990 60.488075 111.751925\n", "233 155.038072 173.321928 141.885912 186.474088 137.614955 190.745045\n", "\n", "[111 rows x 6 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coverage = [0.5, 0.9, 0.95]\n", "y_pred_ints = reg_proba.predict_interval(X_new, coverage=coverage)\n", "y_pred_ints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`predict_interval` output spec:\n", "\n", "`pandas.DataFrame`\\\n", "Row index is as for `X_new`\\\n", "Column has multi-index:\\\n", "1st level = variable names from `y` in fit\\\n", "2nd level = coverage fractions in `coverage`\\\n", "3rd level = string `\"lower\"` or `\"upper\"`\n", "\n", "Entries = interval prediction of lower/upper interval at nominal coverage in 2nd lvl, for var in 1st lvl, for data index in row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.3 quantile predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "quantile prediction `y_pred_quantiles` is a `pd.DataFrame`:\n", "\n", "* rows are the same as `X_new`\n", "* columns indicate variables, quantile points\n", "\n", "\"we predict the 5%, 50%, 95% quantile points for the row to be here\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.050.500.95
287127.416970158.75190.083030
134144.792708171.91199.027292
21358.33492586.32114.305075
8275.66669794.10112.533303
251215.694121251.72287.745879
\n", "
" ], "text/plain": [ " target \n", " 0.05 0.50 0.95\n", "287 127.416970 158.75 190.083030\n", "134 144.792708 171.91 199.027292\n", "213 58.334925 86.32 114.305075\n", "82 75.666697 94.10 112.533303\n", "251 215.694121 251.72 287.745879" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])\n", "y_pred_quantiles.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "multiple quantiles:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.100.250.500.750.90
287134.337558145.901557158.75171.598443183.162442
134150.782158160.790265171.91183.029735193.037842
21364.51604474.84442286.3297.795578108.123956
8279.73809786.54122894.10101.658772108.461903
251223.651228236.947205251.72266.492795279.788772
..................
333163.886086176.038162189.54203.041838215.193914
219138.383221150.266649163.47176.673351188.556779
310177.271152190.349266204.88219.410734232.488848
17869.36018577.29918986.1294.940811102.879815
233146.810051155.038072164.18173.321928181.549949
\n", "

111 rows × 5 columns

\n", "
" ], "text/plain": [ " target \n", " 0.10 0.25 0.50 0.75 0.90\n", "287 134.337558 145.901557 158.75 171.598443 183.162442\n", "134 150.782158 160.790265 171.91 183.029735 193.037842\n", "213 64.516044 74.844422 86.32 97.795578 108.123956\n", "82 79.738097 86.541228 94.10 101.658772 108.461903\n", "251 223.651228 236.947205 251.72 266.492795 279.788772\n", ".. ... ... ... ... ...\n", "333 163.886086 176.038162 189.54 203.041838 215.193914\n", "219 138.383221 150.266649 163.47 176.673351 188.556779\n", "310 177.271152 190.349266 204.88 219.410734 232.488848\n", "178 69.360185 77.299189 86.12 94.940811 102.879815\n", "233 146.810051 155.038072 164.18 173.321928 181.549949\n", "\n", "[111 rows x 5 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alpha = [0.1, 0.25, 0.5, 0.75, 0.9]\n", "y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=alpha)\n", "y_pred_quantiles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`predict_quantiles` output spec:\n", "\n", "`pandas.DataFrame`\\\n", "Row index is same as `X_new`\\\n", "Column has multi-index:\\\n", "1st level = variable names from `y` in `fit`\\\n", "2nd level = quantile points in `alpha`\n", "\n", "Entries = quantiles prediction at quantile point in 2nd lvl, for var in 1st lvl, for data index in row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.4 mean and variance predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "mean and variance predictions `y_pred_mean`, `y_pred_var` are `pd.DataFrame`-s:\n", "\n", "* rows are the same as `X_new`\n", "* columns are the same as `X_new`\n", "\n", "entries are predictive mean and variance in row/column" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "y_pred_mean = reg_proba.predict(X_new)\n", "y_pred_var = reg_proba.predict_var(X_new)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287158.75
134171.91
21386.32
8294.10
251251.72
\n", "
" ], "text/plain": [ " target\n", "287 158.75\n", "134 171.91\n", "213 86.32\n", "82 94.10\n", "251 251.72" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_mean.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287362.869338
134271.792914
213289.466582
82125.589059
251479.705452
\n", "
" ], "text/plain": [ " target\n", "287 362.869338\n", "134 271.792914\n", "213 289.466582\n", "82 125.589059\n", "251 479.705452" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_var.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "this is the same as taking the distribution prediction and taking mean/variance\n", "\n", "(for distribution objects that estimate these precisely)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287158.75
134171.91
21386.32
8294.10
251251.72
\n", "
" ], "text/plain": [ " target\n", "287 158.75\n", "134 171.91\n", "213 86.32\n", "82 94.10\n", "251 251.72" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.mean().head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
287362.869338
134271.792914
213289.466582
82125.589059
251479.705452
\n", "
" ], "text/plain": [ " target\n", "287 362.869338\n", "134 271.792914\n", "213 289.466582\n", "82 125.589059\n", "251 479.705452" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba.var().head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 simple evaluation workflow for probabilistic predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "for simple evaluation:\n", "\n", "1. split the data into train/test set\n", "2. make predictions of either type for test features\n", "3. compute metric on test set, comparing test predictions to hend out test labels\n", "\n", "Note:\n", "\n", "* metrics will compare tabular ground truth to probabilistic prediction\n", "* the metric will needs to be of a compatible type, e.g., for proba predictions" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "29.341032127400844" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "\n", "from skpro.metrics import CRPS\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "# step 1: data specification\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)\n", "\n", "# step 2: specifying the regressor\n", "# example - linear regression for mean prediction\n", "# random forest for variance prediction\n", "reg_mean = LinearRegression()\n", "reg_resid = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean, reg_resid)\n", "\n", "# step 3: fitting the model to training data\n", "reg_proba.fit(X_train, y_train)\n", "\n", "# step 4: predicting labels on new data\n", "y_pred_proba = reg_proba.predict_proba(X_test)\n", "\n", "# step 5: specifying evaluation metric\n", "metric = CRPS()\n", "\n", "# step 6: evaluat metric, compare predictions to actuals\n", "metric(y_test, y_pred_proba)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "how do we know that metric is of right type? Via `scitype:y_pred` tag" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'estimator_type': 'estimator',\n", " 'object_type': 'metric',\n", " 'reserved_params': ['multioutput', 'score_average'],\n", " 'scitype:y_pred': 'pred_proba',\n", " 'lower_is_better': True}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metric.get_tags()\n", "# scitype:y_pred is pred_proba - for proba predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "how do we find metrics for a prediction type?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameobjectscitype:y_pred
0CRPS<class 'skpro.metrics._classes.CRPS'>pred_proba
1ConstraintViolation<class 'skpro.metrics._classes.ConstraintViola...pred_interval
2EmpiricalCoverage<class 'skpro.metrics._classes.EmpiricalCovera...pred_interval
3LinearizedLogLoss<class 'skpro.metrics._classes.LinearizedLogLo...pred_proba
4LogLoss<class 'skpro.metrics._classes.LogLoss'>pred_proba
5PinballLoss<class 'skpro.metrics._classes.PinballLoss'>pred_quantiles
6SquaredDistrLoss<class 'skpro.metrics._classes.SquaredDistrLoss'>pred_proba
\n", "
" ], "text/plain": [ " name object \\\n", "0 CRPS \n", "1 ConstraintViolation \n", "5 PinballLoss \n", "6 SquaredDistrLoss \n", "\n", " scitype:y_pred \n", "0 pred_proba \n", "1 pred_interval \n", "2 pred_interval \n", "3 pred_proba \n", "4 pred_proba \n", "5 pred_quantiles \n", "6 pred_proba " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skpro.registry import all_objects\n", "\n", "all_objects(\"metric\", as_dataframe=True, return_tags=\"scitype:y_pred\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "extra note: quantile metrics can be applied to interval predictions as well\n", "\n", "more details on metrics below" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3 diagnostic visualisations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "some useful diagnostic visualisations: variants of crossplots for probabilistic predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A. crossplot ground truth vs prediction intervals.\n", "\n", "Works with both proba and interval predictions.\n", "\n", "What to look for: intervals shouhld cut through the x = y line (green points)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_interval\n", "\n", "plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_interval\n", "\n", "y_pred_interval = reg_proba.predict_interval(X_test, coverage=0.9)\n", "plot_crossplot_interval(y_test, y_pred_interval)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "B. crossplot residuals vs predictive standard deviation\n", "\n", "Works with both proba and variance predictions.\n", "\n", "What to look for: should be close to a line, high linear correlation" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_std\n", "\n", "plot_crossplot_std(y_test, y_pred_proba)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_std\n", "\n", "y_pred_var = reg_proba.predict_var(X_test)\n", "plot_crossplot_std(y_test, y_pred_var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C. crossplot ground truth vs loss values\n", "\n", "Loss and prediction type should agree.\n", "\n", "What to look for: association between accuracy and ground truth value\n", "\n", "Diagnostic of which values we can predict more accurately,\n", "\n", "e.g., to inform modelling or identify unusual outliers" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_loss\n", "\n", "crps_metric = CRPS()\n", "plot_crossplot_loss(y_test, y_pred_proba, crps_metric)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 `skpro` objects - `scikit-base` interface, searching for regressors and metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `skpro` objects - `skbase` interface points `get_tags`, `get_params`/`set_params`\n", "* searching estimators and metrics via `all_objects`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4.1 primer on `skpro` object interface " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "metrics and estimators are first-class citizens in `skpro`, with a `scikit-base` compatible interface" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# example object 1: CRPS metric\n", "from skpro.metrics import CRPS\n", "\n", "crps_metric = CRPS()\n", "\n", "# example object 2: ResidualDouble regressor\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import LinearRegression\n", "\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "reg_mean = LinearRegression()\n", "reg_resid = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean, reg_resid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "e.g., all have `get_tags` interface" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'estimator_type': 'estimator',\n", " 'object_type': 'metric',\n", " 'reserved_params': ['multioutput', 'score_average'],\n", " 'scitype:y_pred': 'pred_proba',\n", " 'lower_is_better': True}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crps_metric.get_tags()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'estimator_type': 'regressor_proba',\n", " 'object_type': 'regressor_proba',\n", " 'capability:multioutput': False,\n", " 'capability:missing': True}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg_proba.get_tags()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the tag `object_type` indicates the type of object, e.g., metric or proba regressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "all objects also have the `get_params`/`set_params` interface known from `scikit-learn`\n", "\n", "= reading or setting hyper-parameters\n", "\n", "`get_params` returns `dict` `{paramname: paramvalue}`; `set_params` writes it" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'multioutput': 'uniform_average', 'multivariate': False}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crps_metric.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "composite objects have the nested param interface, keys `componentname__paramname`" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ResidualDouble(estimator=LinearRegression(),\n",
       "               estimator_resid=RandomForestRegressor())
Please rerun this cell to show the HTML repr or trust the notebook.
" ], "text/plain": [ "ResidualDouble(estimator=LinearRegression(),\n", " estimator_resid=RandomForestRegressor())" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# note that reg_proba has components LinearRegression and RandomForestaregressor\n", "# each with their own parameters\n", "reg_proba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "so `reg_proba` will have parameters coming from itself and either component:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'cv': None,\n", " 'distr_loc_scale_name': None,\n", " 'distr_params': None,\n", " 'distr_type': 'Normal',\n", " 'estimator': LinearRegression(),\n", " 'estimator_resid': RandomForestRegressor(),\n", " 'min_scale': 1e-10,\n", " 'residual_trafo': 'absolute',\n", " 'use_y_pred': False,\n", " 'estimator__copy_X': True,\n", " 'estimator__fit_intercept': True,\n", " 'estimator__n_jobs': None,\n", " 'estimator__normalize': 'deprecated',\n", " 'estimator__positive': False,\n", " 'estimator_resid__bootstrap': True,\n", " 'estimator_resid__ccp_alpha': 0.0,\n", " 'estimator_resid__criterion': 'squared_error',\n", " 'estimator_resid__max_depth': None,\n", " 'estimator_resid__max_features': 1.0,\n", " 'estimator_resid__max_leaf_nodes': None,\n", " 'estimator_resid__max_samples': None,\n", " 'estimator_resid__min_impurity_decrease': 0.0,\n", " 'estimator_resid__min_samples_leaf': 1,\n", " 'estimator_resid__min_samples_split': 2,\n", " 'estimator_resid__min_weight_fraction_leaf': 0.0,\n", " 'estimator_resid__n_estimators': 100,\n", " 'estimator_resid__n_jobs': None,\n", " 'estimator_resid__oob_score': False,\n", " 'estimator_resid__random_state': None,\n", " 'estimator_resid__verbose': 0,\n", " 'estimator_resid__warm_start': False}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg_proba.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "further common interface points are `get_config`, `set_config`, and `get_fitted_params` (only fittable estimators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4.2 searching for regressors and metrics " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as first-class citizens, all objects in `skpro` are indexed via the `registry` utility `all_objects`.\n", "\n", "To find probabilistic supervised regressors, use `all_objects` with the type `regressor_proba`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameobject
0BaggingRegressor<class 'skpro.regression.ensemble.BaggingRegre...
1BootstrapRegressor<class 'skpro.regression.bootstrap.BootstrapRe...
2GridSearchCV<class 'skpro.model_selection._tuning.GridSear...
3Pipeline<class 'skpro.regression.compose._pipeline.Pip...
4RandomizedSearchCV<class 'skpro.model_selection._tuning.Randomiz...
\n", "
" ], "text/plain": [ " name object\n", "0 BaggingRegressor \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameobjectscitype:y_pred
0CRPS<class 'skpro.metrics._classes.CRPS'>pred_proba
1ConstraintViolation<class 'skpro.metrics._classes.ConstraintViola...pred_interval
2EmpiricalCoverage<class 'skpro.metrics._classes.EmpiricalCovera...pred_interval
3LinearizedLogLoss<class 'skpro.metrics._classes.LinearizedLogLo...pred_proba
4LogLoss<class 'skpro.metrics._classes.LogLoss'>pred_proba
5PinballLoss<class 'skpro.metrics._classes.PinballLoss'>pred_quantiles
6SquaredDistrLoss<class 'skpro.metrics._classes.SquaredDistrLoss'>pred_proba
\n", "" ], "text/plain": [ " name object \\\n", "0 CRPS \n", "1 ConstraintViolation \n", "5 PinballLoss \n", "6 SquaredDistrLoss \n", "\n", " scitype:y_pred \n", "0 pred_proba \n", "1 pred_interval \n", "2 pred_interval \n", "3 pred_proba \n", "4 pred_proba \n", "5 pred_quantiles \n", "6 pred_proba " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skpro.registry import all_objects\n", "\n", "all_objects(\"metric\", as_dataframe=True, return_tags=\"scitype:y_pred\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "all tags can be printed by the `all_tags` utility:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescitypetypedescription
0lower_is_bettermetricboolwhether lower (True) or higher (False) is better
1scitype:y_predmetricstrexpected input type for y_pred in performance ...
\n", "
" ], "text/plain": [ " name scitype type \\\n", "0 lower_is_better metric bool \n", "1 scitype:y_pred metric str \n", "\n", " description \n", "0 whether lower (True) or higher (False) is better \n", "1 expected input type for y_pred in performance ... " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all tags applicable to metrics\n", "from skpro.registry import all_tags\n", "\n", "all_tags(\"metric\", as_dataframe=True)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescitypetypedescription
0capability:missingregressor_probaboolwhether estimator supports missing values
1capability:multioutputregressor_probaboolwhether estimator supports multioutput regression
\n", "
" ], "text/plain": [ " name scitype type \\\n", "0 capability:missing regressor_proba bool \n", "1 capability:multioutput regressor_proba bool \n", "\n", " description \n", "0 whether estimator supports missing values \n", "1 whether estimator supports multioutput regression " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all tags applicable to probabilistic regressors\n", "from skpro.registry import all_tags\n", "\n", "all_tags(\"regressor_proba\", as_dataframe=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "filtering in search can be done with the `filter_tags` argument in `all_objects`, see docstring:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameobject
0CRPS<class 'skpro.metrics._classes.CRPS'>
1LinearizedLogLoss<class 'skpro.metrics._classes.LinearizedLogLo...
2LogLoss<class 'skpro.metrics._classes.LogLoss'>
3SquaredDistrLoss<class 'skpro.metrics._classes.SquaredDistrLoss'>
\n", "
" ], "text/plain": [ " name object\n", "0 CRPS \n", "1 LinearizedLogLoss \n", "3 SquaredDistrLoss " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skpro.registry import all_objects\n", "\n", "# \"retrieve all genuinely probabilistic loss functions\"\n", "all_objects(\"metric\", as_dataframe=True, filter_tags={\"scitype:y_pred\": \"pred_proba\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Prediction types, metrics, benchmarking " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section gives more details on:\n", "\n", "* different prediction types, including a methodological primer\n", "* the API of metrics to compare probabilistic predictions to non-probabilistic actuals\n", "* utilities for batch benchmarking of estimators and metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Probabilistic predictions - methodological primer " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**readers familir with, or less interested in theory, may like to skip section 2.1**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In supervised learning - probabilistic or not:\n", "\n", "* we fit estimator to i.i.d samples $(X_1, Y_1), \\dots, (X_N, Y_N) \\sim (X_*, Y_*)$\n", "* and want to predict $y$ given $x$ accurately, for $(x, y) \\sim (X_*, Y_*)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $y$ be the (true) value, for an observed feature $x$\n", "\n", "(we consider $y$ a random variable)\n", "\n", "| Name | param | prediction/estimate of | `skpro` |\n", "| ---- | ----- | ---------------------- | -------- |\n", "| point prediction | | conditional expectation $\\mathbb{E}[y\\|x]$ | `predict` |\n", "| variance prediction | | conditional variance $Var[y\\|x]$ | `predict_var` |\n", "| quantile prediction | $\\alpha\\in (0,1)$ | $\\alpha$-quantile of $y\\|x$ | `predict_quantiles` |\n", "| interval prediction | $c\\in (0,1)$| $[a,b]$ s.t. $P(a\\le y \\le b\\| x) = c$ | `predict_interval` |\n", "| distribution prediction | | the law/distribution of $y\\|x$ | `predict_proba` |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### More formal details & intuition:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's consider the toy example again" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_new, y_train, _ = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import train_test_split\n", "\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_new, y_train, _ = train_test_split(X, y)\n", "\n", "\n", "reg_mean = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean)\n", "\n", "reg_proba.fit(X_train, y_train)\n", "y_pred_proba = reg_proba.predict_proba(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* a **\"point prediction\"** is a prediction/estimate of the conditional expectation $\\mathbb{E}[y|x]$.\\\n", " **Intuition**: \"out of many repetitions/worlds, this value is the arithmetic average of all observations\"." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
421216.595619
81122.436986
326147.317343
10598.226950
416171.539410
\n", "
" ], "text/plain": [ " target\n", "421 216.595619\n", "81 122.436986\n", "326 147.317343\n", "105 98.226950\n", "416 171.539410" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if y_pred_proba were *true*, here's how many repetitions would look like:\n", "\n", "# repeating this line is \"one repetition\"\n", "y_pred_proba.sample().head()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0421174.503233
81128.299773
326174.547102
10585.523671
416166.025376
.........
99399208.666436
109188.566325
2276.930673
174156.354496
10696.456699
\n", "

11100 rows × 1 columns

\n", "
" ], "text/plain": [ " target\n", "0 421 174.503233\n", " 81 128.299773\n", " 326 174.547102\n", " 105 85.523671\n", " 416 166.025376\n", "... ...\n", "99 399 208.666436\n", " 109 188.566325\n", " 22 76.930673\n", " 174 156.354496\n", " 106 96.456699\n", "\n", "[11100 rows x 1 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "many_samples = y_pred_proba.sample(100)\n", "many_samples" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
421213.432031
81126.286757
326173.065740
105100.126654
416178.876668
\n", "
" ], "text/plain": [ " target\n", "421 213.432031\n", "81 126.286757\n", "326 173.065740\n", "105 100.126654\n", "416 178.876668" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# \"doing many times and taking the mean\" -> usual point prediction\n", "mean_prediction = many_samples.groupby(level=1, sort=False).mean()\n", "mean_prediction.head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
421213.57
81122.14
326172.48
10598.97
416180.31
\n", "
" ], "text/plain": [ " target\n", "421 213.57\n", "81 122.14\n", "326 172.48\n", "105 98.97\n", "416 180.31" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# if we would do this infinity times instead of 100:\n", "y_pred_proba.mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* a **\"variance prediction\"** is a prediction/estimate of the conditional expectation $Var[y|x]$.\\\n", " **Intuition:** \"out of many repetitions/worlds, this value is the average squared distance of the observation to the perfect point prediction\".\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
421237.798822
81262.881415
326275.523556
105401.240249
416267.851337
\n", "
" ], "text/plain": [ " target\n", "421 237.798822\n", "81 262.881415\n", "326 275.523556\n", "105 401.240249\n", "416 267.851337" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as above - take many samples, and then compute element-wise statistics\n", "var_prediction = many_samples.groupby(level=1, sort=False).var()\n", "var_prediction.head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
421289.154099
81289.154099
326289.154099
105289.154099
416289.154099
\n", "
" ], "text/plain": [ " target\n", "421 289.154099\n", "81 289.154099\n", "326 289.154099\n", "105 289.154099\n", "416 289.154099" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# e.g., predict_var should give the same result as infinite large sample's variance\n", "y_pred_proba.var().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* a **\"quantile prediction\"**, at quantile point $\\alpha\\in (0,1)$ is a prediction/estimate of the $\\alpha$-quantile of $y'|y$, i.e., of $F^{-1}_{y|x}(\\alpha)$, where $F^{-1}$ is the (generalized) inverse cdf = quantile function of the random variable y|x.\\\n", " **Intuition**: \"out of many repetitions/worlds, a fraction of exactly $\\alpha$ will have equal or smaller than this value.\"\n", "* an **\"interval prediction\"** or \"predictive interval\" with (symmetric) coverage $c\\in (0,1)$ is a prediction/estimate pair of lower bound $a$ and upper bound $b$ such that $P(a\\le y \\le b| x) = c$ and $P(y \\gneq b| x) = P(y \\lneq a| x) = (1 - c) /2$.\\\n", " **Intuition**: \"out of many repetitions/worlds, a fraction of exactly $c$ will be contained in the interval $[a,b]$, and being above is equally likely as being below\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(similar - exercise left to the reader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* a **\"distribution prediction\"** or \"full probabilistic prediction\" is a prediction/estimate of the distribution of $y|x$, e.g., \"it's a normal distribution with mean 42 and variance 1\".\\\n", "**Intuition**: exhaustive description of the generating mechanism of many repetitions/worlds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "note: the true distribution is unknown, and not accessible easily!\n", "\n", "`y_pred_proba` is a distribution, but in general not equal to the true one!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "that is, there are:\n", "\n", "* *true* distribution `y_pred_proba_true` - unknown and unknowable but estimable\n", "* `y_pred_proba` - our guess at `y_pred_proba_true`\n", "* the actual data `y_true` is *one* `y_pred_proba_true.sample()`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `predict` produces guess of `y_pred_proba_true.mean()`\n", "* `predict_var` produces guess of `y_pred_proba_true.var()`\n", "* `predict_quantiles([0.05, 0.5, 0.95])` produces guess of `y_pred_proba_true.quantiles([0.05, 0.5, 0.95])`\n", "* `predict_proba` produces guess of `y_pred_proba_true`\n", "\n", "the guesses are algorithm specific, and some algorithms are more accurate than others, given data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 probabilistic metrics - details " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General usage pattern same as for `sklearn` metrics:\n", "\n", "1. get some actuals and predictions\n", "2. specify the metric - similar to estimator specs\n", "3. plug the actuals and predictions into metric to get metric values\n", "\n", "*but*: need to use dedicated metric for probabilistic predictions\n", "\n", "* ground truth: `y_true` samples\n", "* prediction e.g., `y_predict_proba`, `y_predict_interval`\n", "* so, match metric with type of prediction!\n", " * `metric(y_true: 2D pd.DataFrame, y_pred: proba_prediction_type) -> float`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall methods available for all probabilistic regressors:\n", "\n", "- `predict_interval` produces interval predictions.\n", " Argument `coverage` (nominal interval coverage) must be provided.\n", "- `predict_quantiles` produces quantile predictions.\n", " Argument `alpha` (quantile values) must be provided.\n", "- `predict_var` produces variance predictions. Same args as `predict`.\n", "- `predict_proba` produces full distributional predictions. Same args as `predict`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Name | param | prediction/estimate of | `skpro` |\n", "| ---- | ----- | ---------------------- | -------- |\n", "| point prediction | | conditional expectation $\\mathbb{E}[y\\|x]$ | `predict` |\n", "| variance prediction | | conditional variance $Var[y\\|x]$ | `predict_var` |\n", "| quantile prediction | $\\alpha\\in (0,1)$ | $\\alpha$-quantile of $y\\|x$ | `predict_quantiles` |\n", "| interval prediction | $c\\in (0,1)$| $[a,b]$ s.t. $P(a\\le y \\le b\\| x) = c$ | `predict_interval` |\n", "| distribution prediction | | the law/distribution of $y\\|x$ | `predict_proba` |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's produce some probabilistic predictions!" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "# 1. get some actuals and predictions\n", "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)\n", "# actuals = y_test" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "reg_mean = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean)\n", "\n", "reg_proba.fit(X_train, y_train)\n", "\n", "# use any of the probabilistic methods, we have seen this\n", "y_pred_int = reg_proba.predict_interval(X_test, coverage=0.95)\n", "y_pred_q = reg_proba.predict_quantiles(X_test, alpha=[0.05, 0.95])\n", "y_pred_proba = reg_proba.predict_proba(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "recall, all have their own output format:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.95
lowerupper
104112.79411180.58589
33463.02411130.81589
271116.11411183.90589
342155.14411222.93589
24351.14411118.93589
.........
979.16411146.95589
4686.54411154.33589
154136.21411204.00589
402133.12411200.91589
284145.72411213.51589
\n", "

111 rows × 2 columns

\n", "
" ], "text/plain": [ " target \n", " 0.95 \n", " lower upper\n", "104 112.79411 180.58589\n", "334 63.02411 130.81589\n", "271 116.11411 183.90589\n", "342 155.14411 222.93589\n", "243 51.14411 118.93589\n", ".. ... ...\n", "9 79.16411 146.95589\n", "46 86.54411 154.33589\n", "154 136.21411 204.00589\n", "402 133.12411 200.91589\n", "284 145.72411 213.51589\n", "\n", "[111 rows x 2 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_int # lower/upper intervals" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
0.050.95
104118.243673175.136327
33468.473673125.366327
271121.563673178.456327
342160.593673217.486327
24356.593673113.486327
.........
984.613673141.506327
4691.993673148.886327
154141.663673198.556327
402138.573673195.466327
284151.173673208.066327
\n", "

111 rows × 2 columns

\n", "
" ], "text/plain": [ " target \n", " 0.05 0.95\n", "104 118.243673 175.136327\n", "334 68.473673 125.366327\n", "271 121.563673 178.456327\n", "342 160.593673 217.486327\n", "243 56.593673 113.486327\n", ".. ... ...\n", "9 84.613673 141.506327\n", "46 91.993673 148.886327\n", "154 141.663673 198.556327\n", "402 138.573673 195.466327\n", "284 151.173673 208.066327\n", "\n", "[111 rows x 2 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_q # quantiles" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Normal(columns=Index(['target'], dtype='object'),\n",
       "       index=Index([104, 334, 271, 342, 243, 120, 183, 333, 315, 357,\n",
       "       ...\n",
       "       313, 111, 137, 198, 413,   9,  46, 154, 402, 284],\n",
       "      dtype='int64', length=111),\n",
       "       mu=array([[146.69],\n",
       "       [ 96.92],\n",
       "       [150.01],\n",
       "       [189.04],\n",
       "       [ 85.04],\n",
       "       [ 91.04],\n",
       "       [202.31],\n",
       "       [226.03],\n",
       "       [ 90.7 ],\n",
       "       [153.92],\n",
       "       [263.72],\n",
       "       [139.12],\n",
       "       [ 81.82],\n",
       "       [138.16],\n",
       "       [144.98],\n",
       "       [ 90.28],\n",
       "       [229.32],\n",
       "       [200.46],\n",
       "       [ 83.39],\n",
       "       [196.02],...\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897],\n",
       "       [17.29413897]]))
Please rerun this cell to show the HTML repr or trust the notebook.
" ], "text/plain": [ "Normal(columns=Index(['target'], dtype='object'),\n", " index=Index([104, 334, 271, 342, 243, 120, 183, 333, 315, 357,\n", " ...\n", " 313, 111, 137, 198, 413, 9, 46, 154, 402, 284],\n", " dtype='int64', length=111),\n", " mu=array([[146.69],\n", " [ 96.92],\n", " [150.01],\n", " [189.04],\n", " [ 85.04],\n", " [ 91.04],\n", " [202.31],\n", " [226.03],\n", " [ 90.7 ],\n", " [153.92],\n", " [263.72],\n", " [139.12],\n", " [ 81.82],\n", " [138.16],\n", " [144.98],\n", " [ 90.28],\n", " [229.32],\n", " [200.46],\n", " [ 83.39],\n", " [196.02],...\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897],\n", " [17.29413897]]))" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_proba # sktime/skpro BaseDistribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we now need to apply a suitable metric, `metric(y_test, y_pred)`\n", "\n", "IMPORTANT: sequence matters, `y_test` first; `y_pred` has very different type!" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40.363339883507926" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2. specify metric\n", "# CRPS = continuous ranked probability score, for distribution predictions\n", "from skpro.metrics import CRPS\n", "\n", "crps = CRPS()\n", "\n", "# 3. evaluate metric\n", "crps(y_test, y_pred_proba)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "how do we find a metric that fits the prediction type?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "answer: metrics are tagged\n", "\n", "important tag: `scitype:y_pred`\n", "\n", "* `\"pred_proba\"` - distributional, can applied to distributions, `predict_proba` output\n", "* `\"pred_quantiles\"` - quantile forecast metric, can be applied to quantile predictions, interval predictions, distributional predictions\n", " * applicable to `predict_quantiles`, `predict_interval`, `predict_proba` outputs\n", "* `\"pred_interval\"` - interval forecast metric, can be applied to interval predictions, distributional predictions\n", " * applicable to `predict_interval`, `predict_proba` outputs" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'estimator_type': 'estimator',\n", " 'object_type': 'metric',\n", " 'reserved_params': ['multioutput', 'score_average'],\n", " 'scitype:y_pred': 'pred_proba',\n", " 'lower_is_better': True}" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crps.get_tags()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "listing metrics with the tag, filtering for probabilistic tags:\n", "\n", "(let's try to find a quantile prediction metric!)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameobjectscitype:y_pred
0CRPS<class 'skpro.metrics._classes.CRPS'>pred_proba
1ConstraintViolation<class 'skpro.metrics._classes.ConstraintViola...pred_interval
2EmpiricalCoverage<class 'skpro.metrics._classes.EmpiricalCovera...pred_interval
3LinearizedLogLoss<class 'skpro.metrics._classes.LinearizedLogLo...pred_proba
4LogLoss<class 'skpro.metrics._classes.LogLoss'>pred_proba
5PinballLoss<class 'skpro.metrics._classes.PinballLoss'>pred_quantiles
6SquaredDistrLoss<class 'skpro.metrics._classes.SquaredDistrLoss'>pred_proba
\n", "
" ], "text/plain": [ " name object \\\n", "0 CRPS \n", "1 ConstraintViolation \n", "5 PinballLoss \n", "6 SquaredDistrLoss \n", "\n", " scitype:y_pred \n", "0 pred_proba \n", "1 pred_interval \n", "2 pred_interval \n", "3 pred_proba \n", "4 pred_proba \n", "5 pred_quantiles \n", "6 pred_proba " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skpro.registry import all_objects\n", "\n", "all_objects(\n", " \"metric\",\n", " as_dataframe=True,\n", " return_tags=\"scitype:y_pred\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`PinballLoss` is a quantile forecast metric:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13.648123260286354" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skpro.metrics import PinballLoss\n", "\n", "pinball_loss = PinballLoss()\n", "\n", "pinball_loss(y_test, y_pred_q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... this is by default an average (grand average, float)\n", "\n", "* averages over samples in `y_pred` / `y_test` (rows)\n", "* averages over variables (columns)\n", "* average over `alpha` values, quantile points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "what if we don't want these averages?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* variable (column) averaging is controlled by the `multioutput` arg.\n", " * `\"raw_values\"` prevents averaging, `\"uniform_average\"` computes arithmetic mean.\n", "* quantile points (`alpha`) or coverage (`coverage`) is controlled by `score_average` arg\n", "* evaluation by row via the `evaluate_by_index` method\n", " * can be useful for diagnostics or statistical tests" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.05 13.133967\n", "0.95 14.162279\n", "Name: 0, dtype: float64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example 1: Pinball loss by quantile point\n", "loss_multi = PinballLoss(score_average=False)\n", "loss_multi(y_test, y_pred_q)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target
10441.946574
33416.320990
27114.728135
3426.761400
24328.452058
......
9187.182827
4659.803051
15418.026285
4024.063702
28415.229778
\n", "

111 rows × 1 columns

\n", "
" ], "text/plain": [ " target\n", "104 41.946574\n", "334 16.320990\n", "271 14.728135\n", "342 6.761400\n", "243 28.452058\n", ".. ...\n", "9 187.182827\n", "46 59.803051\n", "154 18.026285\n", "402 4.063702\n", "284 15.229778\n", "\n", "[111 rows x 1 columns]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example 2: CRPS by test sample index\n", "crps.evaluate_by_index(y_test, y_pred_proba)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Caveat: not every metric is an average over time points, e.g., RMSE\n", "\n", "In this case, `evaluate_by_index` computes jackknife pseudo-samples\n", "\n", "(for mean statistics, jackknife pseudo-samples are equal to individual samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Benchmark evaluation of probabilistic regressors " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "for quick evaluation and benchmarking,\n", "\n", "the `benchmarking.evaluate` utility can be used:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_CRPSfit_timepred_timelen_y_train
031.7834510.0027780.001045294
133.5743290.0030860.001094295
229.9096550.0038070.001278295
\n", "
" ], "text/plain": [ " test_CRPS fit_time pred_time len_y_train\n", "0 31.783451 0.002778 0.001045 294\n", "1 33.574329 0.003086 0.001094 295\n", "2 29.909655 0.003807 0.001278 295" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import KFold\n", "\n", "from skpro.benchmarking.evaluate import evaluate\n", "from skpro.metrics import CRPS\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "# 1. specify dataset\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "\n", "# 2. specify estimator\n", "estimator = ResidualDouble(LinearRegression())\n", "\n", "# 3. specify cross-validation schema\n", "cv = KFold(n_splits=3)\n", "\n", "# 4. specify evaluation metric\n", "crps = CRPS()\n", "\n", "# 5. evaluate - run the benchmark\n", "results = evaluate(estimator=estimator, X=X, y=y, cv=cv, scoring=crps)\n", "\n", "# results are pd.DataFrame\n", "# each row is one repetition of the cross-validation on one fold fit/predict/evaluate\n", "# columns report performance, runtime, and other optional information (see docstring)\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Advanced composition patterns " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we introduce a number of composition patterns available in `skpro`:\n", "\n", "* reducer-wrappers that turn `sklearn` regressors into probabilistic ones\n", "* pipelines of `sklearn` transformers with `skpro` regressors\n", "* tuning `skpro` probabilistic regressors via grid/random search, minimizing a probabilistic metric\n", "* ensembling multiple `skpro` probabilistic regressors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data used in this section:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "evaluation metric used in this section:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "crps = CRPS()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Reducers to turn `sklearn` regressors probabilistic " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "there are many common algorithms that turn a non-probabilistic tabular regressor probabilistic\n", "\n", "formally, this is a type of \"reduction\" - of probabilistic supervised tabular to non-probabilistic supervised tabular\n", "\n", "Examples:\n", "\n", "* predicting variance equal to training residual variance - `ResidualDouble` with standard settings\n", " * or other unconditional distribution estimate for residuals\n", "* \"squaring the residual\" two-step prediction - `ResidualDouble`\n", "* boostrap prediction intervals - `BootstrapRegressor`\n", "* conformal prediction intervals - contributions appreciated :-)\n", "* natural gradient boosting aka NGBoost - contributions appreciated :-)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1.1 constant variance prediction " ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36.25453698363302" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import KFold\n", "\n", "# estimator specification - use any sklearn regressor for reg_mean\n", "reg_mean = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean, cv=KFold(5))\n", "# cv is used to estimate out-of-sample residual variance via 5-fold CV\n", "# note - in-sample predictions will usually underestimate the variance!\n", "\n", "# fit and predict\n", "reg_proba.fit(X_train, y_train)\n", "y_pred_proba = reg_proba.predict_proba(X_test)\n", "\n", "# evaluate\n", "crps(y_test, y_pred_proba)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_interval\n", "\n", "plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1.2 two-step residual prediction " ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36.860503015336164" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import KFold\n", "\n", "# estimator specification - use any sklearn regressor for reg_mean and reg_resid\n", "reg_mean = RandomForestRegressor()\n", "reg_resid = RandomForestRegressor()\n", "reg_proba = ResidualDouble(reg_mean, estimator_resid=reg_resid, cv=KFold(5))\n", "# cv is used to estimate out-of-sample residual variance via 5-fold CV\n", "\n", "# fit and predict\n", "reg_proba.fit(X_train, y_train)\n", "y_pred_proba = reg_proba.predict_proba(X_test)\n", "\n", "# evaluate\n", "crps(y_test, y_pred_proba)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_interval\n", "\n", "plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1.3 bootstrap prediction intervals " ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "43.13652636195608" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "from skpro.regression.bootstrap import BootstrapRegressor\n", "\n", "# estimator specification - use any sklearn regressor for reg_mean\n", "reg_mean = LinearRegression()\n", "reg_proba = BootstrapRegressor(reg_mean, n_bootstrap_samples=100)\n", "\n", "# fit and predict\n", "reg_proba.fit(X_train, y_train)\n", "y_pred_proba = reg_proba.predict_proba(X_test)\n", "\n", "# evaluate\n", "crps(y_test, y_pred_proba)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from skpro.utils.plotting import plot_crossplot_interval\n", "\n", "plot_crossplot_interval(y_test, y_pred_proba, coverage=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Pipelines of `skpro` regressor and `sklearn` transformers " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` regressors can be pipelined with `sklearn` transformers, using the `skpro` pipeline.\n", "\n", "This ensure presence of `predict_proba` etc in the pipeline object.\n", "\n", "The syntax is exactly the same as for `sklearn`'s pipeline." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer as Imputer\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "from skpro.regression.compose import Pipeline\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "# estimator specification\n", "reg_mean = LinearRegression()\n", "reg_proba = ResidualDouble(reg_mean)\n", "\n", "# pipeline is specified as a list of tuples (name, estimator)\n", "pipe = Pipeline(\n", " steps=[\n", " (\"imputer\", Imputer()), # an sklearn transformer\n", " (\"scaler\", MinMaxScaler()), # an sklearn transformer\n", " (\"regressor\", reg_proba), # an skpro regressor\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler()),\n",
       "                ('regressor', ResidualDouble(estimator=LinearRegression()))])
Please rerun this cell to show the HTML repr or trust the notebook.
" ], "text/plain": [ "Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler()),\n", " ('regressor', ResidualDouble(estimator=LinearRegression()))])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "# the pipeline behaves as any skpro regressor\n", "pipe.fit(X_train, y_train)\n", "y_pred = pipe.predict(X=X_test)\n", "y_pred_proba = pipe.predict_proba(X=X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the pipeline provides the familiar nested `get_params`, `set_params` interface:\n", "\n", "nested parameters are keyed `componentname__parametername`" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'steps': [('imputer', SimpleImputer()),\n", " ('scaler', MinMaxScaler()),\n", " ('regressor', ResidualDouble(estimator=LinearRegression()))],\n", " 'imputer': SimpleImputer(),\n", " 'scaler': MinMaxScaler(),\n", " 'regressor': ResidualDouble(estimator=LinearRegression()),\n", " 'imputer__add_indicator': False,\n", " 'imputer__copy': True,\n", " 'imputer__fill_value': None,\n", " 'imputer__missing_values': nan,\n", " 'imputer__strategy': 'mean',\n", " 'imputer__verbose': 'deprecated',\n", " 'scaler__clip': False,\n", " 'scaler__copy': True,\n", " 'scaler__feature_range': (0, 1),\n", " 'regressor__cv': None,\n", " 'regressor__distr_loc_scale_name': None,\n", " 'regressor__distr_params': None,\n", " 'regressor__distr_type': 'Normal',\n", " 'regressor__estimator': LinearRegression(),\n", " 'regressor__estimator_resid': None,\n", " 'regressor__min_scale': 1e-10,\n", " 'regressor__residual_trafo': 'absolute',\n", " 'regressor__use_y_pred': False,\n", " 'regressor__estimator__copy_X': True,\n", " 'regressor__estimator__fit_intercept': True,\n", " 'regressor__estimator__n_jobs': None,\n", " 'regressor__estimator__normalize': 'deprecated',\n", " 'regressor__estimator__positive': False}" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pipelines can also be created via simple lists of estimators,\n", "\n", "in this case names are generated automatically:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "# pipeline is specified as a list of tuples (name, estimator)\n", "pipe = Pipeline(\n", " steps=[\n", " Imputer(), # an sklearn transformer\n", " MinMaxScaler(), # an sklearn transformer\n", " reg_proba, # an skpro regressor\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Tuning of `skpro` regressors via grid and random search " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`skpro` provides grid and random search tuners to tune arbitrary probabilistic regressors,\n", "\n", "using probabilistic metrics. Besides this, they function as the `sklearn` tuners do." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import KFold\n", "\n", "from skpro.metrics import CRPS\n", "from skpro.model_selection import GridSearchCV\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "# cross-validation specification for tuner\n", "cv = KFold(n_splits=3)\n", "\n", "# estimator to be tuned\n", "estimator = ResidualDouble(LinearRegression())\n", "\n", "# tuning grid - do we fit an intercept in the linear regression?\n", "param_grid = {\"estimator__fit_intercept\": [True, False]}\n", "\n", "# metric to be optimized\n", "crps_metric = CRPS()\n", "\n", "# specification of the grid search tuner\n", "gscv = GridSearchCV(\n", " estimator=estimator,\n", " param_grid=param_grid,\n", " cv=cv,\n", " scoring=crps_metric,\n", ")" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=False),\n",
       "             estimator=ResidualDouble(estimator=LinearRegression()),\n",
       "             param_grid={'estimator__fit_intercept': [True, False]},\n",
       "             scoring=CRPS())
Please rerun this cell to show the HTML repr or trust the notebook.
" ], "text/plain": [ "GridSearchCV(cv=KFold(n_splits=3, random_state=None, shuffle=False),\n", " estimator=ResidualDouble(estimator=LinearRegression()),\n", " param_grid={'estimator__fit_intercept': [True, False]},\n", " scoring=CRPS())" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gscv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the grid search tuner behaves like any `skpro` probabilistic regressor:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "gscv.fit(X_train, y_train)\n", "y_pred = gscv.predict(X_test)\n", "y_pred_proba = gscv.predict_proba(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "random search is similar, except that instead of a grid a parameter sampler should be specified:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "from skpro.model_selection import RandomizedSearchCV\n", "\n", "# only difference to GridSearchCV is the param_distributions argument\n", "\n", "# specification of the random search parameter sampler\n", "param_distributions = {\"estimator__fit_intercept\": [True, False]}\n", "\n", "# specification of the random search tuner\n", "rscv = RandomizedSearchCV(\n", " estimator=estimator,\n", " param_distributions=param_distributions,\n", " cv=cv,\n", " scoring=crps_metric,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Bagging/mixture ensemble of probabilistic regressors " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classical bagging does the following, for a wrapped estimator:\n", "\n", "In `fit`:\n", "\n", "1. subsample rows and/or columns of `X`, `y` to `X_subs`, `y_subs`\n", "2. fit clone of wrapped estimator to `X_subs`, `y_subs`\n", "3. Repeat 1-2 `n_estimators` times, store that many fitted clones.\n", "\n", "In `predict`, for `X_test`:\n", "\n", "1. for all fitted clones, obtain predictions on `X_test` - these are distributions\n", "2. return the uniform mixture of these distributions, per test sample" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = load_diabetes(return_X_y=True, as_frame=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "from skpro.regression.ensemble import BaggingRegressor\n", "from skpro.regression.residual import ResidualDouble\n", "\n", "reg_mean = LinearRegression()\n", "reg_proba = ResidualDouble(reg_mean)\n", "\n", "ens = BaggingRegressor(reg_proba, n_estimators=10)\n", "ens.fit(X_train, y_train)\n", "\n", "y_pred = ens.predict_proba(X_test)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Mixture(columns=Index(['target'], dtype='object'),\\n distributions=[Normal(columns=Index(['target'], dtype='object'),\\n index=Index([367, 248, 134, 148, 21, 310, 37, 305, 23, 373,\\n ...\\n 365, 204, 309, 379, 7, 440, 178, 160, 119, 441],\\n dtype='int64', length=111),\\n mu=array([[270.91496133],\\n [227.0209857 ],\\n [153.30255219],\\n [133.24435057],\\n [108.55506881],\\n [206.51166703],\\n [153.83348539],\\n [120.5...\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484],\\n [45.96784484]]))],\\n index=Index([367, 248, 134, 148, 21, 310, 37, 305, 23, 373,\\n ...\\n 365, 204, 309, 379, 7, 440, 178, 160, 119, 441],\\n dtype='int64', length=111))\"" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# y_pred is a mixture distribution!\n", "str(y_pred)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal,\n", " skpro.distributions.normal.Normal]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[type(x) for x in y_pred.distributions]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Extension guide - implementing your own probabilistic regressor " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "`skpro` is meant to be easily extensible, for direct contribution to `skpro` as well as for local/private extension with custom methods.\n", "\n", "To get started:\n", "\n", "* Follow the [\"implementing estimator\" developer guide](https://skpro.readthedocs.io/en/stable/developer_guide/add_estimators.html)\n", "* Use the [probabilistic regressor template](https://github.com/sktime/skpro/blob/main/extension_templates/regression.py) to get started\n", "\n", "1. Read through the [probabilistic regression extension template](https://github.com/sktime/skpro/blob/main/extension_templates/regression.py) - this is a `python` file with `todo` blocks that mark the places in which changes need to be added.\n", "2. Copy the proba regressor extension template to a local folder in your own repository (local/private extension), or to a suitable location in your clone of the `skpro` or affiliated repository (if contributed extension), inside `skpro.regression`; rename the file and update the file docstring appropriately.\n", "3. Address the \"todo\" parts. Usually, this means: changing the name of the class, setting the tag values, specifying hyper-parameters, filling in `__init__`, `_fit`, and at least one of the probabilistic prediction methods, preferably `_predict_proba` (for details see the extension template). You can add private methods as long as they do not override the default public interface. For more details, see the extension template.\n", "4. To test your estimator manually: import your estimator and run it in the worfklows in Section 1; then use it in the compositors in Section 3.\n", "5. To test your estimator automatically: call `skpro.utils.check_estimator` on your estimator. You can call this on a class or object instance. Ensure you have specified test parameters in the `get_test_params` method, according to the extension template.\n", "\n", "In case of direct contribution to `skpro` or one of its affiliated packages, additionally:\n", "\n", "* Add yourself as an author to the code, and to the `CODEOWNERS` for the new estimator file(s).\n", "* Create a pull request that contains only the new estimators (and their inheritance tree, if it's not just one class), as well as the automated tests as described above.\n", "* In the pull request, describe the estimator and optimally provide a publication or other technical reference for the strategy it implements.\n", "* Before making the pull request, ensure that you have all necessary permissions to contribute the code to a permissive license (BSD-3) open source project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Summary\n", "\n", "* `skpro` is a unified interface toolbox for probabilistic supervised regression, that is, for prediction intervals, quantiles, fully distributional predictions, in a tabular regression setting. The interface is fully interoperable with `scikit-learn` and `scikit-base` interface specifications.\n", "\n", "* `skpro` comes with rich composition functionality that allows to build complex pipelines easily, and connect easily with other parts of the open source ecosystem, such as `scikit-learn` and individual algorithm libraries.\n", "\n", "* `skpro` is easy to extend, and comes with user friendly tools to facilitate implementing and testing your own probabilistic regressors and composition principles." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Credits:\n", "\n", "noteook creation: fkiraly\n", "\n", "skpro: https://github.com/sktime/skpro/blob/main/CONTRIBUTORS.md" ] } ], "metadata": { "kernelspec": { "display_name": "skpro", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "3e631b8a076cc106144e9b132b7d31cae2f1e2660b47e5f9fcb0397caae5fbd5" } } }, "nbformat": 4, "nbformat_minor": 2 }