{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M3.02\n", "\n", "The goal is to find the best set of hyperparameters which maximize the\n", "generalization performance on a training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import train_test_split\n", "\n", "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n", "target *= 100 # rescale the target in k$\n", "\n", "data_train, data_test, target_train, target_test = train_test_split(\n", " data, target, random_state=42\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we progressively define the regression pipeline and later\n", "tune its hyperparameters.\n", "\n", "Start by defining a pipeline that:\n", "* uses a `StandardScaler` to normalize the numerical data;\n", "* uses a `sklearn.neighbors.KNeighborsRegressor` as a predictive model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsRegressor\n", "\n", "scaler = StandardScaler()\n", "model = make_pipeline(scaler, KNeighborsRegressor())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `RandomizedSearchCV` with `n_iter=20` to find the best set of\n", "hyperparameters by tuning the following parameters of the `model`:\n", "\n", "- the parameter `n_neighbors` of the `KNeighborsRegressor` with values\n", " `np.logspace(0, 3, num=10).astype(np.int32)`;\n", "- the parameter `with_mean` of the `StandardScaler` with possible values\n", " `True` or `False`;\n", "- the parameter `with_std` of the `StandardScaler` with possible values `True`\n", " or `False`.\n", "\n", "Notice that in the notebook \"Hyperparameter tuning by randomized-search\" we\n", "pass distributions to be sampled by the `RandomizedSearchCV`. In this case we\n", "define a fixed grid of hyperparameters to be explored. Using a `GridSearchCV`\n", "instead would explore all the possible combinations on the grid, which can be\n", "costly to compute for large grids, whereas the parameter `n_iter` of the\n", "`RandomizedSearchCV` controls the number of different random combination that\n", "are evaluated. Notice that setting `n_iter` larger than the number of possible\n", "combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating\n", "already-explored combinations.\n", "\n", "Once the computation has completed, print the best combination of parameters\n", "stored in the `best_params_` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import numpy as np\n", "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "param_distributions = {\n", " \"kneighborsregressor__n_neighbors\": np.logspace(0, 3, num=10).astype(\n", " np.int32\n", " ),\n", " \"standardscaler__with_mean\": [True, False],\n", " \"standardscaler__with_std\": [True, False],\n", "}\n", "\n", "model_random_search = RandomizedSearchCV(\n", " model,\n", " param_distributions=param_distributions,\n", " n_iter=20,\n", " n_jobs=2,\n", " verbose=1,\n", " random_state=1,\n", ")\n", "model_random_search.fit(data_train, target_train)\n", "model_random_search.best_params_" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "So the best hyperparameters give a model where the features are scaled but not\n", "centered.\n", "\n", "Getting the best parameter combinations is the main outcome of the\n", "hyper-parameter optimization procedure. However it is also interesting to\n", "assess the sensitivity of the best models to the choice of those parameters.\n", "The following code, not required to answer the quiz question shows how to\n", "conduct such an interactive analysis for this this pipeline using a parallel\n", "coordinate plot using the `plotly` library.\n", "\n", "We could use `cv_results = model_random_search.cv_results_` to make a parallel\n", "coordinate plot as we did in the previous notebook (you are more than welcome\n", "to try!)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "cv_results = pd.DataFrame(model_random_search.cv_results_)" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "To simplify the axis of the plot, we rename the column of the dataframe and\n", "only select the mean test score and the value of the hyperparameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "column_name_mapping = {\n", " \"param_kneighborsregressor__n_neighbors\": \"n_neighbors\",\n", " \"param_standardscaler__with_mean\": \"centering\",\n", " \"param_standardscaler__with_std\": \"scaling\",\n", " \"mean_test_score\": \"mean test score\",\n", "}\n", "\n", "cv_results = cv_results.rename(columns=column_name_mapping)\n", "cv_results = cv_results[column_name_mapping.values()].sort_values(\n", " \"mean test score\", ascending=False\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "In addition, the parallel coordinate plot from `plotly` expects all data to be\n", "numeric. Thus, we convert the boolean indicator informing whether or not the\n", "data were centered or scaled into an integer, where True is mapped to 1 and\n", "False is mapped to 0. As `n_neighbors` has `dtype=object`, we also convert it\n", "explicitly to an integer." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "column_scaler = [\"centering\", \"scaling\"]\n", "cv_results[column_scaler] = cv_results[column_scaler].astype(np.int64)\n", "cv_results[\"n_neighbors\"] = cv_results[\"n_neighbors\"].astype(np.int64)\n", "cv_results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "solution" ] }, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "fig = px.parallel_coordinates(\n", " cv_results,\n", " color=\"mean test score\",\n", " dimensions=[\"n_neighbors\", \"centering\", \"scaling\", \"mean test score\"],\n", " color_continuous_scale=px.colors.diverging.Tealrose,\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We recall that it is possible to select a range of results by clicking and\n", "holding on any axis of the parallel coordinate plot. You can then slide (move)\n", "the range selection and cross two selections to see the intersections.\n", "\n", "Selecting the best performing models (i.e. above an accuracy of ~0.68), we\n", "observe that **in this case**:\n", "\n", "- scaling the data is important. All the best performing models use scaled\n", " features;\n", "- centering the data does not have a strong impact. Both approaches, centering\n", " and not centering, can lead to good models;\n", "- using some neighbors is fine but using too many is a problem. In particular\n", " no pipeline with `n_neighbors=1` can be found among the best models.\n", " However, scaling features has an even stronger impact than the choice of\n", " `n_neighbors` in this problem.\n", "\n", "The reason is that fitting scaled data leads to a completely different\n", "KNeighbors model: if you have two variables A and B where A has values which\n", "vary between 0 and 10,000 (e.g. the variable `\"Population\"`) and B is a\n", "feature that varies between 1 and 10 (e.g. the variable `\"AveRooms\"`), then\n", "distances between samples (rows of the dataframe) are mostly impacted by\n", "differences in values of the column A, while values of the column B are\n", "comparatively ignored. If one applies StandardScaler to such a database, both\n", "the values of A and B will be approximately between -3 and 3 and the neighbor\n", "structure will be impacted more or less equivalently by both variables.\n", "\n", "Note that **in this case** the models with scaled features perform better than\n", "the models with non-scaled features because all the variables are expected to\n", "be predictive and we rather avoid some of them being comparatively ignored.\n", "\n", "If the variables in lower scales were not predictive one may experience a\n", "decrease of the performance after scaling the features: noisy features would\n", "contribute more to the prediction after scaling and therefore scaling would\n", "increase overfitting." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }