{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M6.01\n",
    "\n",
    "The aim of this notebook is to investigate if we can tune the hyperparameters\n",
    "of a bagging regressor and evaluate the gain obtained.\n",
    "\n",
    "We will load the California housing dataset and split it into a training and a\n",
    "testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
    "target *= 100  # rescale the target in k$\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0, test_size=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a `BaggingRegressor` and provide a `DecisionTreeRegressor` to its\n",
    "parameter `estimator`. Train the regressor and evaluate its generalization\n",
    "performance on the testing set using the mean absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.ensemble import BaggingRegressor\n",
    "\n",
    "tree = DecisionTreeRegressor()\n",
    "bagging = BaggingRegressor(estimator=tree, n_jobs=2)\n",
    "bagging.fit(data_train, target_train)\n",
    "target_predicted = bagging.predict(data_test)\n",
    "print(\n",
    "    \"Basic mean absolute error of the bagging regressor:\\n\"\n",
    "    f\"{mean_absolute_error(target_test, target_predicted):.2f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, create a `RandomizedSearchCV` instance using the previous model and tune\n",
    "the important parameters of the bagging regressor. Find the best parameters\n",
    "and check if you are able to find a set of parameters that improve the default\n",
    "regressor still using the mean absolute error as a metric.\n",
    "\n",
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">You can list the bagging regressor's parameters using the <tt class=\"docutils literal\">get_params</tt> method.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "for param in bagging.get_params().keys():\n",
    "    print(param)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "from scipy.stats import randint\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "\n",
    "param_grid = {\n",
    "    \"n_estimators\": randint(10, 30),\n",
    "    \"max_samples\": [0.5, 0.8, 1.0],\n",
    "    \"max_features\": [0.5, 0.8, 1.0],\n",
    "    \"estimator__max_depth\": randint(3, 10),\n",
    "}\n",
    "search = RandomizedSearchCV(\n",
    "    bagging, param_grid, n_iter=20, scoring=\"neg_mean_absolute_error\"\n",
    ")\n",
    "_ = search.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "columns = [f\"param_{name}\" for name in param_grid.keys()]\n",
    "columns += [\"mean_test_error\", \"std_test_error\"]\n",
    "cv_results = pd.DataFrame(search.cv_results_)\n",
    "cv_results[\"mean_test_error\"] = -cv_results[\"mean_test_score\"]\n",
    "cv_results[\"std_test_error\"] = cv_results[\"std_test_score\"]\n",
    "cv_results[columns].sort_values(by=\"mean_test_error\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "target_predicted = search.predict(data_test)\n",
    "print(\n",
    "    \"Mean absolute error after tuning of the bagging regressor:\\n\"\n",
    "    f\"{mean_absolute_error(target_test, target_predicted):.2f} k$\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We see that the predictor provided by the bagging regressor does not need much\n",
    "hyperparameter tuning compared to a single decision tree."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}