{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M6.03\n",
    "\n",
    "The aim of this exercise is to:\n",
    "\n",
    "* verifying if a random forest or a gradient-boosting decision tree overfit if\n",
    "  the number of estimators is not properly chosen;\n",
    "* use the early-stopping strategy to avoid adding unnecessary trees, to get\n",
    "  the best generalization performances.\n",
    "\n",
    "We use the California housing dataset to conduct our experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n",
    "target *= 100  # rescale the target in k$\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0, test_size=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a gradient boosting decision tree with `max_depth=5` and\n",
    "`learning_rate=0.5`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "\n",
    "gbdt = GradientBoostingRegressor(max_depth=5, learning_rate=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Also create a random forest with fully grown trees by setting `max_depth=None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "forest = RandomForestRegressor(max_depth=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "For both the gradient-boosting and random forest models, create a validation\n",
    "curve using the training set to assess the impact of the number of trees on\n",
    "the performance of each model. Evaluate the list of parameters `param_range =\n",
    "np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import ValidationCurveDisplay\n",
    "\n",
    "param_range = np.array([1, 2, 5, 10, 20, 50, 100])\n",
    "disp = ValidationCurveDisplay.from_estimator(\n",
    "    forest,\n",
    "    data,\n",
    "    target,\n",
    "    param_name=\"n_estimators\",\n",
    "    param_range=param_range,\n",
    "    scoring=\"neg_mean_absolute_error\",\n",
    "    negate_score=True,\n",
    "    std_display_style=\"errorbar\",\n",
    "    n_jobs=2,\n",
    ")\n",
    "\n",
    "_ = disp.ax_.set(\n",
    "    xlabel=\"Number of trees in the forest\",\n",
    "    ylabel=\"Mean absolute error (k$)\",\n",
    "    title=\"Validation curve for random forest\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both gradient boosting and random forest models improve when increasing the\n",
    "number of trees in the ensemble. However, the scores reach a plateau where\n",
    "adding new trees just makes fitting and scoring slower.\n",
    "\n",
    "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
    "offers an early-stopping option. Internally, the algorithm uses an\n",
    "out-of-sample set to compute the generalization performance of the model at\n",
    "each addition of a tree. Thus, if the generalization performance is not\n",
    "improving for several iterations, it stops adding trees.\n",
    "\n",
    "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
    "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
    "that the gradient boosting fitting stops after adding 5 trees that do not\n",
    "improve the overall generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "gbdt = GradientBoostingRegressor(n_estimators=1_000, n_iter_no_change=5)\n",
    "gbdt.fit(data_train, target_train)\n",
    "gbdt.n_estimators_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We see that the number of trees used is far below 1000 with the current\n",
    "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
    "have been useless."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimate the generalization performance of this model again using the\n",
    "`sklearn.metrics.mean_absolute_error` metric but this time using the test set\n",
    "that we held out at the beginning of the notebook. Compare the resulting value\n",
    "with the values observed in the validation curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "error = mean_absolute_error(target_test, gbdt.predict(data_test))\n",
    "print(f\"On average, our GBDT regressor makes an error of {error:.2f} k$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that the MAE value measure on the held out test set is close to the\n",
    "validation error measured to the right hand side of the validation curve. This\n",
    "is kind of reassuring, as it means that both the cross-validation procedure\n",
    "and the outer train-test split roughly agree as approximations of the true\n",
    "generalization performance of the model. We can observe that the final\n",
    "evaluation of the test error seems to be even slightly below than the\n",
    "cross-validated test scores. This can be explained because the final model has\n",
    "been trained on the full training set while the cross-validation models have\n",
    "been trained on smaller subsets: in general the larger the number of training\n",
    "points, the lower the test error."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}