{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M6.03\n", "\n", "The aim of this exercise is to:\n", "\n", "* verifying if a random forest or a gradient-boosting decision tree overfit if\n", " the number of estimators is not properly chosen;\n", "* use the early-stopping strategy to avoid adding unnecessary trees, to get\n", " the best generalization performances.\n", "\n", "We use the California housing dataset to conduct our experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import train_test_split\n", "\n", "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n", "target *= 100 # rescale the target in k$\n", "data_train, data_test, target_train, target_test = train_test_split(\n", " data, target, random_state=0, test_size=0.5\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a gradient boosting decision tree with `max_depth=5` and\n", "`learning_rate=0.5`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "gbdt = GradientBoostingRegressor(max_depth=5, learning_rate=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Also create a random forest with fully grown trees by setting `max_depth=None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.ensemble import RandomForestRegressor\n", "\n", "forest = RandomForestRegressor(max_depth=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "For both the gradient-boosting and random forest models, create a validation\n", "curve using the training set to assess the impact of the number of trees on\n", "the performance of each model. Evaluate the list of parameters `param_range =\n", "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n", "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n", "right sign of the Mean Absolute Error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import numpy as np\n", "\n", "from sklearn.model_selection import ValidationCurveDisplay\n", "\n", "param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])\n", "disp = ValidationCurveDisplay.from_estimator(\n", " forest,\n", " data_train,\n", " target_train,\n", " param_name=\"n_estimators\",\n", " param_range=param_range,\n", " scoring=\"neg_mean_absolute_error\",\n", " negate_score=True,\n", " std_display_style=\"errorbar\",\n", " n_jobs=2,\n", ")\n", "\n", "_ = disp.ax_.set(\n", " xlabel=\"Number of trees in the forest\",\n", " ylabel=\"Mean absolute error (k$)\",\n", " title=\"Validation curve for random forest\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Random forest models improve when increasing the number of trees in the\n", "ensemble. However, the scores reach a plateau where adding new trees just\n", "makes fitting and scoring slower.\n", "\n", "Now repeat the analysis for the gradient boosting model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "# solution\n", "disp = ValidationCurveDisplay.from_estimator(\n", " gbdt,\n", " data_train,\n", " target_train,\n", " param_name=\"n_estimators\",\n", " param_range=param_range,\n", " scoring=\"neg_mean_absolute_error\",\n", " negate_score=True,\n", " std_display_style=\"errorbar\",\n", " n_jobs=2,\n", ")\n", "\n", "_ = disp.ax_.set(\n", " xlabel=\"Number of trees in the gradient boosting model\",\n", " ylabel=\"Mean absolute error (k$)\",\n", " title=\"Validation curve for gradient boosting model\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gradient boosting models overfit when the number of trees is too large. To\n", "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n", "offers an early-stopping option. Internally, the algorithm uses an\n", "out-of-sample set to compute the generalization performance of the model at\n", "each addition of a tree. Thus, if the generalization performance is not\n", "improving for several iterations, it stops adding trees.\n", "\n", "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n", "of trees is certainly too large as we have seen above. Change the parameter\n", "`n_iter_no_change` such that the gradient boosting fitting stops after adding\n", "5 trees to avoid deterioration of the overall generalization performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "gbdt = GradientBoostingRegressor(n_estimators=1_000, n_iter_no_change=5)\n", "gbdt.fit(data_train, target_train)\n", "gbdt.n_estimators_" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We see that the number of trees used is far below 1000 with the current\n", "dataset. Training the gradient boosting model with the entire 1000 trees would\n", "have been detrimental.\n", "\n", "Please note that one should not hyperparameter tune the number of estimators\n", "for both random forest and gradient boosting models. In this exercise we only\n", "show model performance with varying `n_estimators` for educational purposes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimate the generalization performance of this model again using the\n", "`sklearn.metrics.mean_absolute_error` metric but this time using the test set\n", "that we held out at the beginning of the notebook. Compare the resulting value\n", "with the values observed in the validation curve." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "error = mean_absolute_error(target_test, gbdt.predict(data_test))\n", "print(f\"On average, our GBDT regressor makes an error of {error:.2f} k$\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We observe that the MAE value measure on the held out test set is close to the\n", "validation error measured to the right hand side of the validation curve. This\n", "is kind of reassuring, as it means that both the cross-validation procedure\n", "and the outer train-test split roughly agree as approximations of the true\n", "generalization performance of the model. We can observe that the final\n", "evaluation of the test error seems to be even slightly below than the\n", "cross-validated test scores. This can be explained because the final model has\n", "been trained on the full training set while the cross-validation models have\n", "been trained on smaller subsets: in general the larger the number of training\n", "points, the lower the test error." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }