{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M6.03\n", "\n", "The aim of this exercise is to:\n", "\n", "* verifying if a random forest or a gradient-boosting decision tree overfit if\n", " the number of estimators is not properly chosen;\n", "* use the early-stopping strategy to avoid adding unnecessary trees, to get\n", " the best generalization performances.\n", "\n", "We use the California housing dataset to conduct our experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import train_test_split\n", "\n", "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n", "target *= 100 # rescale the target in k$\n", "data_train, data_test, target_train, target_test = train_test_split(\n", " data, target, random_state=0, test_size=0.5\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note
\n", "If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.
\n", "