{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2A.ml - Arbres, forêts aléatoires et extensions - corrigé- Gabriel ROMON\n", "\n", "Cet énoncé est une version modifiée de celui de Xavier Dupré." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import urllib.request\n", "import pickle\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "import xgboost as xgb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Données" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le package `pickle` permet de sauvegarder n'importe quel objet Python (liste, numpy array, ...) en un fichier afin que cet objet puisse être sauvegardé ou partagé. C'est très pratique lorsque l'objet en question est le fruit d'un calcul long (poids d'un réseau de neurones, hyperparamètres d'un modèle, ...).\n", "\n", "On télécharge le fichier `.pickle` depuis mon Github et on l'ouvre. C'est une liste de 3 éléments `[X, y, X_private]`. Le but est d'entraîner un modèle sur `X, y` et de prédire `y_private` (qui est une variable continue).\n", "\n", "Vous m'enverrez un fichier `.pickle` contenant vos prédictions et je vais évaluer le MSE correspondant: $$\\text{MSE}=\\frac 1n\\sum_{i=1}^n (y_i-\\hat{y}_i)^2$$\n", "\n", "C'est un mini-Kaggle !" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "url = 'https://github.com/gabsens/Python-for-Data-Scientists-ENSAE/raw/master/TD2/data.pickle'\n", "urllib.request.urlretrieve(url, './data.pickle')\n", "\n", "with open(\"data.pickle\", \"rb\") as f:\n", " X, y, X_private = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15480, 8)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5160, 8)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_private.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercice 1: Mini-Kaggle, tuning de modèles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Afin de juger de la qualité des modèles entrainés, il faut couper le jeu de données en deux." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=95)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(11610, 8)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3870, 8)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## a) Entraîner une régression linéaire" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On commence toujours par le modèle le plus simple: la régression linéaire. C'est un modèle peu complexe, mais au moins il ne risque pas d'overfitter." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.515074118251781 \n", "test: 0.5375695523360391\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "from sklearn.linear_model import LinearRegression\n", "\n", "lin = LinearRegression().fit(X_train, y_train)\n", "\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, lin.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, lin.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vous devez obtenir à peu près le même MSE sur le train et le test. Ces scores constituent la baseline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## b) Entraîner un arbre de décision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ne touchez à aucun hyperparamètre dans un premier temps." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 5.653651066659692e-14 \n", "test: 0.5390846928573527\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(random_state=42)\n", "tree.fit(X_train, y_train)\n", "\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, tree.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, tree.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le MSE sur le train est quasi 0, alors que le MSE sur le test est proche de celui de la régression linéaire. Le modèle overfit terriblement ! \n", "\n", "Il va falloir jouer sur les hyperparamètres. Pour cela, il faut comprendre comment fonctionnent les arbres de décision...\n", "***\n", "\n", "Les arbres de décision de Scikit-Learn sont des arbres binaires (chaque noeud a exactement 2 fils). Ils sont construits récursivement en partant de la racine. Regardez l'arbre ci-dessous: étant donné un nouvel example $(x_1,\\ldots,x_8)$, on regarde d'abord si $x_1\\leq 5.148$. Si c'est le cas, on regarde ensuite si $x_1\\leq 3.071$. Si oui, le $\\hat y$ prédit est $1.352$. Sinon, on regarde si $x_6\\leq 2.344$. Si c'est le cas, on prédit $2.841$, et sinon on prédit $1.895$.\n", "\n", "A chaque noeud on peut associer les exemples de la base d'entrainement (les lignes de `X_train`) qui vérifient les conditions menant à ce noeud. Le nombre de telles lignes est donné par `samples=...`, la valeur $\\hat y$ que l'arbre leur assigne est donnée par `value=...`, et le MSE correspondant à cette prédiction est donné par `mse=...`. \n", "Par exemple, dans l'arbre ci-dessous, considérons la condition $$(x_1\\leq 5.148) \\text{ and } (x_1> 3.071) \\text{ and } (x_6>2.344)$$\n", "$3861$ lignes la vérifient et la prédiction associée à ces lignes est $1.895$.\n", "\n", "Décrivons succinctement l'algorithme de croissance des arbres. Un noeud de l'arbre est entièrement caractérisé par la variable $x_k$ et le seuil $t_k$ à partir desquels la condition $x_k\\leq t_k$ est créée. La racine de l'arbre est créée en cherchant $(k, t_k)$ qui minimise $$\\frac{m_{\\text{gauche}}}{m} \\text{MSE}_{\\text{gauche}}+\\frac{m_{\\text{droite}}}{m} \\text{MSE}_{\\text{droite}}$$\n", "où $m$ est le nombre total de samples (le nombre de lignes de `X_train`), $m_{\\text{gauche}}$ est le nombre de samples qui vérifient la condition $x_k\\leq t_k$ et $\\text{MSE}_{\\text{gauche}} = \\sum_{i \\in \\text{gauche}} (y^{(i)}-\\hat y_{\\text{gauche}}\n", ")^2$ où la valeur prédite dans le fils gauche est $\\hat y_{\\text{gauche}} = \\frac{1}{m_{\\text{gauche}}} \\sum_{i \\in \\text{gauche}} y^{(i)}$. \n", "La croissance de l'arbre se poursuit en répétant le procédé sur le fils gauche et le fils droit de la racine (etc).\n", "\n", "Il faut des conditions d'arrêt qui disent quand stopper la croissance de l'arbre. Si l'arbre est trop complexe (trop de noeuds/feuilles), il va overfitter les données. C'est là qu'interviennent les hyperparamètres:\n", "* `max_depth`: la profondeur maximale de l'arbre \n", "* `max_leaf_nodes`: nombre maximal de feuilles dans l'arbre\n", "* `min_samples_split`: un noeud qui a moins de `min_samples_split` samples est automatiquement une feuille\n", "* `min_samples_leaf`: nombre minimum de samples dans chaque noeud (un split ne sera envisagé que si il y a au moins `min_samples_leaf` samples dans le fils gauche et `min_samples_leaf` samples dans le fils droit).\n", "\n", "\n", "En mettant `max_depth` petit, `max_leaf_nodes` petit, `min_samples_split` grand, et `min_samples_leaf` grand, on limite l'overfitting." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from graphviz import Source\n", "from sklearn.tree import export_graphviz\n", "import os\n", "os.environ[\"PATH\"] += os.pathsep + 'C:/Temp/release/bin'" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Tree\r\n", "\r\n", "\r\n", "0\r\n", "\r\n", "x1 <= 5.148\r\n", "mse = 1.33\r\n", "samples = 11610\r\n", "value = 2.064\r\n", "\r\n", "\r\n", "1\r\n", "\r\n", "x1 <= 3.071\r\n", "mse = 0.838\r\n", "samples = 9326\r\n", "value = 1.743\r\n", "\r\n", "\r\n", "0->1\r\n", "\r\n", "\r\n", "True\r\n", "\r\n", "\r\n", "2\r\n", "\r\n", "x1 <= 6.877\r\n", "mse = 1.207\r\n", "samples = 2284\r\n", "value = 3.372\r\n", "\r\n", "\r\n", "0->2\r\n", "\r\n", "\r\n", "False\r\n", "\r\n", "\r\n", "3\r\n", "\r\n", "mse = 0.556\r\n", "samples = 4423\r\n", "value = 1.352\r\n", "\r\n", "\r\n", "1->3\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "4\r\n", "\r\n", "x6 <= 2.344\r\n", "mse = 0.83\r\n", "samples = 4903\r\n", "value = 2.096\r\n", "\r\n", "\r\n", "1->4\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "7\r\n", "\r\n", "mse = 1.301\r\n", "samples = 1042\r\n", "value = 2.841\r\n", "\r\n", "\r\n", "4->7\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "8\r\n", "\r\n", "mse = 0.512\r\n", "samples = 3861\r\n", "value = 1.895\r\n", "\r\n", "\r\n", "4->8\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "5\r\n", "\r\n", "x6 <= 2.743\r\n", "mse = 0.884\r\n", "samples = 1566\r\n", "value = 2.959\r\n", "\r\n", "\r\n", "2->5\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "6\r\n", "\r\n", "mse = 0.732\r\n", "samples = 718\r\n", "value = 4.272\r\n", "\r\n", "\r\n", "2->6\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "9\r\n", "\r\n", "mse = 0.983\r\n", "samples = 625\r\n", "value = 3.455\r\n", "\r\n", "\r\n", "5->9\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "10\r\n", "\r\n", "mse = 0.546\r\n", "samples = 941\r\n", "value = 2.63\r\n", "\r\n", "\r\n", "5->10\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = DecisionTreeRegressor(max_depth=3, max_leaf_nodes=6)\n", "tree.fit(X_train, y_train)\n", "\n", "export_graphviz(\n", " tree,\n", " out_file=\"./regression_tree.dot\",\n", " feature_names=[\"x\"+str(i) for i in range(1,9)],\n", " rounded=True,\n", " filled=True\n", " )\n", "Source.from_file(\"./regression_tree.dot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## Question importante: comment choisir les hyperparamètres ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On ne peut pas simplement essayer différentes hyperparamètres et consulter la performance sur `X_test`. Si on fait ça, on va overfit le `X_test` !\n", "\n", "Il faut choisir les hyperparamètres par validation croisée ($k$-fold cross-validation)." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 576 candidates, totalling 5760 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 552 tasks | elapsed: 2.3s\n", "[Parallel(n_jobs=-1)]: Done 2952 tasks | elapsed: 22.1s\n", "[Parallel(n_jobs=-1)]: Done 5704 tasks | elapsed: 1.1min\n", "[Parallel(n_jobs=-1)]: Done 5760 out of 5760 | elapsed: 1.1min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=10, error_score='raise-deprecating',\n", " estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,\n", " max_features=None,\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " presort=False, random_state=None,\n", " splitter='best'),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'max_depth': [2, 3, 4, 5, 7, 10, 15, 20],\n", " 'min_samples_leaf': range(1, 10),\n", " 'min_samples_split': range(2, 10)},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "parameters = {'max_depth':[2,3,4,5,7,10,15,20], 'min_samples_leaf':range(1,10),\n", " 'min_samples_split':range(2,10)}\n", "tree = DecisionTreeRegressor() \n", "tree_search = GridSearchCV(tree, parameters, scoring='neg_mean_squared_error', cv=10, n_jobs=-1, verbose=1)\n", "tree_search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ci-dessous les perfs du meilleur modèle (sur chacun des 10 folds):" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.3967812850061739,\n", " 0.41345941139209036,\n", " 0.400720122410235,\n", " 0.47374972605831284,\n", " 0.38450011085565255,\n", " 0.4202797996579079,\n", " 0.4043485602541037,\n", " 0.422875591800359,\n", " 0.4075115015096375,\n", " 0.3906582167179846]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[-v[tree_search.best_index_] for (k,v) in tree_search.cv_results_.items() if k.startswith('split')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La moyenne de ces MSE est directement accessible:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.41148843256624573" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-tree_search.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Les meilleurs hyperparamètres:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 10, 'min_samples_leaf': 8, 'min_samples_split': 2}" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La fonction `GridSearchCV` prend les meilleurs hyperparamètres et les utilise pour entrainer le modèle sur tout `X_train`. On accède au modèle final avec `tree_search.best_estimator_`:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.2581126516528247 \n", "test: 0.41364969822443814\n" ] } ], "source": [ "tree = tree_search.best_estimator_\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, tree.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, tree.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le modèle overfit certes, mais le MSE sur le test est très proche du MSE calculé par cross-validation. Donc on arrive bien à prévoir la capacité du modèle à généraliser." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## c) Entraîner un random forest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ne touchez à aucun hyperparamètre dans un premier temps." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "L:\\WinPython\\python-3.6.6.amd64\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n", " \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.05300230936779286 \n", "test: 0.2930152237163136\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "rf = RandomForestRegressor()\n", "rf.fit(X_train, y_train)\n", "\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, rf.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, rf.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comme attendu, le modèle overfit beaucoup.\n", "\n", "Grossièrement, un random forest est une moyenne d'un grand nombre d'arbres de décision. Les hyperparamètres sont donc très similaires aux précédents (il faut en plus choisir le nombre d'arbres `n_estimators`)." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 210 candidates, totalling 2100 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 3.8s\n", "[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 16.0s\n", "[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 41.0s\n", "[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 2.8min\n", "[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 4.8min\n", "[Parallel(n_jobs=-1)]: Done 2100 out of 2100 | elapsed: 6.1min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=10, error_score='raise-deprecating',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse',\n", " max_depth=None,\n", " max_features='auto',\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators='warn', n_jobs=None,\n", " oob_score=False, random_state=None,\n", " verbose=0, warm_start=False),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8],\n", " 'max_features': ['auto'],\n", " 'min_samples_leaf': [1, 2, 3, 4, 5, 6],\n", " 'n_estimators': [10, 20, 30, 40, 50]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "parameters = {'max_depth':[2,3,4,5,6,7,8], 'min_samples_leaf':[1,2,3,4,5,6],\n", " 'max_features':[\"auto\"], 'n_estimators':[10,20,30,40,50]}\n", "rf = RandomForestRegressor()\n", "rf_search = GridSearchCV(rf, parameters, scoring='neg_mean_squared_error', cv=10, n_jobs=-1, verbose=1)\n", "rf_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.31412747224478743,\n", " 0.33240025878389157,\n", " 0.3205169171798233,\n", " 0.35747842327414614,\n", " 0.31595881523786473,\n", " 0.3494175731191536,\n", " 0.31592927905735,\n", " 0.3746973167349527,\n", " 0.30656018536020563,\n", " 0.31199181796390957]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[-v[rf_search.best_index_] for (k,v) in rf_search.cv_results_.items() if k.startswith('split')]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3299078058956084" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-rf_search.best_score_" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 8,\n", " 'max_features': 'auto',\n", " 'min_samples_leaf': 2,\n", " 'n_estimators': 50}" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_search.best_params_" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: train: 0.24940160680121298 test: 0.33605346214901327\n" ] } ], "source": [ "rf = rf_search.best_estimator_\n", "print('MSE:', 'train:', mean_squared_error(y_train, rf.predict(X_train)),\n", " 'test:', mean_squared_error(y_test, rf.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2ème tentative en raffinant ce qu'on a trouvé" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 60 candidates, totalling 600 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 18.2s\n", "[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 3.8min\n", "[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 5.4min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=10, error_score='raise-deprecating',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse',\n", " max_depth=None,\n", " max_features='auto',\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators='warn', n_jobs=None,\n", " oob_score=False, random_state=None,\n", " verbose=0, warm_start=False),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'max_depth': [7, 8, 9, 10, 11],\n", " 'min_samples_leaf': [1, 2, 3],\n", " 'n_estimators': [40, 50, 60, 70]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parameters = {'max_depth':[7,8,9,10,11], 'min_samples_leaf':[1,2,3],\n", " 'n_estimators':[40,50,60,70]}\n", "rf = RandomForestRegressor()\n", "rf_search = GridSearchCV(rf, parameters, scoring='neg_mean_squared_error', cv=10, n_jobs=-1, verbose=1)\n", "rf_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.28197782081362294,\n", " 0.2916967889629732,\n", " 0.27348200022477076,\n", " 0.3094498074419222,\n", " 0.27303765168208033,\n", " 0.2978301724111948,\n", " 0.272525493710379,\n", " 0.3253057128965016,\n", " 0.2531161937171129,\n", " 0.2663949656702875]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[-v[rf_search.best_index_] for (k,v) in rf_search.cv_results_.items() if k.startswith('split')]" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.28448166075308456" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-rf_search.best_score_" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 11, 'min_samples_leaf': 3, 'n_estimators': 70}" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_search.best_params_" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: train: 0.14317293137433595 test: 0.28511666703503513\n" ] } ], "source": [ "rf = rf_search.best_estimator_\n", "print('MSE:', 'train:', mean_squared_error(y_train, rf.predict(X_train)),\n", " 'test:', mean_squared_error(y_test, rf.predict(X_test))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On a un MSE en cross-val sur le train et un MSE sur le test proches de $0.285$, c'est bien !\n", "\n", "Je suis satisfait de la performance du modèle, donc c'est le bon moment pour voir ce qu'il vaut vraiment sur de nouvelles données (i.e sur `X_private`). Je rappelle que vous n'avez pas accès à `y_private`, mais moi si." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "y_pred = rf.predict(X_private)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "with open(\"yprivate.pickle\", \"rb\") as f:\n", " y_private = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.28245768089476747" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_pred, y_private)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La capacité de généralisation du modèle est confirmée: on a la même performance sur le test que sur le private." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 ème tentative" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 45 candidates, totalling 450 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 46.4s\n", "[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 3.9min\n", "[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 9.7min\n", "[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 9.9min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=10, error_score='raise-deprecating',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse',\n", " max_depth=None,\n", " max_features='auto',\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators='warn', n_jobs=None,\n", " oob_score=False, random_state=None,\n", " verbose=0, warm_start=False),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'max_depth': [11, 15, 20],\n", " 'min_samples_leaf': [1, 2, 3, 4, 5],\n", " 'n_estimators': [70, 100, 150]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parameters = {'max_depth':[11,15,20], 'min_samples_leaf':[1,2,3,4,5],\n", " 'n_estimators':[70,100,150]}\n", "rf = RandomForestRegressor()\n", "rf_search = GridSearchCV(rf, parameters, scoring='neg_mean_squared_error', cv=10, n_jobs=-1, verbose=1)\n", "rf_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 20, 'min_samples_leaf': 2, 'n_estimators': 150}" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_search.best_params_" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.05885930059949651 \n", "test: 0.26632698653622233\n" ] } ], "source": [ "rf = RandomForestRegressor(max_depth=20, min_samples_leaf=2, n_estimators=150)\n", "rf.fit(X_train, y_train)\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, rf.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, rf.predict(X_test))\n", " )" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "y_pred = rf.predict(X_private)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2610297733844857" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_pred, y_private)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 ème tentative" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 125 candidates, totalling 625 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 6.5min\n", "[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 14.7min\n", "[Parallel(n_jobs=-1)]: Done 625 out of 625 | elapsed: 20.7min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=5, error_score='raise-deprecating',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse',\n", " max_depth=None,\n", " max_features='auto',\n", " max_leaf_nodes=None,\n", " min_impurity_decrease=0.0,\n", " min_impurity_split=None,\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators='warn', n_jobs=None,\n", " oob_score=False, random_state=None,\n", " verbose=0, warm_start=False),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'max_depth': [15, 20, 25, 30, 35],\n", " 'min_samples_leaf': [1, 2, 3, 4, 5],\n", " 'n_estimators': [120, 150, 170, 200, 250]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parameters = {'max_depth':[15,20,25,30,35], 'min_samples_leaf':[1,2,3,4,5],\n", " 'n_estimators':[120,150,170,200,250]}\n", "rf = RandomForestRegressor()\n", "rf_search = GridSearchCV(rf, parameters, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=1)\n", "rf_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 35, 'min_samples_leaf': 1, 'n_estimators': 200}" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_search.best_params_" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.03652540908006555 \n", "test: 0.26239096844368476\n" ] } ], "source": [ "rf = RandomForestRegressor(max_depth=35, min_samples_leaf=1, n_estimators=200)\n", "rf.fit(X_train, y_train)\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, rf.predict(X_train)),\n", " '\\ntest:', mean_squared_error(y_test, rf.predict(X_test))\n", " )" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "y_pred = rf.predict(X_private)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.25729530646987675" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_pred, y_private)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On obtient pas significativement mieux sur le test par rapport à la tentative précédente. En revanche, on a de la chance sur le private vu qu'on obtient $0.257$ en MSE.\n", "\n", "Le gridsearch était plutôt long (20 minutes) et le gain négligeable, il est raisonnable de passer à un autre modèle que les random forests." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## d) Entraîner un xgboost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On ne touche pas trop aux hyperparamètres dans un premier temps." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: \n", "train: 0.3072673651550215 \n", "test: 0.35606410551823686\n" ] } ], "source": [ "#Si vous avez une version récente de xgboost il faut remplacer reg:linear par reg:squarederror\n", "param = {'max_depth':3, 'eta':1, 'silent':1, 'objective':'reg:linear'}\n", "\n", "bst = xgb.train(params=param, dtrain=xgb.DMatrix(X_train, label=y_train))\n", "\n", "print('MSE:', '\\ntrain:', mean_squared_error(y_train, bst.predict(xgb.DMatrix(X_train))),\n", " '\\ntest:', mean_squared_error(y_test, bst.predict(xgb.DMatrix(X_test)))\n", " )" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 42 candidates, totalling 210 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 37.4s\n", "[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 2.6min\n", "[Parallel(n_jobs=-1)]: Done 210 out of 210 | elapsed: 2.9min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=5, error_score='raise-deprecating',\n", " estimator=XGBRegressor(base_score=0.5, booster='gbtree',\n", " colsample_bylevel=1, colsample_bynode=1,\n", " colsample_bytree=1, early_stopping_rounds=5,\n", " gamma=0, importance_type='gain',\n", " learning_rate=0.1, max_delta_step=0,\n", " max_depth=3, min_child_weight=1,\n", " missing=None, n_estimators=100, n_jobs=1,\n", " nfold=10, nthread=None, num_b...\n", " objective='reg:squarederror',\n", " random_state=0, reg_alpha=0, reg_lambda=1,\n", " scale_pos_weight=1, seed=None, silent=None,\n", " subsample=1, verbosity=1),\n", " iid='warn', n_jobs=-1,\n", " param_grid={'eta': array([0.1, 0.4, 0.7, 1. , 1.3, 1.6, 1.9]),\n", " 'max_depth': [3, 5, 7, 10, 15, 20],\n", " 'objective': ['reg:squarederror']},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring='neg_mean_squared_error', verbose=1)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "parameters = {'max_depth':[3,5,7,10,15,20], 'eta':np.arange(0.1,2,0.3),\n", " 'objective':['reg:squarederror']} #avec une version ancienne de xgboost il faut mettre reg:linear\n", "\n", "bst = xgb.XGBRegressor(num_boost_round=30, early_stopping_rounds=5)\n", "bst_search = GridSearchCV(bst, parameters, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=1)\n", "bst_search.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.23320549287225628" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-bst_search.best_score_" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'eta': 0.1, 'max_depth': 7, 'objective': 'reg:squarederror'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bst_search.best_params_" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: train: 0.09091648209189733 test: 0.23192689346104273\n" ] } ], "source": [ "bst = bst_search.best_estimator_\n", "print('MSE:', 'train:', mean_squared_error(y_train, bst.predict(X_train)),\n", " 'test:', mean_squared_error(y_test, bst.predict(X_test))\n", " )" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.22086412567310826" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(\"yprivate.pickle\", \"rb\") as f:\n", " y_private = pickle.load(f)\n", " \n", "y_pred = bst.predict(X_private)\n", "\n", "mean_squared_error(y_pred, y_private)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sans tuner trop on a un MSE de 0.23 en cross-val sur le train, 0.23 sur le test, et 0.22 sur le private (coup de chance).\n", "\n", "Les arbres boostés sont des modèles très puissants. Ils sont de plus rapides à entrainer. Il faut en revanche faire attention à l'overfitting." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }