{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Tuning\n",
    "> A Summary of lecture \"Machine Learning with Tree-Based Models in Python\", via datacamp\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning a CART's Hyperparameters\n",
    "- Hyperparameters\n",
    "    - Machine learning model:\n",
    "        - parameters: learned from data\n",
    "            - CART example: split-point of a node, split-feature of a node, ...\n",
    "        - hyperparameters: not learned from data, set prior to training\n",
    "            - CART example: ```max_depth```, ```min_samples_leaf```, splitting criterion, ...\n",
    "- What is hyperparameter tuning?\n",
    "    - Problem: search for a set of optimal hyperparameters for a learning algorithm.\n",
    "    - Solution: find a set of optimal hyperparameters that results in an optimal model.\n",
    "    - Optimal model: yields an optimal score\n",
    "    - Score : defaults to accuracy (classification) and $R^2$ (regression)\n",
    "    - Cross-validation is used to estimate the generalization performance.\n",
    "- Approaches to hyperparameter tuning\n",
    "    - Grid Search\n",
    "    - Random Search\n",
    "    - Bayesian Optimization\n",
    "    - Genetic Algorithm\n",
    "    - ...\n",
    "- Grid search cross validation\n",
    "    - Manually set a grid of discrete hyperparameter values.\n",
    "    - Set a metric for scoring model performance.\n",
    "    - Search exhaustively through the grid.\n",
    "    - For each set of hyperparameters, evaluate each model's CV score\n",
    "    - The optimal hyperparameters are those of the model achieving the best CV score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tree hyperparameters\n",
    "In the following exercises you'll revisit the [Indian Liver Patient](https://www.kaggle.com/uciml/indian-liver-patient-records) dataset which was introduced in a previous chapter.\n",
    "\n",
    "Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age_std</th>\n",
       "      <th>Total_Bilirubin_std</th>\n",
       "      <th>Direct_Bilirubin_std</th>\n",
       "      <th>Alkaline_Phosphotase_std</th>\n",
       "      <th>Alamine_Aminotransferase_std</th>\n",
       "      <th>Aspartate_Aminotransferase_std</th>\n",
       "      <th>Total_Protiens_std</th>\n",
       "      <th>Albumin_std</th>\n",
       "      <th>Albumin_and_Globulin_Ratio_std</th>\n",
       "      <th>Is_male_std</th>\n",
       "      <th>Liver_disease</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.247403</td>\n",
       "      <td>-0.420320</td>\n",
       "      <td>-0.495414</td>\n",
       "      <td>-0.428870</td>\n",
       "      <td>-0.355832</td>\n",
       "      <td>-0.319111</td>\n",
       "      <td>0.293722</td>\n",
       "      <td>0.203446</td>\n",
       "      <td>-0.147390</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.062306</td>\n",
       "      <td>1.218936</td>\n",
       "      <td>1.423518</td>\n",
       "      <td>1.675083</td>\n",
       "      <td>-0.093573</td>\n",
       "      <td>-0.035962</td>\n",
       "      <td>0.939655</td>\n",
       "      <td>0.077462</td>\n",
       "      <td>-0.648461</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.062306</td>\n",
       "      <td>0.640375</td>\n",
       "      <td>0.926017</td>\n",
       "      <td>0.816243</td>\n",
       "      <td>-0.115428</td>\n",
       "      <td>-0.146459</td>\n",
       "      <td>0.478274</td>\n",
       "      <td>0.203446</td>\n",
       "      <td>-0.178707</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.815511</td>\n",
       "      <td>-0.372106</td>\n",
       "      <td>-0.388807</td>\n",
       "      <td>-0.449416</td>\n",
       "      <td>-0.366760</td>\n",
       "      <td>-0.312205</td>\n",
       "      <td>0.293722</td>\n",
       "      <td>0.329431</td>\n",
       "      <td>0.165780</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.679294</td>\n",
       "      <td>0.093956</td>\n",
       "      <td>0.179766</td>\n",
       "      <td>-0.395996</td>\n",
       "      <td>-0.295731</td>\n",
       "      <td>-0.177537</td>\n",
       "      <td>0.755102</td>\n",
       "      <td>-0.930414</td>\n",
       "      <td>-1.713237</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Age_std  Total_Bilirubin_std  Direct_Bilirubin_std  \\\n",
       "0  1.247403            -0.420320             -0.495414   \n",
       "1  1.062306             1.218936              1.423518   \n",
       "2  1.062306             0.640375              0.926017   \n",
       "3  0.815511            -0.372106             -0.388807   \n",
       "4  1.679294             0.093956              0.179766   \n",
       "\n",
       "   Alkaline_Phosphotase_std  Alamine_Aminotransferase_std  \\\n",
       "0                 -0.428870                     -0.355832   \n",
       "1                  1.675083                     -0.093573   \n",
       "2                  0.816243                     -0.115428   \n",
       "3                 -0.449416                     -0.366760   \n",
       "4                 -0.395996                     -0.295731   \n",
       "\n",
       "   Aspartate_Aminotransferase_std  Total_Protiens_std  Albumin_std  \\\n",
       "0                       -0.319111            0.293722     0.203446   \n",
       "1                       -0.035962            0.939655     0.077462   \n",
       "2                       -0.146459            0.478274     0.203446   \n",
       "3                       -0.312205            0.293722     0.329431   \n",
       "4                       -0.177537            0.755102    -0.930414   \n",
       "\n",
       "   Albumin_and_Globulin_Ratio_std  Is_male_std  Liver_disease  \n",
       "0                       -0.147390            0              1  \n",
       "1                       -0.648461            1              1  \n",
       "2                       -0.178707            1              1  \n",
       "3                        0.165780            1              1  \n",
       "4                       -1.713237            1              1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indian = pd.read_csv('./datasets/indian_liver_patient_preprocessed.csv', index_col=0)\n",
    "indian.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = indian.drop('Liver_disease', axis='columns')\n",
    "y = indian['Liver_disease']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ccp_alpha': 0.0,\n",
       " 'class_weight': None,\n",
       " 'criterion': 'gini',\n",
       " 'max_depth': None,\n",
       " 'max_features': None,\n",
       " 'max_leaf_nodes': None,\n",
       " 'min_impurity_decrease': 0.0,\n",
       " 'min_impurity_split': None,\n",
       " 'min_samples_leaf': 1,\n",
       " 'min_samples_split': 2,\n",
       " 'min_weight_fraction_leaf': 0.0,\n",
       " 'presort': 'deprecated',\n",
       " 'random_state': None,\n",
       " 'splitter': 'best'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# Instantiate dt\n",
    "dt = DecisionTreeClassifier()\n",
    "\n",
    "# Check default hyperparameter\n",
    "dt.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set the tree's hyperparameter grid\n",
    "In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree ```dt``` and find the optimal classifier in the next exercise.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define params_dt\n",
    "params_dt = {\n",
    "    'max_depth': [2, 3, 4],\n",
    "    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search for the optimal tree\n",
    "In this exercise, you'll perform grid search using 5-fold cross validation to find ```dt```'s optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the ```GridSearchCV``` object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the ```.fit()``` method:\n",
    "```python\n",
    "grid_object.fit(X_train, y_train)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score=nan,\n",
       "             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,\n",
       "                                              criterion='gini', max_depth=None,\n",
       "                                              max_features=None,\n",
       "                                              max_leaf_nodes=None,\n",
       "                                              min_impurity_decrease=0.0,\n",
       "                                              min_impurity_split=None,\n",
       "                                              min_samples_leaf=1,\n",
       "                                              min_samples_split=2,\n",
       "                                              min_weight_fraction_leaf=0.0,\n",
       "                                              presort='deprecated',\n",
       "                                              random_state=None,\n",
       "                                              splitter='best'),\n",
       "             iid='deprecated', n_jobs=-1,\n",
       "             param_grid={'max_depth': [2, 3, 4],\n",
       "                         'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},\n",
       "             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
       "             scoring='roc_auc', verbose=0)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Instantiate grid_dt\n",
    "grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='roc_auc', cv=5, n_jobs=-1)\n",
    "\n",
    "grid_dt.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the optimal tree\n",
    "In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.\n",
    "\n",
    "In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the method ```predict_proba()``` of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test set ROC AUC score: 0.681\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "# Extract the best estimator\n",
    "best_model = grid_dt.best_estimator_\n",
    "\n",
    "# Predict the test set probabilities of the positive class\n",
    "y_pred_proba = best_model.predict_proba(X_test)[:, 1]\n",
    "\n",
    "# Compute test_roc_auc\n",
    "test_roc_auc = roc_auc_score(y_test, y_pred_proba)\n",
    "\n",
    "# Print test_roc_auc\n",
    "print(\"Test set ROC AUC score: {:.3f}\".format(test_roc_auc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning a RF's Hyperparameters\n",
    "- Random Forest Hyperparameters\n",
    "    - CART hyperparameters\n",
    "    - number of estimators\n",
    "    - Whether it uses bootstrapping or not\n",
    "    - ...\n",
    "- Tuning is expensive\n",
    "    - Hyperparameter tuning:\n",
    "        - Computationally expensive,\n",
    "        - sometimes leads to very slight improvement\n",
    "    - Weight the impact of tuning on the whole project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random forests hyperparameters\n",
    "In the following exercises, you'll be revisiting the [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand) dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hr</th>\n",
       "      <th>holiday</th>\n",
       "      <th>workingday</th>\n",
       "      <th>temp</th>\n",
       "      <th>hum</th>\n",
       "      <th>windspeed</th>\n",
       "      <th>cnt</th>\n",
       "      <th>instant</th>\n",
       "      <th>mnth</th>\n",
       "      <th>yr</th>\n",
       "      <th>Clear to partly cloudy</th>\n",
       "      <th>Light Precipitation</th>\n",
       "      <th>Misty</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.76</td>\n",
       "      <td>0.66</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>149</td>\n",
       "      <td>13004</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.74</td>\n",
       "      <td>0.70</td>\n",
       "      <td>0.1343</td>\n",
       "      <td>93</td>\n",
       "      <td>13005</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.74</td>\n",
       "      <td>0.0896</td>\n",
       "      <td>90</td>\n",
       "      <td>13006</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.84</td>\n",
       "      <td>0.1343</td>\n",
       "      <td>33</td>\n",
       "      <td>13007</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.70</td>\n",
       "      <td>0.79</td>\n",
       "      <td>0.1940</td>\n",
       "      <td>4</td>\n",
       "      <td>13008</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   hr  holiday  workingday  temp   hum  windspeed  cnt  instant  mnth  yr  \\\n",
       "0   0        0           0  0.76  0.66     0.0000  149    13004     7   1   \n",
       "1   1        0           0  0.74  0.70     0.1343   93    13005     7   1   \n",
       "2   2        0           0  0.72  0.74     0.0896   90    13006     7   1   \n",
       "3   3        0           0  0.72  0.84     0.1343   33    13007     7   1   \n",
       "4   4        0           0  0.70  0.79     0.1940    4    13008     7   1   \n",
       "\n",
       "   Clear to partly cloudy  Light Precipitation  Misty  \n",
       "0                       1                    0      0  \n",
       "1                       1                    0      0  \n",
       "2                       1                    0      0  \n",
       "3                       1                    0      0  \n",
       "4                       1                    0      0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bike = pd.read_csv('./datasets/bikes.csv')\n",
    "bike.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = bike.drop('cnt', axis='columns')\n",
    "y = bike['cnt']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'bootstrap': True,\n",
       " 'ccp_alpha': 0.0,\n",
       " 'criterion': 'mse',\n",
       " 'max_depth': None,\n",
       " 'max_features': 'auto',\n",
       " 'max_leaf_nodes': None,\n",
       " 'max_samples': None,\n",
       " 'min_impurity_decrease': 0.0,\n",
       " 'min_impurity_split': None,\n",
       " 'min_samples_leaf': 1,\n",
       " 'min_samples_split': 2,\n",
       " 'min_weight_fraction_leaf': 0.0,\n",
       " 'n_estimators': 100,\n",
       " 'n_jobs': None,\n",
       " 'oob_score': False,\n",
       " 'random_state': None,\n",
       " 'verbose': 0,\n",
       " 'warm_start': False}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "# Instantiate rf\n",
    "rf = RandomForestRegressor()\n",
    "\n",
    "# Get hyperparameters\n",
    "rf.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set the hyperparameter grid of RF\n",
    "In this exercise, you'll manually set the grid of hyperparameters that will be used to tune ```rf```'s hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the dicrionary 'params_rf'\n",
    "params_rf = {\n",
    "    'n_estimators': [100, 350, 500],\n",
    "    'max_features': ['log2', 'auto', 'sqrt'],\n",
    "    'min_samples_leaf': [2, 10, 30],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search for the optimal forest\n",
    "In this exercise, you'll perform grid search using 3-fold cross validation to find ```rf```'s optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.\n",
    "\n",
    "Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the ```GridSearchCV``` object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the ```.fit()``` method:\n",
    "```python\n",
    "grid_object.fit(X_train, y_train)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 27 candidates, totalling 81 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.\n",
      "[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.6s\n",
      "[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    3.7s finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score=nan,\n",
       "             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,\n",
       "                                             criterion='mse', max_depth=None,\n",
       "                                             max_features='auto',\n",
       "                                             max_leaf_nodes=None,\n",
       "                                             max_samples=None,\n",
       "                                             min_impurity_decrease=0.0,\n",
       "                                             min_impurity_split=None,\n",
       "                                             min_samples_leaf=1,\n",
       "                                             min_samples_split=2,\n",
       "                                             min_weight_fraction_leaf=0.0,\n",
       "                                             n_estimators=100, n_jobs=None,\n",
       "                                             oob_score=False, random_state=None,\n",
       "                                             verbose=0, warm_start=False),\n",
       "             iid='deprecated', n_jobs=-1,\n",
       "             param_grid={'max_features': ['log2', 'auto', 'sqrt'],\n",
       "                         'min_samples_leaf': [2, 10, 30],\n",
       "                         'n_estimators': [100, 350, 500]},\n",
       "             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
       "             scoring='neg_mean_squared_error', verbose=1)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Instantiate grid_rf\n",
    "grid_rf = GridSearchCV(estimator=rf, param_grid=params_rf, scoring='neg_mean_squared_error', cv=3,\n",
    "                      verbose=1, n_jobs=-1)\n",
    "\n",
    "# fit model\n",
    "grid_rf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the optimal forest\n",
    "In this last exercise of the course, you'll evaluate the test set RMSE of ```grid_rf```'s optimal model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test RMSE of best model: 54.358\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error as MSE\n",
    "\n",
    "# Extract the best estimator\n",
    "best_model = grid_rf.best_estimator_\n",
    "\n",
    "# Predict test set labels\n",
    "y_pred = best_model.predict(X_test)\n",
    "\n",
    "# Compute rmse_test\n",
    "rmse_test = MSE(y_test, y_pred) ** 0.5\n",
    "\n",
    "# Print rmse_test\n",
    "print('Test RMSE of best model: {:.3f}'.format(rmse_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}