{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tuning your XGBoost model\n",
    "> This chapter will teach you how to make your XGBoost models as performant as possible. You'll learn about the variety of parameters that can be adjusted to alter the behavior of XGBoost and how to tune them efficiently so that you can supercharge the performance of your models. This is the Summary of lecture \"Extreme Gradient Boosting with XGBoost\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import xgboost as xgb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why tune your model?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning the number of boosting rounds\n",
    "Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use `xgb.cv()` inside a for loop and build one model per `num_boost_round` parameter.\n",
    "\n",
    "Here, you'll continue working with the Ames housing dataset. The features are available in the array `X`, and the target vector is contained in `y`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('./dataset/ames_housing_trimmed_processed.csv')\n",
    "X, y = df.iloc[:, :-1], df.iloc[:, -1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   num_boosting_rounds          rmse\n",
      "0                    5  50903.299479\n",
      "1                   10  34774.194010\n",
      "2                   15  32895.097656\n"
     ]
    }
   ],
   "source": [
    "# Create the DMatrix: housing_dmatrix\n",
    "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n",
    "\n",
    "# Creata the parameter dictionary for each tree: params\n",
    "params = {\"objective\":\"reg:squarederror\", \"max_depth\":3}\n",
    "\n",
    "# Create list of number of boosting rounds\n",
    "num_rounds = [5, 10, 15]\n",
    "\n",
    "# Empty list to store final round rmse per XGBoost model\n",
    "final_rmse_per_round = []\n",
    "\n",
    "# Interate over num_rounds and build one model per num_boost_round parameter\n",
    "for curr_num_rounds in num_rounds:\n",
    "    # Perform cross-validation: cv_results\n",
    "    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, \n",
    "                        num_boost_round=curr_num_rounds, metrics='rmse', \n",
    "                        as_pandas=True, seed=123)\n",
    "    \n",
    "    # Append final round RMSE\n",
    "    final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])\n",
    "    \n",
    "# Print the result DataFrame\n",
    "num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))\n",
    "print(pd.DataFrame(num_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Automated boosting round selection using early_stopping\n",
    "Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within `xgb.cv()`. This is done using a technique called **early stopping**.\n",
    "\n",
    "Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric (`\"rmse\"` in our case) does not improve for a given number of rounds. Here you will use the `early_stopping_rounds` parameter in `xgb.cv()` with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when `num_boost_rounds` is reached, then early stopping does not occur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std\n",
      "0     141871.635417      403.636200   142640.651042     705.559164\n",
      "1     103057.033854       73.769960   104907.664063     111.119966\n",
      "2      75975.966146      253.726099    79262.054687     563.764349\n",
      "3      57420.529948      521.658354    61620.136719    1087.694282\n",
      "4      44552.955729      544.170190    50437.561198    1846.446330\n",
      "5      35763.947917      681.797248    43035.660156    2034.472841\n",
      "6      29861.464193      769.572153    38600.881510    2169.798065\n",
      "7      25994.675781      756.519243    36071.817708    2109.795430\n",
      "8      23306.836588      759.238254    34383.184896    1934.546688\n",
      "9      21459.770833      745.624687    33509.141276    1887.375284\n",
      "10     20148.720703      749.612103    32916.807292    1850.894702\n",
      "11     19215.382162      641.387014    32197.832682    1734.456935\n",
      "12     18627.389323      716.256596    31770.852214    1802.156241\n",
      "13     17960.694661      557.043073    31482.781901    1779.124534\n",
      "14     17559.736979      631.412969    31389.990234    1892.319927\n",
      "15     17205.712891      590.171393    31302.881511    1955.166733\n",
      "16     16876.572591      703.632148    31234.059896    1880.707172\n",
      "17     16597.663086      703.677647    31318.348308    1828.860391\n",
      "18     16330.460937      607.274494    31323.634766    1775.910706\n",
      "19     16005.972656      520.471365    31204.135417    1739.076156\n",
      "20     15814.300456      518.605216    31089.863281    1756.020773\n",
      "21     15493.405599      505.616447    31047.996094    1624.673955\n",
      "22     15270.734375      502.018453    31056.916015    1668.042812\n",
      "23     15086.382161      503.912447    31024.984375    1548.985605\n",
      "24     14917.607747      486.205730    30983.686198    1663.131107\n",
      "25     14709.589518      449.668010    30989.477865    1686.668378\n",
      "26     14457.286458      376.787206    30952.113281    1613.172049\n",
      "27     14185.567383      383.102234    31066.901042    1648.534606\n",
      "28     13934.067057      473.464991    31095.641927    1709.225000\n",
      "29     13749.645182      473.671021    31103.887370    1778.880069\n",
      "30     13549.836263      454.898488    30976.085938    1744.515079\n",
      "31     13413.485026      399.603470    30938.469401    1746.054047\n",
      "32     13275.916016      415.408786    30930.999349    1772.470428\n",
      "33     13085.878255      493.792509    30929.056641    1765.541578\n",
      "34     12947.181641      517.790106    30890.629557    1786.510976\n",
      "35     12846.027344      547.732372    30884.492839    1769.730062\n",
      "36     12702.378906      505.523126    30833.541667    1691.002487\n",
      "37     12532.244140      508.298516    30856.688151    1771.445059\n",
      "38     12384.055013      536.225042    30818.016927    1782.786053\n",
      "39     12198.444010      545.165502    30839.392578    1847.326928\n",
      "40     12054.583333      508.841412    30776.964844    1912.779587\n",
      "41     11897.036784      477.177937    30794.702474    1919.674832\n",
      "42     11756.221354      502.992395    30780.957031    1906.820066\n",
      "43     11618.846680      519.837469    30783.753906    1951.260704\n",
      "44     11484.080404      578.428621    30776.731771    1953.446772\n",
      "45     11356.552734      565.368794    30758.542969    1947.454481\n",
      "46     11193.557943      552.299272    30729.972005    1985.699338\n",
      "47     11071.315104      604.089876    30732.663411    1966.999196\n",
      "48     10950.778320      574.862779    30712.240885    1957.751118\n",
      "49     10824.865885      576.665756    30720.853516    1950.511977\n"
     ]
    }
   ],
   "source": [
    "# Create your housing DMatrix: housing_dmatrix\n",
    "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n",
    "\n",
    "# Create the parameter dictionary for each tree: params\n",
    "params = {\"objective\":\"reg:squarederror\", \"max_depth\":4}\n",
    "\n",
    "# Perform cross-validation with early-stopping: cv_results\n",
    "cv_results = xgb.cv(dtrain=housing_dmatrix, nfold=3, params=params, metrics=\"rmse\", \n",
    "                    early_stopping_rounds=10, num_boost_round=50, as_pandas=True, seed=123)\n",
    "\n",
    "# Print cv_results\n",
    "print(cv_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview of XGBoost's hyperparameters\n",
    "- Common tree tunable parameters\n",
    "    - learning rate: learning rate/eta\n",
    "    - gamma: min loss reduction to create new tree split\n",
    "    - lambda: L2 regularization on leaf weights\n",
    "    - alpha: L1 regularization on leaf weights\n",
    "    - max_depth: max depth per tree\n",
    "    - subsample: % samples used per tree\n",
    "    - colsample_bytree: % features used per tree\n",
    "- Linear tunable parameters\n",
    "    - lambda: L2 reg on weights\n",
    "    - alpha: L1 reg on weights\n",
    "    - lambda_bias: L2 reg term on bias\n",
    "- You can also tune the number of estimators used for both base model types!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning eta\n",
    "It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning the `\"eta\"`, also known as the learning rate.\n",
    "\n",
    "The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of `\"eta\"` penalizing feature weights more strongly, causing much stronger regularization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     eta      best_rmse\n",
      "0  0.001  195736.401042\n",
      "1  0.010  179932.187500\n",
      "2  0.100   79759.408854\n"
     ]
    }
   ],
   "source": [
    "# Create your housing DMatrix: housing_dmatrix\n",
    "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n",
    "\n",
    "# Create the parameter dictionary for each tree (boosting round)\n",
    "params = {\"objective\":\"reg:squarederror\", \"max_depth\":3}\n",
    "\n",
    "# Create list of eta values and empty list to store final round rmse per xgboost model\n",
    "eta_vals = [0.001, 0.01, 0.1]\n",
    "best_rmse = []\n",
    "\n",
    "# Systematicallyvary the eta\n",
    "for curr_val in eta_vals:\n",
    "    params['eta'] = curr_val\n",
    "    \n",
    "    # Perform cross-validation: cv_results\n",
    "    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,\n",
    "                        early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123, \n",
    "                       as_pandas=True)\n",
    "    \n",
    "    # Append the final round rmse to best_rmse\n",
    "    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])\n",
    "    \n",
    "# Print the result DataFrame\n",
    "print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=['eta', 'best_rmse']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning max_depth\n",
    "In this exercise, your job is to tune `max_depth`, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   max_depth     best_rmse\n",
      "0          2  37957.468750\n",
      "1          5  35596.599610\n",
      "2         10  36065.546875\n",
      "3         20  36739.576172\n"
     ]
    }
   ],
   "source": [
    "# Create your housing DMatrix\n",
    "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n",
    "\n",
    "# Create the parameter dictionary\n",
    "params = {\"objective\":\"reg:squarederror\"}\n",
    "\n",
    "# Create list of max_depth values\n",
    "max_depths = [2, 5, 10, 20]\n",
    "best_rmse = []\n",
    "\n",
    "for curr_val in max_depths:\n",
    "    params['max_depth'] = curr_val\n",
    "    \n",
    "    # Perform cross-validation\n",
    "    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, \n",
    "                       early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,\n",
    "                        as_pandas=True)\n",
    "    \n",
    "    # Append the final round rmse to best_rmse\n",
    "    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])\n",
    "    \n",
    "# Print the result DataFrame\n",
    "print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning colsample_bytree\n",
    "Now, it's time to tune `\"colsample_bytree\"`. You've already seen this if you've ever worked with scikit-learn's `RandomForestClassifier` or `RandomForestRegressor`, where it just was called `max_features`. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, `colsample_bytree` must be specified as a float between 0 and 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   colsample_bytree     best_rmse\n",
      "0               0.1  48193.453125\n",
      "1               0.5  36013.544922\n",
      "2               0.8  35932.962891\n",
      "3               1.0  35836.042968\n"
     ]
    }
   ],
   "source": [
    "# Create your housing DMatrix\n",
    "housing_dmatrix = xgb.DMatrix(data=X,label=y)\n",
    "\n",
    "# Create the parameter dictionary\n",
    "params={\"objective\":\"reg:squarederror\", \"max_depth\":3}\n",
    "\n",
    "# Create list of hyperparameter values: colsample_bytree_vals\n",
    "colsample_bytree_vals = [0.1, 0.5, 0.8, 1]\n",
    "best_rmse = []\n",
    "\n",
    "# Systematically vary the hyperparameter value \n",
    "for curr_val in colsample_bytree_vals:\n",
    "    params['colsample_bytree'] = curr_val\n",
    "    \n",
    "    # Perform cross-validation\n",
    "    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,\n",
    "                 num_boost_round=10, early_stopping_rounds=5,\n",
    "                 metrics=\"rmse\", as_pandas=True, seed=123)\n",
    "    \n",
    "    # Append the final round rmse to best_rmse\n",
    "    best_rmse.append(cv_results[\"test-rmse-mean\"].tail().values[-1])\n",
    "\n",
    "# Print the resultant DataFrame\n",
    "print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), \n",
    "                   columns=[\"colsample_bytree\",\"best_rmse\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several other individual parameters that you can tune, such as `\"subsample\"`, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review of grid search and random search\n",
    "- Grid search: review\n",
    "    - Search exhaustively over a given set of hyperparameters, once per set of hyperparameters\n",
    "    - Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter\n",
    "    - Pick final model hyperparameter values that give best cross-validated evaluation metric value\n",
    "- Random search: review\n",
    "    - Create a (possibly infinte) range of hyperparameter values per hyperparameter that you would like to search over\n",
    "    - Set the number of iterations you would like for the random search to continue\n",
    "    - During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters\n",
    "    - After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grid search with XGBoost\n",
    "Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's `GridSearch` and `RandomizedSearch` capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with `GridSearchCV`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 4 folds for each of 4 candidates, totalling 16 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}\n",
      "Lowest RMSE found:  30744.105707685176\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    0.4s finished\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Create the parameter grid: gbm_param_grid\n",
    "gbm_param_grid = {\n",
    "    'colsample_bytree': [0.3, 0.7],\n",
    "    'n_estimators': [50],\n",
    "    'max_depth': [2, 5]\n",
    "}\n",
    "\n",
    "# Instantiate the regressor: gbm\n",
    "gbm = xgb.XGBRegressor()\n",
    "\n",
    "# Perform grid search: grid_mse\n",
    "grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm, \n",
    "                        scoring='neg_mean_squared_error', cv=4, verbose=1)\n",
    "\n",
    "# Fit grid_mse to the data\n",
    "grid_mse.fit(X, y)\n",
    "\n",
    "# Print the best parameters and lowest RMSE\n",
    "print(\"Best parameters found: \", grid_mse.best_params_)\n",
    "print(\"Lowest RMSE found: \", np.sqrt(np.abs(grid_mse.best_score_)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random search with XGBoost\n",
    "Often, `GridSearchCV` can be really time consuming, so in practice, you may want to use `RandomizedSearchCV` instead, as you will do in this exercise. The good news is you only have to make a few modifications to your `GridSearchCV` code to do `RandomizedSearchCV`. The key difference is you have to specify a `param_distributions` parameter instead of a `param_grid` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 4 folds for each of 5 candidates, totalling 20 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found:  {'n_estimators': 25, 'max_depth': 4}\n",
      "Lowest RMSE found:  29998.4522530019\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.5s finished\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "\n",
    "# Create the parameter grid: gbm_param_grid\n",
    "gbm_param_grid = {\n",
    "    'n_estimators': [25],\n",
    "    'max_depth': range(2, 12)\n",
    "}\n",
    "\n",
    "# Instantiate the regressor: gbm\n",
    "gbm = xgb.XGBRegressor(n_estimators=10)\n",
    "\n",
    "# Perform random search: randomized_mse\n",
    "randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm, \n",
    "                                    scoring='neg_mean_squared_error', n_iter=5, cv=4, \n",
    "                                   verbose=1)\n",
    "\n",
    "# Fit randomized_mse to the data\n",
    "randomized_mse.fit(X, y)\n",
    "\n",
    "# Print the best parameters and lowest RMSE\n",
    "print(\"Best parameters found: \", randomized_mse.best_params_)\n",
    "print(\"Lowest RMSE found: \", np.sqrt(np.abs(randomized_mse.best_score_)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Limits of grid search and random search\n",
    "- limitations\n",
    "    - Grid Search\n",
    "        - Number of models you must build with every additionary new parameter grows very quickly\n",
    "    - Random Search\n",
    "        - Parameter space to explore can be massive\n",
    "        - Randomly jumping throughtout the space looking for a \"best\" results becomes a waiting game"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}