{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69436ff0-cff5-49c9-993c-e9d7086bb11b",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Install the necessary dependencies\n",
    "\n",
    "import os\n",
    "import sys\n",
    "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython xgboost"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "755e087a",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "---\n",
    "license:\n",
    "    code: MIT\n",
    "    content: CC-BY-4.0\n",
    "github: https://github.com/ocademy-ai/machine-learning\n",
    "venue: By Ocademy\n",
    "open_access: true\n",
    "bibliography:\n",
    "  - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54581f48",
   "metadata": {},
   "source": [
    "# XGBoost"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "570d1449",
   "metadata": {},
   "source": [
    "## What is XGBoost\n",
    "\n",
    "**XGBoost** is the leading model for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). XGBoost models dominate many famous competitions. \n",
    "\n",
    "To reach peak accuracy, XGBoost models require more knowledge and  _model tuning_ than techniques like Random Forest. After this section, you'ill be able to \n",
    "- Follow the full modeling workflow with XGBoost \n",
    "- Fine-tune XGBoost models for optimal performance\n",
    "\n",
    "\n",
    "XGBoost is an implementation of the **Gradient Boosted Decision Trees** algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.)  What is **Gradient Boosted Decision Trees**?  We'll walk through a diagram."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9af20398",
   "metadata": {},
   "source": [
    ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/xgboost/Gradient_boosted_decision_trees.png\n",
    "---\n",
    "name: 'Gradient Boosted Decision Trees'\n",
    "width: 90%\n",
    "---\n",
    "Gradient Boosted Decision Trees\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ffa9d7b",
   "metadata": {},
   "source": [
    "We go through cycles that repeatedly builds new models and combines them into an **ensemble** model.  We start the cycle by calculating the errors for each observation in the dataset.  We then build a new model to predict those.  We add predictions from this error-predicting model to the \"ensemble of models.\"  \n",
    "\n",
    "To make a prediction, we add the predictions from all previous models.  We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.\n",
    "\n",
    "There's one piece outside that cycle.  We need some base prediction to start the cycle. In practice, the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.\n",
    "\n",
    "This process may sound complicated, but the code to use it is straightforward. We'll fill in some additional explanatory details in the **model tuning** section below."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc345152",
   "metadata": {},
   "source": [
    "## Example\n",
    "\n",
    "We will start with the data pre-loaded into **train_X**, **test_X**, **train_y**, **test_y**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d796e87b",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.impute import SimpleImputer\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "data = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/house_price_train.csv')\n",
    "data.dropna(axis=0, subset=['SalePrice'], inplace=True)\n",
    "y = data.SalePrice\n",
    "X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])\n",
    "train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)\n",
    "\n",
    "my_imputer = SimpleImputer()\n",
    "train_X = my_imputer.fit_transform(train_X)\n",
    "test_X = my_imputer.transform(test_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1857ed52",
   "metadata": {},
   "source": [
    "We build and fit a model just as we would in scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fdf00e58",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=100, n_jobs=0,\n",
       "             num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0, reg_alpha=0,\n",
       "             reg_lambda=1, ...)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">XGBRegressor</label><div class=\"sk-toggleable__content\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=100, n_jobs=0,\n",
       "             num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0, reg_alpha=0,\n",
       "             reg_lambda=1, ...)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
       "             importance_type=None, interaction_constraints='',\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,\n",
       "             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n",
       "             reg_lambda=1, ...)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from xgboost import XGBRegressor\n",
    "\n",
    "my_model = XGBRegressor()\n",
    "# Add silent=True to avoid printing out updates with each cycle\n",
    "my_model.fit(train_X, train_y, verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa6d4338",
   "metadata": {},
   "source": [
    "We similarly evaluate a model and make predictions as we would do in scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0aea61fe",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean Absolute Error : 17256.98339041096\n"
     ]
    }
   ],
   "source": [
    "# make predictions\n",
    "predictions = my_model.predict(test_X)\n",
    "\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "print(\"Mean Absolute Error : \" + str(mean_absolute_error(predictions, test_y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bb30107",
   "metadata": {},
   "source": [
    "## Model Tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fedf9e7",
   "metadata": {},
   "source": [
    "### `n_estimators and early_stopping_rounds`\n",
    "\n",
    "XGBoost has a few parameters that can dramatically affect your model's accuracy and training speed.  The first parameters you should understand are: `n_estimators and early_stopping_rounds`.\n",
    "\n",
    "`n_estimators` specifies how many times to go through the modeling cycle described above.  \n",
    "\n",
    "In the [underfitting vs overfitting graph](http://i.imgur.com/2q85n9s.png), n_estimators moves you further to the right.  Too low a value causes underfitting, which is inaccurate predictions on both training data and new data. Too large a value causes overfitting, which is accurate predictions on training data, but inaccurate predictions on new data (which is what we care about). You can experiment with your dataset to find the ideal.  Typical values range from 100-1000, though this depends a lot on the `learning rate` discussed below.\n",
    "\n",
    "The argument `early_stopping_rounds` offers a way to automatically find the ideal value. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators.  It's smart to set a high value for `n_estimators` and then use `early_stopping_rounds` to find the optimal time to stop iterating.\n",
    "\n",
    "Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping.  `early_stopping_rounds = 5` is a reasonable value.  Thus we stop after 5 straight rounds of deteriorating validation scores.\n",
    "\n",
    "Here is the code to fit with early_stopping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ba6cfeb5",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">XGBRegressor</label><div class=\"sk-toggleable__content\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
       "             importance_type=None, interaction_constraints='',\n",
       "             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints='()', n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=1000)\n",
    "my_model.fit(train_X, train_y, early_stopping_rounds=5, \n",
    "             eval_set=[(test_X, test_y)], verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67bd0f5f",
   "metadata": {},
   "source": [
    "When using `early_stopping_rounds`, you need to set aside some of your data for checking the number of rounds to use.  If you later want to fit a model with all of your data, set `n_estimators` to whatever value you found to be optimal when run with early stopping."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27da0791",
   "metadata": {},
   "source": [
    "### learning_rate\n",
    "Here's a subtle but important trick for better XGBoost models:\n",
    "\n",
    "Instead of getting predictions by simply adding up the predictions from each component model, we will multiply the predictions from each model by a small number before adding them in.  This means each tree we add to the ensemble helps us less.  In practice, this reduces the model's propensity to overfit.\n",
    "\n",
    "So, you can use a higher value of `n_estimators` without overfitting.  If you use early stopping, the appropriate number of trees will be set automatically.\n",
    "\n",
    "In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle.\n",
    "\n",
    "Modifying the example above to include a learing rate would yield the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "44351fff",
   "metadata": {
    "attributes": {
     "classes": [
      "code-cell"
     ],
     "id": ""
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-3\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" checked><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">XGBRegressor</label><div class=\"sk-toggleable__content\"><pre>XGBRegressor(base_score=0.5, booster=&#x27;gbtree&#x27;, callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy=&#x27;depthwise&#x27;,\n",
       "             importance_type=None, interaction_constraints=&#x27;&#x27;,\n",
       "             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints=&#x27;()&#x27;, n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor=&#x27;auto&#x27;, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
       "             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
       "             early_stopping_rounds=None, enable_categorical=False,\n",
       "             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
       "             importance_type=None, interaction_constraints='',\n",
       "             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,\n",
       "             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
       "             missing=nan, monotone_constraints='()', n_estimators=1000,\n",
       "             n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, ...)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)\n",
    "my_model.fit(train_X, train_y, early_stopping_rounds=5, \n",
    "             eval_set=[(test_X, test_y)], verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "667c2773",
   "metadata": {},
   "source": [
    "### `n_jobs`\n",
    "On larger datasets where runtime is a consideration, you can use parallelism to build your models faster.  It's common to set the parameter `n_jobs` equal to the number of cores on your machine.  On smaller datasets, this won't help. \n",
    "\n",
    "The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, it's useful in large datasets where you would otherwise spend a long time waiting during the `fit` command.\n",
    "\n",
    "XGBoost has a multitude of other parameters, but these will go a very long way in helping you fine-tune your XGBoost model for optimal performance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0f509e8",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "XGBoost is currently the dominant algorithm for building accurate models on conventional data (also called tabular or strutured data).  Go apply it to improve your models!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb92902c",
   "metadata": {},
   "source": [
    "## Your turn! 🚀\n",
    "\n",
    "TBD"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "826f7642",
   "metadata": {},
   "source": [
    "## Acknowledgments\n",
    "\n",
    "Thanks to [DanB](https://www.kaggle.com/dansbecker) for creating the open-source course [XGBoost](https://www.kaggle.com/code/dansbecker/xgboost/input?select=train.csv). It inspires the majority of the content in this chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}