{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning your XGBoost model\n", "> This chapter will teach you how to make your XGBoost models as performant as possible. You'll learn about the variety of parameters that can be adjusted to alter the behavior of XGBoost and how to tune them efficiently so that you can supercharge the performance of your models. This is the Summary of lecture \"Extreme Gradient Boosting with XGBoost\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import xgboost as xgb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why tune your model?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning the number of boosting rounds\n", "Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use `xgb.cv()` inside a for loop and build one model per `num_boost_round` parameter.\n", "\n", "Here, you'll continue working with the Ames housing dataset. The features are available in the array `X`, and the target vector is contained in `y`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./dataset/ames_housing_trimmed_processed.csv')\n", "X, y = df.iloc[:, :-1], df.iloc[:, -1]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " num_boosting_rounds rmse\n", "0 5 50903.299479\n", "1 10 34774.194010\n", "2 15 32895.097656\n" ] } ], "source": [ "# Create the DMatrix: housing_dmatrix\n", "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n", "\n", "# Creata the parameter dictionary for each tree: params\n", "params = {\"objective\":\"reg:squarederror\", \"max_depth\":3}\n", "\n", "# Create list of number of boosting rounds\n", "num_rounds = [5, 10, 15]\n", "\n", "# Empty list to store final round rmse per XGBoost model\n", "final_rmse_per_round = []\n", "\n", "# Interate over num_rounds and build one model per num_boost_round parameter\n", "for curr_num_rounds in num_rounds:\n", " # Perform cross-validation: cv_results\n", " cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, \n", " num_boost_round=curr_num_rounds, metrics='rmse', \n", " as_pandas=True, seed=123)\n", " \n", " # Append final round RMSE\n", " final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])\n", " \n", "# Print the result DataFrame\n", "num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))\n", "print(pd.DataFrame(num_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automated boosting round selection using early_stopping\n", "Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within `xgb.cv()`. This is done using a technique called **early stopping**.\n", "\n", "Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric (`\"rmse\"` in our case) does not improve for a given number of rounds. Here you will use the `early_stopping_rounds` parameter in `xgb.cv()` with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when `num_boost_rounds` is reached, then early stopping does not occur." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std\n", "0 141871.635417 403.636200 142640.651042 705.559164\n", "1 103057.033854 73.769960 104907.664063 111.119966\n", "2 75975.966146 253.726099 79262.054687 563.764349\n", "3 57420.529948 521.658354 61620.136719 1087.694282\n", "4 44552.955729 544.170190 50437.561198 1846.446330\n", "5 35763.947917 681.797248 43035.660156 2034.472841\n", "6 29861.464193 769.572153 38600.881510 2169.798065\n", "7 25994.675781 756.519243 36071.817708 2109.795430\n", "8 23306.836588 759.238254 34383.184896 1934.546688\n", "9 21459.770833 745.624687 33509.141276 1887.375284\n", "10 20148.720703 749.612103 32916.807292 1850.894702\n", "11 19215.382162 641.387014 32197.832682 1734.456935\n", "12 18627.389323 716.256596 31770.852214 1802.156241\n", "13 17960.694661 557.043073 31482.781901 1779.124534\n", "14 17559.736979 631.412969 31389.990234 1892.319927\n", "15 17205.712891 590.171393 31302.881511 1955.166733\n", "16 16876.572591 703.632148 31234.059896 1880.707172\n", "17 16597.663086 703.677647 31318.348308 1828.860391\n", "18 16330.460937 607.274494 31323.634766 1775.910706\n", "19 16005.972656 520.471365 31204.135417 1739.076156\n", "20 15814.300456 518.605216 31089.863281 1756.020773\n", "21 15493.405599 505.616447 31047.996094 1624.673955\n", "22 15270.734375 502.018453 31056.916015 1668.042812\n", "23 15086.382161 503.912447 31024.984375 1548.985605\n", "24 14917.607747 486.205730 30983.686198 1663.131107\n", "25 14709.589518 449.668010 30989.477865 1686.668378\n", "26 14457.286458 376.787206 30952.113281 1613.172049\n", "27 14185.567383 383.102234 31066.901042 1648.534606\n", "28 13934.067057 473.464991 31095.641927 1709.225000\n", "29 13749.645182 473.671021 31103.887370 1778.880069\n", "30 13549.836263 454.898488 30976.085938 1744.515079\n", "31 13413.485026 399.603470 30938.469401 1746.054047\n", "32 13275.916016 415.408786 30930.999349 1772.470428\n", "33 13085.878255 493.792509 30929.056641 1765.541578\n", "34 12947.181641 517.790106 30890.629557 1786.510976\n", "35 12846.027344 547.732372 30884.492839 1769.730062\n", "36 12702.378906 505.523126 30833.541667 1691.002487\n", "37 12532.244140 508.298516 30856.688151 1771.445059\n", "38 12384.055013 536.225042 30818.016927 1782.786053\n", "39 12198.444010 545.165502 30839.392578 1847.326928\n", "40 12054.583333 508.841412 30776.964844 1912.779587\n", "41 11897.036784 477.177937 30794.702474 1919.674832\n", "42 11756.221354 502.992395 30780.957031 1906.820066\n", "43 11618.846680 519.837469 30783.753906 1951.260704\n", "44 11484.080404 578.428621 30776.731771 1953.446772\n", "45 11356.552734 565.368794 30758.542969 1947.454481\n", "46 11193.557943 552.299272 30729.972005 1985.699338\n", "47 11071.315104 604.089876 30732.663411 1966.999196\n", "48 10950.778320 574.862779 30712.240885 1957.751118\n", "49 10824.865885 576.665756 30720.853516 1950.511977\n" ] } ], "source": [ "# Create your housing DMatrix: housing_dmatrix\n", "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n", "\n", "# Create the parameter dictionary for each tree: params\n", "params = {\"objective\":\"reg:squarederror\", \"max_depth\":4}\n", "\n", "# Perform cross-validation with early-stopping: cv_results\n", "cv_results = xgb.cv(dtrain=housing_dmatrix, nfold=3, params=params, metrics=\"rmse\", \n", " early_stopping_rounds=10, num_boost_round=50, as_pandas=True, seed=123)\n", "\n", "# Print cv_results\n", "print(cv_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview of XGBoost's hyperparameters\n", "- Common tree tunable parameters\n", " - learning rate: learning rate/eta\n", " - gamma: min loss reduction to create new tree split\n", " - lambda: L2 regularization on leaf weights\n", " - alpha: L1 regularization on leaf weights\n", " - max_depth: max depth per tree\n", " - subsample: % samples used per tree\n", " - colsample_bytree: % features used per tree\n", "- Linear tunable parameters\n", " - lambda: L2 reg on weights\n", " - alpha: L1 reg on weights\n", " - lambda_bias: L2 reg term on bias\n", "- You can also tune the number of estimators used for both base model types!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning eta\n", "It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning the `\"eta\"`, also known as the learning rate.\n", "\n", "The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of `\"eta\"` penalizing feature weights more strongly, causing much stronger regularization." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " eta best_rmse\n", "0 0.001 195736.401042\n", "1 0.010 179932.187500\n", "2 0.100 79759.408854\n" ] } ], "source": [ "# Create your housing DMatrix: housing_dmatrix\n", "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n", "\n", "# Create the parameter dictionary for each tree (boosting round)\n", "params = {\"objective\":\"reg:squarederror\", \"max_depth\":3}\n", "\n", "# Create list of eta values and empty list to store final round rmse per xgboost model\n", "eta_vals = [0.001, 0.01, 0.1]\n", "best_rmse = []\n", "\n", "# Systematicallyvary the eta\n", "for curr_val in eta_vals:\n", " params['eta'] = curr_val\n", " \n", " # Perform cross-validation: cv_results\n", " cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,\n", " early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123, \n", " as_pandas=True)\n", " \n", " # Append the final round rmse to best_rmse\n", " best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])\n", " \n", "# Print the result DataFrame\n", "print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=['eta', 'best_rmse']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning max_depth\n", "In this exercise, your job is to tune `max_depth`, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " max_depth best_rmse\n", "0 2 37957.468750\n", "1 5 35596.599610\n", "2 10 36065.546875\n", "3 20 36739.576172\n" ] } ], "source": [ "# Create your housing DMatrix\n", "housing_dmatrix = xgb.DMatrix(data=X, label=y)\n", "\n", "# Create the parameter dictionary\n", "params = {\"objective\":\"reg:squarederror\"}\n", "\n", "# Create list of max_depth values\n", "max_depths = [2, 5, 10, 20]\n", "best_rmse = []\n", "\n", "for curr_val in max_depths:\n", " params['max_depth'] = curr_val\n", " \n", " # Perform cross-validation\n", " cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, \n", " early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,\n", " as_pandas=True)\n", " \n", " # Append the final round rmse to best_rmse\n", " best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])\n", " \n", "# Print the result DataFrame\n", "print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning colsample_bytree\n", "Now, it's time to tune `\"colsample_bytree\"`. You've already seen this if you've ever worked with scikit-learn's `RandomForestClassifier` or `RandomForestRegressor`, where it just was called `max_features`. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, `colsample_bytree` must be specified as a float between 0 and 1." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " colsample_bytree best_rmse\n", "0 0.1 48193.453125\n", "1 0.5 36013.544922\n", "2 0.8 35932.962891\n", "3 1.0 35836.042968\n" ] } ], "source": [ "# Create your housing DMatrix\n", "housing_dmatrix = xgb.DMatrix(data=X,label=y)\n", "\n", "# Create the parameter dictionary\n", "params={\"objective\":\"reg:squarederror\", \"max_depth\":3}\n", "\n", "# Create list of hyperparameter values: colsample_bytree_vals\n", "colsample_bytree_vals = [0.1, 0.5, 0.8, 1]\n", "best_rmse = []\n", "\n", "# Systematically vary the hyperparameter value \n", "for curr_val in colsample_bytree_vals:\n", " params['colsample_bytree'] = curr_val\n", " \n", " # Perform cross-validation\n", " cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,\n", " num_boost_round=10, early_stopping_rounds=5,\n", " metrics=\"rmse\", as_pandas=True, seed=123)\n", " \n", " # Append the final round rmse to best_rmse\n", " best_rmse.append(cv_results[\"test-rmse-mean\"].tail().values[-1])\n", "\n", "# Print the resultant DataFrame\n", "print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), \n", " columns=[\"colsample_bytree\",\"best_rmse\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several other individual parameters that you can tune, such as `\"subsample\"`, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review of grid search and random search\n", "- Grid search: review\n", " - Search exhaustively over a given set of hyperparameters, once per set of hyperparameters\n", " - Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter\n", " - Pick final model hyperparameter values that give best cross-validated evaluation metric value\n", "- Random search: review\n", " - Create a (possibly infinte) range of hyperparameter values per hyperparameter that you would like to search over\n", " - Set the number of iterations you would like for the random search to continue\n", " - During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters\n", " - After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grid search with XGBoost\n", "Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's `GridSearch` and `RandomizedSearch` capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with `GridSearchCV`!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 4 folds for each of 4 candidates, totalling 16 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best parameters found: {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}\n", "Lowest RMSE found: 30744.105707685176\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# Create the parameter grid: gbm_param_grid\n", "gbm_param_grid = {\n", " 'colsample_bytree': [0.3, 0.7],\n", " 'n_estimators': [50],\n", " 'max_depth': [2, 5]\n", "}\n", "\n", "# Instantiate the regressor: gbm\n", "gbm = xgb.XGBRegressor()\n", "\n", "# Perform grid search: grid_mse\n", "grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm, \n", " scoring='neg_mean_squared_error', cv=4, verbose=1)\n", "\n", "# Fit grid_mse to the data\n", "grid_mse.fit(X, y)\n", "\n", "# Print the best parameters and lowest RMSE\n", "print(\"Best parameters found: \", grid_mse.best_params_)\n", "print(\"Lowest RMSE found: \", np.sqrt(np.abs(grid_mse.best_score_)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random search with XGBoost\n", "Often, `GridSearchCV` can be really time consuming, so in practice, you may want to use `RandomizedSearchCV` instead, as you will do in this exercise. The good news is you only have to make a few modifications to your `GridSearchCV` code to do `RandomizedSearchCV`. The key difference is you have to specify a `param_distributions` parameter instead of a `param_grid` parameter." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 4 folds for each of 5 candidates, totalling 20 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best parameters found: {'n_estimators': 25, 'max_depth': 4}\n", "Lowest RMSE found: 29998.4522530019\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.5s finished\n" ] } ], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "# Create the parameter grid: gbm_param_grid\n", "gbm_param_grid = {\n", " 'n_estimators': [25],\n", " 'max_depth': range(2, 12)\n", "}\n", "\n", "# Instantiate the regressor: gbm\n", "gbm = xgb.XGBRegressor(n_estimators=10)\n", "\n", "# Perform random search: randomized_mse\n", "randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm, \n", " scoring='neg_mean_squared_error', n_iter=5, cv=4, \n", " verbose=1)\n", "\n", "# Fit randomized_mse to the data\n", "randomized_mse.fit(X, y)\n", "\n", "# Print the best parameters and lowest RMSE\n", "print(\"Best parameters found: \", randomized_mse.best_params_)\n", "print(\"Lowest RMSE found: \", np.sqrt(np.abs(randomized_mse.best_score_)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limits of grid search and random search\n", "- limitations\n", " - Grid Search\n", " - Number of models you must build with every additionary new parameter grows very quickly\n", " - Random Search\n", " - Parameter space to explore can be massive\n", " - Randomly jumping throughtout the space looking for a \"best\" results becomes a waiting game" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }