{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "69436ff0-cff5-49c9-993c-e9d7086bb11b", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython xgboost" ] }, { "cell_type": "markdown", "id": "755e087a", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "54581f48", "metadata": {}, "source": [ "# XGBoost" ] }, { "cell_type": "markdown", "id": "570d1449", "metadata": {}, "source": [ "## What is XGBoost\n", "\n", "**XGBoost** is the leading model for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). XGBoost models dominate many famous competitions. \n", "\n", "To reach peak accuracy, XGBoost models require more knowledge and _model tuning_ than techniques like Random Forest. After this section, you'ill be able to \n", "- Follow the full modeling workflow with XGBoost \n", "- Fine-tune XGBoost models for optimal performance\n", "\n", "\n", "XGBoost is an implementation of the **Gradient Boosted Decision Trees** algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.) What is **Gradient Boosted Decision Trees**? We'll walk through a diagram." ] }, { "cell_type": "markdown", "id": "9af20398", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/xgboost/Gradient_boosted_decision_trees.png\n", "---\n", "name: 'Gradient Boosted Decision Trees'\n", "width: 90%\n", "---\n", "Gradient Boosted Decision Trees\n", ":::" ] }, { "cell_type": "markdown", "id": "5ffa9d7b", "metadata": {}, "source": [ "We go through cycles that repeatedly builds new models and combines them into an **ensemble** model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the \"ensemble of models.\" \n", "\n", "To make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.\n", "\n", "There's one piece outside that cycle. We need some base prediction to start the cycle. In practice, the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.\n", "\n", "This process may sound complicated, but the code to use it is straightforward. We'll fill in some additional explanatory details in the **model tuning** section below." ] }, { "cell_type": "markdown", "id": "cc345152", "metadata": {}, "source": [ "## Example\n", "\n", "We will start with the data pre-loaded into **train_X**, **test_X**, **train_y**, **test_y**." ] }, { "cell_type": "code", "execution_count": 2, "id": "d796e87b", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.impute import SimpleImputer\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore')\n", "data = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/house_price_train.csv')\n", "data.dropna(axis=0, subset=['SalePrice'], inplace=True)\n", "y = data.SalePrice\n", "X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])\n", "train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)\n", "\n", "my_imputer = SimpleImputer()\n", "train_X = my_imputer.fit_transform(train_X)\n", "test_X = my_imputer.transform(test_X)" ] }, { "cell_type": "markdown", "id": "1857ed52", "metadata": {}, "source": [ "We build and fit a model just as we would in scikit-learn." ] }, { "cell_type": "code", "execution_count": 3, "id": "fdf00e58", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,\n",
" num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n",
" reg_lambda=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,\n",
" num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n",
" reg_lambda=1, ...)XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=1000,\n",
" n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
" reg_alpha=0, reg_lambda=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=1000,\n",
" n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
" reg_alpha=0, reg_lambda=1, ...)XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=1000,\n",
" n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
" reg_alpha=0, reg_lambda=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n",
" colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n",
" early_stopping_rounds=None, enable_categorical=False,\n",
" eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n",
" importance_type=None, interaction_constraints='',\n",
" learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,\n",
" max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,\n",
" missing=nan, monotone_constraints='()', n_estimators=1000,\n",
" n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,\n",
" reg_alpha=0, reg_lambda=1, ...)