{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "colab": { "name": "2020-06-04-02-Boosting.ipynb", "provenance": [], "toc_visible": true } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "5mXDAjLZLEsR" }, "source": [ "# Boosting\n", "> A Summary of lecture \"Machine Learning with Tree-Based Models in Python\n", "\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine Learning]\n", "- image: images/sgb_train.png" ] }, { "cell_type": "code", "metadata": { "id": "lxKzBC5iLEsa" }, "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ], "execution_count": 59, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "1x-MW5ypLEsb" }, "source": [ "## Adaboost\n", "- Boosting: Ensemble method combining several weak learners to form a strong learner.\n", " - Weak learner: Model doing slightly better than random guessing\n", " - E.g., Dicision stump (CART whose maximum depth is 1)\n", " - Train an ensemble of predictors sequentially.\n", " - Each predictor tries to correct its predecessor\n", " - Most popular boosting methods:\n", " - AdaBoost\n", " - Gradient Boosting\n", "- AdaBoost\n", " - Stands for **Ada**ptive **Boost**ing\n", " - Each predictor pays more attention to the instances wrongly predicted by its predecessor.\n", " - Achieved by changing the weights of training instances.\n", " - Each predictor is assigned a coefficient $\\alpha$ that depends on the predictor's training error\n", "- AdaBoost: Training\n", "![adaboost_train](https://github.com/goodboychan/chans_jupyter/blob/master/_notebooks/image/adaboost_train.png?raw=1)\n", " - Learning rate: $0 < \\eta < 1$" ] }, { "cell_type": "markdown", "metadata": { "id": "RKHgRHAnLEsc" }, "source": [ "### Define the AdaBoost classifier\n", "In the following exercises you'll revisit the [Indian Liver Patient](https://www.kaggle.com/uciml/indian-liver-patient-records) dataset which was introduced in a previous chapter. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. However, this time, you'll be training an AdaBoost ensemble to perform the classification task. In addition, given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.\n", "\n", "As a first step, you'll start by instantiating an AdaBoost classifier." ] }, { "cell_type": "code", "metadata": { "id": "XeldclBSLP8P" }, "source": [ "" ], "execution_count": 59, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "hbcrrikLLRLr", "outputId": "833d4537-be40-41a0-9552-2a69ab552012", "colab": { "base_uri": "https://localhost:8080/" } }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": 60, "outputs": [ { "output_type": "stream", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "jNjzJ0kBLEsc" }, "source": [ "- Preprocess" ] }, { "cell_type": "code", "metadata": { "id": "5qkiIBDvLEsd", "outputId": "9437c378-7bee-440d-bf36-d978c03a520c", "colab": { "base_uri": "https://localhost:8080/", "height": 241 } }, "source": [ "!pwd\n", "indian = pd.read_csv('/content/drive/MyDrive/colab-notebooks/indian_liver_preprocessed.csv', index_col=0)\n", "indian.head()" ], "execution_count": 61, "outputs": [ { "output_type": "stream", "text": [ "/content\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age_stdTotal_Bilirubin_stdDirect_Bilirubin_stdAlkaline_Phosphotase_stdAlamine_Aminotransferase_stdAspartate_Aminotransferase_stdTotal_Protiens_stdAlbumin_stdAlbumin_and_Globulin_Ratio_stdIs_male_stdLiver_disease
01.247403-0.420320-0.495414-0.428870-0.355832-0.3191110.2937220.203446-0.14739001
11.0623061.2189361.4235181.675083-0.093573-0.0359620.9396550.077462-0.64846111
21.0623060.6403750.9260170.816243-0.115428-0.1464590.4782740.203446-0.17870711
30.815511-0.372106-0.388807-0.449416-0.366760-0.3122050.2937220.3294310.16578011
41.6792940.0939560.179766-0.395996-0.295731-0.1775370.755102-0.930414-1.71323711
\n", "
" ], "text/plain": [ " Age_std Total_Bilirubin_std ... Is_male_std Liver_disease\n", "0 1.247403 -0.420320 ... 0 1\n", "1 1.062306 1.218936 ... 1 1\n", "2 1.062306 0.640375 ... 1 1\n", "3 0.815511 -0.372106 ... 1 1\n", "4 1.679294 0.093956 ... 1 1\n", "\n", "[5 rows x 11 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 61 } ] }, { "cell_type": "code", "metadata": { "id": "yEZD0IwtLEsf" }, "source": [ "X = indian.drop('Liver_disease', axis='columns')\n", "y = indian['Liver_disease']" ], "execution_count": 62, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "q6vqFm8zLEsf" }, "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)" ], "execution_count": 63, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Frg7OHObLEsg" }, "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "# Instantiate dt\n", "dt = DecisionTreeClassifier(max_depth=2, random_state=1)\n", "\n", "# Instantiate ada\n", "ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)" ], "execution_count": 64, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "J6BW6jw3LEsg" }, "source": [ "### Train the AdaBoost classifier\n", "Now that you've instantiated the AdaBoost classifier ada, it's time train it. You will also predict the probabilities of obtaining the positive class in the test set. This can be done as follows:\n", "\n", "Once the classifier ```ada``` is trained, call the ```.predict_proba()``` method by passing ```X_test``` as a parameter and extract these probabilities by slicing all the values in the second column as follows:\n", "```python\n", "ada.predict_proba(X_test)[:,1]\n", "```\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "aw7aAYpTLEsh" }, "source": [ "# Fit ada to the training set\n", "ada.fit(X_train, y_train)\n", "\n", "# Compute the probabilities of obtaining the positive class\n", "y_pred_proba = ada.predict_proba(X_test)[:, 1]" ], "execution_count": 65, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Sm5V_5byLEsh" }, "source": [ "### Evaluate the AdaBoost classifier\n", "Now that you're done training ```ada``` and predicting the probabilities of obtaining the positive class in the test set, it's time to evaluate ```ada```'s ROC AUC score. Recall that the ROC AUC score of a binary classifier can be determined using the ```roc_auc_score()``` function from ```sklearn.metrics```." ] }, { "cell_type": "code", "metadata": { "id": "aXrqdSnsLEsi", "outputId": "2285fb3a-df3a-408e-97ef-eed74f51c2d7", "colab": { "base_uri": "https://localhost:8080/" } }, "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "# Evaluate test-set roc_auc_score\n", "ada_roc_auc = roc_auc_score(y_test, y_pred_proba)\n", "\n", "# Print roc_auc_score\n", "print('ROC AUC score: {:.2f}'.format(ada_roc_auc))" ], "execution_count": 66, "outputs": [ { "output_type": "stream", "text": [ "ROC AUC score: 0.64\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "eUysbPtmLEsi" }, "source": [ "## Gradient Boosting (GB)\n", "- Gradient Boosted Trees\n", " - Sequential correction of predecessor's errors\n", " - Does not tweak the weights of training instances\n", " - Fit each predictor is trained using its predecessor's residual errors as labels\n", " - Gradient Boosted Trees: a CART is used as a base learner.\n", "- Gradient Boosted Trees for Regression: Training\n", "![gb_train](https://github.com/goodboychan/chans_jupyter/blob/master/_notebooks/image/gb_train.png?raw=1)\n", " - $\\eta$ (shrinkage)\n", " - Ensemble is shrinked after it is multiplied by a learning rate" ] }, { "cell_type": "markdown", "metadata": { "id": "nJkEUMUuLEsj" }, "source": [ "### Define the GB regressor\n", "You'll now revisit the [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand) dataset that was introduced in the previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be using a gradient boosting regressor.\n", "\n", "As a first step, you'll start by instantiating a gradient boosting regressor which you will train in the next exercise." ] }, { "cell_type": "markdown", "metadata": { "id": "8-20Mw5jLEsk" }, "source": [ "- Preprocess" ] }, { "cell_type": "code", "metadata": { "id": "vZ7_ugmdLEsk", "outputId": "2896c645-54c2-4b65-db88-1daa08e8b3aa", "colab": { "base_uri": "https://localhost:8080/", "height": 204 } }, "source": [ "bike = pd.read_csv('/content/drive/MyDrive/colab-notebooks/bikes.csv')\n", "bike.head()" ], "execution_count": 71, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hrholidayworkingdaytemphumwindspeedcntinstantmnthyrClear to partly cloudyLight PrecipitationMisty
00000.760.660.00001491300471100
11000.740.700.1343931300571100
22000.720.740.0896901300671100
33000.720.840.1343331300771100
44000.700.790.194041300871100
\n", "
" ], "text/plain": [ " hr holiday workingday ... Clear to partly cloudy Light Precipitation Misty\n", "0 0 0 0 ... 1 0 0\n", "1 1 0 0 ... 1 0 0\n", "2 2 0 0 ... 1 0 0\n", "3 3 0 0 ... 1 0 0\n", "4 4 0 0 ... 1 0 0\n", "\n", "[5 rows x 13 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 71 } ] }, { "cell_type": "code", "metadata": { "id": "QVw26CNsLEsk" }, "source": [ "X = bike.drop('cnt', axis='columns')\n", "y = bike['cnt']" ], "execution_count": 72, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8_zbzT1OLEsl" }, "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)" ], "execution_count": 73, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZvQ7sLi_LEsl" }, "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "# Instantiate gb\n", "gb = GradientBoostingRegressor(max_depth=4, n_estimators=200, random_state=2)" ], "execution_count": 74, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "kope2sNXLEsl" }, "source": [ "### Train the GB regressor\n", "You'll now train the gradient boosting regressor ```gb``` that you instantiated in the previous exercise and predict test set labels." ] }, { "cell_type": "code", "metadata": { "id": "-wI0q7SCLEsm" }, "source": [ "# Fit gb to the training set\n", "gb.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = gb.predict(X_test)" ], "execution_count": 75, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "eiBy4_j0LEsm" }, "source": [ "### Evaluate the GB regressor\n", "Now that the test set predictions are available, you can use them to evaluate the test set Root Mean Squared Error (RMSE) of ```gb```." ] }, { "cell_type": "code", "metadata": { "id": "2nEm_Yn9LEsm", "outputId": "f3e6d840-b971-419d-b39d-c33b272ea40b", "colab": { "base_uri": "https://localhost:8080/" } }, "source": [ "from sklearn.metrics import mean_squared_error as MSE\n", "\n", "# Compute MSE\n", "mse_test = MSE(y_test, y_pred)\n", "\n", "# Compute RMSE\n", "rmse_test = mse_test ** 0.5\n", "\n", "# Print RMSE\n", "print(\"Test set RMSE of gb: {:.3f}\".format(rmse_test))" ], "execution_count": 76, "outputs": [ { "output_type": "stream", "text": [ "Test set RMSE of gb: 49.537\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "RroLpJW9LEsn" }, "source": [ "## Stochastic Gradient Boosting (SGB)\n", "- Gradient Boosting: Cons & Pros\n", " - GB involves an exhaustive search procedure\n", " - Each CART is trained to find the best split points and features.\n", " - May lead to CARTs using the same split points and maybe the same features.\n", "- Stochastic Gradient Boosting\n", " - Each tree is trained on a random subset of rows of the training data.\n", " - The sampled instances (40%-80% of the training set) are sampled without replacement.\n", " - Features are sampled (without replacement) when choosing split points\n", " - Result: further ensemble diversity.\n", " - Effect: adding further variance to the ensemble of trees.\n", "- Stochastic Gradient Boosting: Training\n", "![sgb_train](https://github.com/goodboychan/chans_jupyter/blob/master/_notebooks/image/sgb_train.png?raw=1)\n", " - Residual errors are multiplied by the learning rate $\\eta$ and are fed to the next tree in ensemble.\n", " - Process is repeated sequentially until all the trees in the ensemble are trained." ] }, { "cell_type": "markdown", "metadata": { "id": "KfgmL5NmLEso" }, "source": [ "### Regression with SGB\n", "As in the exercises from the previous lesson, you'll be working with the [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand) dataset. In the following set of exercises, you'll solve this bike count regression problem using stochastic gradient boosting." ] }, { "cell_type": "code", "metadata": { "id": "kDWwiVNMLEsp" }, "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "# Instantiate sgbr\n", "sgbr = GradientBoostingRegressor(max_depth=4, n_estimators=200, subsample=0.9, \n", " max_features=0.75, random_state=2)" ], "execution_count": 77, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4o9znM-xLEsp" }, "source": [ "### Train the SGB regressor\n", "In this exercise, you'll train the SGBR ```sgbr``` instantiated in the previous exercise and predict the test set labels." ] }, { "cell_type": "code", "metadata": { "id": "mzM1GHeALEsq" }, "source": [ "# Fit sgbr to the training set\n", "sgbr.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = sgbr.predict(X_test)" ], "execution_count": 79, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LdfTwZncLEsq" }, "source": [ "### Evaluate the SGB regressor\n", "You have prepared the ground to determine the test set RMSE of ```sgbr``` which you shall evaluate in this exercise." ] }, { "cell_type": "code", "metadata": { "id": "p-9ol17TLEsq", "outputId": "3bb0cc5a-460d-4b5f-9e91-0be26aba66c7", "colab": { "base_uri": "https://localhost:8080/" } }, "source": [ "from sklearn.metrics import mean_squared_error as MSE\n", "\n", "# Compute test set MSE\n", "mse_test = MSE(y_test, y_pred)\n", "\n", "# Compute test set RMSE\n", "rmse_test = mse_test ** 0.5\n", "\n", "# Print rmse_test\n", "print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))" ], "execution_count": 80, "outputs": [ { "output_type": "stream", "text": [ "Test set RMSE of sgbr: 47.260\n" ], "name": "stdout" } ] } ] }