{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section demonstrates how to fit bagging, random forest, and boosting models using `scikit-learn`. We will again use the {doc}`penguins ` dataset for classification and the {doc}`tips ` dataset for regression." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "## Import packages\n", "import numpy as np \n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Bagging and Random Forests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that bagging and random forests can handle both classification and regression tasks. For this example we will do classification on the `penguins` dataset. Recall that `scikit-learn` trees do not currently support categorical predictors, so we must first convert those to dummy variables" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "## Load penguins data\n", "penguins = sns.load_dataset('penguins')\n", "penguins = penguins.dropna().reset_index(drop = True)\n", "X = penguins.drop(columns = 'species')\n", "y = penguins['species']\n", "\n", "## Train-test split\n", "np.random.seed(1)\n", "test_frac = 0.25\n", "test_size = int(len(y)*test_frac)\n", "test_idxs = np.random.choice(np.arange(len(y)), test_size, replace = False)\n", "X_train = X.drop(test_idxs)\n", "y_train = y.drop(test_idxs)\n", "X_test = X.loc[test_idxs]\n", "y_test = y.loc[test_idxs]\n", "\n", "## Get dummies\n", "X_train = pd.get_dummies(X_train, drop_first = True)\n", "X_test = pd.get_dummies(X_test, drop_first = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bagging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple bagging classifier is fit below. The most important arguments are `n_estimators` and `base_estimator`, which determine the number and type of weak learners the bagging model should use. The default `base_estimator` is a decision tree, though this can be changed as in the second example below, which uses Naive Bayes estimators. " ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.963855421686747\n", "0.9156626506024096\n" ] } ], "source": [ "from sklearn.ensemble import BaggingClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "## Decision Tree bagger\n", "bagger1 = BaggingClassifier(n_estimators = 50, random_state = 123)\n", "bagger1.fit(X_train, y_train)\n", "\n", "## Naive Bayes bagger\n", "bagger2 = BaggingClassifier(base_estimator = GaussianNB(), random_state = 123)\n", "bagger2.fit(X_train, y_train)\n", "\n", "## Evaluate\n", "print(np.mean(bagger1.predict(X_test) == y_test))\n", "print(np.mean(bagger2.predict(X_test) == y_test))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Forests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example of a random forest in `scikit-learn` is given below. The most important arguments to the random forest are the number of estimators (decision trees), `max_features` (the number of predictors to consider at each split), and any chosen parameters for the decision trees (such as the maximum depth). Guidelines for setting each of these parameters are given below. \n", "\n", "- `n_estimators`: In general, the more base estimators the better, though there are diminishing marginal returns. While increasing the number of base estimators does not risk overfitting, it eventually provides no benefit. \n", "- `max_features`: This argument is set by default to the square root of the number of total features (which is made explicit in the example below). If this value equals the number of total features, we are left with a bagging model. Lowering this value lowers the amount of correlation between trees but also prevents the base estimators from learning potentially valuable information. \n", "- Decision tree parameters: These parameters are generally left untouched. This allows the individual decision trees to grow deep, increasing variance but decreasing bias. The variance is then decreased by the ensemble of individual trees.\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9879518072289156\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "rf = RandomForestClassifier(n_estimators = 100, max_features = int(np.sqrt(X_test.shape[1])), random_state = 123)\n", "rf.fit(X_train, y_train)\n", "print(np.mean(rf.predict(X_test) == y_test))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "Note that the `AdaBoostClassifier` from `scikit-learn` uses a slightly different algorithm than the one introduced in the {doc}`concept section ` though results should be similar. The `AdaBoostRegressor` class in `scikit-learn` uses the same algorithm we introduced: *AdaBoost.R2*\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AdaBoost Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `AdaBoostClassifier` in `scikit-learn` is actually able to handle multiclass target variables, but for consistency, let's use the same binary target we did in our AdaBoost construction: whether the penguin's species is *Adelie*." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "## Make binary\n", "y_train = (y_train == 'Adelie')\n", "y_test = (y_test == 'Adelie')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then fit the classifier with the `AdaBoostClassifier` class as below. Again, we first convert categorical predictors to dummy variables. The classifier will by default use 50 decision trees, each with a max depth of 1, for the weak learners. \n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9759036144578314" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "## Get dummies\n", "X_train = pd.get_dummies(X_train, drop_first = True)\n", "X_test = pd.get_dummies(X_test, drop_first = True)\n", "\n", "## Build model\n", "abc = AdaBoostClassifier(n_estimators = 50)\n", "abc.fit(X_train, y_train)\n", "y_test_hat = abc.predict(X_test)\n", "\n", "## Evaluate \n", "np.mean(y_test_hat == y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A different weak learner can easily be used in place of a decision tree. The below shows an example using logistic regression. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "abc = AdaBoostClassifier(base_estimator = LogisticRegression(max_iter = 1000))\n", "abc.fit(X_train, y_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AdaBoost Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "AdaBoost regression is implemented almost identically in `scikit-learn`. An example with the `tips` dataset is shown below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "## Load penguins data\n", "tips = sns.load_dataset('tips')\n", "tips = tips.dropna().reset_index(drop = True)\n", "X = tips.drop(columns = 'tip')\n", "y = tips['tip']\n", "\n", "## Train-test split\n", "np.random.seed(1)\n", "test_frac = 0.25\n", "test_size = int(len(y)*test_frac)\n", "test_idxs = np.random.choice(np.arange(len(y)), test_size, replace = False)\n", "X_train = X.drop(test_idxs)\n", "y_train = y.drop(test_idxs)\n", "X_test = X.loc[test_idxs]\n", "y_test = y.loc[test_idxs]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.ensemble import AdaBoostRegressor\n", "\n", "## Get dummies\n", "X_train = pd.get_dummies(X_train, drop_first = True)\n", "X_test = pd.get_dummies(X_test, drop_first = True)\n", "\n", "## Build model\n", "abr = AdaBoostRegressor(n_estimators = 50)\n", "abr.fit(X_train, y_train)\n", "y_test_hat = abr.predict(X_test)\n", "\n", "## Visualize predictions\n", "fig, ax = plt.subplots(figsize = (7, 5))\n", "sns.scatterplot(y_test, y_test_hat)\n", "ax.set(xlabel = r'$y$', ylabel = r'$\\hat{y}$', title = r'Test Sample $y$ vs. $\\hat{y}$')\n", "sns.despine()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }