{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Adaptive Boosting (AdaBoost)\n", "\n", "In this notebook, we present the Adaptive Boosting (AdaBoost) algorithm. The\n", "aim is to get intuitions regarding the internal machinery of AdaBoost and\n", "boosting in general.\n", "\n", "We will load the \"penguin\" dataset. We will predict penguin species from the\n", "culmen length and depth features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", "target_column = \"Species\"\n", "\n", "data, target = penguins[culmen_columns], penguins[target_column]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will purposefully train a shallow decision tree. Since it is shallow, it is\n", "unlikely to overfit and some of the training examples will even be\n", "misclassified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "palette = [\"tab:red\", \"tab:blue\", \"black\"]\n", "\n", "tree = DecisionTreeClassifier(max_depth=2, random_state=0)\n", "tree.fit(data, target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can predict on the same dataset and check which samples are misclassified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "target_predicted = tree.predict(data)\n", "misclassified_samples_idx = np.flatnonzero(target != target_predicted)\n", "data_misclassified = data.iloc[misclassified_samples_idx]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.inspection import DecisionBoundaryDisplay\n", "\n", "DecisionBoundaryDisplay.from_estimator(\n", " tree, data, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", ")\n", "\n", "# plot the original dataset\n", "sns.scatterplot(\n", " data=penguins,\n", " x=culmen_columns[0],\n", " y=culmen_columns[1],\n", " hue=target_column,\n", " palette=palette,\n", ")\n", "# plot the misclassified samples\n", "sns.scatterplot(\n", " data=data_misclassified,\n", " x=culmen_columns[0],\n", " y=culmen_columns[1],\n", " label=\"Misclassified samples\",\n", " marker=\"+\",\n", " s=150,\n", " color=\"k\",\n", ")\n", "\n", "plt.legend(bbox_to_anchor=(1.04, 0.5), loc=\"center left\")\n", "_ = plt.title(\n", " \"Decision tree predictions \\nwith misclassified samples highlighted\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that several samples have been misclassified by the classifier.\n", "\n", "We mentioned that boosting relies on creating a new classifier which tries to\n", "correct these misclassifications. In scikit-learn, learners have a parameter\n", "`sample_weight` which forces it to pay more attention to samples with higher\n", "weights during the training.\n", "\n", "This parameter is set when calling `classifier.fit(X, y,\n", "sample_weight=weights)`. We will use this trick to create a new classifier by\n", "'discarding' all correctly classified samples and only considering the\n", "misclassified samples. Thus, misclassified samples will be assigned a weight\n", "of 1 and well classified samples will be assigned a weight of 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_weight = np.zeros_like(target, dtype=int)\n", "sample_weight[misclassified_samples_idx] = 1\n", "\n", "tree = DecisionTreeClassifier(max_depth=2, random_state=0)\n", "tree.fit(data, target, sample_weight=sample_weight)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DecisionBoundaryDisplay.from_estimator(\n", " tree, data, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", ")\n", "sns.scatterplot(\n", " data=penguins,\n", " x=culmen_columns[0],\n", " y=culmen_columns[1],\n", " hue=target_column,\n", " palette=palette,\n", ")\n", "sns.scatterplot(\n", " data=data_misclassified,\n", " x=culmen_columns[0],\n", " y=culmen_columns[1],\n", " label=\"Previously misclassified samples\",\n", " marker=\"+\",\n", " s=150,\n", " color=\"k\",\n", ")\n", "\n", "plt.legend(bbox_to_anchor=(1.04, 0.5), loc=\"center left\")\n", "_ = plt.title(\"Decision tree by changing sample weights\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the decision function drastically changed. Qualitatively, we see\n", "that the previously misclassified samples are now correctly classified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_predicted = tree.predict(data)\n", "newly_misclassified_samples_idx = np.flatnonzero(target != target_predicted)\n", "remaining_misclassified_samples_idx = np.intersect1d(\n", " misclassified_samples_idx, newly_misclassified_samples_idx\n", ")\n", "\n", "print(\n", " \"Number of samples previously misclassified and \"\n", " f\"still misclassified: {len(remaining_misclassified_samples_idx)}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we are making mistakes on previously well classified samples. Thus,\n", "we get the intuition that we should weight the predictions of each classifier\n", "differently, most probably by using the number of mistakes each classifier is\n", "making.\n", "\n", "So we could use the classification error to combine both trees." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ensemble_weight = [\n", " (target.shape[0] - len(misclassified_samples_idx)) / target.shape[0],\n", " (target.shape[0] - len(newly_misclassified_samples_idx)) / target.shape[0],\n", "]\n", "ensemble_weight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first classifier was 94% accurate and the second one 69% accurate.\n", "Therefore, when predicting a class, we should trust the first classifier\n", "slightly more than the second one. We could use these accuracy values to\n", "weight the predictions of each learner.\n", "\n", "To summarize, boosting learns several classifiers, each of which will focus\n", "more or less on specific samples of the dataset. Boosting is thus different\n", "from bagging: here we never resample our dataset, we just assign different\n", "weights to the original dataset.\n", "\n", "Boosting requires some strategy to combine the learners together:\n", "\n", "* one needs to define a way to compute the weights to be assigned to samples;\n", "* one needs to assign a weight to each learner when making predictions.\n", "\n", "Indeed, we defined a really simple scheme to assign sample weights and learner\n", "weights. However, there are statistical theories (like in AdaBoost) for how\n", "these sample and learner weights can be optimally calculated.\n", "\n", "We will use the AdaBoost classifier implemented in scikit-learn and look at\n", "the underlying decision tree classifiers trained." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "estimator = DecisionTreeClassifier(max_depth=3, random_state=0)\n", "adaboost = AdaBoostClassifier(\n", " estimator=estimator, n_estimators=3, algorithm=\"SAMME\", random_state=0\n", ")\n", "adaboost.fit(data, target)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for boosting_round, tree in enumerate(adaboost.estimators_):\n", " plt.figure()\n", " # we convert `data` into a NumPy array to avoid a warning raised in scikit-learn\n", " DecisionBoundaryDisplay.from_estimator(\n", " tree,\n", " data.to_numpy(),\n", " response_method=\"predict\",\n", " cmap=\"RdBu\",\n", " alpha=0.5,\n", " )\n", " sns.scatterplot(\n", " x=culmen_columns[0],\n", " y=culmen_columns[1],\n", " hue=target_column,\n", " data=penguins,\n", " palette=palette,\n", " )\n", " plt.legend(bbox_to_anchor=(1.04, 0.5), loc=\"center left\")\n", " _ = plt.title(f\"Decision tree trained at round {boosting_round}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Weight of each classifier: {adaboost.estimator_weights_}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Error of each classifier: {adaboost.estimator_errors_}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that AdaBoost learned three different classifiers, each of which\n", "focuses on different samples. Looking at the weights of each learner, we see\n", "that the ensemble gives the highest weight to the first classifier. This\n", "indeed makes sense when we look at the errors of each classifier. The first\n", "classifier also has the highest classification generalization performance.\n", "\n", "While AdaBoost is a nice algorithm to demonstrate the internal machinery of\n", "boosting algorithms, it is not the most efficient. This title is handed to the\n", "gradient-boosting decision tree (GBDT) algorithm, which we will discuss in the\n", "next unit." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }