{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "8aec88d2-8c79-452c-927b-17253668a176", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet seaborn pandas scikit-learn numpy matplotlib jupyterlab_myst ipython " ] }, { "cell_type": "markdown", "id": "28bfa3a1", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "f6ca5fc4", "metadata": {}, "source": [ "# Bagging" ] }, { "cell_type": "markdown", "id": "0164eba3", "metadata": {}, "source": [ "## Bootstrapping\n", "\n", "*Bagging* (also known as [Bootstrap aggregation](https://en.wikipedia.org/wiki/Bootstrap_aggregating)) is one of the first and most basic ensemble techniques. It was proposed by [Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman) in 1994. Bagging is based on the statistical method of [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29), which makes the evaluation of many statistics of complex models feasible.\n", "\n", "The bootstrap method goes as follows. Let there be a sample $\\large X$ of size $\\large N$. We can make a new sample from the original sample by drawing $\\large N$ elements from the latter randomly and uniformly, with replacement. In other words, we select a random element from the original sample of size $\\large N$ and do this $\\large N$ times. All elements are equally likely to be selected, thus each element is drawn with the equal probability $\\large \\frac{1}{N}$.\n", "\n", "Let's say we are drawing balls from a bag one at a time. At each step, the selected ball is put back into the bag so that the next selection is made equiprobably i.e. from the same number of balls $\\large N$. Note that, because we put the balls back, there may be duplicates in the new sample. Let's call this new sample $\\large X_1$.\n", "\n", "By repeating this procedure $\\large M$ times, we create $\\large M$ *bootstrap samples* $\\large X_1, \\dots, X_M$. In the end, we have a sufficient number of samples and can compute various statistics of the original distribution." ] }, { "cell_type": "markdown", "id": "f878eb84", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/ensemble-learning/bagging/Process_of_bootstrap.png\n", "---\n", "name: 'process of bootstrap'\n", "width: 90%\n", "---\n", "Process of bootstrap\n", ":::" ] }, { "cell_type": "markdown", "id": "44186e16", "metadata": {}, "source": [ "For our example, we'll use the familiar `telecom_churn` dataset. Previously, when we discussed feature importance, we saw that one of the most important features in this dataset is the number of calls to customer service. Let's visualize the data and look at the distribution of this feature." ] }, { "cell_type": "code", "execution_count": 2, "id": "92d03822", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "id": "65b1b554", "metadata": {}, "outputs": [], "source": [ "# Graphics in retina format are more sharp and legible\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "code", "execution_count": 4, "id": "c336331c", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "tags": [ "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | State | \n", "Account length | \n", "Area code | \n", "International plan | \n", "Voice mail plan | \n", "Number vmail messages | \n", "Total day minutes | \n", "Total day calls | \n", "Total day charge | \n", "Total eve minutes | \n", "Total eve calls | \n", "Total eve charge | \n", "Total night minutes | \n", "Total night calls | \n", "Total night charge | \n", "Total intl minutes | \n", "Total intl calls | \n", "Total intl charge | \n", "Customer service calls | \n", "Churn | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "KS | \n", "128 | \n", "415 | \n", "No | \n", "Yes | \n", "25 | \n", "265.1 | \n", "110 | \n", "45.07 | \n", "197.4 | \n", "99 | \n", "16.78 | \n", "244.7 | \n", "91 | \n", "11.01 | \n", "10.0 | \n", "3 | \n", "2.70 | \n", "1 | \n", "False | \n", "
| 1 | \n", "OH | \n", "107 | \n", "415 | \n", "No | \n", "Yes | \n", "26 | \n", "161.6 | \n", "123 | \n", "27.47 | \n", "195.5 | \n", "103 | \n", "16.62 | \n", "254.4 | \n", "103 | \n", "11.45 | \n", "13.7 | \n", "3 | \n", "3.70 | \n", "1 | \n", "False | \n", "
| 2 | \n", "NJ | \n", "137 | \n", "415 | \n", "No | \n", "No | \n", "0 | \n", "243.4 | \n", "114 | \n", "41.38 | \n", "121.2 | \n", "110 | \n", "10.30 | \n", "162.6 | \n", "104 | \n", "7.32 | \n", "12.2 | \n", "5 | \n", "3.29 | \n", "0 | \n", "False | \n", "
| 3 | \n", "OH | \n", "84 | \n", "408 | \n", "Yes | \n", "No | \n", "0 | \n", "299.4 | \n", "71 | \n", "50.90 | \n", "61.9 | \n", "88 | \n", "5.26 | \n", "196.9 | \n", "89 | \n", "8.86 | \n", "6.6 | \n", "7 | \n", "1.78 | \n", "2 | \n", "False | \n", "
| 4 | \n", "OK | \n", "75 | \n", "415 | \n", "Yes | \n", "No | \n", "0 | \n", "166.7 | \n", "113 | \n", "28.34 | \n", "148.3 | \n", "122 | \n", "12.61 | \n", "186.9 | \n", "121 | \n", "8.41 | \n", "10.1 | \n", "3 | \n", "2.73 | \n", "3 | \n", "False | \n", "