{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 12 - Ensemble Methods - Boosting\n",
    "\n",
    "\n",
    "by [Alejandro Correa Bahnsen](albahnsen.com/) and [Jesus Solano](https://github.com/jesugome)\n",
    "\n",
    "version 1.5, February 2019\n",
    "\n",
    "## Part of the class [Practical Machine Learning](https://github.com/albahnsen/PracticalMachineLearningClass)\n",
    "\n",
    "\n",
    "\n",
    "This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why are we learning about ensembling?\n",
    "\n",
    "- Very popular method for improving the predictive performance of machine learning models\n",
    "- Provides a foundation for understanding more sophisticated models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 5: Boosting\n",
    "\n",
    "While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified. (Wikipedia)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://vision.cs.chubu.ac.jp/wp/wp-content/uploads/2013/07/OurMethodv81.png\" width=\"900\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Image\n",
    "Image(url= \"http://vision.cs.chubu.ac.jp/wp/wp-content/uploads/2013/07/OurMethodv81.png\", width=900)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adaboost\n",
    "\n",
    "AdaBoost (adaptive boosting) is an ensemble learning algorithm that can be used for classification or regression. Although AdaBoost is more resistant to overfitting than many machine learning algorithms, it is often sensitive to noisy data and outliers.\n",
    "\n",
    "AdaBoost is called adaptive because it uses multiple iterations to generate a single composite strong learner. AdaBoost creates the strong learner (a classifier that is well-correlated to the true classifier) by iteratively adding weak learners (a classifier that is only slightly correlated to the true classifier). During each round of training, a new weak learner is added to the ensemble and a weighting vector is adjusted to focus on examples that were misclassified in previous rounds. The result is a classifier that has higher accuracy than the weak learners’ classifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Algorithm:\n",
    "\n",
    "* Initialize all weights ($w_i$) to 1 / n_samples\n",
    "* Train a classifier $h_t$ using weights\n",
    "* Estimate training error $e_t$\n",
    "* set $alpha_t = log\\left(\\frac{1-e_t}{e_t}\\right)$\n",
    "* Update weights \n",
    "$$w_i^{t+1} = w_i^{t}e^{\\left(\\alpha_t \\mathbf{I}\\left(y_i \\ne h_t(x_t)\\right)\\right)}$$\n",
    "* Repeat while $e_t<0.5$ and $t<T$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read in and prepare the churn data\n",
    "# Download the dataset\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/churn.csv'\n",
    "data = pd.read_csv(url)\n",
    "\n",
    "# Create X and y\n",
    "\n",
    "# Select only the numeric features\n",
    "X = data.iloc[:, [1,2,6,7,8,9,10]].astype(np.float)\n",
    "# Convert bools to floats\n",
    "X = X.join((data.iloc[:, [4,5]] == 'no').astype(np.float))\n",
    "\n",
    "y = (data.iloc[:, -1] == 'True.').astype(np.int)\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n",
    "n_samples = X_train.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_estimators = 10\n",
    "weights = pd.DataFrame(index=X_train.index, columns=list(range(n_estimators)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = 0\n",
    "weights[t] = 1 / n_samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,\n",
       "            max_features=None, max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
       "            splitter='best')"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "trees = []\n",
    "trees.append(DecisionTreeClassifier(max_depth=1))\n",
    "trees[t].fit(X_train, y_train, sample_weight=weights[t].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimate error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.13613972234661886"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred_ = trees[t].predict(X_train)\n",
    "error = []\n",
    "error.append(1 - metrics.accuracy_score(y_pred_, y_train))\n",
    "error[t]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.8477293114995077"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alpha = []\n",
    "alpha.append(np.log((1 - error[t]) / error[t]))\n",
    "alpha[t]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights[t + 1] = weights[t]\n",
    "filter_ = y_pred_ != y_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights.loc[filter_, t + 1] = weights.loc[filter_, t] * np.exp(alpha[t])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normalize weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights[t + 1] = weights[t + 1] / weights[t + 1].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Iteration 2 - n_estimators**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "for t in range(1, n_estimators):\n",
    "    trees.append(DecisionTreeClassifier(max_depth=1))\n",
    "    trees[t].fit(X_train, y_train, sample_weight=weights[t].values)\n",
    "    y_pred_ = trees[t].predict(X_train)\n",
    "    error.append(1 - metrics.accuracy_score(y_pred_, y_train))\n",
    "    alpha.append(np.log((1 - error[t]) / error[t]))\n",
    "    weights[t + 1] = weights[t]\n",
    "    filter_ = y_pred_ != y_train\n",
    "    weights.loc[filter_, t + 1] = weights.loc[filter_, t] * np.exp(alpha[t])\n",
    "    weights[t + 1] = weights[t + 1] / weights[t + 1].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.13613972234661886,\n",
       " 0.15629198387819077,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092,\n",
       " 0.8437080161218092]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create classification\n",
    "\n",
    "Only classifiers when error < 0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_n_estimators = np.sum([x<0.5 for x in error])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_all = np.zeros((X_test.shape[0], new_n_estimators))\n",
    "for t in range(new_n_estimators):\n",
    "    y_pred_all[:, t] = trees[t].predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = (np.sum(y_pred_all * alpha[:new_n_estimators], axis=1) >= 1).astype(np.int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.5105105105105104, 0.8518181818181818)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,\n",
       "          learning_rate=1.0, n_estimators=50, random_state=None)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = AdaBoostClassifier()\n",
    "clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.29107981220657275, 0.8627272727272727)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train, y_train)\n",
    "y_pred = clf.predict(X_test)\n",
    "metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Boosting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
       "              learning_rate=0.1, loss='deviance', max_depth=3,\n",
       "              max_features=None, max_leaf_nodes=None,\n",
       "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "              min_samples_leaf=1, min_samples_split=2,\n",
       "              min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "              n_iter_no_change=None, presort='auto', random_state=None,\n",
       "              subsample=1.0, tol=0.0001, validation_fraction=0.1,\n",
       "              verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "\n",
    "clf = GradientBoostingClassifier()\n",
    "clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.5289256198347108, 0.8963636363636364)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train, y_train)\n",
    "y_pred = clf.predict(X_test)\n",
    "metrics.f1_score(y_pred, y_test.values), metrics.accuracy_score(y_pred, y_test.values)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}