{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Bias-Variance Tradeoff\n",
    "> A Summary of lecture \"Machine Learning with Tree-Based Models in Python\n",
    "\", via datacamp\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: images/ensemble.png"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generalization Error\n",
    "- Supervised Learning - Under the Hood\n",
    "    - supervised learning: $y = f(x)$, $f$ is unknown.\n",
    "    - Goals\n",
    "        - Find a model $\\hat{f}$ that best approximates: $f: \\hat{f} \\approx f$\n",
    "        - $\\hat{f}$ can be Logistic Regression, Decision Tree, Neural Network,...\n",
    "        - Discard noise as much as possible\n",
    "        - **End goal**: $\\hat{f}$ should achieve a low predictive error on unseen datasets.\n",
    "- Difficulties in Approximating $f$\n",
    "    - Overfitting: $\\hat{f}(x)$ fits the training set noise.\n",
    "    - Underfitting: $\\hat{f}$ is not flexible enough to approximate $f$.\n",
    "- Generalization Error\n",
    "    - Generalization Error of $\\hat{f}$: Does $\\hat{f}$ generalize well on unseen data?\n",
    "    - It can be decomposed as follows:\n",
    "    $$ \\hat{f} = \\text{bias}^2 + \\text{variance} +\\text{irreducible error} $$\n",
    "    - Bias: error term that tells you, on average, how much $\\hat{f} \\neq f$.\n",
    "![bias](image/bias.png)\n",
    "    - Variance: tells you how much $\\hat{f}$ is inconsistent over different training sets\n",
    "![variance](image/variance.png)\n",
    "- Model Complexity\n",
    "    - Model Complexity: sets the flexibility of $\\hat{f}$\n",
    "    - Example: Maximum tree depth, Minimum samples per leaf, ...\n",
    "- Bias-Variance Tradeoff\n",
    "![](image/ge.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnose bias and variance problems\n",
    "- Estimating the Generalization Error\n",
    "    - How do we estimate the generalization error of a model?\n",
    "        - Cannot be done directly because:\n",
    "            - $f$ is unknown\n",
    "            - usually you only have one dataset\n",
    "            - noise is unpredictable.\n",
    "    - Solution\n",
    "        - Split the data to training and test sets\n",
    "        - fit $\\hat{f}$ to the training set\n",
    "        - evaluate the error of $\\hat{f}$ on the unseen test set\n",
    "        - generalization error of $\\hat{f} \\approx$ test set error of $\\hat{f}$\n",
    "- Better model Evaluation with Cross-Validation\n",
    "    - Test set should not be touched until we are confident about $\\hat{f}$'s performance\n",
    "    - Evaluating $\\hat{f}$ on training set: biased estimate, $\\hat{f}$ has already seen all training points\n",
    "    - Solution: Cross-Validation (CV)\n",
    "        - K-Fold CV\n",
    "        - Hold-Out CV\n",
    "- K-Fold CV\n",
    "    $$\\text{CV error} = \\dfrac{E_1 + \\cdots + E_{10}}{10} $$\n",
    "- Diagnose Variance Problems\n",
    "    - If $\\hat{f}$ suffers from **high variance**: CV error of $\\hat{f} >$ training set error of $\\hat{f}$\n",
    "        - $\\hat{f}$ is said to overfit the training set. To remedy overfitting:\n",
    "            - Decrease model complexity\n",
    "            - Gather more data, ...\n",
    "- Diagnose Bias Problems\n",
    "    - if $\\hat{f}$ suffers from high bias: CV error of $\\hat{f} \\approx$ training set error of $\\hat{f} >>$ desired error.\n",
    "    - $\\hat{f}$ is said to underfit the training set. To remedy underfitting:\n",
    "        - Increase model complexity\n",
    "        - Gather more relevant features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate the model\n",
    "In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>displ</th>\n",
       "      <th>hp</th>\n",
       "      <th>weight</th>\n",
       "      <th>accel</th>\n",
       "      <th>origin</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>250.0</td>\n",
       "      <td>88</td>\n",
       "      <td>3139</td>\n",
       "      <td>14.5</td>\n",
       "      <td>US</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>9.0</td>\n",
       "      <td>304.0</td>\n",
       "      <td>193</td>\n",
       "      <td>4732</td>\n",
       "      <td>18.5</td>\n",
       "      <td>US</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>36.1</td>\n",
       "      <td>91.0</td>\n",
       "      <td>60</td>\n",
       "      <td>1800</td>\n",
       "      <td>16.4</td>\n",
       "      <td>Asia</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18.5</td>\n",
       "      <td>250.0</td>\n",
       "      <td>98</td>\n",
       "      <td>3525</td>\n",
       "      <td>19.0</td>\n",
       "      <td>US</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34.3</td>\n",
       "      <td>97.0</td>\n",
       "      <td>78</td>\n",
       "      <td>2188</td>\n",
       "      <td>15.8</td>\n",
       "      <td>Europe</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mpg  displ   hp  weight  accel  origin  size\n",
       "0  18.0  250.0   88    3139   14.5      US  15.0\n",
       "1   9.0  304.0  193    4732   18.5      US  20.0\n",
       "2  36.1   91.0   60    1800   16.4    Asia  10.0\n",
       "3  18.5  250.0   98    3525   19.0      US  15.0\n",
       "4  34.3   97.0   78    2188   15.8  Europe  10.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mpg = pd.read_csv('./dataset/auto.csv')\n",
    "mpg.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>displ</th>\n",
       "      <th>hp</th>\n",
       "      <th>weight</th>\n",
       "      <th>accel</th>\n",
       "      <th>size</th>\n",
       "      <th>origin_Asia</th>\n",
       "      <th>origin_Europe</th>\n",
       "      <th>origin_US</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>250.0</td>\n",
       "      <td>88</td>\n",
       "      <td>3139</td>\n",
       "      <td>14.5</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>9.0</td>\n",
       "      <td>304.0</td>\n",
       "      <td>193</td>\n",
       "      <td>4732</td>\n",
       "      <td>18.5</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>36.1</td>\n",
       "      <td>91.0</td>\n",
       "      <td>60</td>\n",
       "      <td>1800</td>\n",
       "      <td>16.4</td>\n",
       "      <td>10.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18.5</td>\n",
       "      <td>250.0</td>\n",
       "      <td>98</td>\n",
       "      <td>3525</td>\n",
       "      <td>19.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34.3</td>\n",
       "      <td>97.0</td>\n",
       "      <td>78</td>\n",
       "      <td>2188</td>\n",
       "      <td>15.8</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mpg  displ   hp  weight  accel  size  origin_Asia  origin_Europe  \\\n",
       "0  18.0  250.0   88    3139   14.5  15.0            0              0   \n",
       "1   9.0  304.0  193    4732   18.5  20.0            0              0   \n",
       "2  36.1   91.0   60    1800   16.4  10.0            1              0   \n",
       "3  18.5  250.0   98    3525   19.0  15.0            0              0   \n",
       "4  34.3   97.0   78    2188   15.8  10.0            0              1   \n",
       "\n",
       "   origin_US  \n",
       "0          1  \n",
       "1          1  \n",
       "2          0  \n",
       "3          1  \n",
       "4          0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mpg = pd.get_dummies(mpg)\n",
    "mpg.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = mpg.drop('mpg', axis='columns')\n",
    "y = mpg['mpg']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "# Set SEED for reproducibility\n",
    "SEED = 1\n",
    "\n",
    "# Split the data into 70% train and 30% test\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)\n",
    "\n",
    "# Instantiate a DecisionTreeRegressor dt\n",
    "dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the 10-fold CV error\n",
    "In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree ```dt``` that you instantiated in the previous exercise.\n",
    "\n",
    "Note that since ```cross_val_score``` has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CV RMSE: 5.14\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "# Compute the array containing the 10-folds CV MSEs\n",
    "MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, \n",
    "                                 scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "\n",
    "# Compute the 10-folds CV RMSE\n",
    "RMSE_CV = (MSE_CV_scores.mean()) ** 0.5\n",
    "\n",
    "# Print RMSE_CV\n",
    "print('CV RMSE: {:.2f}'.format(RMSE_CV))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the training error\n",
    "You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.\n",
    "\n",
    "Note that in scikit-learn, the MSE of a model can be computed as follows:\n",
    "```python\n",
    "MSE_model = mean_squared_error(y_true, y_predicted)\n",
    "```\n",
    "where we use the function mean_squared_error from the ```metrics``` module and pass it the true labels ```y_true``` as a first argument, and the predicted labels from the model ```y_predicted``` as a second argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train RMSE: 5.15\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error as MSE\n",
    "\n",
    "# Fit dt to the training set\n",
    "dt.fit(X_train, y_train)\n",
    "\n",
    "# Predict the labels of the training set\n",
    "y_pred_train = dt.predict(X_train)\n",
    "\n",
    "# Evaluate the training set RMSE of dt\n",
    "RMSE_train = (MSE(y_train, y_pred_train)) ** 0.5\n",
    "\n",
    "# Print RMSE_train\n",
    "print(\"Train RMSE: {:.2f}\".format(RMSE_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ensemble Learning\n",
    "- Advantages of CARTs\n",
    "    - Simple to understand\n",
    "    - Simple to interpret\n",
    "    - Easy to use\n",
    "    - Flexibility: ability to describe non-linear dependencies.\n",
    "    - Preprocessing: no need to standardize or normalize features.\n",
    "- Limitations of CARTs\n",
    "    - Classification: can only produce orthogonal decision boundaries\n",
    "    - Sensitive to small variations in the training set\n",
    "    - High variance: unconstrained CARTs may overfit the training set\n",
    "    - Solution: **ensemble learning**\n",
    "- Ensemble Learning\n",
    "    - Train different models on the same dataset.\n",
    "    - Let each model make its predictions\n",
    "    - Meta-Model: aggregates predictionsof individual models\n",
    "    - Final prediction: more robust and less prone to errors\n",
    "    - Best results: models are skillfull in different ways.\n",
    "![ensemble](image/ensemble.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define the ensemble\n",
    "In the following set of exercises, you'll work with the [Indian Liver Patient Dataset](https://www.kaggle.com/jeevannagaraj/indian-liver-patient-dataset) from the UCI Machine learning repository.\n",
    "\n",
    "In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age_std</th>\n",
       "      <th>Total_Bilirubin_std</th>\n",
       "      <th>Direct_Bilirubin_std</th>\n",
       "      <th>Alkaline_Phosphotase_std</th>\n",
       "      <th>Alamine_Aminotransferase_std</th>\n",
       "      <th>Aspartate_Aminotransferase_std</th>\n",
       "      <th>Total_Protiens_std</th>\n",
       "      <th>Albumin_std</th>\n",
       "      <th>Albumin_and_Globulin_Ratio_std</th>\n",
       "      <th>Is_male_std</th>\n",
       "      <th>Liver_disease</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.247403</td>\n",
       "      <td>-0.420320</td>\n",
       "      <td>-0.495414</td>\n",
       "      <td>-0.428870</td>\n",
       "      <td>-0.355832</td>\n",
       "      <td>-0.319111</td>\n",
       "      <td>0.293722</td>\n",
       "      <td>0.203446</td>\n",
       "      <td>-0.147390</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.062306</td>\n",
       "      <td>1.218936</td>\n",
       "      <td>1.423518</td>\n",
       "      <td>1.675083</td>\n",
       "      <td>-0.093573</td>\n",
       "      <td>-0.035962</td>\n",
       "      <td>0.939655</td>\n",
       "      <td>0.077462</td>\n",
       "      <td>-0.648461</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.062306</td>\n",
       "      <td>0.640375</td>\n",
       "      <td>0.926017</td>\n",
       "      <td>0.816243</td>\n",
       "      <td>-0.115428</td>\n",
       "      <td>-0.146459</td>\n",
       "      <td>0.478274</td>\n",
       "      <td>0.203446</td>\n",
       "      <td>-0.178707</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.815511</td>\n",
       "      <td>-0.372106</td>\n",
       "      <td>-0.388807</td>\n",
       "      <td>-0.449416</td>\n",
       "      <td>-0.366760</td>\n",
       "      <td>-0.312205</td>\n",
       "      <td>0.293722</td>\n",
       "      <td>0.329431</td>\n",
       "      <td>0.165780</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.679294</td>\n",
       "      <td>0.093956</td>\n",
       "      <td>0.179766</td>\n",
       "      <td>-0.395996</td>\n",
       "      <td>-0.295731</td>\n",
       "      <td>-0.177537</td>\n",
       "      <td>0.755102</td>\n",
       "      <td>-0.930414</td>\n",
       "      <td>-1.713237</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Age_std  Total_Bilirubin_std  Direct_Bilirubin_std  \\\n",
       "0  1.247403            -0.420320             -0.495414   \n",
       "1  1.062306             1.218936              1.423518   \n",
       "2  1.062306             0.640375              0.926017   \n",
       "3  0.815511            -0.372106             -0.388807   \n",
       "4  1.679294             0.093956              0.179766   \n",
       "\n",
       "   Alkaline_Phosphotase_std  Alamine_Aminotransferase_std  \\\n",
       "0                 -0.428870                     -0.355832   \n",
       "1                  1.675083                     -0.093573   \n",
       "2                  0.816243                     -0.115428   \n",
       "3                 -0.449416                     -0.366760   \n",
       "4                 -0.395996                     -0.295731   \n",
       "\n",
       "   Aspartate_Aminotransferase_std  Total_Protiens_std  Albumin_std  \\\n",
       "0                       -0.319111            0.293722     0.203446   \n",
       "1                       -0.035962            0.939655     0.077462   \n",
       "2                       -0.146459            0.478274     0.203446   \n",
       "3                       -0.312205            0.293722     0.329431   \n",
       "4                       -0.177537            0.755102    -0.930414   \n",
       "\n",
       "   Albumin_and_Globulin_Ratio_std  Is_male_std  Liver_disease  \n",
       "0                       -0.147390            0              1  \n",
       "1                       -0.648461            1              1  \n",
       "2                       -0.178707            1              1  \n",
       "3                        0.165780            1              1  \n",
       "4                       -1.713237            1              1  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indian = pd.read_csv('./dataset/indian_liver_patient_preprocessed.csv', index_col=0)\n",
    "indian.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = indian.drop('Liver_disease', axis='columns')\n",
    "y = indian['Liver_disease']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier as KNN\n",
    "\n",
    "# Set seed for reproducibility\n",
    "SEED = 1\n",
    "\n",
    "# Instantiate lr\n",
    "lr = LogisticRegression(random_state=SEED)\n",
    "\n",
    "# Instantiate knn\n",
    "knn = KNN(n_neighbors=27)\n",
    "\n",
    "# Instantiate dt\n",
    "dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)\n",
    "\n",
    "# Define the list classifiers\n",
    "classifiers = [\n",
    "    ('Logistic Regression', lr),\n",
    "    ('K Nearest Neighbors', knn),\n",
    "    ('Classification Tree', dt)\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate individual classifiers\n",
    "In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Logistic Regression : 0.759\n",
      "K Nearest Neighbors : 0.701\n",
      "Classification Tree : 0.730\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Iterate over the pre-defined list of classifiers\n",
    "for clf_name, clf in classifiers:\n",
    "    # Fit clf to the training set\n",
    "    clf.fit(X_train, y_train)\n",
    "    \n",
    "    # Predict y_pred\n",
    "    y_pred = clf.predict(X_test)\n",
    "    \n",
    "    # Calculate accuracy\n",
    "    accuracy = accuracy_score(y_test, y_pred)\n",
    "    \n",
    "    # Evaluate clf's accuracy on the test set\n",
    "    print('{:s} : {:.3f}'.format(clf_name, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Better performance with a Voting Classifier\n",
    "Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Voting Classifier: 0.770\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import VotingClassifier\n",
    "\n",
    "# Instantiate a VotingClassifier vc\n",
    "vc = VotingClassifier(estimators=classifiers)\n",
    "\n",
    "# Fit vs to the training set\n",
    "vc.fit(X_train, y_train)\n",
    "\n",
    "# Evaluate the test set predictions\n",
    "y_pred = vc.predict(X_test)\n",
    "\n",
    "# Calculate accuracy score\n",
    "accuracy = accuracy_score(y_test, y_pred)\n",
    "print('Voting Classifier: {:.3f}'.format(accuracy))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}