{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Resampling Methods – Applied\n",
    "\n",
    "Excercises from **Chapter 5** of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.\n",
    "\n",
    "I've elected to use Python instead of R."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from IPython.display import display, HTML\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf\n",
    "from sklearn import datasets\n",
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def confusion_table(confusion_mtx):\n",
    "    \"\"\"Renders a nice confusion table with labels\"\"\"\n",
    "    confusion_df = pd.DataFrame({'y_pred=0': np.append(confusion_mtx[:, 0], confusion_mtx.sum(axis=0)[0]),\n",
    "                                 'y_pred=1': np.append(confusion_mtx[:, 1], confusion_mtx.sum(axis=0)[1]),\n",
    "                                 'Total': np.append(confusion_mtx.sum(axis=1), ''),\n",
    "                                 '': ['y=0', 'y=1', 'Total']}).set_index('')\n",
    "    return confusion_df\n",
    "\n",
    "def total_error_rate(confusion_matrix):\n",
    "    \"\"\"Derive total error rate from confusion matrix\"\"\"\n",
    "    return 1 - np.trace(confusion_mtx) / np.sum(confusion_mtx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>balance</th>\n",
       "      <th>income</th>\n",
       "      <th>default_Yes</th>\n",
       "      <th>student_Yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>729.526495</td>\n",
       "      <td>44361.625074</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>817.180407</td>\n",
       "      <td>12106.134700</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1073.549164</td>\n",
       "      <td>31767.138947</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>529.250605</td>\n",
       "      <td>35704.493935</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>785.655883</td>\n",
       "      <td>38463.495879</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       balance        income  default_Yes  student_Yes\n",
       "0   729.526495  44361.625074          0.0          0.0\n",
       "1   817.180407  12106.134700          0.0          1.0\n",
       "2  1073.549164  31767.138947          0.0          0.0\n",
       "3   529.250605  35704.493935          0.0          0.0\n",
       "4   785.655883  38463.495879          0.0          0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "default_df = pd.read_csv('./data/Default.csv', index_col='Unnamed: 0')\n",
    "default_df = default_df.reset_index().drop('index', axis=1)\n",
    "\n",
    "# Check for missing\n",
    "assert default_df.isna().sum().sum() == 0\n",
    "\n",
    "# Rationalise types\n",
    "default_df = pd.get_dummies(default_df, dtype=np.float64).drop(['default_No', 'student_No'], axis=1)\n",
    "\n",
    "display(default_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (a) Fit a logistic regression model that uses income and balance to predict default.\n",
    "### (b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:\n",
    "\n",
    "- i. Split the sample set into a training set and a validation set.\n",
    "- ii. Fit a multiple logistic regression model using only the training observations.\n",
    "- iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.\n",
    "- iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.\n",
    "\n",
    "### (c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com- ment on the results obtained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 1</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4846</td>\n",
       "      <td>0</td>\n",
       "      <td>4846</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>164</td>\n",
       "      <td>0</td>\n",
       "      <td>164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>5010</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4846         0  4846\n",
       "y=1         164         0   164\n",
       "Total      5010         0      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.2735%\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 2</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4691</td>\n",
       "      <td>1</td>\n",
       "      <td>4692</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>176</td>\n",
       "      <td>0</td>\n",
       "      <td>176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>4867</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4691         1  4692\n",
       "y=1         176         0   176\n",
       "Total      4867         1      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.636%\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 3</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4773</td>\n",
       "      <td>1</td>\n",
       "      <td>4774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>163</td>\n",
       "      <td>0</td>\n",
       "      <td>163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>4936</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4773         1  4774\n",
       "y=1         163         0   163\n",
       "Total      4936         1      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.3219%\n"
     ]
    }
   ],
   "source": [
    "for s in range(1,4):\n",
    "    display(HTML('<h3>Random seed = {}</h3>'.format(s)))\n",
    "    # Create index for 50% holdout set\n",
    "    np.random.seed(s)\n",
    "    train = np.random.rand(len(default_df)) < 0.5\n",
    "    \n",
    "    response   = 'default_Yes'\n",
    "    predictors = ['income', 'balance']\n",
    "    \n",
    "    X_train = np.array(default_df[train][predictors])\n",
    "    X_test  = np.array(default_df[~train][predictors])\n",
    "    y_train = np.array(default_df[train][response])\n",
    "    y_test  = np.array(default_df[~train][response])\n",
    "    \n",
    "    # Logistic regression\n",
    "    logit       = LogisticRegression()\n",
    "    model_logit = logit.fit(X_train, y_train)\n",
    "    \n",
    "    # Predict\n",
    "    y_pred = model_logit.predict(X_test)\n",
    "    \n",
    "    # Analysis\n",
    "    confusion_mtx = confusion_matrix(y_test, y_pred)\n",
    "    display(confusion_table(confusion_mtx))\n",
    "    \n",
    "    total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)\n",
    "    print('total_error_rate: {}%'.format(total_error_rate_pct))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 1</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4846</td>\n",
       "      <td>0</td>\n",
       "      <td>4846</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>164</td>\n",
       "      <td>0</td>\n",
       "      <td>164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>5010</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4846         0  4846\n",
       "y=1         164         0   164\n",
       "Total      5010         0      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.2735%\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 2</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4688</td>\n",
       "      <td>4</td>\n",
       "      <td>4692</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>172</td>\n",
       "      <td>4</td>\n",
       "      <td>176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>4860</td>\n",
       "      <td>8</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4688         4  4692\n",
       "y=1         172         4   176\n",
       "Total      4860         8      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.6154%\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<h3>Random seed = 3</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>4773</td>\n",
       "      <td>1</td>\n",
       "      <td>4774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>163</td>\n",
       "      <td>0</td>\n",
       "      <td>163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>4936</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0        4773         1  4774\n",
       "y=1         163         0   163\n",
       "Total      4936         1      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 3.3219%\n"
     ]
    }
   ],
   "source": [
    "for s in range(1,4):\n",
    "    display(HTML('<h3>Random seed = {}</h3>'.format(s)))\n",
    "    # Create index for 50% holdout set\n",
    "    np.random.seed(s)\n",
    "    train = np.random.rand(len(default_df)) < 0.5\n",
    "    \n",
    "    response   = 'default_Yes'\n",
    "    predictors = ['income', 'balance', 'student_Yes']\n",
    "    \n",
    "    X_train = np.array(default_df[train][predictors])\n",
    "    X_test  = np.array(default_df[~train][predictors])\n",
    "    y_train = np.array(default_df[train][response])\n",
    "    y_test  = np.array(default_df[~train][response])\n",
    "    \n",
    "    # Logistic regression\n",
    "    logit       = LogisticRegression()\n",
    "    model_logit = logit.fit(X_train, y_train)\n",
    "    \n",
    "    # Predict\n",
    "    y_pred = model_logit.predict(X_test)\n",
    "    \n",
    "    # Analysis\n",
    "    confusion_mtx = confusion_matrix(y_test, y_pred)\n",
    "    display(confusion_table(confusion_mtx))\n",
    "    \n",
    "    total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)\n",
    "    print('total_error_rate: {}%'.format(total_error_rate_pct))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comment**\n",
    "\n",
    "It is difficult to discern if the student predictor has improved the model because of the variation in results.\n",
    "\n",
    "\n",
    "### 6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.\n",
    "\n",
    "### (a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                           Logit Regression Results                           \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   No. Observations:                10000\n",
      "Model:                          Logit   Df Residuals:                     9997\n",
      "Method:                           MLE   Df Model:                            2\n",
      "Date:                Fri, 21 Sep 2018   Pseudo R-squ.:                  0.4594\n",
      "Time:                        19:43:26   Log-Likelihood:                -789.48\n",
      "converged:                       True   LL-Null:                       -1460.3\n",
      "                                        LLR p-value:                4.541e-292\n",
      "==============================================================================\n",
      "                 coef    std err          z      P>|z|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const        -11.5405      0.435    -26.544      0.000     -12.393     -10.688\n",
      "x1          2.081e-05   4.99e-06      4.174      0.000     1.1e-05    3.06e-05\n",
      "x2             0.0056      0.000     24.835      0.000       0.005       0.006\n",
      "==============================================================================\n",
      "\n",
      "Possibly complete quasi-separation: A fraction 0.14 of observations can be\n",
      "perfectly predicted. This might indicate that there is complete\n",
      "quasi-separation. In this case some parameters will not be identified.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef_sm</th>\n",
       "      <th>SE_sm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-11.540468</td>\n",
       "      <td>0.434772</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000021</td>\n",
       "      <td>0.000005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.005647</td>\n",
       "      <td>0.000227</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     coef_sm     SE_sm\n",
       "0 -11.540468  0.434772\n",
       "1   0.000021  0.000005\n",
       "2   0.005647  0.000227"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response   = 'default_Yes'\n",
    "predictors = ['income', 'balance']\n",
    "\n",
    "X_all = sm.add_constant(np.array(default_df[predictors]))\n",
    "y_all = np.array(default_df[response])\n",
    "\n",
    "## Logistic regression\n",
    "model_logit = smf.Logit(y_all, X_all).fit(disp=False);    \n",
    "\n",
    "# Summary\n",
    "print(model_logit.summary())\n",
    "\n",
    "statsmodels_est = pd.DataFrame({'coef_sm': model_logit.params, 'SE_sm': model_logit.bse})\n",
    "display(statsmodels_est)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE:** Apparent bug in statsmodels.discrete.discrete_model.LogitResults.summary. Std error x2 is misrepresented in summary table, it is easy to misread this as lower than standard error for x1 when in fact it is not (see table above)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def boot_fn(df, idx):\n",
    "    response   = 'default_Yes'\n",
    "    predictors = ['income', 'balance']\n",
    "    \n",
    "    X = sm.add_constant(np.array(df[predictors].loc[idx]));\n",
    "    y = np.array(df[response].loc[idx]) \n",
    "       \n",
    "    # Logistic regression\n",
    "    model_logit = smf.Logit(y, X).fit(disp=False);  \n",
    "    return model_logit.params;\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef_boot</th>\n",
       "      <th>SE_boot</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-11.565762</td>\n",
       "      <td>0.434150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000021</td>\n",
       "      <td>0.000005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.005660</td>\n",
       "      <td>0.000228</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   coef_boot   SE_boot\n",
       "0 -11.565762  0.434150\n",
       "1   0.000021  0.000005\n",
       "2   0.005660  0.000228"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def boot_idx(n):\n",
    "    \"\"\"Return index for bootstrap sample of size n\n",
    "    e.g. generate array in range 0 to n, with replacement\"\"\"\n",
    "    return np.random.randint(low=0, high=n, size=n)\n",
    "\n",
    "def boot(fn, data_df, samples):\n",
    "    \"\"\"Perform bootstrap for B number of samples\"\"\"\n",
    "    results = []\n",
    "    for s in range(samples):\n",
    "        Z = fn(data_df, boot_idx(data_df.shape[0]))\n",
    "        results += [Z]\n",
    "    return np.array(results)\n",
    "\n",
    "def standard_deviation(X):\n",
    "    \"\"\"Compute deviation error for jth element in matrix X\n",
    "    equivalent to np.std(X, axis=0)\"\"\"\n",
    "    X_bar = np.mean(X, axis=0)\n",
    "    SE = np.sqrt((np.sum(np.square(X - X_bar), axis=0)) / (len(X)))\n",
    "    return SE\n",
    "\n",
    "B = 10000\n",
    "coef_preds    = boot(boot_fn, default_df, samples=B)\n",
    "coef_pred     = np.mean(coef_preds, axis=0)\n",
    "standard_errs = standard_deviation(coef_preds)\n",
    "\n",
    "bootstrap_est = pd.DataFrame({'coef_boot': coef_pred, 'SE_boot': standard_errs})\n",
    "display(bootstrap_est)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>coef_sm</th>\n",
       "      <th>SE_sm</th>\n",
       "      <th>coef_boot</th>\n",
       "      <th>SE_boot</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-11.540468</td>\n",
       "      <td>0.434772</td>\n",
       "      <td>-11.565762</td>\n",
       "      <td>0.434150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000021</td>\n",
       "      <td>0.000005</td>\n",
       "      <td>0.000021</td>\n",
       "      <td>0.000005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.005647</td>\n",
       "      <td>0.000227</td>\n",
       "      <td>0.005660</td>\n",
       "      <td>0.000228</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     coef_sm     SE_sm  coef_boot   SE_boot\n",
       "0 -11.540468  0.434772 -11.565762  0.434150\n",
       "1   0.000021  0.000005   0.000021  0.000005\n",
       "2   0.005647  0.000227   0.005660  0.000228"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([statsmodels_est, bootstrap_est], axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the standard errors estimated by the statsmodels (_sm) summary() function with estimates obtained by bootstrap (_boot) in the table above. The standard errors for x1 and x2 (rows 1 and 2) are indistinguishable to 6 decimal places. The coefficient for x2 and the statistics for the intercept x0 vary slightly.\n",
    "\n",
    "Note that the disparity is slightly more significant when fewer bootstrap samples are used. Here 10,000 were used, but the estimates were alike to within the same order of magnitude with only 10 bootstrap samples.\n",
    "\n",
    "**QUESTION:** Why are the standard errors provided by statsmodels equivalent to the standard deviations by my calculations, when $SE = \\frac{σ}{\\sqrt{n}}$ ? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).\n",
    "\n",
    "### (a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Lag1</th>\n",
       "      <th>Lag2</th>\n",
       "      <th>Lag3</th>\n",
       "      <th>Lag4</th>\n",
       "      <th>Lag5</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Today</th>\n",
       "      <th>Direction_Up</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1990</td>\n",
       "      <td>0.816</td>\n",
       "      <td>1.572</td>\n",
       "      <td>-3.936</td>\n",
       "      <td>-0.229</td>\n",
       "      <td>-3.484</td>\n",
       "      <td>0.154976</td>\n",
       "      <td>-0.270</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1990</td>\n",
       "      <td>-0.270</td>\n",
       "      <td>0.816</td>\n",
       "      <td>1.572</td>\n",
       "      <td>-3.936</td>\n",
       "      <td>-0.229</td>\n",
       "      <td>0.148574</td>\n",
       "      <td>-2.576</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1990</td>\n",
       "      <td>-2.576</td>\n",
       "      <td>-0.270</td>\n",
       "      <td>0.816</td>\n",
       "      <td>1.572</td>\n",
       "      <td>-3.936</td>\n",
       "      <td>0.159837</td>\n",
       "      <td>3.514</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1990</td>\n",
       "      <td>3.514</td>\n",
       "      <td>-2.576</td>\n",
       "      <td>-0.270</td>\n",
       "      <td>0.816</td>\n",
       "      <td>1.572</td>\n",
       "      <td>0.161630</td>\n",
       "      <td>0.712</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1990</td>\n",
       "      <td>0.712</td>\n",
       "      <td>3.514</td>\n",
       "      <td>-2.576</td>\n",
       "      <td>-0.270</td>\n",
       "      <td>0.816</td>\n",
       "      <td>0.153728</td>\n",
       "      <td>1.178</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today  Direction_Up\n",
       "0  1990  0.816  1.572 -3.936 -0.229 -3.484  0.154976 -0.270             0\n",
       "1  1990 -0.270  0.816  1.572 -3.936 -0.229  0.148574 -2.576             0\n",
       "2  1990 -2.576 -0.270  0.816  1.572 -3.936  0.159837  3.514             1\n",
       "3  1990  3.514 -2.576 -0.270  0.816  1.572  0.161630  0.712             1\n",
       "4  1990  0.712  3.514 -2.576 -0.270  0.816  0.153728  1.178             1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load data\n",
    "weekly_df = pd.read_csv('./data/Weekly.csv')\n",
    "\n",
    "# Check for missing data\n",
    "assert weekly_df.isnull().sum().sum() == 0\n",
    "\n",
    "# Pre-processing\n",
    "weekly_df = pd.get_dummies(weekly_df).drop('Direction_Down', axis=1)\n",
    "weekly_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>21</td>\n",
       "      <td>221</td>\n",
       "      <td>242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>12</td>\n",
       "      <td>303</td>\n",
       "      <td>315</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>33</td>\n",
       "      <td>524</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0          21       221   242\n",
       "y=1          12       303   315\n",
       "Total        33       524      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 41.8312%\n"
     ]
    }
   ],
   "source": [
    "# Create index for 50% holdout set\n",
    "np.random.seed(s)\n",
    "train = np.random.rand(len(weekly_df)) < 0.5\n",
    "\n",
    "response   = 'Direction_Up'\n",
    "predictors = ['Lag1', 'Lag2']\n",
    "\n",
    "X_train = sm.add_constant(np.array(weekly_df[train][predictors]))\n",
    "X_test  = sm.add_constant(np.array(weekly_df[~train][predictors]))\n",
    "y_train = np.array(weekly_df[train][response])\n",
    "y_test  = np.array(weekly_df[~train][response])\n",
    "\n",
    "# Logistic regression\n",
    "logit       = LogisticRegression()\n",
    "model_logit = logit.fit(X_train, y_train)\n",
    "\n",
    "# Predict\n",
    "y_pred = model_logit.predict(X_test)\n",
    "\n",
    "# Analysis\n",
    "confusion_mtx = confusion_matrix(y_test, y_pred)\n",
    "display(confusion_table(confusion_mtx))\n",
    "\n",
    "total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)\n",
    "print('total_error_rate: {}%'.format(total_error_rate_pct))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create index for LOOCV\n",
    "train = weekly_df.index > 0\n",
    "\n",
    "response   = 'Direction_Up'\n",
    "predictors = ['Lag1', 'Lag2']\n",
    "\n",
    "X_train = np.array(weekly_df[train][predictors])\n",
    "X_test  = np.array(weekly_df[~train][predictors])\n",
    "y_train = np.array(weekly_df[train][response])\n",
    "y_test  = np.array(weekly_df[~train][response])\n",
    "\n",
    "# Logistic regression\n",
    "logit       = LogisticRegression(fit_intercept=True)\n",
    "model_logit = logit.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction=\"Up\"|Lag1, Lag2) > 0.5. Was this observation correctly classified?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0           0         1     1\n",
       "y=1           0         0     0\n",
       "Total         0         1      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 100.0%\n"
     ]
    }
   ],
   "source": [
    "# Predict\n",
    "y_pred = model_logit.predict(X_test)\n",
    "\n",
    "# Analysis\n",
    "confusion_mtx = confusion_matrix(y_test, y_pred)\n",
    "display(confusion_table(confusion_mtx))\n",
    "\n",
    "total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)\n",
    "print('total_error_rate: {}%'.format(total_error_rate_pct))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The observation was incrorectly classified."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (d) Write a for loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps:\n",
    "\n",
    "- i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.\n",
    "- ii. Compute the posterior probability of the market moving up for the ith observation.\n",
    "- iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.\n",
    "- iv. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "response   = 'Direction_Up'\n",
    "predictors = ['Lag1', 'Lag2']\n",
    "\n",
    "y_pred = []\n",
    "for i in range(weekly_df.shape[0]):\n",
    "    # Create index for LOOCV\n",
    "    train = weekly_df.index != i\n",
    "    \n",
    "    X_train = np.array(weekly_df[train][predictors])\n",
    "    X_test  = np.array(weekly_df[~train][predictors])\n",
    "    y_train = np.array(weekly_df[train][response])\n",
    "    \n",
    "    # Logistic regression\n",
    "    logit       = LogisticRegression()\n",
    "    model_logit = logit.fit(X_train, y_train)\n",
    "    \n",
    "    # Predict\n",
    "    y_pred += [model_logit.predict(X_test)]\n",
    "    \n",
    "y_pred = np.array(y_pred)\n",
    "y_test = weekly_df[response]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>y_pred=0</th>\n",
       "      <th>y_pred=1</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>y=0</th>\n",
       "      <td>34</td>\n",
       "      <td>450</td>\n",
       "      <td>484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>y=1</th>\n",
       "      <td>40</td>\n",
       "      <td>565</td>\n",
       "      <td>605</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total</th>\n",
       "      <td>74</td>\n",
       "      <td>1015</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       y_pred=0  y_pred=1 Total\n",
       "                               \n",
       "y=0          34       450   484\n",
       "y=1          40       565   605\n",
       "Total        74      1015      "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total_error_rate: 44.9954%\n"
     ]
    }
   ],
   "source": [
    "# Analysis\n",
    "confusion_mtx = confusion_matrix(y_test, y_pred)\n",
    "display(confusion_table(confusion_mtx))\n",
    "\n",
    "total_error_rate_pct = np.around(total_error_rate(confusion_mtx) * 100, 4)\n",
    "print('total_error_rate: {}%'.format(total_error_rate_pct))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LOOCV yields an estimated test error rate of 45.0%, higher than the 41.8% error rate observed with a 50% holdout set.\n",
    "\n",
    "The LOOCV approach allows the model to train on a 2x - 1 larger training set than 50% holdout, which could be significant if we were to experiment with more flexible models that require more observations to mitigate over-fitting. \n",
    "\n",
    "LOOCV also yields a 2x increase in the effective test set over 50% holdout. \n",
    "\n",
    "The above means that there is less bias in the training and test sets exposed to our model by LOOCV than for 50% holdout. \n",
    "\n",
    "This suggests that the lower error observed by 50% holdout is dues to increased bias in the datasets, or model overfitting. \n",
    "\n",
    "We expect the LOOCV result to exhibit higher variance than the hold-out approach, that is we expect to observe a different error score for some other sample of observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. We will now perform cross-validation on a simulated data set.\n",
    "\n",
    "### (a) Generate a simulated data set as follows:\n",
    "\n",
    "```\n",
    "> set.seed(1)\n",
    "> x=rnorm(100)\n",
    "> y=x-2*x^2+rnorm(100)\n",
    "```\n",
    "\n",
    "### In this data set, what is n and what is p? Write out the model used to generate the data in equation form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1)\n",
    "mu, sigma = 0, 1 # mean and standard deviation\n",
    "x = np.random.normal(mu, sigma, 100)\n",
    "y = ((x-2) * (x**2)) + np.random.normal(mu, sigma, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$y = (x-2)x^2 + ϵ$  \n",
    "$y = x^3 + (-2x^2) + ϵ$  \n",
    "$y = -2x^2 + x^3 + ϵ$  \n",
    "\n",
    "$y = β_0 + β_1 x^2 + β_2 x^3 + ϵ$  \n",
    "\n",
    "Where:  \n",
    "$n = 100$  \n",
    "$p = 2$  \n",
    "$β_0 = 0$  \n",
    "$β_1 = -2$  \n",
    "$β_2 = 1$  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (b) Create a scatterplot of X against Y . Comment on what you find."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0,0.5,'y')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAEKCAYAAAAMzhLIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHmBJREFUeJzt3X9w2/Wd5/HXW7LlKI5LjHFCGycll6XZGi4tjUtLMrcLS7dk25RMCG1ZFkLhlsCmTG97/ZH2aI7upFw3QK+3HTZNQpcyoWULJeRg4MrPhe1MWlqSDaQhrAm027EhTRxjg+MolmV97g9bimxLX0uypO/X1vMxw4wtydLbX+Lv6/v5+TXnnAAAyCXkdwEAgGAjKAAAnggKAIAnggIA4ImgAAB4IigAAJ4ICgCAJ4ICAOCJoAAAeKrxu4BSOOOMM9xZZ53ldxkAMKXs3bv3mHOueaLXTYugOOuss7Rnzx6/ywCAKcXMfp/P6+h6AgB4IigAAJ4ICgCAJ4ICAOCJoAAAeJoWs54AoNokk07d/XHFE0OK1ITVVB9RKGRl+SyCAgCmmGTSqf1In67fsUedPTG1NEZ119o2LZ7bUJawoOsJAKaY7v54OiQkqbMnput37FF3f7wsn0dQAMAUE08MpUMipbMnpnhiqCyfR1AAwBQTqQmrpTE66rGWxqgiNeGyfB5BAQRQMunU1TegN3pOqKtvQMmk87skBEhTfUR3rW1Lh0VqjKKpPlKWz2MwGwiYSg9UYuoJhUyL5zZo1/rlFZn1RIsCCJhKD1RiagqFTM0NdZrXOFPNDXVlvYggKICAqfRAJTARggIYEZRxgUoPVAITISgAnRoXWL1lt5Zvflart+xW+5E+X8Ki0gOVwETMuak/m6Ktrc1x4yJMRlffgFZv2T2qy6elMapd65eruaGu4vVUcnsGVC8z2+uca5vodcx6QtVLJp3iiSF959MfUG9sUFufe137OnoLGhco9Yk9NVBZDoQQCkVQoKplm4q6ec0S3fFEu7qOD+Q1LjCVprNOpVoRHIxRoKplm4q6Yed+feHis/MeF5hK01mnUq0IDloUqGq5pqL+0ZxZqqsN6fDbsQm7Z4qZzupX9w9Tb1EMggJVLTUVdewgtnNOl235RV7dM7neI1e3VSW6f3IFUaG1wn9BGFOi6wlVLdtU1G1XL9W3HjuYd/dModNZy9394zXVl6m3U0tQpm0zPRZVb+wVWzKZ1Ee+/S/jXrd7w0Wa1zgzr/fwuup7o+eElm9+tqD3L8REU32DcIWK/JR72jbTY4E8jZ2KerTvZMHdM4VMZy2m+6eQk/tE4xDlnHqL0grKmBJBAWRIJp2On0zo9suX6CsP7h81hlCq7pmm+oh2XHe+ft99QjMjYZ2ID+m9TTNzvn+hYxr5BhEti+ALypgSQQGMSCad/vDOSb3VH9dQ0un2y5coZKYT8SHNfdfw7pyVPLmmPis2mNAf3j6p5ll16uyJpcc0cnU/pMYhvvtUu9Ysna+m+ojmNNSpMVo76r0LCZ9if2/CaHJS/y/H/n+q9JgSYxSY8rxORvmeqLwW3u3r6NXuDRfp3adF1X6kb9wJ+D2nRVVTk/+8kLf6B9T+h75RLZbbL1+i1vc0qH8gqcGhpCLhkBLJpF472p9udTTW1+rvHjmofR29krzHNBKJpNqP9umGe/dmDYJC+r6LnaXF4r7SKGfY5jtGEdhZT2a2wszazew1M/ua3/UgmLxmhRQyYyTXwrsbL1yUbup398f13afadc2yhdr06EFdvvWXuvIHv1L70cJmocTiQ+mQSH3WVx7cr94TCX1m2y/1p7c/p1seOaCeE4Pa+PABfXb789r48AHF4kP66orFkibufuiJDaZDIvUZ1+/Yo2P9A+rqG9CJeEIbV7bqvPmz0z+T6vseu4vusf6BomZpsbivNCp534mcNVT8E/NgZmFJ/yjpLyS1SvpLM2v1tyoEQeZJ7K3+AR1556T6B06d9DJPRoWcqHINGqaa/k31EcUTQ1qzdL427Bx9kr/h3r0FnfyGnMv6WV19A+nH1yydr/U//rdxYXLmu2Z4dj+kjs+JeCLrZ5wYGNLqLbv1p7c/p02PHtSXL1mcDouWxqhqQqZX/vDOqHA9MVDcgGo5BmKDshV80GsqtaCOUZwv6TXn3G8lycx+ImmVpIO+VgVfZXZlNM+q01dXLNZXHtyv5ll1+sLFZ+uOz3xAh3tj+s6Tr6ZPRvmeqHINGr5ndlRnvmuGJMnM1FQfyfqescHhK/F8urxm1Gb/rMywmR2tzfo54ZDpofXLdHo0ou7+uJLJpIac5JxTNBLWkXeGr/43rmzN+hm/O9Y/rtX095f9Z33tod/o9suX6NDR4/raQ78Z9ZrfHesvakB1sgOxY49hY7RWh7qO+9aVlaon85jX1oR0/GRCa+/+9bTuXgtki0LSPEkdGd93jjyGKpbZQrjxwkXpkPjyJYu18eEDuvg7/6qvPfQbfXXFYkUjYdXWhPK+AVCuhWipkGg/0qdvPnJAp9dHsr7nm72xvLu8zqivG7/I76ql2rn31D/53thg1s8Zck6RsOlQ13HdvGu/Xuvq12e2/VLLNz+rlzreTh+frc+9rs1rloz7jO89c2jUe3b2xDSvcaa++5kP6rbH21UbDo0LqO89c0jbrlpa8CK9ySzuy3YM33w75ltXVqqescf8si2/0JF3hicaVLqmSgpqiyJbFI9qz5nZOknrJGnBggWVqAk+S3VlnDd/thY116uzJ6aNK1vHdQV95cH9euhvlun4QP7TXL1uVt/Vd6qPfnY0oi1/9aF0t1BqIHpGbUjffapdt65eIklZT2ipgeJsnzV7Ro2+cPH7dPBwnzp7Ytq5t0Pfv2qp/uZHpwajN69Zom//v1d0y6fOSbcaMn/3mZFw+ut9Hb2644l2bVzZqvef2aBopEbhkNR1fGDU793SGNV/HOtXfCipfR296YDKDIuu4wN69+wZWY+NF69jOpFs3YZHM7rmUiq1piBVT65/bxtXtuqGe/dWtKZKCmpQdEqan/F9i6Q3M1/gnNsuabs0POupcqXBL5GasD7eOkfXLFuojreGT565umhOJpJae/ev1TyrThtXtmp2tHbUNNdsMheiZXZ7SEpPTX1gb6dWf2ieNq06VzMjYfXGBnXb48Nbkm9c2Zp3l1e2RW9/PLdBD9xwgRJDSdWEQ6oJK117b2wwPQPr5k+2joTW6N997El+X0evNj16cNSK7G1XLdUNY8Lnjifa9c1Lz9G2q5dqTkOddlx3vv7+Z6/oyYNH0+E6O1rcTJtiF/dlG9/o7o/7tqYgVU+uf2+zM6YeT8e9s4IaFC9IOtvMFkp6Q9IVkq70tyT4IfOEXVsT0t9deo5ueeRlXbt8ob7/Vx/SsePZTx5hU3rNQepKTxqeUqr6iT9z7LTO2y9fotseHz5Rh8x07T0vjPu5pvpI+gSRrSYz0xs9J3JeWYdCptpwaLjvOxxSKOS06dGD496nNjzcpTY2GLY+97pnCyoUMr179oxRIZe678bsmbX6/H2nWklbr1qqTavOVSgUGldrJdZGZBvf2Lm3Q9uuXjpuym8l1hSk6snW4mppjOpEfCj99XTcOyuw6yjM7BOS/o+ksKS7nXO35not6yimp2wn7Huu/bC6j8f1pZ++pOZZdfofn3i/opGwbvzR6JPH3HfV6dI7i9sjJ9cag02rztW197ygH37uw9r48IFxz9/31x9Ry8i6hrF1b71qqb73zKujrtIzBzxzrTmoqwmNGyg9u3mW2o/06R+eeVXXLFuY7gppaYzqh59rU32kRk7KehLP9jnbrl6qf3h6uLaJjlWl1kbkqvN9zbPUezJR8QV8qXpS06Mzj3nq31ssPvUWFea7jiKwQVEIgmJ6ynbCznaS/njrHN3yqXMknTo5SuNP1tlOaNmujg+/Hcu6ad/Pv3qRwqZRs4syT2KL5zSkF95lvq+Z6ZuPHPA8EecKp4fWL5PJxp0Y3+of0Esdb+uMWRHNqA3r+EBCvScGNauuRmedUe8ZhpPZBLGSGw4mEkm9+fbw2ER3f1w793boi3++uCwzivKpO9usp6kWDGOxKSCmvGz91JkDtilPHjyqdX+ySPV1NVo8N5r+o51oIDXX1XHTrEjW7oVobTh9Ap4djXi+d2bf/Bs9J0aFhDR+zCLXmoPBRDLr6uvZ0YjOPG3GuO6xplmRCbs9xo4bdPUN5N33nzmh4MYLF6XHT5LJ5KgT+8nBIdXVhNU/kNDMurBOj0bUExssKEB6YoO68ge/GlXXwcN9Jds5NSXfVlI1b6ZIUCCwsvVTn4gP5VyD8Lf3vzjqJDLRH3auBXkP37Rswv11Sr1bbD6vGXvVe3bzLD20fplODibTLZ1iBp0L2U8oc0JBZvfLtquXqm8goc/98IVRwfXF+19U1/GBCbveskkmk6MG87c+97r2dfSWfEZRrn8HpQ6kqSyo6yiArPPw39s0c9xjm9cs0dbnXi94WmKuq/i+2JDObp6lXeuXa/eGi/TADRfo9Jm1I90OhXfV5rOeYKLXZFtXcKjruM6or9OC02dqXuNMnV5f3PYOmdNYd2+4SLvWL895Em+qj+gbnxw/RfSGe/eq463YuGmjN164aHjdy4/2as3S+ennJlprkEw6HeuPa9OjB/XZ7c+nV5F/vHVOyWcUBWUr7yCjRQHf5eofzjYPvzFaq3cGBnX/uo/q8Nsn1d0fT08bLXRaYq6r+N8d61d9XY2a6iPjuiR2XHe+Zs2o0WAiWdL1BBO9ptxXvfm2kEIhUzhkWU+sMyPhcY+lpo2OnUI60Ym4uz8+bq+qDTv3676//kjJZxRN1JpjB1xaFPDZRKuYMzdEa6qP6FDXcV16527ddN8+DQ4ltenRg+mQKHRaYlN9ZNyK481rluh7zxxSPDE07uTcPKtOR945qcu2/KLg21Lms7Gb12uCdNWbOrFmypwimvlYb2xw3Nep771CPdfvGx65gCglr9ZcUG5F6jdaFPBVIVfKma/t7InptsfbtWnVuVo0Z5aitYVf6XmtK4jUhMedrFLbhpTyqj7fq9Wg3MBGyj2mUTeyZcrYtSeZ04NTdU8U6pX8ffNdlS9V7/gFQQFfFXKlPPa1+zp6de09L2j3houK/qPNNnsodRIbuxI416rcYq/qC1mTEJQb2Ei5T6yS0o/V1oRUEzLdeeV56S7DW1cv0S2fyq/7ptK/b66utyC15PxEUMBXhVw5luMq0+tqcuzJKteMq2I/v5DW1GT2TSqHXCfWzLUUg4mkzIbXgPTEVFC9qd937KyuSgtSS85PjFHAV/nM9knt9e/ktOO684vajdRLrrGBsbOBPjD/tKJ3Q82m0KvVINzAZiK5dlkttm+/+3hcV971vJZvflaX3ln58YHJ7IA7nbAyGxWTqz/e6/FsXTN+bpdQyhkwhdyOdKpI/U4bV7Zm3aeqkN8tn+NTiRlJ03nWEyuzEShjT/ofb52jb3yyVeGQ5fzj8+qayXWv6HIr5ercII07lMpEu6yWYp1L6j0qte9UNa/ITiEoUBGZJ/3z5s/WNcsWprdnyPUHnutEERsc8tyFdaoI2rhDKUy0y2op1rmk3mMya0umcyuhHBijQEVknvRvvHDRuJW92Vbq5rpD3etHj0+bOe1TYdyhEKlW0s69HePusFfMOhev8YFiZySxNqJwtChQEZlXh/l0SySTTsdPjr9DXWpufupnSjmnnavMyUu1km5dvUTJZFI/veECDSaTCpsVPGtpohZXsTOS2NupcAQFKiKzPz6fbonu/vi4O9TNeVed/vv9L2lfR2/6daWa016p/u5qkGolleKYeo0PFDvGw9qIwtH1hIJlTlnt6htQMumyPpYp8+rwgy2nadvVSz27JVJ/zPs6enXDvXv12e3P69Ujx7Pe87kUc9pzXWV6bVwHb+U+poVsZpgp1xYk1bY2ohC0KFCQQu7E5rWff3PDDM9B3ErfCpOrzNKrxDEtZkbSdJxtVm4EBQqS6ypx06pzC+rznegPPNsf8xf/fHF6++9SjyOwArf0gnpMp+Nss3IjKFCQXFeJ80+P6rz5s9PjB5O9cvT6Yy7HgCNXmaUX5GPK2ojCEBQoSK6rxI63YvryJYuLvjdENpX8Y+Yqs/Q4ptMHg9koSLa57al7OGzYOXxHsyBdORZiuq1pCAKO6fRAiwIFSV0l3r/uo+rsiaXv4ZDqcnr/mcNXkPlcObJuAZgaCAoULDSyP9OXfvrSuC4oKb/tpFm3AEwddD2hKE31kXFrITavWaJvPXYwr3nyrFsApg5aFChKKGQ6oz6SXjWd2QV1y6cmnu3EugVg6iAoULRQKJT1ngP5zHYK6hx7AOPR9YSiTebuX9w5DJg6uMMdJmUyM5eY9QT4izvcoSImsyiO1bHA1EDXEwDAU+CCwsy+aWZvmNmLI/99wu+aAKCaBbXr6bvOuTv8LgITY5wBmP6CGhSYAlhdDVSHwHU9jbjJzPab2d1m1pjtBWa2zsz2mNmerq6uStcHsboaqBa+BIWZPW1mB7L8t0rS9yUtkvRBSYclfSfbezjntjvn2pxzbc3NzRWsHimsrgaqgy9dT865j+XzOjO7S9KjZS4HRWJ1NVAdAtf1ZGbvzvh2taQDftUCb6yuBqpDEAezbzOzD0pykv5D0g3+loNcuIMZUB0CFxTOuav9rgH5Y3U1MP0FrusJABAsBAUAwBNBAQDwRFAAADwRFAAATwQFAMATQQEA8ERQAAA8ERQAAE8EBQDAE0EBAPBEUAAAPBEUAABPgds9FqWXTDp198fZChxAUQiKaS6ZdGo/0pe+t3Xq5kKL5zYQFgDyQtfTNNfdH0+HhDR8T+vrd+xRd3/c58oATBUExTQXTwyNuqe1NBwW8cSQTxUBmGoIimkuUhNO39M6paUxqkhN2KeKAEw1BMU011Qf0V1r29Jh0dIY1barlyqZTKqrb0DJpPO5QgBBx2D2NBcKmRbPbdCu9csVTwxpKOn0rccO6smDRxnYBpAXWhRVIBQyNTfUKVIT1pU/+JWePHhUEgPbAPJDUFQRBrYBFIOgqCIMbAMoBkFRRbINbN+1tk1N9RGfKwMQZAxmV5GxA9ts5wEgHwRFlUkNbANAvuh6AgB4IigAAJ4ICgCApwmDwsxuMrPGUn6omX3azF42s6SZtY157utm9pqZtZvZJaX8XABA4fJpUZwp6QUze8DMVphZKabIHJB0maSfZz5oZq2SrpB0jqQVkraYGZP8AcBHEwaFc+4bks6W9E+SPifpkJn9LzNbVOyHOudecc61Z3lqlaSfOOcGnHO/k/SapPOL/RwAwOTlNUbhnHOS/jDyX0JSo6QHzey2EtczT1JHxvedI48BAHwy4ToKM/uCpGskHZP0A0lfcc4NmllI0iFJX83xc09ruNtqrJudcw/n+rgsj2XdB9vM1klaJ0kLFizw/B0AAMXLZ8HdGZIuc879PvNB51zSzFbm+iHn3MeKqKdT0vyM71skvZnj/bdL2i5JbW1t3FQBAMoknzGK/zk2JDKee6XE9Twi6QozqzOzhRoeG/l1iT8DAFAAX9ZRmNlqM+uUdIGkx8zsCUlyzr0s6QFJByU9Lunzzjn2wAYAH/my15NzbpekXTmeu1XSrZWtCACQCyuzAQCeCAoAgCeCAgDgiaAAAHgiKAAAnggKAIAnggIA4ImgAAB4IigAAJ4ICgCAJ4ICAOCJoAAAeCIoAACeCAoAgCeCAgDgiaAAAHgiKAAAnggKAIAnggIA4ImgAAB4IigAAJ4ICgCAJ4ICAOCJoAAAeCIoAACeCAoAgCeCAgDgiaAAAHgiKAAAnggKAIAnX4LCzD5tZi+bWdLM2jIeP8vMYmb24sh/W/2oDwBwSo1Pn3tA0mWStmV57nXn3AcrXA8AIAdfgsI594okmZkfHw8AKEAQxygWmtk+M/tXM/svuV5kZuvMbI+Z7enq6qpkfQBQVcrWojCzpyWdmeWpm51zD+f4scOSFjjnus1sqaT/a2bnOOfeGftC59x2Sdslqa2tzZWqbgDAaGULCufcx4r4mQFJAyNf7zWz1yW9T9KeEpcHAMhToLqezKzZzMIjX/8nSWdL+q2/VQFAdfNreuxqM+uUdIGkx8zsiZGn/kTSfjN7SdKDkm50zr3lR40AgGF+zXraJWlXlsd3StpZ+YoAALkEqusJABA8BAUAwBNBAQDwRFAAADz5tdfTtJNMOnX3xxVPDClSE1ZTfUShEFuUAJj6CIoSSCad2o/06fode9TZE1NLY1R3rW3T4rkNhAWAKY+up0lIJp26+gZ0+O1YOiQkqbNn+Pvu/rjPFQLA5BEURUq1IlZv2a3Onlg6JFI6e2KKJ4Z8qg4ASoegKFJ3fzzdiuiNDaqlMTrq+ZbGqCI1YZ+qA4DSISiKFE8MpVsRW597XZvXLEmHRWqMoqk+4meJAFASDGYXKVITVktjVJ09Me3r6NUdT7Rr06pztWjOLEVrmfUEYPqgRVGkpvqI7lrblm5FdB0f0JmnzVDL7KiaG+oICQDTBi2KIoVCpsVzG7Rr/XLWTgCY1giKSQiFTM0NdX6XAQBlRdcTAMATQQEA8ERQAAA8ERQAAE8EBQDAE0EBAPBEUAAAPBEUAABPBAUAwBNBAQDwRFAAADwRFAAATwQFAMATQQEA8ERQAAA8+RIUZna7mf27me03s11mNjvjua+b2Wtm1m5ml/hRHwDgFL9aFE9JOtc5t0TSq5K+Lklm1irpCknnSFohaYuZhX2qEQAgn4LCOfekcy4x8u3zklpGvl4l6SfOuQHn3O8kvSbpfD9qBAAMC8IYxXWSfjby9TxJHRnPdY48BgDwSdnumW1mT0s6M8tTNzvnHh55zc2SEpJ+nPqxLK93Od5/naR1krRgwYJJ1wsAyK5sQeGc+5jX82Z2jaSVki52zqXCoFPS/IyXtUh6M8f7b5e0XZLa2tqyhgkAYPL8mvW0QtIGSZc6505kPPWIpCvMrM7MFko6W9Kv/agRADCsbC2KCdwpqU7SU2YmSc875250zr1sZg9IOqjhLqnPO+eGfKoRACCfgsI590cez90q6dYKlgMA8BCEWU8AgAAjKAAAnggKAIAnggIA4ImgAAB4IigAAJ4ICgCAJ4ICAOCJoAAAeCIoAACeCAoAgCeCAgDgya/dYwMhmXTq7o8rnhhSpCaspvqIQqFs904CgOpVtUGRTDq1H+nT9Tv2qLMnppbGqO5a26bFcxsICwDIULVdT9398XRISFJnT0zX79ij7v64z5UBQLBUbVDEE0PpkEjp7IkpnuA+SQCQqWqDIlITVktjdNRjLY1RRWrCPlUEAMFUtUHRVB/RXWvb0mGRGqNoqo/4XBkABEvVDmaHQqbFcxu0a/1yZj0BgIeqDQppOCyaG+r8LgMAAq1qu54AAPkhKAAAnggKAIAnggIA4ImgAAB4Muec3zVMmpl1Sfq933VkOEPSMb+LCBCOx2gcj1M4FqNV+ni81znXPNGLpkVQBI2Z7XHOtfldR1BwPEbjeJzCsRgtqMeDricAgCeCAgDgiaAoj+1+FxAwHI/ROB6ncCxGC+TxYIwCAOCJFgUAwBNBUSZmdruZ/buZ7TezXWY22++a/GRmnzazl80saWaBm9VRCWa2wszazew1M/ua3/X4yczuNrOjZnbA71qCwMzmm9mzZvbKyN/Jf/O7pkwERfk8Jelc59wSSa9K+rrP9fjtgKTLJP3c70L8YGZhSf8o6S8ktUr6SzNr9bcqX90jaYXfRQRIQtKXnHPvl/RRSZ8P0r8PgqJMnHNPOucSI98+L6nFz3r85px7xTnX7ncdPjpf0mvOud865+KSfiJplc81+cY593NJb/ldR1A45w475/5t5Os+Sa9ImudvVacQFJVxnaSf+V0EfDVPUkfG950K0IkAwWFmZ0k6T9Kv/K3klKq+cdFkmdnTks7M8tTNzrmHR15zs4ablT+uZG1+yOd4VLFst05kyiFGMbNZknZK+lvn3Dt+15NCUEyCc+5jXs+b2TWSVkq62FXBPOSJjkeV65Q0P+P7Fklv+lQLAsjMajUcEj92zj3kdz2Z6HoqEzNbIWmDpEudcyf8rge+e0HS2Wa20Mwikq6Q9IjPNSEgzMwk/ZOkV5xz/9vvesYiKMrnTkkNkp4ysxfNbKvfBfnJzFabWaekCyQ9ZmZP+F1TJY1MbLhJ0hMaHqh8wDn3sr9V+cfM/lnSLyUtNrNOM/uvftfks+WSrpb0ZyPnixfN7BN+F5XCymwAgCdaFAAATwQFAMATQQEA8ERQAAA8ERQAAE8EBQDAE0EBAPBEUABlYGYfHrkXyQwzqx+5x8C5ftcFFIMFd0CZmNm3JM2QFJXU6Zz7ts8lAUUhKIAyGdnT6QVJJyUtc84N+VwSUBS6noDyOV3SLA3v+TXD51qAotGiAMrEzB7R8J3sFkp6t3PuJp9LAorC/SiAMjCztZISzrn7Ru6X/Qsz+zPn3L/4XRtQKFoUAABPjFEAADwRFAAATwQFAMATQQEA8ERQAAA8ERQAAE8EBQDAE0EBAPD0/wEnEnskC0kHHgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ax = sns.scatterplot(x=x, y=y)\n",
    "plt.xlabel('x')\n",
    "plt.ylabel('y')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above plot shows x plotted against y. It shows a non-linear relationship with some variance. The shape of the plot is has two maxima/minima, typical of a 3rd order polynomial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:\n",
    "\n",
    "- i.  $Y = β_0 + β_1 X + ϵ$  \n",
    "- ii. $Y = β_0 + β_1 X + β_2 X^2 + ϵ$  \n",
    "- iii. $Y = β_0 + β_1 X + β_2 X^2 + β_3 X^3 + ϵ$  \n",
    "- iv. $Y = β_0 + β_1 X + β_2 X^2 + β_3 X^3 + β_4 X^4 + ϵ$  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>MSE</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'x': 9.710113037214064,\n",
       " 'x^2': 4.16996357963943,\n",
       " 'x^3': 0.9268768781648805,\n",
       " 'x^4': 0.8669116865881077}"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def mse(y_pred, y):\n",
    "    \"\"\"Calculate mean squared error\"\"\"\n",
    "    return np.sum(np.square(y_pred - y)) / y.size\n",
    "\n",
    "\n",
    "def sim_loocv(seed):\n",
    "    \"\"\"Run loocv on simulated data generated with random seed provided\"\"\"\n",
    "    # Generate simulated data\n",
    "    np.random.seed(seed)\n",
    "    mu, sigma = 0, 1 # mean and standard deviation\n",
    "    x  = np.random.normal(mu, sigma, 100)\n",
    "    y = ((x-2) * (x**2)) + np.random.normal(mu, sigma, 100)\n",
    "    sim_df = pd.DataFrame({'x': x, 'y': y})\n",
    "    \n",
    "    formulae = {'x'   : 'y ~ x', \n",
    "                'x^2' : 'y ~ x + np.power(x, 2)',\n",
    "                'x^3' : 'y ~ x + np.power(x, 2) + np.power(x, 3)',\n",
    "                'x^4' : 'y ~ x + np.power(x, 2) + np.power(x, 3) + np.power(x, 4)'}\n",
    "    \n",
    "    errors = {}\n",
    "    for f in formulae:\n",
    "        # predictions state\n",
    "        y_pred = pd.Series({})\n",
    "        for i in range(sim_df.shape[0]):\n",
    "            # Create index for LOOCV\n",
    "            train = sim_df.index != i\n",
    "            \n",
    "            # Linear regression\n",
    "            model_ols = smf.ols(formula=formulae[f], data=sim_df[train]).fit()\n",
    "            \n",
    "            ## Predict\n",
    "            y_hat   = model_ols.predict(exog=sim_df[~train])\n",
    "            y_pred  = pd.concat([y_pred, y_hat])\n",
    "        errors[f] = mse(np.array(y_pred), y)\n",
    "    \n",
    "    display(HTML('<h3>MSE</h3>'))\n",
    "    display(errors)\n",
    "    \n",
    "sim_loocv(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>MSE</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'x': 21.444406970083293,\n",
       " 'x^2': 5.635014149446473,\n",
       " 'x^3': 1.2820418215169618,\n",
       " 'x^4': 1.3165915804276835}"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sim_loocv(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing the random seed that is used to simulate the observations has a large effect on the observed mean squared error. \n",
    "\n",
    "In changing the random seed we have changed the sample of observations taken from the population, so this change in error is due to variance in our sample. In this case the sample size in is small n=100, and so we might expect quite hi high variance between succesive samples. This would lead to variability between the mse errors observed for different samples of X, Y.\n",
    "\n",
    "### (e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.\n",
    "\n",
    "In test c), model iv with an $x^4$ term exhibited the lowest error, marginally lower than model iii which also performed well. I expected the lowest error to be observed for model iii) because this is simulated data and we know the true function f(x) in this case is a 3rd order polynomial. \n",
    "\n",
    "Interestingly, the second test with a different seed yields the expected result – lowest error for x^3 model.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "### (f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>            <td>y</td>        <th>  R-squared:         </th> <td>   0.937</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.935</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   350.8</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 21 Sep 2018</td> <th>  Prob (F-statistic):</th> <td>1.42e-55</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>19:46:55</td>     <th>  Log-Likelihood:    </th> <td> -129.19</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>    99</td>      <th>  AIC:               </th> <td>   268.4</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>    94</td>      <th>  BIC:               </th> <td>   281.4</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     4</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "         <td></td>           <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>      <td>    0.3031</td> <td>    0.136</td> <td>    2.227</td> <td> 0.028</td> <td>    0.033</td> <td>    0.573</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>x</th>              <td>   -0.1113</td> <td>    0.184</td> <td>   -0.606</td> <td> 0.546</td> <td>   -0.476</td> <td>    0.253</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.power(x, 2)</th> <td>   -2.5461</td> <td>    0.248</td> <td>  -10.281</td> <td> 0.000</td> <td>   -3.038</td> <td>   -2.054</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.power(x, 3)</th> <td>    1.1059</td> <td>    0.064</td> <td>   17.283</td> <td> 0.000</td> <td>    0.979</td> <td>    1.233</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.power(x, 4)</th> <td>    0.1407</td> <td>    0.057</td> <td>    2.462</td> <td> 0.016</td> <td>    0.027</td> <td>    0.254</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 1.519</td> <th>  Durbin-Watson:     </th> <td>   2.121</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.468</td> <th>  Jarque-Bera (JB):  </th> <td>   1.037</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.225</td> <th>  Prob(JB):          </th> <td>   0.595</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 3.221</td> <th>  Cond. No.          </th> <td>    15.9</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                      y   R-squared:                       0.937\n",
       "Model:                            OLS   Adj. R-squared:                  0.935\n",
       "Method:                 Least Squares   F-statistic:                     350.8\n",
       "Date:                Fri, 21 Sep 2018   Prob (F-statistic):           1.42e-55\n",
       "Time:                        19:46:55   Log-Likelihood:                -129.19\n",
       "No. Observations:                  99   AIC:                             268.4\n",
       "Df Residuals:                      94   BIC:                             281.4\n",
       "Df Model:                           4                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==================================================================================\n",
       "                     coef    std err          t      P>|t|      [0.025      0.975]\n",
       "----------------------------------------------------------------------------------\n",
       "Intercept          0.3031      0.136      2.227      0.028       0.033       0.573\n",
       "x                 -0.1113      0.184     -0.606      0.546      -0.476       0.253\n",
       "np.power(x, 2)    -2.5461      0.248    -10.281      0.000      -3.038      -2.054\n",
       "np.power(x, 3)     1.1059      0.064     17.283      0.000       0.979       1.233\n",
       "np.power(x, 4)     0.1407      0.057      2.462      0.016       0.027       0.254\n",
       "==============================================================================\n",
       "Omnibus:                        1.519   Durbin-Watson:                   2.121\n",
       "Prob(Omnibus):                  0.468   Jarque-Bera (JB):                1.037\n",
       "Skew:                          -0.225   Prob(JB):                        0.595\n",
       "Kurtosis:                       3.221   Cond. No.                         15.9\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Generate simulated data\n",
    "np.random.seed(1)\n",
    "mu, sigma = 0, 1 # mean and standard deviation\n",
    "x  = np.random.normal(mu, sigma, 100)\n",
    "y = ((x-2) * (x**2)) + np.random.normal(mu, sigma, 100)\n",
    "sim_df = pd.DataFrame({'x': x, 'y': y})\n",
    "\n",
    "formulae = {'x'  : 'y ~ x', \n",
    "            'x^2' : 'y ~ x + np.power(x, 2)',\n",
    "            'x^3': 'y ~ x + np.power(x, 2) + np.power(x, 3)',\n",
    "            'x^4' : 'y ~ x + np.power(x, 2) + np.power(x, 3) + np.power(x, 4)'}\n",
    "\n",
    "errors = {}\n",
    "for f in formulae:\n",
    "    # predictions state\n",
    "    y_pred = pd.Series({})\n",
    "    for i in range(sim_df.shape[0]):\n",
    "        # Create index for LOOCV\n",
    "        train = sim_df.index != i\n",
    "        \n",
    "        # Linear regression\n",
    "        model_ols = smf.ols(formula=formulae[f], data=sim_df[train]).fit()\n",
    "        \n",
    "        ## Predict\n",
    "        y_hat   = model_ols.predict(exog=sim_df[~train])\n",
    "        y_pred  = pd.concat([y_pred, y_hat])\n",
    "    errors[f] = mse(np.array(y_pred), y)\n",
    "\n",
    "display(model_ols.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comment**\n",
    "\n",
    "The p-values associated with feature-wise t-statsitics for the model show that p < 0.05 for features $x^0$, $x^1$, $x^2$, and $x^4$. Note that there is not enough evidence in this model to reject the null hypothesis for $x^3$, that $H_0: β_3 = 0$.\n",
    "\n",
    "These statistics suggest that the the strongest model will include features $x^0$, $x^1$, $x^2$, and $x^4$, which supports the conclusions drawn from cross-validations that models iii and iv perfroma best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. We will now consider the Boston housing data set, from the MASS library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CRIM</th>\n",
       "      <th>ZN</th>\n",
       "      <th>INDUS</th>\n",
       "      <th>CHAS</th>\n",
       "      <th>NOX</th>\n",
       "      <th>RM</th>\n",
       "      <th>AGE</th>\n",
       "      <th>DIS</th>\n",
       "      <th>RAD</th>\n",
       "      <th>TAX</th>\n",
       "      <th>PTRATIO</th>\n",
       "      <th>B</th>\n",
       "      <th>LSTAT</th>\n",
       "      <th>medv</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.00632</td>\n",
       "      <td>18.0</td>\n",
       "      <td>2.31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.538</td>\n",
       "      <td>6.575</td>\n",
       "      <td>65.2</td>\n",
       "      <td>4.0900</td>\n",
       "      <td>1.0</td>\n",
       "      <td>296.0</td>\n",
       "      <td>15.3</td>\n",
       "      <td>396.90</td>\n",
       "      <td>4.98</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.02731</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.07</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.469</td>\n",
       "      <td>6.421</td>\n",
       "      <td>78.9</td>\n",
       "      <td>4.9671</td>\n",
       "      <td>2.0</td>\n",
       "      <td>242.0</td>\n",
       "      <td>17.8</td>\n",
       "      <td>396.90</td>\n",
       "      <td>9.14</td>\n",
       "      <td>21.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.02729</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.07</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.469</td>\n",
       "      <td>7.185</td>\n",
       "      <td>61.1</td>\n",
       "      <td>4.9671</td>\n",
       "      <td>2.0</td>\n",
       "      <td>242.0</td>\n",
       "      <td>17.8</td>\n",
       "      <td>392.83</td>\n",
       "      <td>4.03</td>\n",
       "      <td>34.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.03237</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.18</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.458</td>\n",
       "      <td>6.998</td>\n",
       "      <td>45.8</td>\n",
       "      <td>6.0622</td>\n",
       "      <td>3.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>18.7</td>\n",
       "      <td>394.63</td>\n",
       "      <td>2.94</td>\n",
       "      <td>33.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.06905</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.18</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.458</td>\n",
       "      <td>7.147</td>\n",
       "      <td>54.2</td>\n",
       "      <td>6.0622</td>\n",
       "      <td>3.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>18.7</td>\n",
       "      <td>396.90</td>\n",
       "      <td>5.33</td>\n",
       "      <td>36.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \\\n",
       "0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   \n",
       "1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   \n",
       "2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   \n",
       "3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   \n",
       "4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   \n",
       "\n",
       "   PTRATIO       B  LSTAT  medv  \n",
       "0     15.3  396.90   4.98  24.0  \n",
       "1     17.8  396.90   9.14  21.6  \n",
       "2     17.8  392.83   4.03  34.7  \n",
       "3     18.7  394.63   2.94  33.4  \n",
       "4     18.7  396.90   5.33  36.2  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boston = datasets.load_boston()\n",
    "boston_feat = pd.DataFrame(boston.data, columns=boston.feature_names)\n",
    "boston_resp = pd.Series(boston.target).rename('medv')\n",
    "\n",
    "boston_df = pd.concat([boston_feat, boston_resp], axis=1)\n",
    "\n",
    "# Check for missing values\n",
    "#assert boston_df.isnull().sum().sum() == 0\n",
    "\n",
    "boston_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate μˆ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "22.532806324110698"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "mu_hat = boston_df['medv'].mean()\n",
    "display(mu_hat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (b) Provide an estimate of the standard error of μˆ. Interpret this result.\n",
    "Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4084569346972866"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def standard_error_mean(df):\n",
    "    \"\"\"Compute estimated standard error of the mean\n",
    "    with analytic approach\"\"\"\n",
    "    medv = np.array(df)\n",
    "    SE   = np.std(medv) / np.sqrt(len(medv))\n",
    "    return SE\n",
    "display(standard_error_mean(boston_df['medv']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (c) Now estimate the standard error of μˆ using the bootstrap. How does this compare to your answer from (b)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SE: 0.4059561352313032\n"
     ]
    }
   ],
   "source": [
    "# Compute standard error of the mean with the bootstrap approach\n",
    "\n",
    "def mean_boot(df, idx):\n",
    "    Z = np.array(df.loc[idx])\n",
    "    return np.mean(Z)\n",
    "\n",
    "def boot_idx(n):\n",
    "    \"\"\"Return index for bootstrap sample of size n\n",
    "    e.g. generate array in range 0 to n, with replacement\"\"\"\n",
    "    return np.random.randint(low=0, high=n, size=n)\n",
    "\n",
    "def boot(fn, data_df, samples):\n",
    "    \"\"\"Perform bootstrap for B number of samples\"\"\"\n",
    "    results = []\n",
    "    for s in range(samples):\n",
    "        Z = fn(data_df, boot_idx(data_df.shape[0]))\n",
    "        results += [Z]\n",
    "    return np.array(results)\n",
    "\n",
    "B = 10000\n",
    "mean_boot  = boot(mean_boot, boston_df['medv'], samples=B)\n",
    "SE_pred    = np.std(mean_boot) \n",
    "\n",
    "print('SE: ' + str(SE_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The bootstrap method gives a remarkably good estimate of the standard error, when compared to the same estimate derived analytically. The bootstrap approach is computationally more expensive, but has the advantage that no analytic derivation of the standard error for the statistic is required, making it much more general for application to other statistics.\n",
    "\n",
    "### (d) Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using \n",
    "\n",
    "```\n",
    "t.test(Boston$medv)\n",
    "```\n",
    "\n",
    "Hint: You can approximate a 95% confidence interval using the formula [μˆ − 2SE(μˆ), μˆ + 2SE(μˆ)].\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mu        22.532806\n",
       "SE         0.405956\n",
       "[0.025    21.720894\n",
       "0.975]    23.344719\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mu_hat   = np.mean(boston_df['medv'])\n",
    "conf_low = mu_hat - (2*SE_pred)\n",
    "conf_hi  = mu_hat + (2*SE_pred)\n",
    "\n",
    "pd.Series({'mu': mu_hat, \n",
    "           'SE': SE_pred,\n",
    "           '[0.025': conf_low,\n",
    "           '0.975]': conf_hi})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (e) Based on this dataset, provide an estimate, μˆmed, for the median value of medv in the population."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "median: 21.2\n"
     ]
    }
   ],
   "source": [
    "median_hat = np.median(boston_df['medv'])\n",
    "print('median: ' + str(median_hat))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (f) We now would like to estimate the standard error of μˆmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "'numpy.ndarray' object is not callable",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-13-498521db254b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     19\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     20\u001b[0m \u001b[0mB\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10000\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 21\u001b[0;31m \u001b[0mboot_obs\u001b[0m   \u001b[0;34m=\u001b[0m \u001b[0mboot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmean_boot\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mboston_df\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'medv'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msamples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mB\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     22\u001b[0m \u001b[0mSE_pred\u001b[0m    \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mboot_obs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     23\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-13-498521db254b>\u001b[0m in \u001b[0;36mboot\u001b[0;34m(fn, data_df, samples)\u001b[0m\n\u001b[1;32m     14\u001b[0m     \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msamples\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m         \u001b[0mZ\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_df\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mboot_idx\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m         \u001b[0mresults\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mZ\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mresults\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: 'numpy.ndarray' object is not callable"
     ]
    }
   ],
   "source": [
    "# Compute standard error of the median with the bootstrap approach\n",
    "\n",
    "def median_boot(df, idx):\n",
    "    Z = np.array(df.loc[idx])\n",
    "    return np.median(Z)\n",
    "\n",
    "def boot_idx(n):\n",
    "    \"\"\"Return index for bootstrap sample of size n\n",
    "    e.g. generate array in range 0 to n, with replacement\"\"\"\n",
    "    return np.random.randint(low=0, high=n, size=n)\n",
    "\n",
    "def boot(fn, data_df, samples):\n",
    "    \"\"\"Perform bootstrap for B number of samples\"\"\"\n",
    "    results = []\n",
    "    for s in range(samples):\n",
    "        Z = fn(data_df, boot_idx(data_df.shape[0]))\n",
    "        results += [Z]\n",
    "    return np.array(results)\n",
    "\n",
    "B = 10000\n",
    "boot_obs   = boot(mean_boot, boston_df['medv'], samples=B)\n",
    "SE_pred    = np.std(boot_obs) \n",
    "\n",
    "print('SE: ' + str(SE_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The estimated standard error for the median value of medv is similar to the estimated standard error for the mean, which is not suprising we would expect the precision of median and and mean to be related."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μˆ0.1. (You can use the quantile() function.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tenth_percentile = np.percentile(boston_df['medv'], 10)\n",
    "print('tenth_percentile: ' + str(tenth_percentile))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (h) Use the bootstrap to estimate the standard error of μˆ0.1. Comment on your findings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute standard error of the tenth percentile with the bootstrap approach\n",
    "\n",
    "def tenth_percentile(df, idx):\n",
    "    Z = np.array(df.loc[idx])\n",
    "    return np.percentile(Z, 10)\n",
    "\n",
    "def boot_idx(n):\n",
    "    \"\"\"Return index for bootstrap sample of size n\n",
    "    e.g. generate array in range 0 to n, with replacement\"\"\"\n",
    "    return np.random.randint(low=0, high=n, size=n)\n",
    "\n",
    "def boot(fn, data_df, samples):\n",
    "    \"\"\"Perform bootstrap for B number of samples\"\"\"\n",
    "    results = []\n",
    "    for s in range(samples):\n",
    "        Z = fn(data_df, boot_idx(data_df.shape[0]))\n",
    "        results += [Z]\n",
    "    return np.array(results)\n",
    "\n",
    "B = 10000\n",
    "boot_obs   = boot(tenth_percentile, boston_df['medv'], samples=B)\n",
    "SE_pred    = np.std(boot_obs) \n",
    "\n",
    "print('SE: ' + str(SE_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe a standard error for the tenth percentile of 0.5. This error is similar to that observed for median and mean, which we would expect as these stats are related, for example a bootstrap sample with higher mean would likely exhibit a higher tenth percentile as the distribution shifts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}