{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 15: Overfitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In last lecture we applied linear regression on looking for the relation between $x$ and $y$ using a set of data $\\{(x_i,y_i)\\}_{i=1}^{N}$. Our linear regression only \"fits\" the data roughly, not precisely. Yet it is a good model. To know why, we need to learn interpolation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is interpolation:\n",
    "Suppose we know $n+1$ distinct grid points\n",
    "$x_0, x_1, x_2, \\dots, x_n$, and the values the values at each of these\n",
    "points as $f_k = f(x_k)$, but we have no idea of what $f$'s analytical expression is. Then the problem of interpolation is to find an approximation of $h(x)$ that is defined at any point $x \\in [a, b]$ that **coincides** with $f(x)$ at $x_k$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us borrow `scikit-learn` package again.\n",
    "\n",
    "Reference: Adapted from [https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) to be more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.pipeline import Pipeline # for easier fitting using high degree polynomials testing\n",
    "from sklearn.preprocessing import PolynomialFeatures # evaluating polynomials at points\n",
    "from sklearn.linear_model import LinearRegression # we have used this before"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model:\n",
    "Now we are gonna use a built-in \"linear\" regression model (which is a class) `LinearRegression` in `scikit-learn `package to fit not just a linear function but a polynomial function of any degree, e.g. $h(x) = w_{10} x^{10} + w_9 x^9 + \\dots + w_1 x + b$, to the data. \n",
    "\n",
    "Remark: if you are interested, we are using the Vandermonde matrix by adding $x^p$ as features. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training:\n",
    "`X_train`, `y_train` are our training data. In the following example, we have 10 of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "X_train = np.linspace(0,2,10)\n",
    "# true function is x^2, adding some noise\n",
    "true_function = lambda x: x**2\n",
    "y_train = true_function(X_train) + np.random.normal(0,0.5, size=10)\n",
    "plt.scatter(X_train, y_train, s=40, alpha=0.8);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# linear regression\n",
    "polynomial_features = PolynomialFeatures(degree=1,include_bias=False)\n",
    "linear_regression = LinearRegression()\n",
    "pipeline = Pipeline([(\"polynomial_features\", polynomial_features),\n",
    "                         (\"linear_regression\", linear_regression)])\n",
    "# pipeline combines both classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pipeline.fit(X_train[:, np.newaxis], y_train) # \"training\"\n",
    "pipeline.fit(X_train.reshape(-1,1), y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation (Testing)\n",
    "Basically we choose a bunch of testing points, see if our model (built from only 10 noisy samples) approximates our true function $x^2$ to a reasonable level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_samples = 100\n",
    "X_test = np.linspace(0, 2, num_samples) # this the the testing points\n",
    "y_pred = pipeline.predict(X_test.reshape(-1,1)) \n",
    "y_true = true_function(X_test)\n",
    "error = np.mean((y_pred - y_true)**2)\n",
    "\n",
    "plt.figure(figsize=(8,6))\n",
    "plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n",
    "plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n",
    "plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n",
    "plt.legend(loc='best', fontsize = 'x-large')\n",
    "plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class exercise: What if we increase the degree?\n",
    "Try increasing the degree gradually in `PolynomialFeatures()` (since we have packed `PolynomialFeatures()` and `LinearRegression()` into one class, we can use pipeline). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now we use pipeline to change the polynomial_features directly w/o redefine it\n",
    "# better than the scikit-learn's example's clumsy usage of pipeline\n",
    "pipeline.set_params(polynomial_features__degree=9)\n",
    "pipeline.fit(X_train.reshape(-1,1), y_train)\n",
    "\n",
    "## Cross-validation\n",
    "num_samples = 100\n",
    "X_test = np.linspace(0, 2, num_samples) # this the the testing points\n",
    "y_pred = pipeline.predict(X_test.reshape(-1,1)) # this the value predicted by the model\n",
    "y_true = true_function(X_test)\n",
    "error = np.mean((y_pred - y_true)**2)\n",
    "\n",
    "plt.figure(figsize=(8,6))\n",
    "plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n",
    "plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n",
    "plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n",
    "plt.legend(loc='best', fontsize = 'x-large')\n",
    "plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When the degree = 9, it is interpolation. But is it doing us any favor?\n",
    "\n",
    "### Metric: Mean squared error\n",
    "Check the mean square error we had at the testing points:\n",
    "$$\n",
    "\\operatorname {MSE} ={\\frac {1}{n_{\\text{test}}}}\n",
    "\\sum_{i=1}^{n_{\\text{test}}}\\left(y^{\\text{True}}_{i}- y^{\\text{Pred}}_{i} \\right)^{2}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class exercise: Underfitting vs Overfitting\n",
    "* Underfitting: $k = ?$\n",
    "* Good fitting: $k = ?$\n",
    "* Overfitting: $k = ?$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multivariate linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us look at the [wine data set on Kaggle](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) again. The first 11 columns are \"objective\" measurements, the last column \"quality\" is a score that is subjectively assigned to the wines. \n",
    "\n",
    "In last few lectures, we fit the data $(x_i, y_i)$ using explicit-formula in least-square and gradient descent. There is a similar solution for when you are trying to learn a linear approximation from a given data-set:\n",
    "\n",
    "$$\\{ (x^{(i)}_1,...,x^{(i)}_n,y^{(i)}) \\}_{i=1}^N$$\n",
    "\n",
    "The $i$-th data set has $\\mathbf{x}^{(i)} = (x^{(i)}_1,...,x^{(i)}_n)$ as features (totally $n$-features, a.k.a. input values, training data), and $y^{(i)}$ as the label (target values, training label).\n",
    "\n",
    "We are trying to learn a function to tell the quality of a red wine from all previous 11 columns' inputs\n",
    "\n",
    "$$h(\\mathbf{x}) = h(x_1,....,x_n) = \\sum_{i=1}^n w_i x_i + w_0 =  [1, \\;\\mathbf{x}]^{\\top} \\mathbf{w}$$\n",
    "\n",
    "We hope that we have $y^{(i)} \\approx h(\\mathbf{x}^{(i)})$ for the $i$-th training example. If we succeed in finding a function $h(\\mathbf{x})$ like this, and we have seen enough examples of wines and their quality scores, we hope that the function $h(\\mathbf{x})$ will also be a good predictor of the wine quality score even when we are given the features for a new wine where the quality score is not known.\n",
    "\n",
    "Mathematically speaking, starting from a vector of input values $(x^{(j)}_1,...,x^{(j)}_n)$ where $j\\neq 1,2,\\dots, N$ is not in our training samples, this function should be able to generate an output (target value) $y^{(j)}$ which is a good predictor. For example, given this input $(x^{(j)}_1,...,x^{(j)}_n)$, including the acidities, sulphur concentration, etc (which represents features), we can use our model to predict a $y^{(j)}$ that represents the quality of this wine. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading: How to achieve this goal?\n",
    "\n",
    "----\n",
    "\n",
    "### Loss function:\n",
    "$$\n",
    "L(\\mathbf{w}) = \\frac{1}{N}\\sum_{i=1}^N \\left( h(\\mathbf{x}^{(i)}) - y^{(i)} \\right)^2 = \n",
    "\\frac{1}{N}\\sum_{i=1}^N  \n",
    "\\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)} \\right)^2, \\tag{$\\ast$}\n",
    "$$\n",
    "which in the matrix form, can be written as:\n",
    "$$\n",
    "L(\\mathbf{w}) = \\frac{1}{N}\\left\\| \\mathbf{X} \\mathbf{w} - \\mathbf{y}\\right\\|^2\n",
    "= \\frac{1}{N}\\left\\| \n",
    "\\text{ }\n",
    "\\begin{pmatrix} \n",
    "1 & x^{(1)}_1 & \\dots & x^{(1)}_n\n",
    "\\\\\n",
    "1 & x^{(2)}_1 & \\dots & x^{(2)}_n\n",
    "\\\\\n",
    "1 &  \\ddots &  \\ddots & \\vdots\n",
    "\\\\\n",
    "1 & x^{(N)}_1 & \\dots & x^{(N)}_n\n",
    "\\end{pmatrix}\n",
    "\\begin{pmatrix}\n",
    "w_0 \\\\ w_1  \\\\ \\vdots \\\\ w_n\n",
    "\\end{pmatrix}\n",
    "- \\begin{pmatrix}\n",
    "y^{(1)} \\\\ y^{(2)} \\\\ \\vdots \\\\ y^{(N)}\n",
    "\\end{pmatrix}\n",
    "\\text{ } \n",
    "\\right\\|^2.\n",
    "$$\n",
    "The matrix $\\mathbf{X}$ is an $N\\times (n+1)$ matrix, where the $i$-th row ($1\\leq i \\leq N$) corresponds to the $i$-th data point, and each column corresponds to a feature that is associated with a weight $w_k$ ($k=1,\\dots, n$) or the bias $w_0$.\n",
    "\n",
    "----\n",
    "\n",
    "### Least-square\n",
    "We want to minimize the \"loss\". Solve the linear system generated by $\\displaystyle \\frac{\\partial L}{\\partial w_i} = 0$ for the weights and bias. After a long computation (we should be able to compute on paper for $n=2$ case), we will have:\n",
    "$$\n",
    "\\min_{\\mathbf{w} \\in \\mathbb{R}^{n+1}} L(\\mathbf{w}) \\Longleftrightarrow  \\frac{\\partial L}{\\partial w_m} = 0 \\; \\text{ for } k= 0,1,\\dots n \\Longleftrightarrow \n",
    "\\text{Solve for } \\;\\mathbf{w} \\;\\text{ in }\\; (\\mathbf{X}^{\\top} \\mathbf{X})\\mathbf{w} = \\mathbf{X}^\\top \\mathbf{y}\n",
    "$$\n",
    "\n",
    "----\n",
    "\n",
    "### Gradient descent\n",
    "\n",
    "\n",
    "Taking derivative with respect to the $k$-th weight based on equation $(\\ast)$ yields: for $k=0,1,\\dots, n$\n",
    "$$\n",
    "\\frac{\\partial L}{\\partial w_k} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial w_k} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n",
    "= \\frac{2}{N}\\sum_{i=1}^N x^{(i)}_k \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n",
    "$$\n",
    "which is the sum of the product of $k$-th feature and the residual \n",
    "$[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}$. \n",
    "\n",
    "Now if we take the gradient with respect to all features:\n",
    "$$\n",
    "\\frac{\\partial L}{\\partial \\mathbf{w}} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial \\mathbf{w}} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n",
    "= \\frac{2}{N}\\sum_{i=1}^N[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n",
    "$$\n",
    "\n",
    "Copying the routine from Lecture 14 and 16 and adapt it here: $w_{k,m}$ stands for the weight for the $k$-th feature in the $m$-th iteration of the gradient descent.\n",
    "\n",
    "> Choose initial guess $\\mathbf{w}_0 := (w_{0,0}, w_{1,0}, w_{2,0}, \\dots, w_{n,0})$ and step size (learning rate) $\\eta$<br><br>\n",
    ">    For $m=0,1,2, \\cdots, M$<br><br>\n",
    ">     &nbsp;&nbsp;&nbsp;&nbsp;   $\\mathbf{w}_{m+1} =  \\mathbf{w}_m - \\eta\\,\\nabla L\\left(\\mathbf{w}_m\\right) $\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tool: scikit-learn\n",
    "Reading: [https://scikit-learn.org/stable/modules/linear_model.html](https://scikit-learn.org/stable/modules/linear_model.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wine_data = pd.read_csv('winequality-red.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose train and test data\n",
    "X_train = wine_data.values[:-200,:-1] # no last column, last column is the quality score which is y\n",
    "y_train = wine_data.values[:-200,-1]\n",
    "\n",
    "X_test = wine_data.values[-200:,:-1]\n",
    "y_test = wine_data.values[-200:,-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "wine_reg = LinearRegression()\n",
    "wine_reg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions using the testing set\n",
    "y_pred = wine_reg.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation\n",
    "\n",
    "### Metric 1: Mean squared error\n",
    "$$\n",
    "\\text{MSE}(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}) = \\frac{1}{N} \n",
    "\\sum_{i=1}^{N} (y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}})^2.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train_pred = wine_reg.predict(X_train)\n",
    "train_MSE = np.mean((y_train - y_train_pred)**2)\n",
    "print(\"Training mean squared error is: %.5f\" % train_MSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_MSE = np.mean((y_test- y_pred)**2)\n",
    "print(\"Testing mean squared error is: %.5f\" % test_MSE)\n",
    "# kinda small...kinda big..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation\n",
    "\n",
    "### Metric 2: Coefficient of determination $R^2$\n",
    "$$\n",
    "R^2\\Big(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}\\Big) = 1 - \\frac{\\displaystyle\\sum_{i=1}^{n_{\\text{test}}} \\left(y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}}\\right)^2}{\\displaystyle\\sum_{i=1}^{n_\\text{test}} (y^{(i),\\text{Actual}} - \\bar{y}^{\\text{Actual}})^2}\n",
    "\\quad \n",
    "\\text{ where }\\; \\bar{y}^{\\text{Actual}} = \\displaystyle\\frac{1}{n_{\\text{test}}} \n",
    "\\sum_{i=1}^{n_\\text{test}} y^{(i),\\text{Actual}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "R_squared = 1 - (np.sum((y_test- y_pred)**2))/(np.sum((y_test- y_test.mean())**2))\n",
    "print(\"R squared is: %.5f\" % R_squared)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to improve?\n",
    "We will see next time."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}