{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 15: Overfitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In last lecture we applied linear regression on looking for the relation between $x$ and $y$ using a set of data $\\{(x_i,y_i)\\}_{i=1}^{N}$. Our linear regression only \"fits\" the data roughly, not precisely. Yet it is a good model. To know why, we need to learn interpolation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is interpolation:\n", "Suppose we know $n+1$ distinct grid points\n", "$x_0, x_1, x_2, \\dots, x_n$, and the values the values at each of these\n", "points as $f_k = f(x_k)$, but we have no idea of what $f$'s analytical expression is. Then the problem of interpolation is to find an approximation of $h(x)$ that is defined at any point $x \\in [a, b]$ that **coincides** with $f(x)$ at $x_k$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us borrow `scikit-learn` package again.\n", "\n", "Reference: Adapted from [https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) to be more readable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.pipeline import Pipeline # for easier fitting using high degree polynomials testing\n", "from sklearn.preprocessing import PolynomialFeatures # evaluating polynomials at points\n", "from sklearn.linear_model import LinearRegression # we have used this before" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model:\n", "Now we are gonna use a built-in \"linear\" regression model (which is a class) `LinearRegression` in `scikit-learn `package to fit not just a linear function but a polynomial function of any degree, e.g. $h(x) = w_{10} x^{10} + w_9 x^9 + \\dots + w_1 x + b$, to the data. \n", "\n", "Remark: if you are interested, we are using the Vandermonde matrix by adding $x^p$ as features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training:\n", "`X_train`, `y_train` are our training data. In the following example, we have 10 of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "X_train = np.linspace(0,2,10)\n", "# true function is x^2, adding some noise\n", "true_function = lambda x: x**2\n", "y_train = true_function(X_train) + np.random.normal(0,0.5, size=10)\n", "plt.scatter(X_train, y_train, s=40, alpha=0.8);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# linear regression\n", "polynomial_features = PolynomialFeatures(degree=1,include_bias=False)\n", "linear_regression = LinearRegression()\n", "pipeline = Pipeline([(\"polynomial_features\", polynomial_features),\n", " (\"linear_regression\", linear_regression)])\n", "# pipeline combines both classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pipeline.fit(X_train[:, np.newaxis], y_train) # \"training\"\n", "pipeline.fit(X_train.reshape(-1,1), y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation (Testing)\n", "Basically we choose a bunch of testing points, see if our model (built from only 10 noisy samples) approximates our true function $x^2$ to a reasonable level." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_samples = 100\n", "X_test = np.linspace(0, 2, num_samples) # this the the testing points\n", "y_pred = pipeline.predict(X_test.reshape(-1,1)) \n", "y_true = true_function(X_test)\n", "error = np.mean((y_pred - y_true)**2)\n", "\n", "plt.figure(figsize=(8,6))\n", "plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n", "plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n", "plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n", "plt.legend(loc='best', fontsize = 'x-large')\n", "plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class exercise: What if we increase the degree?\n", "Try increasing the degree gradually in `PolynomialFeatures()` (since we have packed `PolynomialFeatures()` and `LinearRegression()` into one class, we can use pipeline). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now we use pipeline to change the polynomial_features directly w/o redefine it\n", "# better than the scikit-learn's example's clumsy usage of pipeline\n", "pipeline.set_params(polynomial_features__degree=9)\n", "pipeline.fit(X_train.reshape(-1,1), y_train)\n", "\n", "## Cross-validation\n", "num_samples = 100\n", "X_test = np.linspace(0, 2, num_samples) # this the the testing points\n", "y_pred = pipeline.predict(X_test.reshape(-1,1)) # this the value predicted by the model\n", "y_true = true_function(X_test)\n", "error = np.mean((y_pred - y_true)**2)\n", "\n", "plt.figure(figsize=(8,6))\n", "plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n", "plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n", "plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n", "plt.legend(loc='best', fontsize = 'x-large')\n", "plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When the degree = 9, it is interpolation. But is it doing us any favor?\n", "\n", "### Metric: Mean squared error\n", "Check the mean square error we had at the testing points:\n", "$$\n", "\\operatorname {MSE} ={\\frac {1}{n_{\\text{test}}}}\n", "\\sum_{i=1}^{n_{\\text{test}}}\\left(y^{\\text{True}}_{i}- y^{\\text{Pred}}_{i} \\right)^{2}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class exercise: Underfitting vs Overfitting\n", "* Underfitting: $k = ?$\n", "* Good fitting: $k = ?$\n", "* Overfitting: $k = ?$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multivariate linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us look at the [wine data set on Kaggle](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) again. The first 11 columns are \"objective\" measurements, the last column \"quality\" is a score that is subjectively assigned to the wines. \n", "\n", "In last few lectures, we fit the data $(x_i, y_i)$ using explicit-formula in least-square and gradient descent. There is a similar solution for when you are trying to learn a linear approximation from a given data-set:\n", "\n", "$$\\{ (x^{(i)}_1,...,x^{(i)}_n,y^{(i)}) \\}_{i=1}^N$$\n", "\n", "The $i$-th data set has $\\mathbf{x}^{(i)} = (x^{(i)}_1,...,x^{(i)}_n)$ as features (totally $n$-features, a.k.a. input values, training data), and $y^{(i)}$ as the label (target values, training label).\n", "\n", "We are trying to learn a function to tell the quality of a red wine from all previous 11 columns' inputs\n", "\n", "$$h(\\mathbf{x}) = h(x_1,....,x_n) = \\sum_{i=1}^n w_i x_i + w_0 = [1, \\;\\mathbf{x}]^{\\top} \\mathbf{w}$$\n", "\n", "We hope that we have $y^{(i)} \\approx h(\\mathbf{x}^{(i)})$ for the $i$-th training example. If we succeed in finding a function $h(\\mathbf{x})$ like this, and we have seen enough examples of wines and their quality scores, we hope that the function $h(\\mathbf{x})$ will also be a good predictor of the wine quality score even when we are given the features for a new wine where the quality score is not known.\n", "\n", "Mathematically speaking, starting from a vector of input values $(x^{(j)}_1,...,x^{(j)}_n)$ where $j\\neq 1,2,\\dots, N$ is not in our training samples, this function should be able to generate an output (target value) $y^{(j)}$ which is a good predictor. For example, given this input $(x^{(j)}_1,...,x^{(j)}_n)$, including the acidities, sulphur concentration, etc (which represents features), we can use our model to predict a $y^{(j)}$ that represents the quality of this wine. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading: How to achieve this goal?\n", "\n", "----\n", "\n", "### Loss function:\n", "$$\n", "L(\\mathbf{w}) = \\frac{1}{N}\\sum_{i=1}^N \\left( h(\\mathbf{x}^{(i)}) - y^{(i)} \\right)^2 = \n", "\\frac{1}{N}\\sum_{i=1}^N \n", "\\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)} \\right)^2, \\tag{$\\ast$}\n", "$$\n", "which in the matrix form, can be written as:\n", "$$\n", "L(\\mathbf{w}) = \\frac{1}{N}\\left\\| \\mathbf{X} \\mathbf{w} - \\mathbf{y}\\right\\|^2\n", "= \\frac{1}{N}\\left\\| \n", "\\text{ }\n", "\\begin{pmatrix} \n", "1 & x^{(1)}_1 & \\dots & x^{(1)}_n\n", "\\\\\n", "1 & x^{(2)}_1 & \\dots & x^{(2)}_n\n", "\\\\\n", "1 & \\ddots & \\ddots & \\vdots\n", "\\\\\n", "1 & x^{(N)}_1 & \\dots & x^{(N)}_n\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "w_0 \\\\ w_1 \\\\ \\vdots \\\\ w_n\n", "\\end{pmatrix}\n", "- \\begin{pmatrix}\n", "y^{(1)} \\\\ y^{(2)} \\\\ \\vdots \\\\ y^{(N)}\n", "\\end{pmatrix}\n", "\\text{ } \n", "\\right\\|^2.\n", "$$\n", "The matrix $\\mathbf{X}$ is an $N\\times (n+1)$ matrix, where the $i$-th row ($1\\leq i \\leq N$) corresponds to the $i$-th data point, and each column corresponds to a feature that is associated with a weight $w_k$ ($k=1,\\dots, n$) or the bias $w_0$.\n", "\n", "----\n", "\n", "### Least-square\n", "We want to minimize the \"loss\". Solve the linear system generated by $\\displaystyle \\frac{\\partial L}{\\partial w_i} = 0$ for the weights and bias. After a long computation (we should be able to compute on paper for $n=2$ case), we will have:\n", "$$\n", "\\min_{\\mathbf{w} \\in \\mathbb{R}^{n+1}} L(\\mathbf{w}) \\Longleftrightarrow \\frac{\\partial L}{\\partial w_m} = 0 \\; \\text{ for } k= 0,1,\\dots n \\Longleftrightarrow \n", "\\text{Solve for } \\;\\mathbf{w} \\;\\text{ in }\\; (\\mathbf{X}^{\\top} \\mathbf{X})\\mathbf{w} = \\mathbf{X}^\\top \\mathbf{y}\n", "$$\n", "\n", "----\n", "\n", "### Gradient descent\n", "\n", "\n", "Taking derivative with respect to the $k$-th weight based on equation $(\\ast)$ yields: for $k=0,1,\\dots, n$\n", "$$\n", "\\frac{\\partial L}{\\partial w_k} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial w_k} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n", "= \\frac{2}{N}\\sum_{i=1}^N x^{(i)}_k \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n", "$$\n", "which is the sum of the product of $k$-th feature and the residual \n", "$[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}$. \n", "\n", "Now if we take the gradient with respect to all features:\n", "$$\n", "\\frac{\\partial L}{\\partial \\mathbf{w}} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial \\mathbf{w}} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n", "= \\frac{2}{N}\\sum_{i=1}^N[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n", "$$\n", "\n", "Copying the routine from Lecture 14 and 16 and adapt it here: $w_{k,m}$ stands for the weight for the $k$-th feature in the $m$-th iteration of the gradient descent.\n", "\n", "> Choose initial guess $\\mathbf{w}_0 := (w_{0,0}, w_{1,0}, w_{2,0}, \\dots, w_{n,0})$ and step size (learning rate) $\\eta$

\n", "> For $m=0,1,2, \\cdots, M$

\n", ">      $\\mathbf{w}_{m+1} = \\mathbf{w}_m - \\eta\\,\\nabla L\\left(\\mathbf{w}_m\\right) $\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tool: scikit-learn\n", "Reading: [https://scikit-learn.org/stable/modules/linear_model.html](https://scikit-learn.org/stable/modules/linear_model.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wine_data = pd.read_csv('winequality-red.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# choose train and test data\n", "X_train = wine_data.values[:-200,:-1] # no last column, last column is the quality score which is y\n", "y_train = wine_data.values[:-200,-1]\n", "\n", "X_test = wine_data.values[-200:,:-1]\n", "y_test = wine_data.values[-200:,-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "wine_reg = LinearRegression()\n", "wine_reg.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make predictions using the testing set\n", "y_pred = wine_reg.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation\n", "\n", "### Metric 1: Mean squared error\n", "$$\n", "\\text{MSE}(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}) = \\frac{1}{N} \n", "\\sum_{i=1}^{N} (y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}})^2.\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_train_pred = wine_reg.predict(X_train)\n", "train_MSE = np.mean((y_train - y_train_pred)**2)\n", "print(\"Training mean squared error is: %.5f\" % train_MSE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_MSE = np.mean((y_test- y_pred)**2)\n", "print(\"Testing mean squared error is: %.5f\" % test_MSE)\n", "# kinda small...kinda big..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation\n", "\n", "### Metric 2: Coefficient of determination $R^2$\n", "$$\n", "R^2\\Big(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}\\Big) = 1 - \\frac{\\displaystyle\\sum_{i=1}^{n_{\\text{test}}} \\left(y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}}\\right)^2}{\\displaystyle\\sum_{i=1}^{n_\\text{test}} (y^{(i),\\text{Actual}} - \\bar{y}^{\\text{Actual}})^2}\n", "\\quad \n", "\\text{ where }\\; \\bar{y}^{\\text{Actual}} = \\displaystyle\\frac{1}{n_{\\text{test}}} \n", "\\sum_{i=1}^{n_\\text{test}} y^{(i),\\text{Actual}}\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "R_squared = 1 - (np.sum((y_test- y_pred)**2))/(np.sum((y_test- y_test.mean())**2))\n", "print(\"R squared is: %.5f\" % R_squared)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to improve?\n", "We will see next time." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }