{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lecture 15: Overfitting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In last lecture we applied linear regression on looking for the relation between $x$ and $y$ using a set of data $\\{(x_i,y_i)\\}_{i=1}^{N}$. Our linear regression only \"fits\" the data roughly, not precisely. Yet it is a good model. To know why, we need to learn interpolation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is interpolation:\n",
"Suppose we know $n+1$ distinct grid points\n",
"$x_0, x_1, x_2, \\dots, x_n$, and the values the values at each of these\n",
"points as $f_k = f(x_k)$, but we have no idea of what $f$'s analytical expression is. Then the problem of interpolation is to find an approximation of $h(x)$ that is defined at any point $x \\in [a, b]$ that **coincides** with $f(x)$ at $x_k$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us borrow `scikit-learn` package again.\n",
"\n",
"Reference: Adapted from [https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) to be more readable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.pipeline import Pipeline # for easier fitting using high degree polynomials testing\n",
"from sklearn.preprocessing import PolynomialFeatures # evaluating polynomials at points\n",
"from sklearn.linear_model import LinearRegression # we have used this before"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model:\n",
"Now we are gonna use a built-in \"linear\" regression model (which is a class) `LinearRegression` in `scikit-learn `package to fit not just a linear function but a polynomial function of any degree, e.g. $h(x) = w_{10} x^{10} + w_9 x^9 + \\dots + w_1 x + b$, to the data. \n",
"\n",
"Remark: if you are interested, we are using the Vandermonde matrix by adding $x^p$ as features. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training:\n",
"`X_train`, `y_train` are our training data. In the following example, we have 10 of them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(42)\n",
"X_train = np.linspace(0,2,10)\n",
"# true function is x^2, adding some noise\n",
"true_function = lambda x: x**2\n",
"y_train = true_function(X_train) + np.random.normal(0,0.5, size=10)\n",
"plt.scatter(X_train, y_train, s=40, alpha=0.8);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# linear regression\n",
"polynomial_features = PolynomialFeatures(degree=1,include_bias=False)\n",
"linear_regression = LinearRegression()\n",
"pipeline = Pipeline([(\"polynomial_features\", polynomial_features),\n",
" (\"linear_regression\", linear_regression)])\n",
"# pipeline combines both classes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pipeline.fit(X_train[:, np.newaxis], y_train) # \"training\"\n",
"pipeline.fit(X_train.reshape(-1,1), y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross-validation (Testing)\n",
"Basically we choose a bunch of testing points, see if our model (built from only 10 noisy samples) approximates our true function $x^2$ to a reasonable level."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_samples = 100\n",
"X_test = np.linspace(0, 2, num_samples) # this the the testing points\n",
"y_pred = pipeline.predict(X_test.reshape(-1,1)) \n",
"y_true = true_function(X_test)\n",
"error = np.mean((y_pred - y_true)**2)\n",
"\n",
"plt.figure(figsize=(8,6))\n",
"plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n",
"plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n",
"plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n",
"plt.legend(loc='best', fontsize = 'x-large')\n",
"plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In-class exercise: What if we increase the degree?\n",
"Try increasing the degree gradually in `PolynomialFeatures()` (since we have packed `PolynomialFeatures()` and `LinearRegression()` into one class, we can use pipeline). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# now we use pipeline to change the polynomial_features directly w/o redefine it\n",
"# better than the scikit-learn's example's clumsy usage of pipeline\n",
"pipeline.set_params(polynomial_features__degree=9)\n",
"pipeline.fit(X_train.reshape(-1,1), y_train)\n",
"\n",
"## Cross-validation\n",
"num_samples = 100\n",
"X_test = np.linspace(0, 2, num_samples) # this the the testing points\n",
"y_pred = pipeline.predict(X_test.reshape(-1,1)) # this the value predicted by the model\n",
"y_true = true_function(X_test)\n",
"error = np.mean((y_pred - y_true)**2)\n",
"\n",
"plt.figure(figsize=(8,6))\n",
"plt.plot(X_test, y_pred, linewidth = 2, label=\"Model's prediction\")\n",
"plt.plot(X_test, y_true, linewidth = 2, label=\"True function\")\n",
"plt.scatter(X_train, y_train, edgecolor='b', s=40, label=\"Training samples\")\n",
"plt.legend(loc='best', fontsize = 'x-large')\n",
"plt.title(\"Mean Square Error = {:.2e}\".format(error), fontsize = 'xx-large');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When the degree = 9, it is interpolation. But is it doing us any favor?\n",
"\n",
"### Metric: Mean squared error\n",
"Check the mean square error we had at the testing points:\n",
"$$\n",
"\\operatorname {MSE} ={\\frac {1}{n_{\\text{test}}}}\n",
"\\sum_{i=1}^{n_{\\text{test}}}\\left(y^{\\text{True}}_{i}- y^{\\text{Pred}}_{i} \\right)^{2}.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In-class exercise: Underfitting vs Overfitting\n",
"* Underfitting: $k = ?$\n",
"* Good fitting: $k = ?$\n",
"* Overfitting: $k = ?$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multivariate linear regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us look at the [wine data set on Kaggle](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) again. The first 11 columns are \"objective\" measurements, the last column \"quality\" is a score that is subjectively assigned to the wines. \n",
"\n",
"In last few lectures, we fit the data $(x_i, y_i)$ using explicit-formula in least-square and gradient descent. There is a similar solution for when you are trying to learn a linear approximation from a given data-set:\n",
"\n",
"$$\\{ (x^{(i)}_1,...,x^{(i)}_n,y^{(i)}) \\}_{i=1}^N$$\n",
"\n",
"The $i$-th data set has $\\mathbf{x}^{(i)} = (x^{(i)}_1,...,x^{(i)}_n)$ as features (totally $n$-features, a.k.a. input values, training data), and $y^{(i)}$ as the label (target values, training label).\n",
"\n",
"We are trying to learn a function to tell the quality of a red wine from all previous 11 columns' inputs\n",
"\n",
"$$h(\\mathbf{x}) = h(x_1,....,x_n) = \\sum_{i=1}^n w_i x_i + w_0 = [1, \\;\\mathbf{x}]^{\\top} \\mathbf{w}$$\n",
"\n",
"We hope that we have $y^{(i)} \\approx h(\\mathbf{x}^{(i)})$ for the $i$-th training example. If we succeed in finding a function $h(\\mathbf{x})$ like this, and we have seen enough examples of wines and their quality scores, we hope that the function $h(\\mathbf{x})$ will also be a good predictor of the wine quality score even when we are given the features for a new wine where the quality score is not known.\n",
"\n",
"Mathematically speaking, starting from a vector of input values $(x^{(j)}_1,...,x^{(j)}_n)$ where $j\\neq 1,2,\\dots, N$ is not in our training samples, this function should be able to generate an output (target value) $y^{(j)}$ which is a good predictor. For example, given this input $(x^{(j)}_1,...,x^{(j)}_n)$, including the acidities, sulphur concentration, etc (which represents features), we can use our model to predict a $y^{(j)}$ that represents the quality of this wine. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading: How to achieve this goal?\n",
"\n",
"----\n",
"\n",
"### Loss function:\n",
"$$\n",
"L(\\mathbf{w}) = \\frac{1}{N}\\sum_{i=1}^N \\left( h(\\mathbf{x}^{(i)}) - y^{(i)} \\right)^2 = \n",
"\\frac{1}{N}\\sum_{i=1}^N \n",
"\\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)} \\right)^2, \\tag{$\\ast$}\n",
"$$\n",
"which in the matrix form, can be written as:\n",
"$$\n",
"L(\\mathbf{w}) = \\frac{1}{N}\\left\\| \\mathbf{X} \\mathbf{w} - \\mathbf{y}\\right\\|^2\n",
"= \\frac{1}{N}\\left\\| \n",
"\\text{ }\n",
"\\begin{pmatrix} \n",
"1 & x^{(1)}_1 & \\dots & x^{(1)}_n\n",
"\\\\\n",
"1 & x^{(2)}_1 & \\dots & x^{(2)}_n\n",
"\\\\\n",
"1 & \\ddots & \\ddots & \\vdots\n",
"\\\\\n",
"1 & x^{(N)}_1 & \\dots & x^{(N)}_n\n",
"\\end{pmatrix}\n",
"\\begin{pmatrix}\n",
"w_0 \\\\ w_1 \\\\ \\vdots \\\\ w_n\n",
"\\end{pmatrix}\n",
"- \\begin{pmatrix}\n",
"y^{(1)} \\\\ y^{(2)} \\\\ \\vdots \\\\ y^{(N)}\n",
"\\end{pmatrix}\n",
"\\text{ } \n",
"\\right\\|^2.\n",
"$$\n",
"The matrix $\\mathbf{X}$ is an $N\\times (n+1)$ matrix, where the $i$-th row ($1\\leq i \\leq N$) corresponds to the $i$-th data point, and each column corresponds to a feature that is associated with a weight $w_k$ ($k=1,\\dots, n$) or the bias $w_0$.\n",
"\n",
"----\n",
"\n",
"### Least-square\n",
"We want to minimize the \"loss\". Solve the linear system generated by $\\displaystyle \\frac{\\partial L}{\\partial w_i} = 0$ for the weights and bias. After a long computation (we should be able to compute on paper for $n=2$ case), we will have:\n",
"$$\n",
"\\min_{\\mathbf{w} \\in \\mathbb{R}^{n+1}} L(\\mathbf{w}) \\Longleftrightarrow \\frac{\\partial L}{\\partial w_m} = 0 \\; \\text{ for } k= 0,1,\\dots n \\Longleftrightarrow \n",
"\\text{Solve for } \\;\\mathbf{w} \\;\\text{ in }\\; (\\mathbf{X}^{\\top} \\mathbf{X})\\mathbf{w} = \\mathbf{X}^\\top \\mathbf{y}\n",
"$$\n",
"\n",
"----\n",
"\n",
"### Gradient descent\n",
"\n",
"\n",
"Taking derivative with respect to the $k$-th weight based on equation $(\\ast)$ yields: for $k=0,1,\\dots, n$\n",
"$$\n",
"\\frac{\\partial L}{\\partial w_k} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial w_k} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n",
"= \\frac{2}{N}\\sum_{i=1}^N x^{(i)}_k \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n",
"$$\n",
"which is the sum of the product of $k$-th feature and the residual \n",
"$[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}$. \n",
"\n",
"Now if we take the gradient with respect to all features:\n",
"$$\n",
"\\frac{\\partial L}{\\partial \\mathbf{w}} = \\frac{2}{N}\\sum_{i=1}^N \\frac{\\partial h}{\\partial \\mathbf{w}} \\left(h(\\mathbf{x}^{(i)}) - y^{(i)}\\right)\n",
"= \\frac{2}{N}\\sum_{i=1}^N[1, \\;\\mathbf{x}^{(i)}]^{\\top} \\left( [1, \\;\\mathbf{x}^{(i)}]^{\\top} \\mathbf{w} - y^{(i)}\\right),\n",
"$$\n",
"\n",
"Copying the routine from Lecture 14 and 16 and adapt it here: $w_{k,m}$ stands for the weight for the $k$-th feature in the $m$-th iteration of the gradient descent.\n",
"\n",
"> Choose initial guess $\\mathbf{w}_0 := (w_{0,0}, w_{1,0}, w_{2,0}, \\dots, w_{n,0})$ and step size (learning rate) $\\eta$
\n",
"> For $m=0,1,2, \\cdots, M$
\n",
"> $\\mathbf{w}_{m+1} = \\mathbf{w}_m - \\eta\\,\\nabla L\\left(\\mathbf{w}_m\\right) $\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tool: scikit-learn\n",
"Reading: [https://scikit-learn.org/stable/modules/linear_model.html](https://scikit-learn.org/stable/modules/linear_model.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"wine_data = pd.read_csv('winequality-red.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# choose train and test data\n",
"X_train = wine_data.values[:-200,:-1] # no last column, last column is the quality score which is y\n",
"y_train = wine_data.values[:-200,-1]\n",
"\n",
"X_test = wine_data.values[-200:,:-1]\n",
"y_test = wine_data.values[-200:,-1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"wine_reg = LinearRegression()\n",
"wine_reg.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make predictions using the testing set\n",
"y_pred = wine_reg.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross-validation\n",
"\n",
"### Metric 1: Mean squared error\n",
"$$\n",
"\\text{MSE}(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}) = \\frac{1}{N} \n",
"\\sum_{i=1}^{N} (y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}})^2.\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_train_pred = wine_reg.predict(X_train)\n",
"train_MSE = np.mean((y_train - y_train_pred)**2)\n",
"print(\"Training mean squared error is: %.5f\" % train_MSE)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_MSE = np.mean((y_test- y_pred)**2)\n",
"print(\"Testing mean squared error is: %.5f\" % test_MSE)\n",
"# kinda small...kinda big..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross-validation\n",
"\n",
"### Metric 2: Coefficient of determination $R^2$\n",
"$$\n",
"R^2\\Big(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}\\Big) = 1 - \\frac{\\displaystyle\\sum_{i=1}^{n_{\\text{test}}} \\left(y^{(i),\\text{Actual}} - y^{(i),\\text{Pred}}\\right)^2}{\\displaystyle\\sum_{i=1}^{n_\\text{test}} (y^{(i),\\text{Actual}} - \\bar{y}^{\\text{Actual}})^2}\n",
"\\quad \n",
"\\text{ where }\\; \\bar{y}^{\\text{Actual}} = \\displaystyle\\frac{1}{n_{\\text{test}}} \n",
"\\sum_{i=1}^{n_\\text{test}} y^{(i),\\text{Actual}}\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"R_squared = 1 - (np.sum((y_test- y_pred)**2))/(np.sum((y_test- y_test.mean())**2))\n",
"print(\"R squared is: %.5f\" % R_squared)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How to improve?\n",
"We will see next time."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}