{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 14: Linear Regression by Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Today we will learn to minimize the loss function in the linear regression using gradient descent, and the general machine learning workflow:\n",
    "* Model choosing and set-up.\n",
    "* Train the model (gradient descent).\n",
    "* Cross-validation / Testing (MSE, $R^2$).\n",
    "\n",
    "\n",
    "\n",
    "Again we will import the wine quality data from [UCI machine learning dataset repo on Kaggle](https://www.kaggle.com/uciml/datasets) like we did in last lecture.\n",
    "\n",
    "* Download `winequality-red.csv` from [https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009/), unzip it and put it in the same directory with this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this will help us plot things nicely\n",
    "def plot_data_and_line(weight, bias, X, Y):\n",
    "    # this function plots data against our model\n",
    "    # weight and bias are our regression model's parameters\n",
    "    # Conventionally X are our training data, Y are the labels \n",
    "    plot_max, plot_min = max(X), min(X)\n",
    "    XX = np.linspace(plot_min,plot_max,200)\n",
    "    YY = weight*XX + bias\n",
    "\n",
    "    plt.scatter(X, Y, alpha=0.1)\n",
    "    plt.plot(XX, YY, color='red',linewidth = 4, alpha=0.4)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wine_data = pd.read_csv('winequality-red.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fix_acid = wine_data['fixed acidity'].values\n",
    "vol_acid = wine_data['volatile acidity'].values\n",
    "ctr_acid = wine_data['citric acid'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1: Model and loss\n",
    "\n",
    "## Model/Hypothesis\n",
    "Here we want to investigate the (empirical) relation of citric acid with the fixed acidity.\n",
    "There are $N$ samples, $i$-th sample has features $x_i$ (citric acid), and a target value $y_i$ (fixed acidity), we want to use the feature(s) to predict the target value(s). Here we use a linear model:\n",
    "$$h(x; w,b) = wx + b.\n",
    "$$\n",
    "Specifically in this example, $h(x_i; w, b) = w x_i + b$, where citric acid as input $x_i$, $w,b$ as parameters, we want this $h$ to be as close to $y_i$ (actual target value) as possible.\n",
    "\n",
    "## Loss function: \n",
    "$$L(w,b; X, y) = \\frac{1}{N}\\sum_{i=1}^{N} \\Big((w x_i + b) - y_i\\Big)^2,$$\n",
    "the $1/N$ factor is added so that statistically we are trying to minimize an expected value of the square error from an empirical distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = ctr_acid\n",
    "y = fix_acid\n",
    "N = len(x)\n",
    "\n",
    "# be aware that x and y here are both arrays\n",
    "# model\n",
    "def h(x, w, b):\n",
    "    return w * x + b\n",
    "\n",
    "# loss function = mean square error on the whole data set\n",
    "def loss(w, b, x, y):\n",
    "    error = h(x, w, b) - y\n",
    "    return np.mean(error**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: training\n",
    "\n",
    "## Gradient descent to minimize the loss\n",
    "\n",
    "> Choose initial guess $(w_0,b_0)$ and step size (learning rate) $\\eta$<br><br>\n",
    ">    For $k=0,1,2, \\cdots, M$<br><br>\n",
    ">     &nbsp;&nbsp;&nbsp;&nbsp;   $(w_{k+1},b_{k+1}) =  (w_k,b_k) - \\eta\\nabla L(w_k,b_k) $\n",
    "\n",
    "From Lecture 13:\n",
    "$$\\frac{\\partial}{\\partial b} L(w,b) = \\frac{2}{N}  \\sum\\limits_{i=1}^N \\big(w x_i + b-y_i\\big) \n",
    "=  \\frac{2}{N}  \\sum\\limits_{i=1}^N \\big(h(x_i)-y_i\\big)$$\n",
    "\n",
    "$$\\frac{\\partial}{\\partial w} L(w,b) =  \\frac{2}{N} \\sum\\limits_{i=1}^N \\big((w x_i + b-y_i) \\cdot x_i\\big)\n",
    "= \\frac{2}{N} \\sum\\limits_{i=1}^N \\big((h(x_i)-y_i) \\cdot x_i\\big)$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gradient_loss(w,b,x,y):\n",
    "    dw = 2*np.mean((h(x,w,b)- y)*x)\n",
    "    db = 2*np.mean(h(x,w,b) - y)\n",
    "    return dw, db"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "eta = 0.05  \n",
    "num_steps = 500\n",
    "\n",
    "# initial guess\n",
    "w = 0.0\n",
    "b = 0.0\n",
    "loss_at_eachstep = np.zeros(num_steps) # record the change of the loss function\n",
    "for i in range(num_steps):\n",
    "    loss_at_eachstep[i] = loss(w,b,x,y)\n",
    "    dw, db = gradient_loss(w,b,x,y)\n",
    "    w = w - eta * dw\n",
    "    b = b - eta * db\n",
    "    if i in (1, 2, 5, 10, 20, 50, 100, 200, 400):\n",
    "        plot_data_and_line(w, b, x, y)\n",
    "\n",
    "print(\"final loss after\", num_steps, \"iterations is: \", loss(w,b,x,y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(range(num_steps), loss_at_eachstep)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: Testing/cross-validation\n",
    "\n",
    "To make sure our model is able to predict, normally we reserve certain portion of the all the samples as testing/cv set. The training is done for training samples, and we test the model we have trained on the testing samples. \n",
    "\n",
    "We can manually split the data using slicing, or using `train_test_split` function from `scikit-learn`'s `model_selection` submodule.\n",
    "\n",
    "### Metric 1: Mean squared error\n",
    "$$\n",
    "\\text{MSE}(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}) = \\frac{1}{n_\\text{test}} \n",
    "\\sum_{i=1}^{n_\\text{test}} (y^{\\text{Actual}}_i - y^{\\text{Pred}}_i)^2.\n",
    "$$\n",
    "\n",
    "Remark: training MSE is actually the loss function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metric 2: \n",
    "Coefficient of determination $R^2$\n",
    "$$\n",
    "R^2\\Big(\\mathbf{y}^{\\text{Actual}}, \\mathbf{y}^{\\text{Pred}}\\Big) = 1 - \\frac{\\displaystyle\\sum_{i=1}^{n_{\\text{test}}} \\left(y^{\\text{Actual}}_i - y^{\\text{Pred}}_i\\right)^2}{\\displaystyle\\sum_{i=1}^{n_\\text{test}} (y^{\\text{Actual}}_i - \\bar{y}^{\\text{Actual}})^2}\n",
    "\\quad \n",
    "\\text{ where }\\; \\bar{y}^{\\text{Actual}} = \\displaystyle\\frac{1}{n_{\\text{test}}} \n",
    "\\sum_{i=1}^{n_\\text{test}} y^{\\text{Actual}}_i\n",
    "$$\n",
    "Let us use wikipedia's image to explain this metric\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/8/86/Coefficient_of_Determination.svg\" alt=\"Rsquared\" width=\"600\"/>\n",
    "$$\n",
    "R^2 = 1-{\\frac {\\color {blue}{\\text{residual sum of squares }} }{\\color {red}{\\text{total sum of squares}}}}\n",
    "$$\n",
    "In regression, the $R^2$ coefficient of determination measures how well the regression predictions approximate the real data points. An $R^2$ of 1 indicates that the regression prediction perfectly fit the test data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# In-class exercise\n",
    "* Try the same procedure by letting `x = vol_acid` and `y = fix_acid`.\n",
    "* Try varying the step size (learning rate) $\\eta$ from 1, 1e-1, 1e-2, to 1e-3. What have you observed?\n",
    "* A regularization can be added to the loss function to make the converging process smoother: we can choose a smaller $\\epsilon < 1$ such that our loss is \"regularized\" by the squared (sum) of the weight(s) (bias not included)\n",
    "$$L(w,b) = \\frac{1}{N}\\sum_{i=1}^{N} \\Big((w x_i + b) - y_i\\Big)^2 + \\epsilon w^2$$\n",
    "Try this with $\\epsilon = 10^{-3}$,\n",
    "(Reading: this is called [Ridge regression](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression)).\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}