{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sebastian Raschka, 2015  \n",
    "`mlxtend`, a library of extension and helper modules for Python's data analysis and machine learning libraries\n",
    "\n",
    "- GitHub repository: https://github.com/rasbt/mlxtend\n",
    "- Documentation: http://rasbt.github.io/mlxtend/\n",
    "\n",
    "View this page in [jupyter nbviewer](http://nbviewer.ipython.org/github/rasbt/mlxtend/blob/master/docs/sources/_ipynb_templates/regressor/linear_regression.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sebastian Raschka \n",
      "last updated: 2016-01-30 \n",
      "\n",
      "CPython 3.5.1\n",
      "IPython 4.0.3\n",
      "\n",
      "matplotlib 1.5.1\n",
      "numpy 1.10.2\n",
      "scipy 0.16.1\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deriving the Gradient Descent Rule for Linear Regression and Adaline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression except for a threshold function $\\phi(\\cdot)_T$ that converts the continuous output into a categorical class label\n",
    "\n",
    "$$\\phi(z)_T = \\begin{cases} \n",
    "      1 & if \\; z \\geq 0 \\\\\n",
    "      0 & if \\; z < 0 \n",
    "   \\end{cases},$$\n",
    "   \n",
    "where $z$ is the net input, which is computed as the sum of the input features $\\mathbf{x}$ multiplied by the model weights $\\mathbf{w}$:\n",
    "\n",
    "$$z = w_0x_0 + w_1x_1 \\dots w_mx_m = \\sum_{j=0}^{m} x_j w_j = \\mathbf{w}^T \\mathbf{x}$$\n",
    "\n",
    "(Note that $x_0$ refers to the bias unit so that $x_0=1$.)\n",
    "\n",
    "In the case of linear regression and Adaline, the activation function $\\phi(\\cdot)_A$ is simply the identity function so that $\\phi(z)_A = z$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](./linear-gradient-derivative_files/regression-vs-adaline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, in order to learn the optimal model weights $\\mathbf{w}$, we need to define a cost function that we can optimize. Here, our cost function $J({\\cdot})$ is the sum of squared errors (SSE), which we multiply by $\\frac{1}{2}$ to make the derivation easier:\n",
    "\n",
    "$$J({\\mathbf{w}}) = \\frac{1}{2} \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big)^2,$$\n",
    "\n",
    "where $y^{(i)}$ is the label or target label of the $i$th training point $x^{(i)}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Note that the SSE cost function is convex and therefore differentiable.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In simple words, we can summarize the gradient descent learning as follows:\n",
    "\n",
    "1. Initialize the weights to 0 or small random numbers.\n",
    "2. For $k$ epochs (passes over the training set)\n",
    "    2. For each training sample $x^{(i)}$\n",
    "        1. Compute the predicted output value $\\hat{y}^{(i)}$\n",
    "        2. Compare $\\hat{y}^{(i)}$ to the actual output $y^{(i)}$ and Compute the \"weight update\" value\n",
    "        3. Update the \"weight update\" value\n",
    "    3. Update the weight coefficients by the accumulated \"weight update\" values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which we can translate into a more mathematical notation:\n",
    "    \n",
    "1. Initialize the weights to 0 or small random numbers.\n",
    "2. For $k$ epochs\n",
    "    3. For each training sample $x^{(i)}$\n",
    "        1. $\\phi(z^{(i)})_A = \\hat{y}^{(i)}$\n",
    "        2. $\\Delta w_{(t+1), \\; j} = \\eta (y^{(i)} - \\hat{y}^{(i)}) x_{j}^{(i)}\\;$  (where $\\eta$ is the learning rate); \n",
    "        3. $\\Delta w_{j} :=  \\Delta w_j\\; + \\Delta w_{(t+1), \\;j}$ \n",
    "    \n",
    "    3. $\\mathbf{w} := \\mathbf{w} + \\Delta \\mathbf{w}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performing this global weight update\n",
    "\n",
    "$$\\mathbf{w} := \\mathbf{w} + \\Delta \\mathbf{w},$$\n",
    "\n",
    "can be understood as \"updating the model weights by taking an opposite step towards the cost gradient scaled by the learning rate $\\eta$\" \n",
    "\n",
    "$$\\Delta \\mathbf{w} = - \\eta \\nabla J(\\mathbf{w}),$$\n",
    "\n",
    "where the partial derivative with respect to each $w_j$ can be written as\n",
    "\n",
    "$$\\frac{\\partial J}{\\partial w_j} = - \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big) x_{j}^{(i)}.$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize: in order to use gradient descent to learn the model coefficients, we simply update the weights $\\mathbf{w}$ by taking a step into the opposite direction of the gradient for each pass over the training set -- that's basically it. But how do we get to the equation\n",
    "\n",
    "$$\\frac{\\partial J}{\\partial w_j} = - \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big) x_{j}^{(i)}?$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's walk through the derivation step by step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\begin{aligned}\n",
    "& \\frac{\\partial J}{\\partial w_j} \\\\\n",
    "& = \\frac{\\partial}{\\partial w_j} \\frac{1}{2} \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big)^2 \\\\\n",
    "& = \\frac{1}{2} \\frac{\\partial}{\\partial w_j} \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big)^2 \\\\\n",
    "& = \\frac{1}{2} \\sum_i  \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big) \\frac{\\partial}{\\partial w_j}  \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big) \\\\\n",
    "& = \\sum_i  \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big) \\frac{\\partial}{\\partial w_j} \\bigg(y^{(i)} - \\sum_i \\big(w_{j}^{(i)} x_{j}^{(i)} \\big) \\bigg) \\\\\n",
    "& = \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big)(-x_{j}^{(i)}) \\\\\n",
    "& = - \\sum_i \\big(y^{(i)} - \\phi(z)_{A}^{(i)}\\big)x_{j}^{(i)} \n",
    "\\end{aligned}$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}