{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 12: Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br><br>\n",
    "Learning goal: how do we get to (approximately) the minimum for a function $f(x,y)$. e.g., $f(x,y)$ could be the cost of a I will be putting to market and $x$ and $y$ could be certain dimensions of the product, or one could be price. Or $f(x,y)$ could the error with which my model with parameters $x$, $y$ is predicting the stock-market. \n",
    "\n",
    "Please think of one other situation when we would want to minimize a multi-variable function. \n",
    "\n",
    "<br>\n",
    "\n",
    "Say, for example:\n",
    "\n",
    "$$f(x,y) = x^2 + 4y^2 + 4x + y + 6$$\n",
    "\n",
    "And we want to find the $(x,y)$ that minimizes $f(x,y)$ to a precisionwe prescribe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = lambda x, y: x*x + 2*y*y + 4*x + y + 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some of the following code are from matplotlib documentation and modified it to graph f.\n",
    "# You do not need to know how to make 3d graphs\n",
    "import matplotlib.pyplot as plt\n",
    "from mpl_toolkits.mplot3d import axes3d # plotting surfaces\n",
    "from matplotlib.colors import LogNorm # for later use, display colormap in log scale\n",
    "from matplotlib import animation # for later use, animate the gradient descending process\n",
    "from IPython.display import HTML # needed for rendering the video to html5 format\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(10, 10))\n",
    "ax = plt.axes(projection='3d', elev=30, azim=-50)\n",
    "\n",
    "X = np.linspace(-20,20,300)\n",
    "Y = np.linspace(-20,20,300)\n",
    "X, Y = np.meshgrid(X,Y)\n",
    "Z = f(X, Y)\n",
    "\n",
    "# Plot a basic surface.\n",
    "ax.plot_surface(X, Y, Z, cmap = 'jet', rstride=10, cstride=10, alpha=0.8)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "So how do we find the $(x,y)$ that minimizes the function $f$? Here are three ideas, all of which work for our case. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Idea 1: Monte Carlo\n",
    "Try a bunch of random values of $x$ and $y$. Keep the $x,y$ with the smallest values seen.\n",
    "\n",
    "Not a bad idea in 2-dimensions. But will it work if $f$ were $f(x_1,x_2,x_3,..., x_{100})$ with lots of parameters instead of just two variables? No, because there are too many points to look at."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Idea 2:  Math 2D\n",
    "Look at the partial derivatives of $f$ and solve $\\displaystyle\\frac{\\partial f}{\\partial x}(x,y) = 0$ and $\\displaystyle\\frac{\\partial f}{\\partial y}(x,y) = 0$. The idea here is that at a local mininum, the partial derivatives will all be 0. \n",
    "\n",
    "This is great in our case because the partial derivatives are solved easily. But again, if we had lots and lots of parameters, e.g. $f(x_1,x_2,x_3,..., x_{100})$, we would have to solve 100 equations, this would be fine if all the partial derivatives were linear, but in general it's very hard to solve lots of non-linear equations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Idea 3:  \n",
    "Start at a point $(x,y)$ and \"go downhill\". \n",
    "\n",
    "**Fact from calculus:** At any point $(x,y)$, the gradient $\\nabla f (x,y) = \\displaystyle\\left(\\frac{\\partial f}{\\partial x}, \\frac{\\partial f}{\\partial y}\\right)$ will give the direction that makes $f$ increase the most. \n",
    "\n",
    "If $\\nabla f (x,y)$ will give the direction of steepest ascent, then $- \\nabla f (x,y)$ will give the direction of steepest descent (refer to Homework of Week 5). So I can re-write:\n",
    "\n",
    "> Start at a a point and iteratively (in a loop) change $(x,y)$ to \n",
    "\n",
    "$$(x,y) - \\eta \\nabla f (x,y),$$\n",
    "where $\\eta$ is a small number; and this iterative process can be formulated as:\n",
    "\n",
    "> Choose initial guess $(x_0,y_0)$ and step size (learning rate) $\\eta$<br><br>\n",
    ">    For $k=0,1,2, \\cdots, M$<br><br>\n",
    ">    &nbsp;&nbsp;&nbsp;&nbsp;    $(x_{k+1},y_{k+1}) =  (x_k,y_k) - \\eta\\nabla f(x_k,y_k) $\n",
    "\n",
    "Here for simplicity we only consider a fixed number of steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = lambda x, y: x*x + 2*y*y + 4*x + y + 6\n",
    "partialfx = lambda x, y: 2*x + 4  # partial derivative of our function f with respect to x\n",
    "partialfy = lambda x, y: 4*y + 1\n",
    "\n",
    "# starting at (5,5) for no reason\n",
    "x, y = 5, 5\n",
    "\n",
    "# this is the rate at which we'll move to the opposite of the gradient\n",
    "# called \"learning rate\" in machine learning\n",
    "eta = 0.5\n",
    "\n",
    "# total number of steps we will perform this descent\n",
    "num_steps = 200\n",
    "\n",
    "# we'll store all the intermediate values during the descent:\n",
    "x_vals = np.zeros(num_steps)\n",
    "y_vals = np.zeros(num_steps)\n",
    "f_vals = np.zeros(num_steps)\n",
    "\n",
    "for i in range(num_steps):\n",
    "    # update x and y\n",
    "    dx = partialfx(x, y)\n",
    "    dy = partialfy(x, y)\n",
    "    x = x - eta*dx\n",
    "    y = y - eta*dy\n",
    "    \n",
    "    # let's store the x, y and f(x,y) values for later use\n",
    "    x_vals[i] = x\n",
    "    y_vals[i] = y\n",
    "    f_vals[i] = f(x, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-5.5,  5. , -5.5,  5. , -5.5,  5. , -5.5,  5. , -5.5,  5. , -5.5,\n",
       "        5. , -5.5,  5. , -5.5,  5. , -5.5,  5. , -5.5,  5. ])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_vals.shape\n",
    "y_vals[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-21.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "partialfy(-2,-5.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Local minimum of f(x,y): \", f(x,y), \"at point\", (x,y))\n",
    "# let's see what the f(x,y) values were    \n",
    "plt.plot(range(num_steps), f_vals) # the changes of f(x,y) over these 200 iters\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also visualize what happens on a contour graph. \n",
    "\n",
    "This is the contour graph of $f$. The curves are the solutions to $f(x,y) = c$ where $c$ is the labeled number on each curve. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "X = np.linspace(-20,20,300)\n",
    "Y = np.linspace(-20,20,300)\n",
    "X, Y = np.meshgrid(X,Y)\n",
    "Z = f(X, Y)\n",
    "\n",
    "plt.figure(figsize=(8, 6))\n",
    "CS = plt.contour(X, Y, Z, [1, 4, 9, 16, 25, 36, 49, 64, 81, \n",
    "                           100, 121, 144, 169, 196], cmap='jet')\n",
    "# the contour plot is when f(x,y) = the values above\n",
    "plt.axis([-6,6,-10,10])\n",
    "plt.clabel(CS, inline=True, fontsize=10)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's include the arrows for how gradient descent moves us. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 6))\n",
    "CS = plt.contour(X, Y, Z, [1, 4, 9, 16, 25, 36, 49, 64, 81, \n",
    "                           100, 121, 144, 169, 196], cmap='jet')\n",
    "plt.axis([-8,8,-8,8])\n",
    "plt.clabel(CS, inline=True, fontsize=10)\n",
    "# let's plot every few times to avoid congestion of arrows in the picture\n",
    "delta_n = 4\n",
    "for i in range(0,99,delta_n):\n",
    "#     plt.scatter(x_vals[i], y_vals[i])\n",
    "    plt.arrow(x_vals[i], y_vals[i], (x_vals[i+delta_n] - x_vals[i]), \n",
    "              (y_vals[i+delta_n] - y_vals[i]), \n",
    "              head_width=0.3, head_length=0.2, linewidth = 1.5, color='red')\n",
    "\n",
    "# plt.plot(x_vals, y_vals)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So our point $(x,y)$ is really going downhill towards the minimum!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class exercise: \n",
    "What happens if you take $\\eta$ (`eta`) to be too big? Try guessing what will happen and then go and try `eta = 0.5` in the code. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numerical gradient (read after class):\n",
    "\n",
    "The gradient descent method works really well not just for polynomial functions but all kinds of functions. How do we take derivatives in general though? Two ways:\n",
    "\n",
    "* **Symbolic**: presumable we know the formula $f$ is, so we can take the derivative using our knowledge of derivatives. (that's what we did above)\n",
    "* **Numerical**: we can numerically approximate using the definition of derivative:\n",
    "\n",
    "$$\\frac{d f}{d x}(x) = \\lim_{h \\rightarrow 0}\\frac{f(x + h) - f(x)}{h} \\approx \\frac{f(x + h) - f(x)}{h}$$\n",
    "when $h$ is small.\n",
    " \n",
    "So, to numerically approximate the derivative, we could take $h = 0.00001$ for example.\n",
    "\n",
    "Reference: a starter would be the [Wikipedia entry](https://en.wikipedia.org/wiki/Numerical_differentiation), a more advanced introduction would be a little piece written by my former Purdue colleague whom I took machine learning class from: [Finite Calculus:\n",
    "A Tutorial for Solving Nasty Sums](https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202005%20-%20finite%20calculus.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def num_derivative(f):\n",
    "    h = 0.00001\n",
    "    return (lambda x: (f(x+h)-f(x))/h) # return value is function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_f = lambda x: x*x\n",
    "num_derivative(test_f)(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can of course do it for partial derivatives:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def numpartialx(f):\n",
    "    h = 0.00001\n",
    "    return (lambda x, y: (f(x+h, y)-f(x, y))/h)\n",
    "\n",
    "def numpartialy(f):\n",
    "    h = 0.00001\n",
    "    return (lambda x, y: (f(x, y+h)-f(x, y))/h)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We can then do gradient descent for any function without taking the partial derivatives by hand:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is now a function\n",
    "def grad_descent(f, x0 = (0,0), eta=0.01, num_steps=200):\n",
    "    # if we do not give x0 any value, then by default it is (0,0)\n",
    "    # starting at some point\n",
    "    x, y = x0[0], x0[1]\n",
    "\n",
    "    x_vals = np.zeros(num_steps)\n",
    "    y_vals = np.zeros(num_steps)\n",
    "    f_vals = np.zeros(num_steps)\n",
    "\n",
    "    for i in range(num_steps):\n",
    "        # update x and y\n",
    "        # numerical gradient\n",
    "        dx, dy = numpartialx(f)(x,y), numpartialy(f)(x,y)\n",
    "        x = x - eta*dx\n",
    "        y = y - eta*dy\n",
    "\n",
    "        # let's store the x, y and f(x,y) values for later use\n",
    "        x_vals[i] = x\n",
    "        y_vals[i] = y\n",
    "        f_vals[i] = f(x, y)\n",
    "    \n",
    "    return x_vals, y_vals, f_vals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_vals, y_vals, f_vals = grad_descent(f, (-1,2), 0.1, 50)\n",
    "\n",
    "print(\"Local minimum of f(x,y): \", f(x_vals[-1],y_vals[-1]), \n",
    "      \"at point\", (x_vals[-1],y_vals[-1]))\n",
    "# let's see what the f(x,y) values were    \n",
    "plt.plot(range(50), f_vals) # the changes of f(x,y) over these 200 iters\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An (in)famous benchmark: Beale function\n",
    "\n",
    "Beale function is one of the [benchmark for testing your optimization algorithm](https://en.wikipedia.org/wiki/Test_functions_for_optimization):\n",
    "$$\\displaystyle f(x,y)=\\left(1.5-x+xy\\right)^{2}+\\left(2.25-x+xy^{2}\\right)^{2} \n",
    "+\\left(2.625-x+xy^{3}\\right)^{2}$$\n",
    "We know that this function has the global minimum is achieved at $(3,0.5)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# beale function\n",
    "f  = lambda x, y: (1.5 - x + x*y)**2 + (2.25 - x + x*y**2)**2 + (2.625 - x + x*y**3)**2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "num_steps = 2000\n",
    "x_vals, y_vals, f_vals = grad_descent(f, x0 = (-2,-2), eta = 1e-4, num_steps=num_steps)\n",
    "print(\"The value of f(x,y): \", f(x_vals[-1],y_vals[-1]), \"after\", \n",
    "      num_steps, \"iterations at point\", (x_vals[-1],y_vals[-1]))\n",
    "# let's see what the f(x,y) values were    \n",
    "plt.title(\"f(x,y) over gradient descent steps\")\n",
    "plt.plot(range(num_steps), f_vals)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the contour too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.linspace(-4.5,4.5,300)\n",
    "Y = np.linspace(-4.5,4.5,300)\n",
    "X, Y = np.meshgrid(X,Y)\n",
    "Z = f(X, Y)\n",
    "\n",
    "# these 4 lines of code is plotting the contour\n",
    "fig, ax = plt.subplots(figsize=(12, 8))\n",
    "CS = ax.contour(X, Y, Z, levels=np.logspace(0, 5, 35), cmap='jet', norm = LogNorm())\n",
    "plt.axis('tight')\n",
    "ax.clabel(CS, inline=True, fontsize=8)\n",
    "\n",
    "# let's plot the arrow every few times to avoid congestion of arrows in the picture\n",
    "delta_n = 100\n",
    "for i in range(0,num_steps-delta_n,delta_n):\n",
    "    # plt.scatter(x_vals[i], y_vals[i])\n",
    "    plt.arrow(x_vals[i], y_vals[i], (x_vals[i+delta_n] - x_vals[i]), \n",
    "              (y_vals[i+delta_n] - y_vals[i]), \n",
    "              head_width=0.3, head_length=0.2, linewidth = 1.5, color='red')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Animation of the gradient descent\n",
    "Please run the following cell to animate this process. You do not need to understand what I did here...\n",
    "\n",
    "If Python complains about `RuntimeError: Requested MovieWriter (ffmpeg) not available`, please install FFmpeg package using Anaconda prompt by the following command either on your own computer and on the lab computer:\n",
    "> conda install -c conda-forge ffmpeg\n",
    "\n",
    "After you have done this, restart the kernel and run every cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "## the gradient descent path\n",
    "path = np.array([x_vals, y_vals])\n",
    "path = path[:,::10] # we dont plot too often\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "CS = ax.contour(X, Y, Z, levels=np.logspace(0, 5, 35), cmap='jet', norm = LogNorm())\n",
    "plt.axis('tight')\n",
    "ax.clabel(CS, inline=True, fontsize=8)\n",
    "line, = ax.plot([], [], 'b', label='Gradient Descent', linewidth=3.0)\n",
    "point, = ax.plot([], [], 'bo')\n",
    "\n",
    "def init():\n",
    "    line.set_data([], [])\n",
    "    point.set_data([], [])\n",
    "    return line, point\n",
    "\n",
    "def animate(i):\n",
    "    line.set_data(*path[::,:i])\n",
    "    point.set_data(*path[::,i-1:i])\n",
    "    return line, point\n",
    "\n",
    "anim = animation.FuncAnimation(fig, animate, init_func=init,\n",
    "                               frames=path.shape[1], interval=50, \n",
    "                               repeat_delay=20, blit=True)\n",
    "HTML(anim.to_html5_video())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 1\n",
    "Try using gradient descent method on the following function with different starting point. Do you get the same result? Use the code above to plot the contour of this function, do you see why it is the case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = lambda x, y: 2*np.sin(x) + x*x + y*y + 5*np.cos(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2\n",
    "Try replace the numerical gradient formula by\n",
    "$$\\frac{f(x + h) - f(x - h)}{2h}$$ \n",
    "and repeat the gradient for Beale function above.\n",
    "This will give you a much better estimate of the derivative with the same $h$. Why is it so?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}