{ "cells": [ { "cell_type": "markdown", "id": "4ff5888c", "metadata": {}, "source": [ "Sascha Spors,\n", "Professorship Signal Theory and Digital Signal Processing,\n", "Institute of Communications Engineering (INT),\n", "Faculty of Computer Science and Electrical Engineering (IEF),\n", "University of Rostock,\n", "Germany\n", "\n", "# Data Driven Audio Signal Processing - A Tutorial with Computational Examples\n", "\n", "Winter Semester 2023/24 (Master Course #24512)\n", "\n", "- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture\n", "- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise\n", "\n", "Feel free to contact lecturer frank.schultz@uni-rostock.de" ] }, { "cell_type": "markdown", "id": "dcd5fabd", "metadata": {}, "source": [ "# Gradient Descent (GD)" ] }, { "cell_type": "markdown", "id": "dbb4aa25", "metadata": {}, "source": [ "## Analytical Loss Function in 2D\n", "\n", "Suppose, that the (made up) loss function of a model with two parameters $\\beta_1$ and $\\beta_2$ is analytically given as\n", "\n", "$$\\mathcal{L}(\\beta_1, \\beta_2) = (\\beta_1 - 2)^2 + (\\beta_2 - 1)^4 - (\\beta_2 -1)^2$$\n", "\n", "In this toy example there is no data dependency involved, which is not how things work in practice, but it is good to understand the essence of finding a minimum numerically.\n", "\n", "In order to find **potential minima**, and thereby the **optimum model parameters** $\\hat{\\beta_1}$ and $\\hat{\\beta_2}$, we need to solve gradient for zero\n", "\n", "$$\\nabla \\mathcal{L} = \n", "\\begin{bmatrix}\n", "\\frac{\\partial \\mathcal{L}}{\\partial \\beta_1}\\\\\n", "\\frac{\\partial \\mathcal{L}}{\\partial \\beta_2}\n", "\\end{bmatrix}=\n", "\\mathbf{0}$$\n", "\n", "The required partial derivatives of first order are \n", "\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_1} = 2 (\\beta_1 - 2)^1$$\n", "\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_2} = 4 (\\beta_2 - 1)^3 - 2(\\beta_2 -1)^1$$\n", "\n", "A check with the Hessian of $\\mathcal{L}(\\beta_1, \\beta_2)$ yields whether we deal with a minimum, maximum, saddle point or neither of them for each of the zero gradient conditions.\n", "\n", "We get **first minimum** at\n", "$$\\beta_{1,min1} = 2\\qquad \\beta_{2,min1} = 1+\\frac{1}{\\sqrt{2}}$$\n", "\n", "We get **second minimum** at\n", "$$\\beta_{1,min2} = 2\\qquad \\beta_{2,min2} = 1-\\frac{1}{\\sqrt{2}}$$\n", "\n", "Both minima yield the same function value\n", "$$\\mathcal{L}(\\beta_{1,min}, \\beta_{2,min}) = -\\frac{1}{4},$$\n", "so we deal actually with **two optimum models**, as there is **no global minimum** with only one lowest function value.\n", "\n", "We have **one saddle point**, that actually separates the two minima at\n", "$$\\beta_{1,saddle} = 2\\qquad \\beta_{2,saddle} = 1$$\n", "\n", "with function value\n", "$$\\mathcal{L}(\\beta_{1,saddle}, \\beta_{2,saddle}) = 0$$" ] }, { "cell_type": "markdown", "id": "17187740", "metadata": {}, "source": [ "## Gradient Descent Update Rule \n", "We could (and in ML problems we often need) to find the minima numerically.\n", "The most straightforward and simple numerical solver is the so called **gradient descent** (GD), a first order method.\n", "\n", "It uses the (analytically known) gradient $\\nabla\\mathcal{L}$, evaluates it for an actual $\\beta_{actual}$ and updates subsequently into direction of negative gradient, i.e. the (or rather a?!?) minimum that we want to find.\n", "\n", "This iterative procedure can be written as\n", "$$(1):\\quad \\beta_{new} = \\beta_{actual} - \\mathrm{step size} \\cdot \\nabla\\mathcal{L}\\bigg|_{\\beta_{actual}}\\quad(2):\\quad\\beta_{new} \\rightarrow \\beta_{actual}\\quad(3): \\mathrm{go to}\\,(1)$$\n", "repeated until we hopefully converged to the $\\beta$ that represents the minimum.\n", "\n", "In practice GD is not often used, as it is not very robust and many things can go wrong. Let us check this with some illustrative examples below." ] }, { "cell_type": "code", "execution_count": null, "id": "d5752a87", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits import mplot3d\n", "\n", "matplotlib_widget_flag = False" ] }, { "cell_type": "code", "execution_count": null, "id": "d8b0d840", "metadata": {}, "outputs": [], "source": [ "# analytical solutions from above\n", "minimum1, fminimum1 = np.array([2, 1 + 1 / np.sqrt(2)]), -1 / 4\n", "minimum2, fminimum2 = np.array([2, 1 - 1 / np.sqrt(2)]), -1 / 4\n", "saddle, fsaddle = np.array([2, 1]), 0\n", "\n", "\n", "def get_gradient(beta):\n", " beta_gradient = np.array(\n", " [2 * (beta[0] - 2), 4 * (beta[1] - 1) ** 3 - 2 * (beta[1] - 1)]\n", " )\n", " return beta_gradient" ] }, { "cell_type": "markdown", "id": "67bc3812", "metadata": {}, "source": [ "## Plot the Loss Function" ] }, { "cell_type": "code", "execution_count": null, "id": "1f54b6d5", "metadata": {}, "outputs": [], "source": [ "def f(x, y):\n", " # our loss function from above\n", " # only with nicer to read variables x,y\n", " # instead of beta1, beta2\n", " return (x - 2) ** 2 + (y - 1) ** 4 - (y - 1) ** 2\n", "\n", "\n", "x, y = np.linspace(0, 4, 2**7), np.linspace(-1 / 2, 5 / 2, 2**7)\n", "X, Y = np.meshgrid(x, y)\n", "Z = f(X, Y)" ] }, { "cell_type": "code", "execution_count": null, "id": "a13e8731", "metadata": {}, "outputs": [], "source": [ "col_min, col_max, no_col = -1 / 2, 5, 12\n", "col_tick = np.linspace(col_min, col_max, no_col, endpoint=True)\n", "cmap = plt.cm.magma_r\n", "norm = mpl.colors.BoundaryNorm(col_tick, cmap.N)\n", "\n", "if matplotlib_widget_flag:\n", " %matplotlib widget\n", "fig = plt.figure()\n", "ax = plt.axes(projection=\"3d\")\n", "c = ax.plot_surface(\n", " X, Y, Z, rstride=1, cstride=1, cmap=cmap, norm=norm, edgecolor=\"none\"\n", ")\n", "ax.plot(minimum1[0], minimum1[1], fminimum1, \"C0o\")\n", "ax.plot(minimum2[0], minimum2[1], fminimum2, \"C0o\")\n", "ax.plot(saddle[0], saddle[1], fsaddle, \"C0o\")\n", "cbar = fig.colorbar(\n", " c, ax=ax, ticks=col_tick[:: no_col // 10], label=r\"$\\mathcal{L}$\"\n", ")\n", "ax.set_xlim(x[0], x[-1])\n", "ax.set_ylim(y[0], y[-1])\n", "ax.set_zlim(-1 / 4, 5)\n", "ax.set_xlabel(r\"$\\beta_1$\")\n", "ax.set_ylabel(r\"$\\beta_2$\")\n", "ax.set_zlabel(r\"$\\mathcal{L}$\")\n", "ax.view_init(elev=60, azim=-40)" ] }, { "cell_type": "markdown", "id": "2cb567ee", "metadata": {}, "source": [ "## Examples for Finding Minimum with Gradient Descent (GD)" ] }, { "cell_type": "code", "execution_count": null, "id": "5646a5a1", "metadata": {}, "outputs": [], "source": [ "def my_gradient_descent(beta, beta_gradient, step_size):\n", " # simple update rule for GD\n", " beta -= beta_gradient * step_size\n", " return beta" ] }, { "cell_type": "code", "execution_count": null, "id": "dd0e4077", "metadata": {}, "outputs": [], "source": [ "def plot_contour_of_my_gradient_descent():\n", " fig, ax = plt.subplots()\n", " ax.contour(X, Y, Z, cmap=\"magma_r\")\n", " ax.plot(beta_path[0], beta_path[1], \"C0.-\", lw=0.3)\n", " ax.plot(minimum1[0], minimum1[1], \"kx\", ms=10)\n", " ax.plot(minimum2[0], minimum2[1], \"kx\", ms=10)\n", " ax.plot(saddle[0], saddle[1], \"kx\", ms=10)\n", " ax.set_xlabel(r\"$\\beta_1$\")\n", " ax.set_ylabel(r\"$\\beta_2$\")\n", " ax.set_title(\n", " r\"last calculated beta1 = {f1:5.4f}, beta2 = {f2:5.4f}\".format(\n", " f1=beta_path[0, -1], f2=beta_path[1, -1]\n", " )\n", " )\n", " ax.axis(\"equal\")\n", " ax.grid(\"True\")" ] }, { "cell_type": "markdown", "id": "16da4461", "metadata": {}, "source": [ "### Gradient Descent Towards Minimum 1\n", "\n", "Let us start with an example giving a nice result.\n", "\n", "The parameters that are inherent to the straightforward GD algorithm\n", "\n", "- chosen number of steps and\n", "- chosen step size and\n", "- chosen starting point \n", "\n", "work nicely out to find the minimum 1 at\n", "$$\\beta_{1,min1} = 2\\qquad \\beta_{2,min1} = 1+\\frac{1}{\\sqrt{2}}$$" ] }, { "cell_type": "code", "execution_count": null, "id": "cf675363", "metadata": {}, "outputs": [], "source": [ "steps = 2**5\n", "beta = np.array([4, 1.075])\n", "step_size = 1 / 5\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()" ] }, { "cell_type": "markdown", "id": "57c29171", "metadata": {}, "source": [ "### Gradient Descent Towards Minimum 2\n", "\n", "The chosen number of steps, the chosen step size and the chosen starting point work out to approach the minimum 2 at\n", "\n", "$$\\beta_{1,min2} = 2\\qquad \\beta_{2,min2} = 1-\\frac{1}{\\sqrt{2}},$$\n", "\n", "which we however not reached yet. Increasing the number of steps and/or slightly decreasing the step size should safely bring us to the minimum 2. " ] }, { "cell_type": "code", "execution_count": null, "id": "ab382881", "metadata": {}, "outputs": [], "source": [ "steps = 2**6\n", "beta = np.array([0.0, -0.5])\n", "step_size = 1 / 50\n", "print(step_size)\n", "\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()" ] }, { "cell_type": "markdown", "id": "655dde16", "metadata": {}, "source": [ "### Gradient Descent Towards Minimum 2 Along the Saddle Point Ridge\n", "\n", "This GD path is special: We start very close to the saddle point ridge, i.e. we start with a chosen starting value $\\beta_2=0.9999$, which is very close to $\\beta_{2,saddle} = 1$.\n", "The GD then initially moves along the saddle point ridge, but fortunately drops of to find its way to minimum 2.\n", "The chosen values for number of steps and step size are much more critical here compared to above examples.\n", "\n", "What happens if we choose `beta = np.array([4., 1.0001])`? Make first an expectation, then try..." ] }, { "cell_type": "code", "execution_count": null, "id": "2fb8c43f", "metadata": {}, "outputs": [], "source": [ "steps = 2**6\n", "beta = np.array([4.0, 0.9999])\n", "step_size = 1 / 10\n", "\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()" ] }, { "cell_type": "markdown", "id": "5132abe9", "metadata": {}, "source": [ "### Gradient Descent Towards Saddle Point\n", "\n", "In the examples above fortunately the GDs approached or even finally found a minimum, which is what we desire when learning optimum model parameters.\n", "\n", "Things could go wrong, if we set up things badly. We should realize, that ML libraries such as TensorFlow are nowadays very sophisticated, mature, highly double-checked and debugged and it is thus more likely that things go wrong **because of us** using these ML algorithms. So, human intelligence and know how is still needed to build the best models.\n", "\n", "One prominent and illustrative example for bad things that happen, is given below. Here, we never reach one of the two minima, but rather we *die* on the saddle point. This is because the GD can by its gradient concept never drop of the ridge of the saddle point.\n", "Here in this **toy example**, we know **that and why** this happens. But what about practical, very complicated loss functions?!?! *Dying* on a saddle point is probably not a nicely learned model." ] }, { "cell_type": "code", "execution_count": null, "id": "0a469708", "metadata": {}, "outputs": [], "source": [ "steps = 2**6\n", "beta = np.array([4.0, 1.0])\n", "step_size = 1 / 10\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()" ] }, { "cell_type": "markdown", "id": "89f51480", "metadata": {}, "source": [ "### Gradient Descent Ends in Oscillation\n", "\n", "Another bad thing occurs in this example.\n", "The chosen large step size of $1$ and the chosen $\\beta_2=1$ on the saddle point ridge yield a GD that oscillates between\n", "$$[\\beta_1, \\beta_2]^\\mathrm{T} = [4, 1]^\\mathrm{T}$$\n", "and\n", "$$[\\beta_1, \\beta_2]^\\mathrm{T} = [0, 1]^\\mathrm{T}$$\n", "\n", "We shortly should check the analytical equations, these numbers are not by accident. The gradient is\n", "\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_1} = 2 (\\beta_1 - 2)^1$$\n", "\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_2} = 4 (\\beta_2 - 1)^3 - 2(\\beta_2 -1)^1$$" ] }, { "cell_type": "markdown", "id": "d7fb1874", "metadata": {}, "source": [ "We start with initial $$\\beta_{actual} = [\\beta_1, \\beta_2]^\\mathrm{T} = [4, 1]^\\mathrm{T},$$\n", "hence\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_1}\\bigg|_{\\beta_{actual}} = 2 (4 - 2)^1 = 4$$\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_2}\\bigg|_{\\beta_{actual}} = 4 (1 - 1)^3 - 2(1 -1)^1 = 0$$\n", "The update rule in GD is\n", "$$\\beta_{new} = \\beta_{actual} - \\mathrm{stepsize} \\cdot \\nabla\\mathcal{L}\\bigg|_{\\beta_{actual}},$$\n", "with inserted numbers\n", "$$\\beta_{new} = \n", "\\begin{bmatrix}4\\\\1\n", "\\end{bmatrix}\n", "- 1 \\cdot\n", "\\begin{bmatrix}4\\\\0\n", "\\end{bmatrix}=\n", "\\begin{bmatrix}0\\\\1\n", "\\end{bmatrix}\n", "$$\n", "GD moves to the next step, so the model parameter vector which was newly calculated becomes the actual vector $\\beta_{new} \\rightarrow \\beta_{actual}$." ] }, { "cell_type": "markdown", "id": "bbcda374", "metadata": {}, "source": [ "So, next we deal with $$\\beta_{actual} = [\\beta_1, \\beta_2]^\\mathrm{T} = [0, 1]^\\mathrm{T},$$\n", "hence\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_1}\\bigg|_{\\beta_{actual}} = 2 (0 - 2)^1 = -4$$\n", "$$\\frac{\\partial \\mathcal{L}}{\\partial \\beta_2}\\bigg|_{\\beta_{actual}} = 4 (1 - 1)^3 - 2(1 -1)^1 = 0$$\n", "The update rule in GD is\n", "$$\\beta_{new} = \\beta_{actual} - \\mathrm{stepsize} \\cdot \\nabla\\mathcal{L}\\bigg|_{\\beta_{actual}},$$\n", "with inserted numbers\n", "$$\\beta_{new} = \n", "\\begin{bmatrix}0\\\\1\n", "\\end{bmatrix}\n", "- 1 \\cdot\n", "\\begin{bmatrix}-4\\\\0\n", "\\end{bmatrix}=\n", "\\begin{bmatrix}4\\\\1\n", "\\end{bmatrix}\n", "$$\n", "In the next step, we again get\n", "$$\\beta_{new} =\n", "\\begin{bmatrix}0\\\\1\n", "\\end{bmatrix}\n", "$$, and so on...we see the oscillation.\n", "In this toy example the oscillation is very obvious and easy to check by the simple equations.\n", "In practice, oscillations could be more complex (more than two states) and hard to trace.\n", "We should be aware of such phenomenon as a minimum is never reached." ] }, { "cell_type": "code", "execution_count": null, "id": "16e66a5d", "metadata": {}, "outputs": [], "source": [ "steps = 2**3\n", "beta = np.array([4.0, 1])\n", "step_size = 1\n", "\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()\n", "beta_path" ] }, { "cell_type": "markdown", "id": "7522f5e2", "metadata": {}, "source": [ "### Gradient Descent Exploding\n", "\n", "We have seen a **stable GD oscillation** in the **example** directly **above**. Things could get even worse.\n", "In this example we deal with a dangerous phenomenon often referred to as **exploding gradient descent** in textbooks.\n", "Instead of approaching one minimum the GD starts to **oscillate** and with successively **climbing up** out of the valley.\n", "In the example here we have chosen the numbers, such that double precision still can handle the numbers.\n", "Very often such exploding gradients end in Nan, Inf for the gradients and the model parameters. Obviously, that's not useful at all.\n", "\n", "We should be aware, that changing `step_size = 0.29` to `step_size = 0.28` the GD does not explode, but finds its way to minimum 1. So, a tiny change in one number rules over success vs. failure.\n", "This indicates the importance of being cautious against chosen parameters and actually understanding what and why algorithms do the things they do." ] }, { "cell_type": "code", "execution_count": null, "id": "97d002c8", "metadata": {}, "outputs": [], "source": [ "steps = 2**3\n", "beta = np.array([4.0, 2.5])\n", "\n", "if True:\n", " step_size = 0.29 # GD explodes\n", " flag = True\n", "else:\n", " step_size = 0.28 # GD approaches minimum1\n", " flag = False\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " print(\"step\", step, \"gradient\", beta_gradient)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()\n", "if flag:\n", " plt.title(\"\") # get rid of the super long title for larbe beta2\n", "print(\"beta vector along the GD\")\n", "beta_path" ] }, { "cell_type": "markdown", "id": "43140591", "metadata": {}, "source": [ "### Gradient Descent Ends in More Complex Oscillation But Does Not Explode\n", "\n", "In this example, we deal with a more complex oscillation which however is stable, i.e. the GD does not explode. It is nevertheless a meaningless solution to our problem.\n", "\n", "We should take some time to figure out how to change the parameters `steps` and `step_size`\n", "- making the GD explode\n", "- making the GD find a minimum\n", "\n", "Which minimum is found in the latter case? Why must this always be the case?" ] }, { "cell_type": "code", "execution_count": null, "id": "1ef5c69a", "metadata": {}, "outputs": [], "source": [ "steps = 2**7 # default: 2**7\n", "beta = np.array([4.0, 1.25])\n", "step_size = 1 # default: 1\n", "\n", "beta_path = np.zeros((2, steps + 1))\n", "beta_path[:, 0] = beta # store chosen init beta\n", "for step in range(steps):\n", " # this is the GD\n", " beta_gradient = get_gradient(beta)\n", " beta = my_gradient_descent(beta, beta_gradient, step_size)\n", " # store our GD path\n", " beta_path[:, step + 1] = beta\n", "# and plot\n", "plot_contour_of_my_gradient_descent()" ] }, { "cell_type": "markdown", "id": "651b1eff", "metadata": {}, "source": [ "## Copyright\n", "\n", "- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)\n", "- feel free to use the notebooks for your own purposes\n", "- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)\n", "- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)\n", "- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year." ] } ], "metadata": { "kernelspec": { "display_name": "myddasp", "language": "python", "name": "myddasp" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }