{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Derivation of the Backpropagation Algorithm

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume we are given a neural network $\\operatorname{NN}$ with feed forward $F = F_{\\Theta}:\\mathbb{R}^{n_i} \\to \\mathbb{R}^{n_o}$, where $\\Theta$ is the collection of the weights in all the layers. If we want to train this network using [gradient descent](https://nbviewer.jupyter.org/github/niknow/machine-learning-examples/blob/master/newton_gradient_backprop/gradient_descent.ipynb), we need to calculate the derivative $\\nabla_{\\Theta} F_{\\Theta}$. Because $F = F_L \\circ \\ldots F_1$ is a composition of the various feed forwards of the layers and each layer has some weights, computing this derivative is not entirely trivial.\n", "\n", "Backpropagation is an algorithm that is based on a clever computation of the derivative $\\nabla_{\\Theta}F_{\\Theta}$, which - as the name might suggest - starts from the back of the network, i.e. the output layer $F_L$ and then works its way backwards to the first layer.\n", "\n", "In this notebook, we provide the mathematical foundations of backpropagations and derive the key equations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recall Definition & Notation for Neural Networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to pin down the precise equations for backpropagation, we first have to pin down the definition of neural network. Even for multilayer perceptrons (MLPs), there are various formulations of them in the literature. We will use the following:\n", "\n", "**Definition (neural network):** A *neural network* $\\operatorname{NN}$ is a tuple $\\operatorname{NN}=(A_l, b_l, \\sigma_l)_{1 \\leq l \\leq L}$ defined by\n", "* a numer $n_i$ of *inputs*,\n", "* a number $n_o$ of *outputs*\n", "* a number $L$ of *layers* and\n", "* for each layer $1 \\leq l \\leq L$ \n", " * a number $n_l$ of *neurons* (or *units*),\n", " * a matrix $A_l \\in \\mathbb{R}^{n_{l} \\times n_{l-1}}$ and a vector $b_l \\in \\mathbb{R}^{n_l}$ of *weights* such that $n_0 = n_i$, $n_{L}=n_o$ and\n", " * an *activation function* $\\sigma_l:\\mathbb{R} \\to \\mathbb{R}$.\n", "\n", "For any $1 \\leq l \\leq L$, the tuple $(A_l, b_l, \\sigma_l)$ is called a *layer*. For $l=L$, the layer is called *output layer* and for $1 \\leq l< L$, the layer is called *hidden layer*. We denote by $\\Theta_l := (b_l, A_l) \\in \\mathbb{R}^{(n_l+1) \\times n_{l-1}}$ the total weights of layer $l$ and set $\\Theta := (\\Theta_1, \\ldots, \\Theta_L)$.\n", "\n", "A graphical representation of the layers can be found in the [introduction to MLPs](https://nbviewer.jupyter.org/github/niknow/machine-learning-examples/blob/master/neural_network_intro/neural_network_intro_model_setup.ipynb). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feed Forward\n", "The *feed forward* of a neural network is the process of feeding an input data sample into the network and computing the output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following notation will be convenient:\n", "\n", "**Definition (affine linear map):** Let $A \\in \\mathbb{R}^{m \\times n}$ be a matrix and $b \\in \\mathbb{R}^{m}$ be a vector. Then we denote by\n", "\\begin{align*}\n", " f_{A,b}:\\mathbb{R}^{n} \\to \\mathbb{R}^m, && v \\mapsto Av + b\n", "\\end{align*}\n", "the *affine linear map with parameters $A$ and $b$*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Definition (feed forward function):** Let $\\operatorname{NN}=(A_l, b_l, \\sigma_l)_{1 \\leq l \\leq L}$ be a neural network. Then for each $1 \\leq l \\leq L$, we define a function \n", "\\begin{align*}\n", "F_l := \\sigma_l \\circ f_{A_l, b_l}: \\mathbb{R}^{n_{l-1}} \\to \\mathbb{R}^{n_l}, && v \\mapsto \\sigma_l(A_lv + b_l),\n", "\\end{align*}\n", "\n", "where we employ the convention that $\\sigma_l$ is applied in every component.\n", "The composition $F:= F_{\\Theta}:\\mathbb{R}^{n_i} \\to \\mathbb{R}^{n_o}$, $F_{\\Theta} := F_L \\circ \\ldots \\circ F_2 \\circ F_1$ is called the *feed forward function* of $\\operatorname{NN}$. Any set of inputs $x \\in \\mathbb{R}^{n_i}$ is called an *input layer*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Algorithm (feed forward):** The feed forward of a neural network on an input $x \\in \\mathbb{R}^{n_i}$ is simply the evaluation of the feed forward function $F$ on $x$, i.e. the computation of $y=F(x)$. As $F$ is a composition of the various $F_l$, this evaluation is computed by evaluating the $F_l$ one by one feeding the input foward through the network as follows:\n", "\n", "\\begin{align}\n", " a_0 &:= x \\in \\mathbb{R}^{n_i} \\\\\n", " z_1 & := f_{A_1, b_1}(a_0) = A_1 a_0 + b_1 \\in \\mathbb{R}^{n_1} \\\\\n", " a_1 & := \\sigma_1(z_1) \\in \\mathbb{R}^{n_1} \\\\\n", " z_2 & := f_{A_2, b_2}(a_1) = A_2 a_1 + b_2 \\in \\mathbb{R}^{n_2} \\\\\n", " a_2 & := \\sigma_2(z_2) \\in \\mathbb{R}^{n_2} \\\\\n", " & \\vdots \\\\\n", " z_l &:= f_{A_l, b_l}(a_{l-1}) = A_l a_{l-1} + b_l \\in \\mathbb{R}^{n_l}\\\\\n", " a_l &:= \\sigma_l(z_{l}) \\in \\mathbb{R}^{n_l}\\\\\n", " & \\vdots \\\\\n", " z_L &:= f_{A_L, b_L}(a_{L-1}) = A_L a_{L-1} + b_L \\in \\mathbb{R}^{n_L} \\\\\n", " a_L &:= \\sigma_L(z_L) \\in \\mathbb{R}^{n_L} \\\\\n", " y &:= a_L \\in \\mathbb{R}^{n_o}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Backpropagation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cost Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final result $y=F(x)$ of the feed-forward depends on all the weights in all the layers. In supervised learning, we are typically given a labeled training set $(x_1, y_1), \\ldots, (x_N, y_N)$, $x_k \\in \\mathbb{R}^{n_i}$, $y_k \\in \\mathbb{R}^{n_o}$, and we are interested in how well the network fits the data set, i.e. how close the $F(x_k)$ are to the given $y_k$. In order to measure this, we need a *cost function* $J$ that measures the distance between the vector of vectors $(F(x_1), \\ldots, F(x_N))$ and $(y_1, \\ldots, y_N)$. While in theory, this function can have arbitrary shape, the most common way to chose it, is to choose a cost function $C_k$, which only measures the distance between $F(x_k)$ and $y_k$, and then aggregate these to the total cost via\n", "\\begin{align*}\n", " J_{\\Theta}(x_1, \\ldots, x_N, y_1, \\ldots, y_N) = \\frac{1}{N} \\sum_{k=1}^{N}{C_k(F_{\\Theta}(x_k))}\n", "\\end{align*}\n", "One of the most common choices for the cost function is $C_k(y) := \\|y - y_k\\|^2$, i.e. to choose the least squares. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When training the neural network we want to minimize the cost function $J_{\\Theta}$ by changing the parameters $\\Theta$ - usually via gradient descent. Obviously, gradient descent requires the gradient of the function it is trying to minimize. The big advantage of assuming that the cost function $J_{\\Theta}$ can be written as a sum of cost functions $C_k$ is that instead of having to compute the gradient $\\nabla_{\\Theta} J_{\\Theta}(x_1, \\ldots, x_N, y_1, \\ldots, y_N)$, we can compute the gradients $\\nabla_{\\Theta}C_k(F_{\\Theta}(x_k))$ separately. Thus, instead of working on the whole training set, we will restrict our attention to a single sample $(x,y)$ with $x \\in \\mathbb{R}^{n_i}$ and $y \\in \\mathbb{R}^{n_o}$. Our aim is to compute the gradient of a single cost function $C$ on that sample, i.e. to compute\n", "\\begin{align*}\n", " \\nabla_{\\Theta}(C \\circ F_{\\Theta})(x)).\n", "\\end{align*}\n", "This means, we assume that\n", "\\begin{align*}\n", " C:\\mathbb{R}^{n_o} \\to \\mathbb{R}, && a \\mapsto C(a)\n", "\\end{align*}\n", "is a differentiable function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reminder of Calculus: Nabla, Grad and Chain Rule" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To derive the backpropagation algorithm, we employ the following notation from calculus: \n", "\n", "**Nabla:** For any differentiable function $g:\\mathbb{R}^{n} \\to \\mathbb{R}^{m}$ and any $x \\in \\mathbb{R}^n$, we denote by $\\nabla g(x) \\in \\mathbb{R}^{m \\times n}$ the matrix of partial derivatives, i.e. \n", "\\begin{align*}\n", " (\\nabla g(x))_{ij}) = \\partial_{x_j} g_i\n", "\\end{align*}\n", "In particular, for a function $g: \\mathbb{R}^n \\to \\mathbb{R}$, we denote by $\\nabla g(x) \\in \\mathbb{R}^{1 \\times n}$ the row vector of partial derivatives.\n", "\n", "**Gradient:** For a differentiable function $g: \\mathbb{R}^n \\to \\mathbb{R}$ and an $x \\in \\mathbb{R}^n$, we denote by $\\operatorname{grad}(x) \\in \\mathbb{R}^{n \\times 1}$ the column vector of partial derivatives, i.e.\n", "\\begin{align*}\n", " \\operatorname{grad} g(x) = \\nabla g(x)^{\\top}\n", "\\end{align*}\n", "\n", "We generally regard $\\mathbb{R}^n$ as a space of column vectors.\n", "\n", "**Transpose:** For any matrix $A \\in \\mathbb{R}^{m \\times n}$, we denote its transpose by $A^{\\top} \\in \\mathbb{R}^{n \\times m}$.\n", "\n", "**Chain Rule:** For two differentiable functions $g:\\mathbb{R}^n \\to \\mathbb{R}^m$ and $h:\\mathbb{R}^m \\to \\mathbb{R}^{k}$, the derivative of the composition $h \\circ g$ is related to the derivative of the components via\n", "\\begin{align*}\n", " \\forall x \\in \\mathbb{R}^n: \\nabla(h \\circ g)(x) = \\nabla h(g(x)) \\bullet \\nabla g(x),\n", "\\end{align*}\n", "where $\\bullet$ denotes the matrix product." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plan of Attack\n", "In order to compute the gradient $\\nabla_{\\Theta}(C(F_\\Theta(x))$ , we will proceed in two steps:\n", "1. Compute $\\nabla_x (C(F_{\\Theta}(x))$ step by step working backwards through the network\n", "2. Relate the result to $\\nabla_\\Theta (C(F_{\\Theta}(x))$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Backwards Recursion\n", "The key idea to execute the first step is the following insight: The function $F_{\\Theta} = F_L \\circ \\ldots \\circ F_1$ is a complex composition of many functions $F_l$. Thus, computing $\\nabla F$ requires a lot of applications of the chain rule. However, computing only the last gradient $F_L$ is easy. Therefore, the idea is to work backwards by computing the derivatives of increasingly comples compositions. To that end, the following definition is helpful.\n", "\n", "**Definition:** Let $\\operatorname{NN}$ be a neural network with feed-forward function $F = F_{\\Theta} = F_L \\circ \\ldots \\circ F_1$ and $C$ be a const function for a single sample. We define the functions\n", "\\begin{align*}\n", " G_l := C \\circ F_L \\circ \\ldots \\circ F_{l+1} \\circ \\sigma_l : \\mathbb{R}^{n_l} \\to \\mathbb{R}\n", "\\end{align*}\n", "for $1 \\leq l \\leq L$.\n", "\n", "The main insight into these function is the following\n", "\n", "**Lemma:** Let $G_l$ be as a above and assume that $z_l$ are computed via feed-forward as above. Then the sequence of error terms\n", "\\begin{align*}\n", " \\varepsilon_l := \\operatorname{grad} G_l(z_l) \\in \\mathbb{R}^{n_l}\n", "\\end{align*}\n", "satisfies the backward recusion\n", "\\begin{align*}\n", " \\varepsilon_L = \\nabla \\sigma_L (z_L) \\bullet \\operatorname{grad} C(a_L), && \\varepsilon_l = \\nabla \\sigma_l (z_l) \\bullet A_{l+1}^{\\top} \\bullet \\varepsilon_{l+1}.\n", "\\end{align*}\n", "\n", "**Proof:** For $l=L$, this follows from the definitions and the chain rule as\n", "\\begin{align*}\n", " \\nabla G_L(z_L)\n", " = \\nabla (C \\circ \\sigma_L)(z_L)\n", " = \\nabla C(\\sigma_L(z_L)) \\bullet \\nabla \\sigma_L(z_L)\n", "\\end{align*}\n", "and thus\n", "\\begin{align*}\n", " \\varepsilon_L = \\operatorname{grad} G_L (z_L) = (\\nabla G_L(z_L))^{\\top} = \\nabla \\sigma_L(z_L) \\bullet \\operatorname{grad} C(\\sigma_L(z_L)).\n", "\\end{align*}\n", "Here, we use the above mentioned convention that we identify the scalar function $\\sigma_l:\\mathbb{R} \\to \\mathbb{R}$ with the vector valued function $\\sigma_l:\\mathbb{R}^{n_l} \\to \\mathbb{R}^{n_l}$, $v \\mapsto (\\sigma(v_1), \\ldots, \\sigma(v_{n_l}))$. Thus, the derivative of this vector valued function is given as a diagonal matrix $\\nabla \\sigma_l (v)$, where the diagonal is given by $\\sigma'(v_1), \\ldots, \\sigma'(v_{n_l})$. Thus, this matrix is symmetric, i.e. $\\nabla \\sigma_l (v) = \\nabla \\sigma_l (v)^{\\top}$.\n", "\n", "For $l+1 \\to l$, notice that by definition, the funtions $G_l$ satisfy\n", "\\begin{align*}\n", " G_l &= C \\circ F_L \\circ \\ldots \\circ F_{l+2} \\circ F_{l+1} \\circ \\sigma_l \\\\\n", " &= C \\circ F_L \\circ \\ldots \\circ F_{l+2} \\circ \\sigma_{l+1} \\circ f_{A_{l+1},b_{l+1}} \\circ \\sigma_l \\\\\n", " &= G_{l+1} \\circ f_{A_{l+1},b_{l+1}} \\circ \\sigma_l \\\\\n", "\\end{align*}\n", "\n", "Thus,\n", "\\begin{align*}\n", " \\nabla G_l(z_l) & = \\nabla G_{l+1}(f_{A_{l+1},b_{l+1}}(\\sigma_l(z_l))) \\bullet \\nabla f_{A_{l+1},b_{l+1}}(\\sigma_l(z_l)) \\bullet \\nabla \\sigma_l(z_l) \\\\\n", " &= \\nabla G_{l+1}(z_{l+1})) \\bullet A_{l+1} \\bullet \\nabla \\sigma_l(z_l),\n", "\\end{align*}\n", "which implies\n", "\\begin{align*}\n", " \\varepsilon_l \n", " = \\operatorname{grad} G_l(z_l)\n", " = \\nabla G_l(z_l)^{\\top} \n", " = \\nabla \\sigma_l(z_l) \\bullet A_{l+1}^{\\top} \\bullet \\varepsilon_{l+1}.\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Backwards Gradient Computation\n", "\n", "Finally, we use the result of the previous lemma to compute the derivative $\\nabla_{\\Theta}(F_{\\Theta}(x))$.\n", "\n", "**Theorem (backpropagation):** Let $\\operatorname{NN} = (\\Theta_l \\sigma_l)_{1 \\leq l \\leq L}$, $\\Theta_l=(A_l, b_l)$, be an MLP and $x \\in \\mathbb{R}^{n_i}$ be an input. Let $C:\\mathbb{R}^{n_o} \\to \\mathbb{R}$ be a differentiable cost function. Let $(\\varepsilon_l)_{1 \\leq l \\leq L}$ be the sequence of error terms of the previous lemma. Then\n", "\\begin{align*}\n", " \\operatorname{grad}_{b_l}(C(F_{\\Theta}(x))) &= \\varepsilon_{l} \\\\\n", " \\operatorname{grad}_{A_l}(C(F_{\\Theta}(x))) &= a_{l-1} \\varepsilon_{l}^{\\top}\n", "\\end{align*}\n", "where $a_l$ is defined as above (feed forward).\n", "\n", "**Proof:** Analogously to the previous lemma, we define the functions\n", "\\begin{align*}\n", " G_{A,b}^l := C \\circ F_L \\circ \\ldots \\circ F_{l+1} \\circ \\sigma_l \\circ f_{A,b}: \\mathbb{R}^{n_{l-1}} \\to \\mathbb{R}\n", "\\end{align*}\n", "By construction $G_{A,b}^l = G_l \\circ f_{A,b}$. Therefore,\n", "\\begin{align*}\n", " \\nabla_b (G_{A_l,b}^l(a_{l-1}))(b_l) = \\nabla G_l (f_{A_l,b_l}(a_{l-1})) \\bullet \\nabla_b f_{A_l, b}(b_l) = \\nabla G_l(z_l),\n", "\\end{align*}\n", "as $\\nabla b f_{A,b}$ is the identity matrix. Therefore, \n", "\\begin{align*}\n", " \\nabla_{b_l} C(F_{\\Theta}(x)) \n", " & = \\nabla _b(C \\circ F_L \\circ \\ldots \\circ F_1(x))(b_l) \\\\\n", " & = \\nabla_b(G_{A_l,b} \\circ F_{l-1} \\circ \\ldots \\circ F_1(x))(b_l)\\\\\n", " &= \\nabla G_l(z_l) = \\varepsilon_l,\n", "\\end{align*}\n", "which implies the first claim. \n", "\n", "To see the second, notice that as a function of $A$, we have $f_{\\_,b_l}(a_{l-1}):\\mathbb{R}^{n_l \\times n_{l-1}} \\to \\mathbb{R}^{n_l}$ and hence analogously, $G_{\\_,b}^l(a_{l-1}) = (G_l \\circ f_{\\_,b})(a_{l-1}):\\mathbb{R}^{n_l \\times n_{l-1}} \\to \\mathbb{R}$. Thence, we can calculate in coordinates using the chain rule\n", "\n", "\\begin{align*}\n", " \\frac{\\partial( G_{\\_,b}^l(a_{l-1}))(A_l)}{\\partial A_{\\nu \\mu}} \n", " =\\sum_{k=1}^{n_l}{\\nabla G_l}(f_{A_l,b_l}(a_{l-1}))_k \\frac{\\partial (A a_{l-1}+b)(A_l)_k}{\\partial A_{\\nu \\mu}}\n", " =\\sum_{k=1}^{n_l}{ \\nabla G_l(z_l)_k \\delta_{\\nu k} a_{l-1;\\mu} }\n", " =\\varepsilon_{l;\\nu} a_{l-1;\\mu}\n", " =(\\varepsilon_{l} a_{l-1}^{\\top})_{\\nu \\mu}.\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Algorithm\n", "Putting everything together, the backpropagation algorithm works as follows:\n", "\n", "**Algorithm (backpropagation):**\n", "\n", "**Inputs:**\n", "* A neural network $\\operatorname{NN} = (A_l, b_l, \\sigma_l)_{1 \\leq l \\leq L}$,\n", "* a single input $x \\in \\mathbb{R}^{n_i}$,\n", "* a cost function $C:\\mathbb{R}^{n_o} \\to \\mathbb{R}$ for that input.\n", "\n", "**Outputs:**\n", "The gradients\n", "* $\\nabla_{b_l}(C(F_{\\Theta}(x))$ and \n", "* $\\nabla_{A_l}(C(F_{\\Theta}(x))$.\n", "\n", "**Steps:**\n", "1. Compute the feed forward $F_{\\Theta}(x)$\n", " * Initialize: $a_0 := x$\n", " * For $l=1, \\ldots, L$:\n", " * $a_l := f_{A_l,b_l}(a_{l-1})$\n", " * $z_l := \\sigma_l(a_l)$\n", "2. Compute the errors $\\varepsilon_l$:\n", " * Initialize: $\\varepsilon_L := \\nabla \\sigma_L(z_L) \\operatorname{grad}C(a_L)$\n", " * For $L=l-1, \\ldots, 1$: $\\varepsilon_l := \\nabla \\sigma_l(z_l) A_{l+1}^{\\top} \\varepsilon_{l+1}$.\n", "3. Compute the gradients: For $l=1, \\ldots, L$ (or in any order):\n", " * $\\operatorname{grad}_{b_l}(C(F_{\\Theta}(x)) = \\varepsilon_l$\n", " * $\\operatorname{grad}_{A_l}(C(F_{\\Theta}(x)) = a_{l-1} \\varepsilon_l^{\\top}$\n", "\n", "In case, we have multiple training samples $x_i$ (which we usually have), the above is repeated on every training sample and then the gradient of the total cost function $J$ is given as the average of the gradients over the samples. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "There are various other sources on backpropagations you might find helpful (list not exhaustive):\n", "\n", "* http://neuralnetworksanddeeplearning.com/chap2.html\n", "* https://brilliant.org/wiki/backpropagation/\n", "* https://datascience.stackexchange.com/questions/44703/how-does-gradient-descent-and-backpropagation-work-together\n", "* https://stackoverflow.com/questions/47416861/backward-propagation-in-keras" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }