{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "***\n", "# Lecture 4 - Convergence of Iterative Methods\n", "***\n", "$\\newcommand{\\vct}[1]{\\mathbf{#1}}$\n", "$\\newcommand{\\mtx}[1]{\\mathbf{#1}}$\n", "$\\newcommand{\\e}{\\varepsilon}$\n", "$\\newcommand{\\norm}[1]{\\|#1\\|}$\n", "$\\newcommand{\\minimize}{\\text{minimize}\\quad}$\n", "$\\newcommand{\\maximize}{\\text{maximize}\\quad}$\n", "$\\newcommand{\\subjto}{\\quad\\text{subject to}\\quad}$\n", "$\\newcommand{\\R}{\\mathbb{R}}$\n", "$\\newcommand{\\trans}{T}$\n", "$\\newcommand{\\ip}[2]{\\langle {#1}, {#2} \\rangle}$\n", "$\\newcommand{\\zerovct}{\\vct{0}}$\n", "$\\newcommand{\\diff}[1]{\\mathrm{d}{#1}}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iterative algorithms for solving a problem of the form\n", "\n", "\\begin{equation*}\n", " \\minimize f(\\vct{x})\n", "\\end{equation*}\n", "\n", "on $\\R^n$ generate a sequence of vectors $\\vct{x}_0,\\vct{x}_1,\\dots$ in the hope that this sequence converges to a (local or global) minimizer $\\vct{x}^*$. In this lecture we first study step length selection procedures and then study what it means for a sequence to converge, and how to quantify the speed of convergence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step length selection\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When moving in a descent direction $\\vct{p}_k$ (not necessarily the steepest descent direction), \n", "\n", "\\begin{equation*}\n", " \\vct{x}_{k+1} = \\vct{x}_k+\\alpha_k \\vct{p}_k\n", "\\end{equation*}\n", "\n", "we can only guarantee that the function value will decrease if the step length $\\alpha_k$ is small enough. If we move too far in a descent direction, we might even land at a point where $f$ is larger than where we started! It is therefore important to choose a step length that \n", "\n", "* is not too small (so that the algorithm does not take too long);\n", "* is not too large (so that we don't end up at a point with larger function value);\n", "* is easy to compute.\n", "\n", "An optimal step $\\alpha$ for a descent method would be the minimizer of the function\n", "\n", "\\begin{equation*}\n", " \\alpha \\mapsto \\varphi(\\alpha) := f(\\vct{x}_k+\\alpha \\vct{p}_k).\n", "\\end{equation*}\n", "\n", "In practice, minimizing this function is not always the most efficient thing to do (or even possible). One would rather choose a step length that satisfies some criteria that ensure that the sequence $\\vct{x}_k$ converges to a minimizer $\\vct{x}^*$ under suitable conditions on a function $f$. One such condition is a **sufficient decrease condition**,\n", "\n", "\\begin{equation*}\n", " f(\\vct{x}_k+\\alpha \\vct{p}_k) \\leq f(\\vct{x}_k) + c \\alpha\\ip{\\nabla f(\\vct{x}_k)}{\\vct{p}_k} =: \\ell(\\alpha),\n", "\\end{equation*}\n", "\n", "with $c\\in (0,1)$. Note that $\\varphi'(0) = \\ip{\\nabla f(\\vct{x}_k)}{\\vct{p}_k} <0$, because $\\vct{p}_k$ is a descent direction. The function $\\ell(\\alpha)$ is therefore a line through $f(\\vct{x}_k)$ with a slope $c \\ip{\\nabla f(\\vct{x}_k)}{\\vct{p}_k} = c\\varphi'(0)>\\varphi'(0)$. To see why this condition is necessary, consider the function $f(x)=x^2-1$ and the sequence $x_k = \\sqrt{1+1/k}$ for $k\\geq 1$. Clearly, the sequence $f(\\vct{x}_k)=1/k$ decreases, but fails to converge to the minimizer $f(0)=-1$.\n", "\n", "![Armijo](images/armillo.png)\n", "\n", "The sufficient decrease condition (also called Armijo condition) can always be satisfied if $\\alpha$ is chosen small enough, but the algorithm may become very slow. It is therefore common to supplement the sufficient descent condition with other criteria that guarantee that sufficient progress is made. Two of the commonly used criteria are:\n", "\n", "* the Wolfe conditions, which add a *curvature condition*\n", "\n", "\\begin{equation*}\n", "\\varphi'(\\alpha)\\geq \\tilde{c} \\varphi'(0)\n", "\\end{equation*}\n", "\n", "for some $\\tilde{c}\\in (c,1)$, which gives a lower bound on the slope of the new point;\n", "\n", "* the Armijo-Goldstein conditions, which state that a step length $\\alpha_k$ should additionally satisfy bound \n", "\n", "\\begin{equation*}\n", " f(\\vct{x}_k)+(1-c)\\alpha_k \\ip{\\nabla f(\\vct{x}_k)}{\\vct{p}_k} \\leq f(\\vct{x}_k+\\alpha_k \\vct{p}_k),\n", "\\end{equation*}\n", "\n", "which gives a lower bound on the step size.\n", "\n", "Another common approach is **backtracking**: in this method one uses a high initial value of $\\alpha$ (for example, $\\alpha=1$), and then decreases it until the sufficient descent condition is satisfied." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example.** Consider the function $f\\colon \\R^2\\to \\R$ defined by $f(\\vct{x}) = x_1^2+x_2^2$. The gradient is $\\nabla f(\\vct{x}) = 2\\vct{x}$, and the $\\phi$ function at $\\vct{x}_k=(1,1)^{\\trans}$\n", " \n", " \\begin{equation*}\n", " \\phi(\\alpha) = f(\\vct{x}_k-\\alpha \\nabla f(\\vct{x}_k)) = 2(1-2\\alpha)^2, \\quad \\phi'(\\alpha) = -8(1-2\\alpha).\n", " \\end{equation*}\n", "\n", "The Armijo-Goldstein conditions then state that we can choose $\\alpha$ such that\n", "\n", "\\begin{equation*}\n", " 2(1-4(1-c)\\alpha) \\leq 2(1-2\\alpha)^2 \\leq 2(1-4c\\alpha).\n", "\\end{equation*}\n", "\n", "The optimal step length in this case would be $\\alpha=0.5$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Convergence of iterative methods\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A sequence of vectors $\\{\\vct{x}_k\\}$ in $\\R^n$, $k\\geq 0$, {\\em converges} to a vector $\\vct{x}^*$ with respect to a norm $\\norm{\\cdot}$ as $k\\to \\infty$, written $\\vct{x}_k\\to \\vct{x}$, if the sequence of numbers $\\norm{\\vct{x}_k-\\vct{x}^*}$ converges to zero. More formally, if for every $\\e>0$ there exists an index $N$ such that for all $n\\geq N$,\n", "\n", "\\begin{equation*}\n", " \\norm{\\vct{x}_n-\\vct{x}^*} < \\e.\n", "\\end{equation*}\n", "\n", "Iterative algorithms will rarely find the exact solution to a minimization problem, so we will usually be happy to find a solution that differs from the true one by at most some specified accuracy.\n", "\n", "**Definition.** A sequence of vectors $\\{\\vct{x}_k\\}$, $k\\geq 0$, is said to converge to $\\vct{x}^*$ \n", "\n", "(a) **linearly** (or Q-linear, Q for quotient), if there exist an $r\\in (0,1)$ such that for sufficiently large $k$,\n", " \n", " \\begin{equation*}\n", " \\norm{\\vct{x}_{k+1}-\\vct{x}^*}\\leq r\\norm{\\vct{x}_k-\\vct{x}^*}.\n", " \\end{equation*}\n", " \n", "(b) **superlinearly**, if \n", "\n", "\\begin{equation*}\n", " \\lim_{k\\to \\infty} \\frac{\\norm{\\vct{x}_{k+1}-\\vct{x}^*}}{\\norm{\\vct{x}_k-\\vct{x}^*}} = 0,\n", " \\end{equation*}\n", "\n", "(c) with order $p$, if there exists a constant $M>0$, such that for sufficiently large $k$,\n", "\n", "\\begin{equation*}\n", " \\norm{\\vct{x}_{k+1}-\\vct{x}^*}\\leq M\\norm{\\vct{x}_k-\\vct{x}^*}^p.\n", " \\end{equation*}\n", "\n", "The case $p=2$ is called **quadratic convergence**.\n", "\n", "Of course, as mentioned earlier, these definitions depend on the choice of a norm. It can be shown that quadratic convergence implies superlinear convergence, and superlinear convergence implies linear convergence.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example.** Consider the sequence of numbers $x_k = 1/2^{r^k}$ for some $r>1$. Clearly, $x_k\\to x^*=0$ as $k\\to \\infty$. Moreover,\n", "\n", " \\begin{equation*}\n", " x_{k+1}=\\frac{1}{2^{r^{k+1}}}=\\frac{1}{2^{r^{k}r}}=\\left(\\frac{1}{2^{r^{k}}}\\right)^r=x_k^r,\n", " \\end{equation*}\n", "\n", "which shows that the sequence has rate of convergence $r$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Convergence of gradient descent\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, a norm $\\norm{\\cdot}$ will refere to the $2$-norm, unless otherwise stated.\n", "We now study the convergence of gradient descent for the least squares problem\n", "\n", "\\begin{equation*}\n", " \\minimize f(\\vct{x}) = \\frac{1}{2}\\norm{\\mtx{A}\\vct{x}-\\vct{b}}^2,\n", "\\end{equation*}\n", "\n", "where $\\mtx{A}\\in \\R^{m\\times n}$ with $m\\geq n$ of full rank.\n", "As we have seen in Lecture 3, the gradient descent method is the procedure\n", "\n", "\\begin{equation*}\n", " \\vct{x}_{k+1}=\\vct{x}_{k} + \\alpha_k\\vct{r}_k,\n", "\\end{equation*}\n", "\n", "where the step length and the *residual* are given by\n", "\n", "\\begin{equation*}\n", " \\alpha_k = \\frac{\\norm{\\vct{r}_k}^2}{\\norm{\\mtx{A}\\vct{r}_k}^2}, \\quad \\vct{r}_k = \\mtx{A}^{\\trans}(\\vct{b}-\\mtx{A}\\vct{x}_k)=-\\nabla f(\\vct{x}_k).\n", "\\end{equation*}\n", "\n", "At the minimizer, the residual is \n", "\n", "\\begin{equation*}\n", " \\vct{r} = -\\nabla f(\\vct{x}^*)=\\mtx{A}^{\\trans}(\\vct{b}-\\mtx{A}\\vct{x}^*) = 0,\n", "\\end{equation*}\n", "\n", "and as the sequence $\\vct{x}_k$ converges to $\\vct{x}^*$, the norms of the residuals converge to $0$. Conversely, the residual is related to the difference $\\vct{x}_k-\\vct{x}^*$ by\n", "\n", "\\begin{equation*}\n", " \\vct{r}_k = \\mtx{A}^{\\trans}(\\vct{b}-\\mtx{A}\\vct{x}_k) = \\mtx{A}^{\\trans}(\\vct{b}-\\mtx{A}\\vct{x}_k-(\\vct{b}-\\mtx{A}\\vct{x}^*)) = \\mtx{A}^{\\trans}\\mtx{A}(\\vct{x_k}-\\vct{x}^*),\n", "\\end{equation*}\n", "\n", "Therefore\n", "\\begin{equation*}\n", " \\norm{\\vct{x}_k-\\vct{x}^*} = \\norm{(\\mtx{A}^{\\trans}\\mtx{A})^{-1}\\vct{r}}\\leq \\norm{(\\mtx{A}^{\\trans}\\mtx{A})^{-1}}\\norm{\\vct{r}_k},\n", "\\end{equation*}\n", "\n", "where $\\norm{\\mtx{B}} = \\max_{\\vct{x}\\neq \\zerovct}\\norm{\\mtx{B}\\vct{x}}/\\norm{\\vct{x}}$ is the operator norm of a matrix $\\mtx{B}$ with respect to the $2$-norm. Consequently, if the sequence $\\norm{\\vct{r}_k}$ converges to zero, so does the sequence $\\norm{\\vct{x}_k-\\vct{x}^*}$. A reasonable criterium is to stop the algorithm is therefore when the residual norm $\\norm{\\vct{r}_k}$ is below a predefined tolerance $\\e$. \n", "\n", "The following theorem (whose proof we omit) shows that the gradient descent method for linear least squares converges linearly with respect to the $\\mtx{A}$ norm. The statement involves the **condition number** of $\\mtx{A}$ (the concept of condition number, introduced by Alan Turing while in Manchester, is one of the most important ideas in numerical analysis, as it is indispensable in studying the performance of numerical algorithms.). This quantity is defined as\n", "\n", "\\begin{equation*}\n", " \\kappa(\\mtx{A}) = \\norm{\\mtx{A}}\\cdot \\norm{\\mtx{A}^{\\dagger}}.\n", "\\end{equation*}\n", "\n", "**Theorem.** The error in the $k+1$-th iterate is bounded by\n", " \\begin{equation*}\n", " \\norm{\\vct{x}_{k+1}-\\vct{x}^*} \\leq \\left(\\frac{\\kappa^2(\\mtx{A})-1}{\\kappa^2(\\mtx{A})+1}\\right)\\norm{\\vct{x}_{k}-\\vct{x}^*}.\n", " \\end{equation*}\n", "\n", "In particular, the gradient descent algorithm converges linearly. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [py27]", "language": "python", "name": "Python [py27]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }