{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EART97012 \n", " \n", "# Geophysical Inversion \n", " \n", "## Lecture 6 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Today's learning objectives \n", " \n", " \n", "1. To more formally define equi, under, over and mixed determined systems\n", " \n", " \n", "2. To derive the least squares and minimum norm solution methods we've already seen\n", " \n", " \n", "3. To understand in what situations we can use these solution approaches\n", " \n", " \n", "4. To consider additional solution approaches where the \"simple\" least squares and minimum norm solution formulae will not work" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%precision 6\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.linalg as sl\n", "from pprint import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introductory comments \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous lecture we saw how the rank and the null space were important concepts in the solvability of both square and non-square linear systems $A\\boldsymbol{x} = \\boldsymbol{b}$ or $G\\boldsymbol{m} = \\boldsymbol{d}$.\n", "\n", "\n", "
\n", "\n", "We saw that in the ***over-determined case***, where we can't in general satisfy all equations/constraints, that the ***least squares solution*** gave us one useful solution, and could be found by solving the square linear system:\n", "\n", "$$A^TA\\boldsymbol{x} = A^T\\boldsymbol{b}$$\n", "\n", "or assuming the inverse can be taken:\n", "\n", "$$\\boldsymbol{x} = (A^TA)^{-1}A^T\\boldsymbol{b}$$\n", "\n", "
\n", "\n", "\n", "While in the ***under-determined case***, assuming there is no inconsistency between equations, we have infinitely many solutions. The solution being closest to the origin, i.e. the ***minimum-norm solution***, could be obtained using the formula\n", "\n", "$$\\boldsymbol{x} = A^T(AA^T)^{-1}\\boldsymbol{b},$$\n", "\n", "in this case assuming that $AA^T$ is invertible.\n", "\n", "\n", "
\n", "\n", "We ended the previous lecture by considering some small examples of non-square problems.\n", "\n", "In particular, by considering multiplication by a matrix $A\\in\\mathbb{R}^{m\\times n}$ in terms of tranforming vectors or points from $\\mathbb{R}^n$ into $\\mathbb{R}^m$ we plotted examples of how this transformatiuon maps clouds of points, and also our (least squares, minimum norm or exact-unique) solution.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Over-, under-, equi- and mixed-determined problems\n", "\n", "\n", "We'll quickly review the different cases, and then return to each in a little more detail.\n", "\n", "\n", "## Introduction\n", "\n", "Consider the linear inversion problem represented by the equation \n", "\n", "$$A\\, \\boldsymbol{x} = \\boldsymbol{b}\n", " \\quad\\text{or}\\quad\n", "G\\, \\boldsymbol{m} = \\boldsymbol{d} $$\n", "\n", "where $A$ or $G$ is an $m \\times n$ matrix, we can formally identify several possible circumstances: \n", "\n", "\n", "\n", "### The equi-determined case\n", "\n", "\n", "\n", "There are exactly as many data points as model parameters. \n", "\n", "The matrix $A$ will be square so that $m = n$. \n", "\n", "There will be as many equations as unknowns. \n", "\n", "
\n", "\n", "If these equations are both ***independent***\n", "\n", "(i.e. the \"lines\" we saw previously in $2\\times 2$ cases corresponding to each equation/constraint are not parallel) \n", "\n", "are and ***consistent***\n", "\n", "(which means that the lines cross at least once; inconsistency is where we can derive a contradiction from the equations, e.g. using row operations to arrive at the equation $0=1$, i.e. the rank of the augmented matrix is greater than the rank of the coefficient matrix - see next cell), \n", "\n", "then the problem is called ***equi-determined***. \n", "\n", "There will be one and only one solution that will fit the data (the RHS vector) exactly. \n", "\n", "

\n", "\n", "For further discussion of this see \n", "\n", "also ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aside - the Rouché–Capelli theorem\n", "\n", "The [Rouché–Capelli theorem](https://en.wikipedia.org/wiki/Rouch%C3%A9%E2%80%93Capelli_theorem) tells us that \n", "\n", "\"any system of equations ... is inconsistent if the rank of the augmented matrix is greater than the rank of the coefficient matrix. If, on the other hand, the ranks of these two matrices are equal, the system must have at least one solution. The solution is unique if and only if the rank equals the number of variables. Otherwise the general solution has k free parameters where k is the difference between the number of variables and the rank; hence in such a case there are an infinitude of solutions.\"\n", "\n", "Above quoted from \n", "\n", "\n", "

\n", "\n", "We saw a simple example of an inconsistent system in the previous lecture:\n", "\n", "$$ \n", "\\begin{align*}\n", "x + y + z = 0 \\\\[5pt]\n", "x + y + z = 1\n", "\\end{align*}\n", "$$\n", "\n", "if you form the augmented matrix here and perform row operations, the rank of the augmented matrix will clearly be larger than the rank of $A$ alone. Just drop the $z$ from these equations and we have an inconsistent square system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The over-determined case\n", "\n", "\n", "\n", "\n", "In this case there are more independent data points than there are model parameters.\n", "\n", "The matrix $A$ will be a \"tall\" matrix with more rows than columns so that $m > n$. \n", "\n", "Equivalently, there are more equations (or constraints) than unknowns. \n", "\n", "If $A$ is full rank (recall that this means that $\\text{rank}(A) = \\min(m,n)= n$), then the inversion problem is called ***purely over-determined***. \n", "\n", "\n", "Typically there will be no exact solution to an over-determined problem. \n", "\n", "In the purely over-determined case we can use our formula above for the ***least squares solution*** \n", "\n", "\n", "$$\\boldsymbol{x} = (A^TA)^{-1}A^T\\boldsymbol{b}$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### The under-determined case\n", "\n", "\n", "\n", "\n", "\n", "In this case there are fewer data points than model parameters. \n", "\n", "The matrix $A$ will be a \"fat\" matrix with fewer rows than columns so that $m < n$. \n", "\n", "Equivalently, there are less equations (or constraints) than unknowns. \n", "\n", "If $A$ is full rank (recall that this means that $\\text{rank}(A) = \\min(m,n) = m$), then the problem is called ***purely under-determined***. \n", "\n", "There will be infinitely many solutions that are able to fit the data exactly, including the ***minimum-norm solution***\n", "\n", "$$\\boldsymbol{x} = A^T(AA^T)^{-1}\\boldsymbol{b}$$\n", "\n", "

\n", "\n", "If an under-determined case is **not** purely under-determined (i.e. $\\text{rank}(A) < m$), then this means that we can't find solutions that exactly fit the data.\n", "\n", "We've seen simple examples of purely under-determined problems in the previous lecture.\n", "\n", "The following is an example that is **not** purely under-determined, as it is indeed under-determined but has no exact solution(s) as the equations are inconsistent:\n", "\n", "$$\n", "\\begin{pmatrix}\n", "1 & 2 & 3 \\\\\n", "2 & 4 & 6\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "m_1\\\\\n", "m_2\\\\\n", "m_3\n", "\\end{pmatrix}\n", "=\n", "\\begin{pmatrix}\n", "1\\\\\n", "1\n", "\\end{pmatrix}.\n", "$$\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### The mixed-determined case\n", "\n", "\n", "Many real problems will actually be partly over-determined and partly under-determined. \n", "\n", "These problems are called ***mixed-determined***. \n", "\n", "Problems in which some but not all of the parameters are equi-determined are also mixed determined. \n", "\n", "In mixed-determined problems, some of the model parameters (i.e. some components of the unknown $\\boldsymbol{x}$) will typically be described by more independent equations than there are parameters and there will be no model that can satisfy this set of equations, \n", "\n", "while some other parameters will be described by fewer independent equations than there are parameters and there may be be infinitely many models that can satisfy this second set of equations. \n", "\n", "A mixed-determined problem can occur with any shaped matrix $A$. \n", "\n", "There will typically be no exact solution to this problem. In addition, there will be infinitely many solutions that can fit the data equally accurately. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rank etc - the non-square case [repeated from L3]\n", "\n", "When $A$ is of size $m\\times n$ we now think of $A$ as mapping vectors in $\\mathbb{R}^n$ into $\\mathbb{R}^m$ - matrix-multiplication can be interpreted as moving in a linear combination of directions given by the columns of the matrix in the space $\\mathbb{R}^m$. \n", "\n", "\n", "In the non-square case $A$ is said to have *full rank* if $\\text{rank}(A)=\\min(m,n)$ and is said to be *rank deficient* if $\\text{rank}(A)<\\min(m,n)$. \n", "\n", "\n", "Note that as the column and row rank are also equal for non-square matrices, i.e. the number of linearly independent rows is the same as the number of linearly independent columns, then of course the maximum the rank of the matrix can ever be is $\\min(m,n)$\n", " \n", "
\n", "\n", "Assuming $m\\ne n$ then there are two possibilities:\n", "\n", "\n", "1. In the \"tall\" [over-determined](https://en.wikipedia.org/wiki/Overdetermined_system) case ($m>n$), the range (or column space) of $A$ doesn't (cannot) cover all of the space that our data $\\boldsymbol{b}$ lives in, so we cannot find exact solutions for all data. But if $\\boldsymbol{b}$ does live in the range then we can find a solution (it may or may not be unique).\n", "\n", "\n", "2. In the \"fat\" [under-determined](https://en.wikipedia.org/wiki/Underdetermined_system) case ($m\n", "\n", "You could check this with an example - see homework.\n", "\n", "
\n", "\n", "We have several important results:\n", "\n", "\n", "- A square $(m = n)$ matrix is full rank if and only if it is non-singular. \n", "\n", "\n", "- For a tall $(m > n)$ matrix $A$ that is full rank, the matrix formed by $A^TA$ (which has dimension $n \\times n$) will be non-singular. \n", "\n", "\n", "- For a fat $(m < n)$ matrix $A$ that is full rank, the matrix formed by $AA^T$ (which has dimension $m \\times m$) will be non-singular. \n", " \n", " \n", "- If $A$ is not full rank, then whatever the shape of $A$, both these matrices will be singular. \n", "\n", "\n", "The first three cases correspond respectively to equi-, purely over-, and purely under- determined inversion problems. The mixed-determined case will fall into category 4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why is this important?\n", "\n", "\n", "
\n", "\n", "- Recall in the under-determined case we used the following expression to give us the *minimum norm solution*\n", "\n", "$$\\boldsymbol{x} = A^T(AA^T)^{-1}\\boldsymbol{b}$$\n", "\n", "\n", "\n", "- In the over-determined case we used the following to give us the *least squares solution*\n", "\n", "$$\\boldsymbol{x} = (A^TA)^{-1} A^T\\boldsymbol{b}$$\n", "\n", "\n", "
\n", "\n", "Establishing that $A$ has full rank, no matter its shape or size, allows us to use these two explicit formulae as appropriate.\n", "\n", "And of course in the equi-determined case means we have an inverse matrix.\n", "\n", "But we have a problem in what to do for problems in case 4 which includes mixed-determined problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some examples\n", "\n", "### Example 1\n", "\n", "Consider the following problem:\n", "\n", "$$\n", "\\begin{pmatrix}\n", "1 & 0 & 0 \\\\\n", "1 & 0 & 0 \\\\\n", "0 & 2 & 2 \\\\\n", "0 & 3 & 3\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "m_1\\\\\n", "m_2\\\\\n", "m_3\n", "\\end{pmatrix}\n", "=\n", "\\begin{pmatrix}\n", "1\\\\\n", "2\\\\\n", "2\\\\\n", "3\n", "\\end{pmatrix}.\n", "$$\n", "\n", "For this problem, $m_1$ is over-determined since there is no possible value of $m_1$ that can exactly fit both of the first two equations in this system - the first two equations/constraints are inconsistent. \n", "\n", "The problem is also under-determined because the last two equations are not independent. In the particular case as written here the final two equations actually boil down to one equation for two unknowns. Thus, there are infinitely many solutions for $m_2$ and $m_3$ that can fit the last two equations exactly. \n", "\n", "This problem is therefore clearly ***mixed-determined***. \n", "\n", "A homework exercise asks you to come up with a sensible/reasonable \"solution\" to this problem.\n", "\n", "
\n", "\n", "\n", "Most, large, real-world, inversion problems are likely to be mixed-determined. In practice it can also be very difficult to discover if a large system is mixed determined or not, and so in practical problems we will often proceed by assuming that the problem is mixed-determined, employing an iterative numerical approach. \n", "\n", "\n", "

\n", "\n", "### Example 2\n", "\n", "The following figure illustrates these four problem types for a simple tomographic experiment. The \"s\" values plotted just indicate that some measurement has been taken, and we want to use this to establish values in the two blocks.\n", "\n", "\n", "\n", "\n", "\n", "

\n", "\n", "- In the first case we have exactly enough information to solve our inversion problem exactly.\n", "\n", "\n", "- In the second case we can only describe the mean of the parameter values, but without additional information there is an infinite family of possible individual parameter values that agree with our data - we have an under-determined problem.\n", "\n", "\n", "- In the third we are given the parameters in each rectangle and additionally the mean - we have an over-determined problem; we may or may not be able to find a solution that satsifies all observations/equations.\n", "\n", "\n", "- In the fourth we are only told the mean (so are under-determined as far as the individual block values, or their difference, are concerned - so these quantities are under-determined) but we are told that mean value three times (so we are over-determined in terms of the mean - an issue if the three data values are different). We could transform this problem into one for two new parameters - the mean and the difference - where we have three equations for the first parameter and none for the second." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 3 - A tomography inspired mixed-determined problem\n", "\n", "A relatively simple example that demonstrates a mixed-determined problem that cannot be solved using the methods above is that of estimating the spatial distribution of slowness (the reciprocal of velocity) from measurements of travel times along several straight ray paths through a solid body. \n", "\n", "We saw a $4\\times 4$ version of this problem in the cell on tomography in an earlier lecture - this cell is now repeated for completeness.\n", "\n", "**Straight-ray tomography (repeated from previous lecture)**\n", "\n", "[\"Tomography is imaging by sections or sectioning, through the use of any kind of penetrating wave\"](https://en.wikipedia.org/wiki/Tomography).\n", "\n", "Most acoustic and seismic inverse problems are non-linear (since changing the model parameters - the structure of the body being imaged will change the ray path (e.g. due to refraction or reflections) the wave travels along, and hence change the arrival times (the data) in a nonlinear manner). \n", "\n", "However, if we assume ray that paths do not vary with the model (which we could justify if we assume that we don't change the $\\boldsymbol{m}$ values very much), and if the data are travel times and the model is composed of [slowness](https://en.wikipedia.org/wiki/Slowness_(seismology)) values, then the problem is linear. \n", "\n", "Consider the following example\n", "\n", "\n", "\n", "\n", "Suppose that a wall is assembled from an array of bricks as above, and that each brick is composed of a different type of clay. If the acoustic velocities of the different clays differ, then one might attempt to distinguish the different kinds of bricks by measuring the travel time of sound across the various rows and columns of bricks, in the wall. The data in this problem are $n = 8$ measurements of travel time (based on there being 4 rows and 4 columns) \n", "\n", "$$ \\boldsymbol{d} = [T_1 , T_2 , T_3 , \\ldots, T_8]^T$$\n", "\n", "We will assume that each brick is composed of a uniform material and that the travel time of sound across each brick is proportional to the width and height of the brick $h$. The proportionality factor is the brick’s slowness $s_i$, giving 16 model parameters \n", "\n", "$$ \\boldsymbol{s} = [s_1 , s_2 , s_3 , \\ldots, s_{16}]^T$$\n", "\n", "where the ordering of the model parameters is according to the numbering scheme in the schematic above. The travel time of acoustic waves through the rows and columns of a square array of bricks is measured with the acoustic source and receiver placed on the edges of the square. \n", "\n", "The set of linear simultaneous equations that we must now solve are: \n", "\n", "$$\n", "\\begin{align*}\n", "hs_1+hs_2 + hs_3 + hs_4 &= T_1\\\\\n", "hs_5+hs_6 + hs_7 + hs_8 &= T_2\\\\\n", "&\\vdots \\\\\n", "hs_4+hs_8 + hs_{12} + hs_{16} &= T_8\n", "\\end{align*}\n", "$$\n", "\n", "where we order our firing of waves from left to right, starting at the top moving down, and then from top to bottom starting at the left and moving right.\n", "\n", "These algebraic equations can be arranged as a matrix equation $G\\, \\boldsymbol{m} = \\boldsymbol{d}$ of the form\n", "\n", "$$\n", "\\begin{pmatrix}\n", "h & h & h & h & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\\n", "0 & 0 & 0 & 0 & h & h & h & h & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\\n", "\\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n", "0 & 0 & 0 & h & 0 & 0 & 0 & h & 0 & 0 & 0 & h & 0 & 0 & 0 &h \\\\\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "s_1\\\\\n", "s_2\\\\\n", "\\vdots \\\\\n", "s_{16}\n", "\\end{pmatrix}\n", "=\n", "\\begin{pmatrix}\n", "T_1\\\\\n", "T_2\\\\\n", "\\vdots \\\\\n", "T_8\n", "\\end{pmatrix}.\n", "$$\n", "\n", "This tomography problem is linear because the assumptions we have made were designed to make the travel times depend linearly upon the model parameters. In addition, slowness rather than velocity is used for the model parameters, and slowness has a linear relationship to travel time while velocity does not. \n", "\n", "X-ray tomography used in medical imaging, and radar tomography used in atmospheric physics, typically also assume ray paths that are model independent and employ model parameterisations that are related linearly to the data, and so are able to use linear inversion methods. In seismic tomography in the interior of the Earth, the independence of the ray path and the velocity model is generally a poor assumption - in reality ray paths change significantly when seismic models change, so that linear inversion is not normally adequate for seismic tomography. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**A $3\\times 3$ case**\n", "\n", "In a homework exercise you are asked to consider a $3\\times 3$ case where our equations turn out to be of the form\n", "\n", "$$\\begin{align*}\n", "T_1 &= x_1 + x_2 + x_3\\\\\n", "T_2 &= x_4 + x_5 + x_6\\\\\n", "& \\vdots \\\\\n", "T_6 &= x_3 + x_6 + x_9\n", "\\end{align*}\n", "$$\n", "\n", "and hence our matrix system is\n", "\n", "$$\n", "\\begin{pmatrix}\n", "1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\\\\n", "0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\\\\n", "\\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n", "0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "x_1\\\\\n", "x_2\\\\\n", "\\vdots \\\\\n", "x_9\n", "\\end{pmatrix}\n", "=\n", "\\begin{pmatrix}\n", "T_1\\\\\n", "T_2\\\\\n", "\\vdots \\\\\n", "T_6\n", "\\end{pmatrix}\n", "$$\n", "\n", "This problem is taken up below and in the homework exercises." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n", "-7.6438855245442e-14\n" ] } ], "source": [ "# just confirm that G.T@G and G@G.T are singluar for the 3x3 tomographyt example:\n", "A = np.array([\n", " [1,1,1,0,0,0,0,0,0],\n", " [0,0,0,1,1,1,0,0,0],\n", " [0,0,0,0,0,0,1,1,1],\n", " [1,0,0,1,0,0,1,0,0],\n", " [0,1,0,0,1,0,0,1,0],\n", " [0,0,1,0,0,1,0,0,1]])\n", "\n", "print(sl.det(A.T@A))\n", "print(sl.det(A@A.T))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Equi-determined problems - square linear systems\n", "\n", "For a linear inversion problem that has the same number of data points as model parameters, the equation \n", "\n", "$$A\\, \\boldsymbol{x} = \\boldsymbol{b} \\quad\\text{or}\\quad\n", "G\\, \\boldsymbol{m} = \\boldsymbol{d}$$\n", "\n", "contains a square matrix $A$ or $G$, and the equation represents a set of linear simultaneous equations that has the same number of equations as unknowns. \n", "\n", "\n", "\n", "Recall: If $G$ is not singular, then equivalently these linear equations will all be independent, and this system will have only one solution; that solution will provide an exact fit to the data. This solution is given by \n", "\n", "$$\\boldsymbol{x} = A^{-1}\\,\\boldsymbol{b}$$\n", "\n", "
\n", "\n", "**CAUTION:** The statement that the solution is exact is a mathematical not a physical statement. That is, if the data contains uncertainties, which they always will, then those uncertainties will influence the solution. A model that provides an exact fit to an observed noisy dataset may not provide a good estimate of the true model even though the estimated model exactly predicts these observations. \n", "\n", "That is, the issue of small perturbations in the data potentially leading to large errors in the model - this is a concept termed ***ill-conditioning***. We don't have time to cover this, but for the case of a linear system it results when the [***condition number***](https://en.wikipedia.org/wiki/Condition_number) of the matrix is large.\n", "\n", "
\n", "\n", "If the matrix $A$ is singular, then the inverse of $A$ does not exist, and we cannot use the above equation to find a solution. \n", "\n", "In these circumstances, there are two possibilities: either the equations are not independent in which case there will be infinitely many solutions that exactly fit the data, or the equations are incompatible in which case there will be no solution at all that exactly fits the data. \n", "\n", "\n", "
\n", "\n", "Recall that we saw this case in previous weeks where we had a $2\\times 2$ and the two singular cases could be interpreted as two lines that are on top of one another, or as two lines that never cross.\n", "\n", "In the former case we could find the minimum norm solution out of the infinitely many solutions.\n", "\n", "In the latter case there are no exact solutions, but there are infinitely many solutions that are equally valid in terms of a least squares approach, we could select from these the one that is also the minimum norm solution - this perhaps gives a hint for how we will generally deal with mixed-determined problems! See a homework exercise.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Over-determined problems\n", "\n", "In over-determined problems, there are more independent equations than unknowns. That is, in the system \n", "\n", "$$A\\, \\boldsymbol{x} = \\boldsymbol{b} \\quad\\text{or}\\quad\n", "G\\, \\boldsymbol{m} = \\boldsymbol{d}$$\n", "\n", "$m>n$ where $A$ or $G$ is an $m\\times n$ matrix.\n", "\n", "\n", "\n", "
\n", "\n", "The inverse of $A$ is not defined, and in general there is no solution $\\boldsymbol{x}$ that will exactly satisfy this relation \n", "\n", "BUT we can find an $\\boldsymbol{x}$ if the data $\\boldsymbol{b}$ lies in the range of $A$ but this won't generally be the case. \n", "\n", "
\n", "\n", "A useful solution can still be found in the general case, and we've already seen how:\n", "\n", "Instead of solving our original problem, we instead solve the related equation\n", "\n", "$$A^T \\,A\\, \\boldsymbol{m} = A^T \\,\\boldsymbol{b}$$\n", "\n", "This relation if called the **normal equation**.\n", "\n", "\n", "The $n\\times n$ matrix $A^T \\,A$ is now square and symmetric, and provided that it is not singular, the solution to the normal equation will be \n", " \n", "$$\\boldsymbol{x} = (A^T \\,A)^{-1}A^T \\,\\boldsymbol{b}$$\n", "\n", "This approach generates the **least squares solution** to the problem.\n", "\n", "We've seen examples of this already in both the linear system case as well as in terms of linear regression (fitting a polynomial to data).\n", "\n", "Let's now establish where this expression for the least squares solution comes from:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The least squares solution - derivation [I'll skim over the math details]\n", "\n", "We are in the over-determined situation. Since (in general) there is no vector of model parameters that exactly fits the data, we will proceed by trying to find a model that minimises the difference between the model predictions and the observed data. For a good model, we would like this difference to be less than the errors in the data. \n", "\n", "There are many ways to compare a predicted and an observed dataset. The one that we will use here is to try to minimise the sum of the squares of the differences between the predicted and observed data. That is, we will minimise the square of the $L^2$-norm of the data residuals. \n", "\n", "[A homework exercise walks you through this example in the case of linear regression fitting data, and how the result responds to the presence of an outlier in the data in this and other choices of norm].\n", "\n", "Thus, if the predicted data are $\\boldsymbol{p}:=A\\boldsymbol{x}$ and the observed data are $\\boldsymbol{b}$, then we will try to minimise the function \n", "\n", "$$\n", "\\begin{align*}\n", "f &= \\frac{1}{2}(\\boldsymbol{p} - \\boldsymbol{b})^T(\\boldsymbol{p} - \\boldsymbol{b})\\\\\n", "&=\\frac{1}{2}(A\\boldsymbol{x} - \\boldsymbol{b})^T(A\\boldsymbol{x} - \\boldsymbol{b})\n", "\\end{align*}\n", "$$\n", "\n", "
\n", "\n", "[NB. The factor of a half has just been added here so that later results simplify.]\n", "\n", "
\n", "\n", "equivalently using norms\n", "\n", "$$\n", "\\begin{align*}\n", "f&=\\frac{1}{2}\\|\\boldsymbol{p} - \\boldsymbol{b}\\|_2^2\\\\\n", "&=\\frac{1}{2}\\|A\\boldsymbol{x} - \\boldsymbol{b}\\|_2^2\n", "\\end{align*}\n", "$$\n", "\n", "or in terms of scalar components:\n", "\n", "$$\n", "\\begin{align*}\n", "f&=\\frac{1}{2}\\sum_{j=1}^m (p_j - b_j)^2\\\\\n", "&=\\frac{1}{2}\\sum_{j=1}^m \\left( \\sum_{k=1}^n A_{jk}x_k - b_j \\right)^2\n", "\\end{align*}\n", "$$\n", "\n", "
\n", "\n", "Here, $f$ is just a non-negative number that will vary as the model varies. The function $f$ is usually called the ***objective function***. It is also often called the ***misfit function*** for obvious reasons. In an optimisation based approach to solve inversion problems, we often use iterative approaches to minimise the objective/misfit function.\n", "\n", "\n", "\n", "\n", "As we've seen several times in this module already, to minimise $f$, we need to differentiate $f$ with respect to each of the solution vector components $x_k$, set these differentials to zero, and solve the resulting set of equations for these model parameters (which is done in a homework exercise). \n", "\n", "Now we have $n$ model parameters and $n$ equations (one for each differential), so that we have the same number of equations as unknowns. Now we can get a unique solution - at least we can provided that these $n$ equations are independent and consistent. \n", "\n", "The solution that this will generate will not exactly explain the data - there is no solution that can do that for arbitrary data. It is instead the least-squares solution, a solution that gives the best fit to the observed data in a least-squares sense. It minimises the sum of the squares of the residual data (the difference between observed and predicted data). It minimises the Euclidean distance between the observed and predicted data vectors. \n", "\n", "Some theory does exist to establish that the least squares solution is optimal assuming certain assumptions on the errors in the data apply - see ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To progress we can make use of a vector calculus result:\n", "\n", "$$\\frac{\\partial}{\\partial \\boldsymbol{x}} \\left(\\boldsymbol{a}^T\\boldsymbol{b}\\right) \n", "=\\left(\\frac{\\partial \\boldsymbol{a}}{\\partial \\boldsymbol{x}}\\right)^T\\boldsymbol{b} +\n", "\\left(\\frac{\\partial \\boldsymbol{b}}{\\partial \\boldsymbol{x}}\\right)^T\\boldsymbol{a}$$\n", "\n", "You're asked to think about this in a homework example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider now a function of the form \n", "\n", "$$f = \\frac{1}{2}\\|\\boldsymbol{r}\\|_2^2 = \n", "\\frac{1}{2}\\boldsymbol{r}^T \\boldsymbol{r}\n", "$$\n", "\n", "where both $\\boldsymbol{r}$ and $f$ are functions of the solution $\\boldsymbol{x}$, and $\\boldsymbol{r}$ is some measure of the mismatch between the data that is observed and the data that would be generated by using the solution $\\boldsymbol{x}$. Now we will see why the factor of a half is added to simplify the final result. \n", "\n", "The classic least-squares problem then involves discovering the model parameters that minimises the value of the objective function $f$, and this typically involves differentiating $f$ with respect to $\\boldsymbol{x}$. \n", "\n", "Using the result above for the differential of an inner product, we obtain \n", "\n", "$$\\frac{\\partial f}{\\partial \\boldsymbol{x}}\n", "= \\frac{\\partial }{\\partial \\boldsymbol{x}}\\left(\\frac{1}{2}\\boldsymbol{r}^T \\boldsymbol{r}\\right)\n", "= \\frac{1}{2}\\left(\\left(\\frac{\\partial \\boldsymbol{r}}{\\partial \\boldsymbol{x}}\\right)^T\\boldsymbol{r} +\n", "\\left(\\frac{\\partial \\boldsymbol{r}}{\\partial \\boldsymbol{x}}\\right)^T\\boldsymbol{r}\\right)\n", "= \\left(\\frac{\\partial \\boldsymbol{r}}{\\partial \\boldsymbol{x}}\\right)^T\\boldsymbol{r}\n", "$$\n", "\n", "Let's now apply this to our situation:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall we had\n", "\n", "$$\n", "\\begin{align*}\n", "f &= \\frac{1}{2}(\\boldsymbol{p} - \\boldsymbol{b})^T(\\boldsymbol{p} - \\boldsymbol{d})\\\\\n", "&=\\frac{1}{2}(A\\boldsymbol{x} - \\boldsymbol{b})^T(A\\boldsymbol{x} - \\boldsymbol{d})\n", "\\end{align*}\n", "$$\n", "\n", "Lets differentiate with respect to $\\boldsymbol{x}$ and use our vector calculus result from just above\n", "\n", "$$ \\frac{\\partial f}{\\partial \\boldsymbol{x}} = \\frac{\\partial (A\\boldsymbol{x} - \\boldsymbol{b})^T}{\\partial \\boldsymbol{x}} (A\\boldsymbol{x} - \\boldsymbol{b})$$\n", "\n", "which, since $\\boldsymbol{b}$ and $G$ do not depend on $\\boldsymbol{x}$, gives \n", "\n", "$$ \\frac{\\partial f}{\\partial \\boldsymbol{x}} = G^T(G\\boldsymbol{x} - \\boldsymbol{b})$$\n", "\n", "For $f$ to be a minimum, this must be equal to zero, so we are looking for the $\\boldsymbol{x}$ where\n", "\n", "$$A^T A \\boldsymbol{x} - A^T \\boldsymbol{b}=0 \\qquad \\iff \\qquad\n", "A^T A \\boldsymbol{x} = A^T \\boldsymbol{b}\n", "$$\n", "\n", "or \n", "\n", "$$ \\boldsymbol{x} = (A^T A)^{-1} A^T \\boldsymbol{b} $$\n", "\n", "provided that $(A^T A)^{-1}$ exists.\n", "\n", "Note that this is the normal equation we defined above!\n", "\n", "Here $A$ is a rectangular $m\\times n$ matrix where for our \"tall\" over-determined case $m>n$. However, $A^TA$ is a square matrix of size $n\\times n$. Its inverse will exist provided that $A$ is of full column rank, that is provided that the problem is purely over-determined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The left inverse (or the least-squares inverse)\n", "\n", "Note that the expression $(A^T A)^{-1} A^T$ has similarities to an inverse of the rectangular matrix $A$ since\n", "\n", "$$\\left( (A^T A)^{-1} A^T \\right)A = I$$\n", "\n", "However it is not a true inverse since\n", "\n", "$$A\\left( (A^T A)^{-1} A^T \\right) \\ne I$$\n", "\n", "It is often called the [**left inverse**](https://en.wikipedia.org/wiki/Inverse_element#Matrices) (or the least-squares inverse) of $A$.\n", "\n", "In many problems $A$ may not be full rank, so that $A^TA$ is singular. Such problems are normally mixed-determined. We can solve such problems using the method of *damped least squares* (see below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Under-determined problems\n", "\n", "In under-determined problems, there are fewer equations than unknowns, and so $m$ is less than $n$. \n", "\n", "\n", "\n", "Now the equations do not uniquely define a solution. [See the L3 homework exercise on matrix rank and RREF where we consider a \"fat\" example.]\n", "\n", "However, in this case too there is still a useful solution to be found. \n", "\n", "When $m\n", "\n", "The latter condition makes the model parameter vector as \"short\" as possible (thinking of $\\boldsymbol{x}$ as a vector) given that it must also match the data. This is a minimum model parameter vector that has nothing within it that can be left out without degrading the fit to the data. However, we can choose to add to it any linear combination from the null space and it will still explain the data exactly. \n", "\n", "In many circumstances, it is appropriate to parametrise the model so that the parameters that we are attempting to obtain are not defined in an absolute sense, but are defined as changes to some *a priori* model $\\boldsymbol{x}_0$. \n", "\n", "The *a priori* model is our best guess of what the answer should be in the absence of the data. If we parametrise in this way, then a solution using the approach above will ensure that we find the model that best fits the data and that is also as close to the *a priori* model as possible. \n", "\n", "In this case, the problem is often set up to solve for $\\delta \\boldsymbol{x}:=\\boldsymbol{x} - \\boldsymbol{x}_0$, and takes the form\n", "\n", "$$\\boldsymbol{x} - \\boldsymbol{x}_0 = A^T(AA^T)^{-1}(\\boldsymbol{b} - A \\boldsymbol{x}_0)$$\n", "\n", "or\n", "\n", "$$\\delta \\boldsymbol{x} = A^T(AA^T)^{-1}\\delta \\boldsymbol{b}$$\n", "\n", "where\n", "\n", "$$\\delta \\boldsymbol{b}=\\boldsymbol{b} - A \\boldsymbol{x}_0$$\n", "\n", "is the difference between the observed data and that predicted using the starting model $\\boldsymbol{x}_0$.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Derivation of the minimum norm solution [again, I'll skim over the math]\n", "\n", "Mathematically we can state that we want to find the solution $\\boldsymbol{m}$ that satisfies our problem $A\\boldsymbol{x} = \\boldsymbol{b}$, subject to that solution being as small as possible, i.e. such that $\\boldsymbol{x}^T\\boldsymbol{x}$ is minimsed.\n", "\n", "As a constrained optimisation problem we would write\n", "\n", "\\begin{align}\n", "\\textrm{minimise} \\quad & \\boldsymbol{x}^T\\boldsymbol{x}\\\\\n", "\\textrm{subject to} \\quad & A\\boldsymbol{x} = \\boldsymbol{b}\n", "\\end{align}\n", "\n", "A standard approach to solve problems such as this is via [*Lagrange multipliers*](https://en.wikipedia.org/wiki/Lagrange_multiplier):\n", "\n", "define the so-called *Lagrangian function*\n", "\n", "$$\n", "\\mathcal{L}(\\boldsymbol{x}, \\boldsymbol{\\lambda}) := \\boldsymbol{x}^T\\boldsymbol{x} - \\boldsymbol{\\lambda}^T (A\\boldsymbol{x} - \\boldsymbol{b})\n", "$$\n", "\n", "where $\\boldsymbol{\\lambda}$ is the Lagrange multiplier that is introduced to enforce the constraint - here the constraint is vector valued and so $\\boldsymbol{\\lambda}$ is a vector of Lagrange multipliers. \n", "\n", "We now look for the stationary points of $\\mathcal{L}$, i.e. those for which the derivative w.r.t. $\\boldsymbol{x}$ (here a gradient vector) is zero, and where the derivative w.r.t. $\\boldsymbol{\\lambda}$ is zero (again a gradient vector; the latter being satsified just means that each of the components of our vector constraint are satisfied of course, and so the stationary points of $\\mathcal{L}$ are always points in $(\\boldsymbol{x}, \\boldsymbol{\\lambda})$ space where our equation is exactly satsified.\n", "\n", "In this case this means that we want\n", "\n", "$$\\boldsymbol{0}=\\nabla_{\\boldsymbol{x}}\\mathcal{L} = 2\\boldsymbol{x} - A^T\\boldsymbol{\\lambda}$$\n", "\n", "\n", "[demonstrating this result is a homework exercise]\n", "\n", "and\n", "\n", "$$\\boldsymbol{0}=\\nabla_{\\boldsymbol{\\lambda}}\\mathcal{L} = A\\boldsymbol{x} - \\boldsymbol{b}$$\n", "\n", "Substituting the first into the second we have\n", "\n", "\n", "\\begin{align}\n", "\\boldsymbol{0} &= A\\boldsymbol{x} - \\boldsymbol{b} = A\\left(\\frac{1}{2}A^T\\boldsymbol{\\lambda}\\right) - \\boldsymbol{b}\\\\\n", "\\Rightarrow\n", "\\boldsymbol{\\lambda} &= 2 (A A^T)^{-1}\\boldsymbol{b}\n", "\\end{align}\n", "\n", "and substituting back into the first:\n", "\n", "$$\\boldsymbol{x} = \\frac{1}{2} A^T\\boldsymbol{\\lambda} = A^T (A A^T)^{-1}\\boldsymbol{b}. $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mixed-determined problems\n", "\n", "A simple example that demonstrates a mixed-determined problem that cannot be solved using \n", "the methods above is that of estimating the spatial distribution of slowness (the reciprocal of \n", "velocity) from measurements of travel times along several straight ray paths through a solid \n", "body. \n", "\n", "We saw a $4\\times 4$ version of this problem in the cell on tomography above.\n", "\n", "In a homework exercise you are asked to consider a $3\\times 3$ case where our equations turn out to be of the form\n", "\n", "$$\\begin{align*}\n", "T_1 &= x_1 + x_2 + x_3\\\\\n", "T_2 &= x_4 + x_5 + x_6\\\\\n", "& \\vdots \\\\\n", "T_6 &= x_3 + x_6 + x_9\n", "\\end{align*}\n", "$$\n", "\n", "and hence our matrix system is\n", "\n", "$$\n", "\\begin{pmatrix}\n", "1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\\\\n", "0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\\\\n", "\\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots & \\vdots \\\\\n", "0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1\n", "\\end{pmatrix}\n", "\\begin{pmatrix}\n", "x_1\\\\\n", "x_2\\\\\n", "\\vdots \\\\\n", "x_9\n", "\\end{pmatrix}\n", "=\n", "\\begin{pmatrix}\n", "T_1\\\\\n", "T_2\\\\\n", "\\vdots \\\\\n", "T_6\n", "\\end{pmatrix}\n", "$$\n", "\n", "\n", "A homework exercise goes through this example in more detail and discussed the corresponding null space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imperfect data\n", "\n", "If all the measurements were perfect and returned the value 6, then clearly the solution with all values 2 is a possible model parameter solution.\n", "\n", "However, consider the case where the measurements are noisy, let us assume that the following measurements were made for the travel times \n", "\n", "$$ \\boldsymbol{T} = (6.07, 6.07, 5.77, 5.93, 5.93, 6.03)^T $$\n", "\n", "Now, even though there are fewer data than model parameters, there is no model that fits the \n", "data exactly. We can see this because it should be the case from $G$ that\n", "\n", "$$ T_1 + T_2 + T_3 = T_4 + T_5 + T_6$$ \n", "\n", "whereas for this data\n", "\n", "$$ T_1 + T_2 + T_3 = 17.91, \\qquad T_4 + T_5 + T_6 = 17.89 $$\n", "\n", "so that, with these data, there can be no solution. \n", "\n", "See homework for more on this problem.\n", "\n", "\n", "

\n", "\n", "**How to proceed?**\n", "\n", "In this problem, $A$ is not square so that $A^{-1}$ does not exist, and both $A^TA$ and $AA^T$ are singular matrices, so that none of the methods that we have used so far will work. \n", "\n", "How then can we proceed? \n", "\n", "There are two principal options: \n", "\n", "
\n", "\n", "1. we can use the generalised inverse $A^+$, also known as the pseudo-inverse or the [Moore-Penrose inverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse), or\n", "\n", "\n", "2. we can use some form of regularisation to the model of which damped least-squares is the most straightforward. \n", "\n", "
\n", "\n", "Some general guidance over the methods to be presented below:\n", "\n", "The generalised inverse is preferable in small problems, especially when we would like to analyse the quality of the results carefully, while regularised least-squares and related methods are preferable for large problems when the generalised inverse is prohibitively expensive, or when linearised inversion is being used in order to solve a non-linear problem by iteration. \n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n", "-7.6438855245442e-14\n" ] } ], "source": [ "# just confirm that A.T@A and A@A.T are singluar:\n", "A = np.array([\n", " [1,1,1,0,0,0,0,0,0],\n", " [0,0,0,1,1,1,0,0,0],\n", " [0,0,0,0,0,0,1,1,1],\n", " [1,0,0,1,0,0,1,0,0],\n", " [0,1,0,0,1,0,0,1,0],\n", " [0,0,1,0,0,1,0,0,1]])\n", "\n", "print(sl.det(A.T@A))\n", "print(sl.det(A@A.T))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The generalised/pseudo/Moore-Penrose inverse\n", "\n", "Let's suppose we are in a situation where we cannot use either \n", "\n", "the ***least squares solution*** \n", "\n", "$$\\boldsymbol{x} = (A^TA)^{-1}A^T\\boldsymbol{b}$$\n", "\n", "or the ***minimum-norm solution***\n", "\n", "$$\\boldsymbol{x} = A^T(AA^T)^{-1}\\boldsymbol{b}.$$\n", "\n", "
\n", "\n", "Is there anything we can do analytically?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Eigenvalues and the eigen-decomposition\n", "\n", "We don't have time to go into the details, but recall that for square matrices if we can find a vector (direction) $\\boldsymbol{v}$ such that\n", "\n", "$$ A \\boldsymbol{v} = \\lambda \\boldsymbol{v} $$\n", "\n", "for some non-zero scalar $\\lambda$, i.e. there is a special direction such that pre-multiplication by $A$ simply scales the vector but does not change its direction, then this $\\boldsymbol{v}$ is called an *eigenvector* and $\\lambda$ is the corresponding eigenvalue.\n", "\n", "Eigen-values/vectors are incredibly useful, but of course by their definition (consider the words as well as the maths above) are only defined for square matrices.\n", "\n", "One way they are useful is that they allow us to find a diagonalisation of a matrix:\n", "\n", "$$A = P\\Lambda P^{-1}$$\n", "\n", "NB. $P$ is a matrix made up of the eigenvectors as colums, and $\\Lambda$ is a diagonal matrix with the eigenvalues as the diagonal entries. This is why this diapgonalisation is also known as an ***eigen-decomposition***. \n", "\n", "If we have this decomposition or diagonalisation then it's trivial to find the inverse matrix, and also to better understand how $A$ maps points or vectors from one space into another.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Singular values and the singular value decomposition\n", "\n", "So what do we do if we want to do something similar for non-square matrices, i.e. to easily find an \"inverse\", to better understand the matrix as a mapping?\n", "\n", "The answer is ***singular values*** which allow us to write **any** matrix as a [***singular value decomposition***](https://en.wikipedia.org/wiki/Singular_value_decomposition):\n", "\n", "\n", "
\n", "\n", "Consider an arbitrary $m \\times n$ real matrix $A$ - very similar to the square case it can be decomposed into a product of three matrices:\n", "\n", "$$A = U\\Sigma V^{T}$$\n", "\n", "where\n", "\n", "\n", "- $U$ is an $m\\times m$ [orthogonal matrix](https://en.wikipedia.org/wiki/Orthogonal_matrix) whose columns are the eigenvectors of the matrix $AA^T$,\n", "\n", "\n", "- $V$ is an $n\\times n$ orthogonal matrix whose columns are the eigenvectors of the matrix $A^TA$,\n", "\n", "\n", "- $\\Sigma$ is an $m\\times n$ diagonal matrix whose diagonal entries, $\\sigma_1, \\sigma_2,\\ldots $ are the ***singular values*** of $A$\n", "\n", "\n", "
\n", "\n", "The singular values of $A$ are positive, and the convention is to number them/place down the diagonal of $\\Sigma$ in order of their magnitude: $\\sigma_1\\geq\\sigma_2\\geq\\ldots\\geq 0$.\n", "\n", "
\n", "\n", "The singular values are the square roots of the eigenvalues of the square matrix $A^TA$. They are also the square roots of the eigenvalues of the square matrix $AA^T$. \n", "\n", "But hang on ..... if $A$ is non-square then $A^TA$ and $AA^T$ are of different size and thus have different numbers of eigenvalues - how to reconcile this apparent contradiction? \n", "Well it turns out that the extra eigenvalues are always zero, i.e. the maximum number of non-zero singular values is the smaller of $m$ and $n$.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solving linear inversion problems using the generalised/pseudo/Moore-Penrose inverse\n", "\n", "Once we have the singular value decomposition it is trivial to construct\n", "the ***generalised (or pseudo, or Moore-Penrose) inverse*** of $A$, written $A^+$.\n", "\n", "\n", "The generalised inverse has the following properties all of which make it behave in ways that are similar to a true inverse: \n", "\n", "$$\\begin{align*}\n", "AA^+A &= A\\\\\n", "A^+AA^+ &= A^+\\\\\n", "(A^+A)^T &= A^+A\\\\\n", "(AA^+)^T &= AA^+\\\\\n", "\\end{align*}\n", "$$\n", "\n", "\n", "
\n", "\n", "Note that the generalised inverse defined exists for any matrix other than the zero matrix, but when the matrix is of full rank $\\text{rank}(A)=\\min(m,n)$) then it can be expressed via simple algebraic formulae that we've seen before!\n", "\n", "\n", "
\n", "\n", "\n", "$A^+$ always exists for any non-null (non-zero) matrix $A$ so that the generalised inverse solution $\\boldsymbol{x}^+$ to the equation $A\\boldsymbol{x} = \\boldsymbol{b}$ also always exists and is given by \n", "\n", "\n", "$$\\boldsymbol{x}^+ = A^+\\boldsymbol{b}$$\n", "\n", "\n", "It has the properties that\n", "\n", "\n", "- If there is any model parameter vector that can provide an exact fit to the data $\\boldsymbol{b}$, then $\\boldsymbol{x}^+$ will be such a vector. \n", "\n", "\n", "\n", "- If there is no model parameter vector that can provide an exact fit to the data $\\boldsymbol{b}$, then $\\boldsymbol{x}^+$ will provide a best least-squares fit to the data. \n", "\n", "\n", " \n", "- If there is more than one exact or best-fitting parameter vector to the data $\\boldsymbol{b}$, then $\\boldsymbol{x}^+$ will be the exact or best-fitting model that also has the smallest norm. \n", "\n", "\n", "
\n", "\n", "\n", "Thus, we can apply the generalised inverse to provide a solution to ***any*** linear inverse, and we will get a result that has various sensible and desirable properties, cf. the descriptions and justifications of these properties given in previous lectures. \n", "\n", "
\n", "\n", "However, this approach has a number of issues, in particular it is expensive and it is also not robust to there being noise or errors in our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regularisation\n", "\n", "\n", "If we have a mixed-determined problem for which the generalised inverse is not affordable, or for which small singular values and small eigenvalues lead to a noisy solution/model, then we will need a different approach. \n", "\n", "In the case of a square problem, the matrix may not be singular, but it may still have a very small determinant and at least one very small eigenvalue. In such a case, although the inverse may exist, the problem is unstable - large spurious values will likely appear in the solution. The solution will depend upon the values of the smallest eigenvalues, which themselves will likely be almost entirely related to the noise in the system. Effectively, in any inversion problem, we face the issue of potentially dividing by small numbers that are noise dominated. This is related to the brief mention of ill-conditioning from earlier as well as in a previous lecture.\n", "\n", "There are several related methods for dealing with this situation. Here we will explore damped least-squares and from there introduce general schemes for stabilisation and regularisation of the general inverse problem. \n", "\n", "\n", "## Damped least squares\n", "\n", "In damped least-squares, instead of minimising the objective function of the form\n", "\n", "$$\n", "f=\\|\\boldsymbol{p} - \\boldsymbol{b}\\|_2^2\n", "= \\|A\\boldsymbol{x} - \\boldsymbol{b}\\|_2^2\n", "$$\n", "\n", "\n", "instead we minimise the objective function\n", "\n", "$$\n", "f = \\|A\\boldsymbol{x} - \\boldsymbol{b}\\|_2^2 + \\mu \\|\\boldsymbol{x}\\|_2^2\n", "$$\n", "\n", "where $\\mu$ is a positive scalar known in various contexts as the damping factor, trade-off, regularisation or pre-whitening parameter. \n", "\n", "
\n", "\n", "Now when we invert, if you follow through the steps taken in L4 to derive least squares, the equation that we must solve takes the form \n", "\n", "$$ \\boldsymbol{x} = (A^TA + \\mu{I})^{-1} A^T \\boldsymbol{b}$$\n", "\n", "For a positive $\\mu$, note that the matrix $(A^TA + \\mu{I})$ is always non-singular and thus we are able to use this expression - it can be used to solve equi-, under-, over- and mixed-determined systems even if $A^TA$ and $AA^T$ are singular.\n", "\n", "
\n", "\n", "By damping the least squares system, we have protected the solution against zero or small eigenvalues by adding a constant value, controlled by $\\mu$, to the diagonal of $A^TA$. If $\\mu$ is very small, then we will have something very close to regular least squares (cf. the under-determined case above); if $\\mu$ is very large, then we will simply minimise the norm of the model parameters (the second term dominating the first). Choosing the value of $\\mu$ therefore involves a trade-off between fitting the data and stabilising the result. \n", "\n", "We saw earlier that the minimal-norm inverse involved first finding an exact or best least-squares fit to the data, then, subject to that constraint, finding the solution that minimised the $L^2$-norm of the model/solution vector (cf. an earlier homework exercise - \"minimal-norm solution to under-determined problem\").\n", " \n", "Damped least-squares does something similar, except here the parameter $\\mu$ is adjusted to control the degree to which the solution fits the data (which is favoured by decreasing the value of $\\mu$), and the degree to which it minimises the norm of the model (which is favoured by increasing the value of $\\mu$). Recall that in practice a problem may well be formulated such that it is the perturbation to an *a priori* model rather than the model itself that is minimised. \n", "\n", "The key idea here is to set the value of $\\mu$ so that it removes the influence of small eigenvalues in $A^TA$ from the problem, but leaves the larger eigenvalues alone. The larger eigenvalues then control the fit to the data (the smaller eigenvalues would contribute almost nothing to this fit anyway because they are small), and the resulting under-determined model is then fully determined by minimising its norm. \n", "\n", "Including only the largest eigenvalues in the solution will effectively reduce the resolution of the recovered model parameters, whereas including the smaller eigenvalues will increase the noise within the recovered model parameters. We are therefore trading resolution against noise in the model, and the value of $\\mu$ controls that trade-off. Yet another way to regard $\\mu$ is that we should choose it so that the fit to the data is only as good as is required by the level of noise in the data, but no better. If we insist on lowering $\\mu$ beyond this, then we will be [***over-fitting***](https://en.wikipedia.org/wiki/Overfitting) the data - the increased apparent match to the observed data will then merely serve to map noise from the data into the model without additional benefit to the model. \n", "\n", "In practice, it can be difficult to choose $\\mu$ to achieve these outcomes. One way to do this, at least in principle, is to invert the data using a range of values for $\\mu$, and then plot the model norm against the data-residual norm as in \n", "\n", "\n", "\n", "\n", "If we are lucky, there may be a corner or elbow in this plot that indicates a clear break and an optimal value for $\\mu$; in many problems though no obvious corner appears. If we know the data errors, then we can compute the minimum useful value of the data-residual norm, and select a value of $\\mu$ that is consistent with this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More general regularisation\n", "\n", "Use of damped least-squares assumes implicitly that we know something about the model parameters that is independent of the data - that is we know it *a priori*, before we undertake the experiment or observations that generate the data. \n", "In this case, our *a priori* knowledge is simply that the model should have a smal norm, ideally zero. \n", "This assumption is most appropriate when, in the absence of data, we would assume that the model was zero - typically this will be the case when the model for which we are inverting is the perturbation to some best-guess starting model.\n", "\n", "However, assuming a minimum-norm is not the only a priori assumption we can make. \n", "We might instead assume for example that the norm of the first or second derivatives of the model parameters should be small. \n", "The first of these is the assumption that the model is \"flat\" - all the model parameters are the same - and the second is the assumption that the model is \"smooth\" - the variation between adjacent parameters is everywhere the same. \n", "\n", "These assumptions, and the assumption of minimal model norm as made in the damped least-squares approach, are all forms of [***model regularisation***](https://en.wikipedia.org/wiki/Regularization_(mathematics)). \n", "In this case, they are a type of regularisation called [***Tikhonov regularisation***](https://en.wikipedia.org/wiki/Tikhonov_regularization). \n", "\n", "Minimising the model norm is sometimes called zero-order Tikhonov regularisation, and minimising the norms of the first and second differentials of the model are called first and second-order Tikhonov regularisation. \n", "\n", "There are many other forms of regularisation possible, but these three illustrate the general idea. \n", "\n", "Regularisation in general refers to the process of introducing additional a priori information about the properties of the model in order to solve ill-posed problems without over-fitting the data and without introducing noise, instability and spurious structure into the resultant model. \n", "\n", "
\n", "\n", "Many regularisation schemes can be implemented by minimising an objective function of the form \n", "\n", "$$\n", "f = \\|A\\boldsymbol{x} - \\boldsymbol{b}\\|_2^2 + \\mu \\|L\\boldsymbol{x}\\|_2^2,\n", "$$\n", "\n", "where $L$ is a matrix that acts on $\\boldsymbol{x}$ to generate the property that we seek to minimise. \n", "For damped least-squares, $L$ is simply the identity matrix $I$. \n", "For first-order Tikhonov regularisation $L$ needs to operate of $\\boldsymbol{x}$ to generate a first order approximate derivative, e.g. using an upwind finite difference. \n", "In second-order Tikhonov regularisation, $L$ needs to act to generate an approximation to the second derivative of the model, e.g. a second-order central difference. \n", "\n", "Of course in 1D it's easy to write down these operators as matrices. \n", "In higher dimensions things are more complicated, and depend on the ordering assumed for $\\boldsymbol{x}$.\n", "\n", "Note that through the additional of extra terms it's possible to incorporate multiple regularisation terms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Nonlinear problems\n", "\n", "So far in the theory part of this module we have considered solution methods appropriate for small linear problems.\n", "\n", "Of course for real world applications we need to be prepared to deal with very large and very nonlinear problems. We potentially also need to incorporate various additional types of constraints on our solutions.\n", "\n", "In the applied lectures of this module you will see some examples in more complex settings, including seismic inversion. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "\n", "\n", "- We've formally introduces the concepts of equi-, under-, over-, and mixed-determined problems and considered some examples\n", "\n", "\n", "- We've derived the least squares and the minimum norm solution\n", "\n", "\n", "- and established that these expressions for the solutions can be used in cases where the matrix $A$ is full rank\n", "\n", "\n", "- We have considered approaches that can be used when $A$ is not full rank, including when it is mixed determined, i.e. the pseudo-inverse available via the singular value decomposition and damped least squares." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": true }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }