{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Algebra and Linear Regression\n",
    "\n",
    "### 2022-01-27"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Abstract**: In this session we combine the objective function\n",
    "perspective and the probabilistic perspective on *linear regression*. We\n",
    "motivate the importance of *linear algebra* by showing how much faster\n",
    "we can complete a linear regression using linear algebra."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "::: {.cell .markdown}\n",
    "\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!---->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->\n",
    "<!--\n",
    "\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_notebooks/includes/notebook-setup.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_notebooks/includes/notebook-setup.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams.update({'font.size': 22})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--setupplotcode{import seaborn as sns\n",
    "sns.set_style('darkgrid')\n",
    "sns.set_context('paper')\n",
    "sns.set_palette('colorblind')}-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## notutils\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "This small package is a helper package for various notebook utilities\n",
    "used\n",
    "\n",
    "The software can be installed using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install notutils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "from the command prompt where you can access your python installation.\n",
    "\n",
    "The code is also available on GitHub:\n",
    "<https://github.com/lawrennd/notutils>\n",
    "\n",
    "Once `notutils` is installed, it can be imported in the usual manner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import notutils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## mlai\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "The `mlai` software is a suite of helper functions for teaching and\n",
    "demonstrating machine learning algorithms. It was first used in the\n",
    "Machine Learning and Adaptive Intelligence course in Sheffield in 2013.\n",
    "\n",
    "The software can be installed using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install mlai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "from the command prompt where you can access your python installation.\n",
    "\n",
    "The code is also available on GitHub: <https://github.com/lawrennd/mlai>\n",
    "\n",
    "Once `mlai` is installed, it can be imported in the usual manner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression Examples\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/regression-examples.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/regression-examples.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "Regression involves predicting a real value, $y_i$, given an input\n",
    "vector, $\\mathbf{ x}_i$. For example, the Tecator data involves\n",
    "predicting the quality of meat given spectral measurements. Or in\n",
    "radiocarbon dating, the C14 calibration curve maps from radiocarbon age\n",
    "to age measured through a back-trace of tree rings. Regression has also\n",
    "been used to predict the quality of board game moves given expert rated\n",
    "training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Olympic 100m Data\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/olympic-100m-data.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/olympic-100m-data.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"50%\">\n",
    "\n",
    "-   Gold medal times for Olympic 100 m runners since 1896.\n",
    "-   One of a number of Olypmic data sets collected by Rogers and\n",
    "    Girolami (2011). `{=html}     </td>` `{=html}     <td width=\"50%\">`\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/100m_final_start.jpg\" style=\"width:100%\">\n",
    "\n",
    "Figure: <i>Start of the 2012 London 100m race. *Image from Wikimedia\n",
    "Commons* <http://bit.ly/191adDC></i>\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "The first thing we will do is load a standard data set for regression\n",
    "modelling. The data consists of the pace of Olympic Gold Medal 100m\n",
    "winners for the Olympics from 1896 to present. First we load in the data\n",
    "and plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_100m_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xlim = (1875,2030)\n",
    "ylim = (9, 12)\n",
    "yhat = (y-offset)/scale\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('year', fontsize=20)\n",
    "ax.set_ylabel('pace min/km', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(filename='olympic-100m.svg', \n",
    "                  directory='./datasets')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//datasets/olympic-100m.svg\" class=\"\" width=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Olympic 100m wining times since 1896.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Olympic Marathon Data\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/olympic-marathon-data.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/olympic-marathon-data.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"70%\">\n",
    "\n",
    "-   Gold medal times for Olympic Marathon since 1896.\n",
    "-   Marathons before 1924 didn’t have a standardised distance.\n",
    "-   Present results using pace per km.\n",
    "-   In 1904 Marathon was badly organised leading to very slow times.\n",
    "\n",
    "</td>\n",
    "<td width=\"30%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//Stephen_Kiprotich.jpg\" style=\"width:100%\">\n",
    "<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "The first thing we will do is load a standard data set for regression\n",
    "modelling. The data consists of the pace of Olympic Gold Medal Marathon\n",
    "winners for the Olympics from 1896 to present. First we load in the data\n",
    "and plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pods.datasets.olympic_marathon_men()\n",
    "x = data['X']\n",
    "y = data['Y']\n",
    "\n",
    "offset = y.mean()\n",
    "scale = np.sqrt(y.var())\n",
    "yhat = (y - offset)/scale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import mlai.plot as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xlim = (1875,2030)\n",
    "ylim = (2.5, 6.5)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n",
    "_ = ax.plot(x, y, 'r.',markersize=10)\n",
    "ax.set_xlabel('year', fontsize=20)\n",
    "ax.set_ylabel('pace min/km', fontsize=20)\n",
    "ax.set_xlim(xlim)\n",
    "ax.set_ylim(ylim)\n",
    "\n",
    "mlai.write_figure(filename='olympic-marathon.svg', \n",
    "                  directory='./datasets')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//datasets/olympic-marathon.svg\" class=\"\" width=\"80%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Olympic marathon pace times since 1896.</i>\n",
    "\n",
    "Things to notice about the data include the outlier in 1904, in this\n",
    "year, the olympics was in St Louis, USA. Organizational problems and\n",
    "challenges with dust kicked up by the cars following the race meant that\n",
    "participants got lost, and only very few participants completed.\n",
    "\n",
    "More recent years see more consistently quick marathons."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What is Machine Learning?\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "What is machine learning? At its most basic level machine learning is a\n",
    "combination of\n",
    "\n",
    "$$\\text{data} + \\text{model} \\stackrel{\\text{compute}}{\\rightarrow} \\text{prediction}$$\n",
    "\n",
    "where *data* is our observations. They can be actively or passively\n",
    "acquired (meta-data). The *model* contains our assumptions, based on\n",
    "previous experience. That experience can be other data, it can come from\n",
    "transfer learning, or it can merely be our beliefs about the\n",
    "regularities of the universe. In humans our models include our inductive\n",
    "biases. The *prediction* is an action to be taken or a categorization or\n",
    "a quality score. The reason that machine learning has become a mainstay\n",
    "of artificial intelligence is the importance of predictions in\n",
    "artificial intelligence. The data and the model are combined through\n",
    "computation.\n",
    "\n",
    "In practice we normally perform machine learning using two functions. To\n",
    "combine data with a model we typically make use of:\n",
    "\n",
    "**a prediction function** a function which is used to make the\n",
    "predictions. It includes our beliefs about the regularities of the\n",
    "universe, our assumptions about how the world works, e.g. smoothness,\n",
    "spatial similarities, temporal similarities.\n",
    "\n",
    "**an objective function** a function which defines the cost of\n",
    "misprediction. Typically it includes knowledge about the world’s\n",
    "generating processes (probabilistic objectives) or the costs we pay for\n",
    "mispredictions (empiricial risk minimization).\n",
    "\n",
    "The combination of data and model through the prediction function and\n",
    "the objective function leads to a *learning algorithm*. The class of\n",
    "prediction functions and objective functions we can make use of is\n",
    "restricted by the algorithms they lead to. If the prediction function or\n",
    "the objective function are too complex, then it can be difficult to find\n",
    "an appropriate learning algorithm. Much of the acdemic field of machine\n",
    "learning is the quest for new learning algorithms that allow us to bring\n",
    "different types of models and data together.\n",
    "\n",
    "A useful reference for state of the art in machine learning is the UK\n",
    "Royal Society Report, [Machine Learning: Power and Promise of Computers\n",
    "that Learn by\n",
    "Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n",
    "\n",
    "You can also check my post blog post on [What is Machine\n",
    "Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Laplace’s Idea\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/linear-regression-log-likelihood.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/linear-regression-log-likelihood.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "Laplace had the idea to augment the observations by noise, that is\n",
    "equivalent to considering a probability density whose mean is given by\n",
    "the *prediction function*\n",
    "$$p\\left(y_i|x_i\\right)=\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\exp\\left(-\\frac{\\left(y_i-f\\left(x_i\\right)\\right)^{2}}{2\\sigma^2}\\right).$$\n",
    "\n",
    "This is known as *stochastic process*. It is a function that is\n",
    "corrupted by noise. Laplace didn’t suggest the Gaussian density for that\n",
    "purpose, that was an innovation from Carl Friederich Gauss, which is\n",
    "what gives the Gaussian density its name."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Height as a Function of Weight\n",
    "\n",
    "In the standard Gaussian, parametized by mean and variance.\n",
    "\n",
    "Make the mean a linear function of an *input*.\n",
    "\n",
    "This leads to a regression model. $$\n",
    "\\begin{align*}\n",
    "  y_i=&f\\left(x_i\\right)+\\epsilon_i,\\\\\n",
    "         \\epsilon_i \\sim & \\mathcal{N}\\left(0,\\sigma^2\\right).\n",
    "  \\end{align*}\n",
    "$$\n",
    "\n",
    "Assume $y_i$ is height and $x_i$ is weight."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sum of Squares Error\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/sum-of-squares-log-likelihood.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/sum-of-squares-log-likelihood.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Legendre\n",
    "\n",
    "Minimizing the sum of squares error was first proposed by\n",
    "[Legendre](http://en.wikipedia.org/wiki/Adrien-Marie_Legendre) in 1805\n",
    "(Legendre, 1805). His book, which was on the orbit of comets, is\n",
    "available on google books, we can take a look at the relevant page by\n",
    "calling the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import notutils as nu\n",
    "nu.display_google_book(id='spcAAAAAMAAJ', page='PA72')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Figure: <i>Legendre’s book was on the determination of orbits of comets.\n",
    "This page describes the formulation of least squares</i>\n",
    "\n",
    "Of course, the main text is in French, but the key part we are\n",
    "interested in can be roughly translated as\n",
    "\n",
    "> In most matters where we take measures data through observation, the\n",
    "> most accurate results they can offer, it is almost always leads to a\n",
    "> system of equations of the form $$E = a + bx + cy + fz + etc .$$ where\n",
    "> $a$, $b$, $c$, $f$ etc are the known coefficients and $x$, $y$, $z$\n",
    "> etc are unknown and must be determined by the condition that the value\n",
    "> of E is reduced, for each equation, to an amount or zero or very\n",
    "> small.\n",
    "\n",
    "He continues\n",
    "\n",
    "> Of all the principles that we can offer for this item, I think it is\n",
    "> not broader, more accurate, nor easier than the one we have used in\n",
    "> previous research application, and that is to make the minimum sum of\n",
    "> the squares of the errors. By this means, it is between the errors a\n",
    "> kind of balance that prevents extreme to prevail, is very specific to\n",
    "> make known the state of the closest to the truth system. The sum of\n",
    "> the squares of the errors\n",
    "> $E^2 + \\left.E^\\prime\\right.^2 + \\left.E^{\\prime\\prime}\\right.^2 + etc$\n",
    "> being if we wanted a minimum, by varying x alone, we will have the\n",
    "> equation …\n",
    "\n",
    "This is the earliest know printed version of the problem of least\n",
    "squares. The notation, however, is a little awkward for mordern eyes. In\n",
    "particular Legendre doesn’t make use of the sum sign, $$\n",
    "\\sum_{i=1}^3 z_i = z_1 + z_2 + z_3\n",
    "$$ nor does he make use of the inner product.\n",
    "\n",
    "In our notation, if we were to do linear regression, we would need to\n",
    "subsititue: $$\\begin{align*}\n",
    "a &\\leftarrow y_1-c, \\\\ a^\\prime &\\leftarrow y_2-c,\\\\ a^{\\prime\\prime} &\\leftarrow\n",
    "y_3 -c,\\\\ \n",
    "\\text{etc.} \n",
    "\\end{align*}$$ to introduce the data observations $\\{y_i\\}_{i=1}^{n}$\n",
    "alongside $c$, the offset. We would then introduce the input locations\n",
    "$$\\begin{align*}\n",
    "b & \\leftarrow x_1,\\\\\n",
    "b^\\prime & \\leftarrow x_2,\\\\\n",
    "b^{\\prime\\prime} & \\leftarrow x_3\\\\\n",
    "\\text{etc.}\n",
    "\\end{align*}$$ and finally the gradient of the function\n",
    "$$x \\leftarrow -m.$$ The remaining coefficients ($c$ and $f$) would then\n",
    "be zero. That would give us $$\\begin{align*}   &(y_1 -\n",
    "(mx_1+c))^2 \\\\ + &(y_2 -(mx_2 + c))^2\\\\ + &(y_3 -(mx_3 + c))^2 \\\\ + & \\text{etc.}\n",
    "\\end{align*}$$ which we would write in the modern notation for sums as\n",
    "$$\n",
    "\\sum_{i=1}^n(y_i-(mx_i + c))^2\n",
    "$$ which is recognised as the sum of squares error for a linear\n",
    "regression.\n",
    "\n",
    "This shows the advantage of modern [summation\n",
    "operator](http://en.wikipedia.org/wiki/Summation), $\\sum$, in keeping\n",
    "our mathematical notation compact. Whilst it may look more complicated\n",
    "the first time you see it, understanding the mathematical rules that go\n",
    "around it, allows us to go much further with the notation.\n",
    "\n",
    "Inner products (or [dot\n",
    "products](http://en.wikipedia.org/wiki/Dot_product)) are similar. They\n",
    "allow us to write $$\n",
    "\\sum_{i=1}^q u_i v_i\n",
    "$$ in a more compact notation, $\\mathbf{u}\\cdot\\mathbf{v}.$\n",
    "\n",
    "Here we are using bold face to represent vectors, and we assume that the\n",
    "individual elements of a vector $\\mathbf{z}$ are given as a series of\n",
    "scalars $$\n",
    "\\mathbf{z} = \\begin{bmatrix} z_1\\\\ z_2\\\\ \\vdots\\\\ z_n\n",
    "\\end{bmatrix}\n",
    "$$ which are each indexed by their position in the vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running Example: Olympic Marathons\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/olympic-marathon-linear-regression.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/olympic-marathon-linear-regression.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "Note that `x` and `y` are not `pandas` data frames for this example,\n",
    "they are just arrays of dimensionality $n\\times 1$, where $n$ is the\n",
    "number of data.\n",
    "\n",
    "The aim of this lab is to have you coding linear regression in python.\n",
    "We will do it in two ways, once using iterative updates (coordinate\n",
    "ascent) and then using linear algebra. The linear algebra approach will\n",
    "not only work much better, it is easy to extend to multiple input linear\n",
    "regression and *non-linear* regression using basis functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Maximum Likelihood: Iterative Solution\n",
    "\n",
    "Now we will take the maximum likelihood approach we derived in the\n",
    "lecture to fit a line, $y_i=mx_i + c$, to the data you’ve plotted. We\n",
    "are trying to minimize the error function: $$\n",
    "E(m, c) =  \\sum_{i=1}^n(y_i-mx_i-c)^2\n",
    "$$ with respect to $m$, $c$ and $\\sigma^2$. We can start with an initial\n",
    "guess for $m$,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = -0.4\n",
    "c = 80"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we use the maximum likelihood update to find an estimate for the\n",
    "offset, $c$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objective Functions and Regression\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/regression.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/regression.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.random.normal(size=(4, 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_true = 1.4\n",
    "c_true = -3.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = m_true*x+c_true"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(x, y, 'r.', markersize=10) # plot data as red dots\n",
    "plt.xlim([-3, 3])\n",
    "mlai.write_figure(filename='regression.svg', directory='./ml', transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>A simple linear regression.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Noise Corrupted Plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "noise = np.random.normal(scale=0.5, size=(4, 1)) # standard deviation of the noise is 0.5\n",
    "y = m_true*x + c_true + noise\n",
    "plt.plot(x, y, 'r.', markersize=10)\n",
    "plt.xlim([-3, 3])\n",
    "mlai.write_figure(filename='regression_noise.svg', directory='./ml', transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression_noise.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>A simple linear regression with noise.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contour Plot of Error Function\n",
    "\n",
    "-   Visualise the error function surface, create vectors of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create an array of linearly separated values around m_true\n",
    "m_vals = np.linspace(m_true-3, m_true+3, 100) \n",
    "# create an array of linearly separated values ae\n",
    "c_vals = np.linspace(c_true-3, c_true+3, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-   create a grid of values to evaluate the error function in 2D."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_grid, c_grid = np.meshgrid(m_vals, c_vals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-   compute the error function at each combination of $c$ and $m$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "E_grid = np.zeros((100, 100))\n",
    "for i in range(100):\n",
    "    for j in range(100):\n",
    "        E_grid[i, j] = ((y - m_grid[i, j]*x - c_grid[i, j])**2).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contour Plot of Error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot\n",
    "import mlai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f, ax = plt.subplots(figsize=(5,5))\n",
    "plot.regression_contour(f, ax, m_vals, c_vals, E_grid)\n",
    "mlai.write_figure(filename='regression_contour.svg', directory='./ml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression_contour.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Contours of the objective function for linear regression by\n",
    "minimizing least squares.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Steepest Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algorithm\n",
    "\n",
    "-   We start with a guess for $m$ and $c$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_star = 0.0\n",
    "c_star = -5.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Offset Gradient\n",
    "\n",
    "-   Now we need to compute the gradient of the error function, firstly\n",
    "    with respect to $c$, $$\n",
    "      \\frac{\\text{d}E(m, c)}{\\text{d} c} = -2\\sum_{i=1}^n(y_i - mx_i - c)\n",
    "      $$\n",
    "\n",
    "-   This is computed in python as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_grad = -2*(y-m_star*x - c_star).sum()\n",
    "print(\"Gradient with respect to c is \", c_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deriving the Gradient\n",
    "\n",
    "To see how the gradient was derived, first note that the $c$ appears in\n",
    "every term in the sum. So we are just differentiating\n",
    "$(y_i - mx_i - c)^2$ for each term in the sum. The gradient of this term\n",
    "with respect to $c$ is simply the gradient of the outer quadratic,\n",
    "multiplied by the gradient with respect to $c$ of the part inside the\n",
    "quadratic. The gradient of a quadratic is two times the argument of the\n",
    "quadratic, and the gradient of the inside linear term is just minus one.\n",
    "This is true for all terms in the sum, so we are left with the sum in\n",
    "the gradient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Slope Gradient\n",
    "\n",
    "The gradient with respect tom $m$ is similar, but now the gradient of\n",
    "the quadratic’s argument is $-x_i$ so the gradient with respect to $m$\n",
    "is\n",
    "\n",
    "$$\\frac{\\text{d}E(m, c)}{\\text{d} m} = -2\\sum_{i=1}^nx_i(y_i - mx_i -\n",
    "c)$$\n",
    "\n",
    "which can be implemented in python (numpy) as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_grad = -2*(x*(y-m_star*x - c_star)).sum()\n",
    "print(\"Gradient with respect to m is \", m_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Equations\n",
    "\n",
    "-   Now we have gradients with respect to $m$ and $c$.\n",
    "-   Can update our inital guesses for $m$ and $c$ using the gradient.\n",
    "-   We don’t want to just subtract the gradient from $m$ and $c$,\n",
    "-   We need to take a *small* step in the gradient direction.\n",
    "-   Otherwise we might overshoot the minimum.\n",
    "-   We want to follow the gradient to get to the minimum, the gradient\n",
    "    changes all the time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Move in Direction of Gradient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import teaching_plots as plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f, ax = plt.subplots(figsize=plot.big_figsize)\n",
    "plot.regression_contour(f, ax, m_vals, c_vals, E_grid)\n",
    "ax.plot(m_star, c_star, 'g*', markersize=20)\n",
    "ax.arrow(m_star, c_star, -m_grad*0.1, -c_grad*0.1, head_width=0.2)\n",
    "mlai.write_figure(filename='regression_contour_step001.svg', directory='./ml/', transparent=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression_contour_step001.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Single update descending the contours of the error surface\n",
    "for regression.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Equations\n",
    "\n",
    "-   The step size has already been introduced, it’s again known as the\n",
    "    learning rate and is denoted by $\\eta$. $$\n",
    "    c_\\text{new}\\leftarrow c_{\\text{old}} - \\eta\\frac{\\text{d}E(m, c)}{\\text{d}c}\n",
    "    $$\n",
    "\n",
    "-   gives us an update for our estimate of $c$ (which in the code we’ve\n",
    "    been calling `c_star` to represent a common way of writing a\n",
    "    parameter estimate, $c^*$) and $$\n",
    "    m_\\text{new} \\leftarrow m_{\\text{old}} - \\eta\\frac{\\text{d}E(m, c)}{\\text{d}m}\n",
    "    $$\n",
    "\n",
    "-   Giving us an update for $m$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Update Code\n",
    "\n",
    "-   These updates can be coded as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Original m was\", m_star, \"and original c was\", c_star)\n",
    "learn_rate = 0.01\n",
    "c_star = c_star - learn_rate*c_grad\n",
    "m_star = m_star - learn_rate*m_grad\n",
    "print(\"New m is\", m_star, \"and new c is\", c_star)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Iterating Updates\n",
    "\n",
    "-   Fit model by descending gradient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient Descent Algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_plots = plot.regression_contour_fit(x, y, diagrams='./ml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('regression_contour_fit{num:0>3}.svg', directory='./ml', num=IntSlider(0, 0, num_plots, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression_contour_fit028.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Batch gradient descent for linear regression</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   If $n$ is small, gradient descent is fine.\n",
    "-   But sometimes (e.g. on the internet $n$ could be a billion.\n",
    "-   Stochastic gradient descent is more similar to perceptron.\n",
    "-   Look at gradient of one data point at a time rather than summing\n",
    "    across *all* data points)\n",
    "-   This gives a stochastic estimate of gradient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   The real gradient with respect to $m$ is given by\n",
    "\n",
    "    $$\\frac{\\text{d}E(m, c)}{\\text{d} m} = -2\\sum_{i=1}^nx_i(y_i -\n",
    "    mx_i - c)$$\n",
    "\n",
    "    but it has $n$ terms in the sum. Substituting in the gradient we can\n",
    "    see that the full update is of the form\n",
    "\n",
    "    $$m_\\text{new} \\leftarrow\n",
    "    m_\\text{old} + 2\\eta\\left[x_1 (y_1 - m_\\text{old}x_1 - c_\\text{old}) + (x_2 (y_2 -   m_\\text{old}x_2 - c_\\text{old}) + \\dots + (x_n (y_n - m_\\text{old}x_n - c_\\text{old})\\right]$$\n",
    "\n",
    "    This could be split up into lots of individual updates\n",
    "    $$m_1 \\leftarrow m_\\text{old} + 2\\eta\\left[x_1 (y_1 - m_\\text{old}x_1 -\n",
    "    c_\\text{old})\\right]$$ $$m_2 \\leftarrow m_1 + 2\\eta\\left[x_2 (y_2 -\n",
    "    m_\\text{old}x_2 - c_\\text{old})\\right]$$\n",
    "    $$m_3 \\leftarrow m_2 + 2\\eta\n",
    "    \\left[\\dots\\right]$$\n",
    "    $$m_n \\leftarrow m_{n-1} + 2\\eta\\left[x_n (y_n -\n",
    "    m_\\text{old}x_n - c_\\text{old})\\right]$$\n",
    "\n",
    "which would lead to the same final update."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Updating $c$ and $m$\n",
    "\n",
    "-   In the sum we don’t $m$ and $c$ we use for computing the gradient\n",
    "    term at each update.\n",
    "-   In stochastic gradient descent we *do* change them.\n",
    "-   This means it’s not quite the same as steepest desceint.\n",
    "-   But we can present each data point in a random order, like we did\n",
    "    for the perceptron.\n",
    "-   This makes the algorithm suitable for large scale web use (recently\n",
    "    this domain is know as ‘Big Data’) and algorithms like this are\n",
    "    widely used by Google, Microsoft, Amazon, Twitter and Facebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   Or more accurate, since the data is normally presented in a random\n",
    "    order we just can write $$\n",
    "    m_\\text{new} = m_\\text{old} + 2\\eta\\left[x_i (y_i - m_\\text{old}x_i - c_\\text{old})\\right]\n",
    "    $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose a random point for the update \n",
    "i = np.random.randint(x.shape[0]-1)\n",
    "# update m\n",
    "m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))\n",
    "# update c\n",
    "c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SGD for Linear Regression\n",
    "\n",
    "Putting it all together in an algorithm, we can do stochastic gradient\n",
    "descent for our regression data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_plots = plot.regression_contour_sgd(x, y, diagrams='./ml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pods\n",
    "from ipywidgets import IntSlider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pods.notebook.display_plots('regression_sgd_contour_fit{num:0>3}.svg', \n",
    "    directory='./ml', num=IntSlider(0, 0, num_plots, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//ml/regression_sgd_contour_fit058.svg\" class=\"\" width=\"60%\" style=\"vertical-align:middle;\">\n",
    "\n",
    "Figure: <i>Stochastic gradient descent for linear regression.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reflection on Linear Regression and Supervised Learning\n",
    "\n",
    "Think about:\n",
    "\n",
    "1.  What effect does the learning rate have in the optimization? What’s\n",
    "    the effect of making it too small, what’s the effect of making it\n",
    "    too big? Do you get the same result for both stochastic and steepest\n",
    "    gradient descent?\n",
    "\n",
    "2.  The stochastic gradient descent doesn’t help very much for such a\n",
    "    small data set. It’s real advantage comes when there are many,\n",
    "    you’ll see this in the lab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quadratic Loss\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/linear-regression-direct-solution.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/linear-regression-direct-solution.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "Now we’ve identified the empirical risk with the loss, we’ll use\n",
    "$E(\\mathbf{ w})$ to represent our objective function. $$\n",
    "E(\\mathbf{ w}) = \\sum_{i=1}^n\\left(y_i - f(\\mathbf{ x}_i, \\mathbf{ w})\\right)^2\n",
    "$$ gives us our objective.\n",
    "\n",
    "In the case of the linear prediction function we can substitute\n",
    "$f(\\mathbf{ x}_i, \\mathbf{ w}) = \\mathbf{ w}^\\top \\mathbf{ x}_i$. $$\n",
    "E(\\mathbf{ w}) = \\sum_{i=1}^n\\left(y_i - \\mathbf{ w}^\\top \\mathbf{ x}_i\\right)^2\n",
    "$$ To compute the gradient of the objective, we first of all expand the\n",
    "brackets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bracket Expansion\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  E(\\mathbf{ w},\\sigma^2)  = &\n",
    "\\frac{n}{2}\\log \\sigma^2 + \\frac{1}{2\\sigma^2}\\sum\n",
    "_{i=1}^{n}y_i^{2}-\\frac{1}{\\sigma^2}\\sum\n",
    "_{i=1}^{n}y_i\\mathbf{ w}^{\\top}\\mathbf{ x}_i\\\\&+\\frac{1}{2\\sigma^2}\\sum\n",
    "_{i=1}^{n}\\mathbf{ w}^{\\top}\\mathbf{ x}_i\\mathbf{ x}_i^{\\top}\\mathbf{ w}\n",
    "+\\text{const}.\\\\\n",
    "    = & \\frac{n}{2}\\log \\sigma^2 + \\frac{1}{2\\sigma^2}\\sum\n",
    "_{i=1}^{n}y_i^{2}-\\frac{1}{\\sigma^2}\n",
    "\\mathbf{ w}^\\top\\sum_{i=1}^{n}\\mathbf{ x}_iy_i\\\\&+\\frac{1}{2\\sigma^2}\n",
    "\\mathbf{ w}^{\\top}\\left[\\sum\n",
    "_{i=1}^{n}\\mathbf{ x}_i\\mathbf{ x}_i^{\\top}\\right]\\mathbf{ w}+\\text{const}.\n",
    "\\end{align*}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solution with Linear Algebra\n",
    "\n",
    "In this section we’re going compute the minimum of the quadratic loss\n",
    "with respect to the parameters. When we do this we’ll also review\n",
    "*linear algebra*. We will represent all our errors and functions in the\n",
    "form of matrices and vectors.\n",
    "\n",
    "Linear algebra is just a shorthand for performing lots of\n",
    "multiplications and additions simultaneously. What does it have to do\n",
    "with our system then? Well the first thing to note is that the classic\n",
    "linear function we fit for a one dimensional regression has the form: $$\n",
    "f(x) = mx + c\n",
    "$$ the classical form for a straight line. From a linear algebraic\n",
    "perspective we are looking for multiplications and additions. We are\n",
    "also looking to separate our parameters from our data. The data is the\n",
    "*givens*. In French the word is données literally translated means\n",
    "*givens* that’s great, because we don’t need to change the data, what we\n",
    "need to change are the parameters (or variables) of the model. In this\n",
    "function the data comes in through $x$, and the parameters are $m$ and\n",
    "$c$.\n",
    "\n",
    "What we’d like to create is a vector of parameters and a vector of data.\n",
    "Then we could represent the system with vectors that represent the data,\n",
    "and vectors that represent the parameters.\n",
    "\n",
    "We look to turn the multiplications and additions into a linear\n",
    "algebraic form, we have one multiplication ($m\\times c$) and one\n",
    "addition ($mx + c$). But we can turn this into a inner product by\n",
    "writing it in the following way, $$\n",
    "f(x) = m \\times x +\n",
    "c \\times 1,\n",
    "$$ in other words we’ve extracted the unit value, from the offset, $c$.\n",
    "We can think of this unit value like an extra item of data, because it\n",
    "is always given to us, and it is always set to 1 (unlike regular data,\n",
    "which is likely to vary!). We can therefore write each input data\n",
    "location, $\\mathbf{ x}$, as a vector $$\n",
    "\\mathbf{ x}= \\begin{bmatrix} 1\\\\ x\\end{bmatrix}.\n",
    "$$\n",
    "\n",
    "Now we choose to also turn our parameters into a vector. The parameter\n",
    "vector will be defined to contain $$\n",
    "\\mathbf{ w}= \\begin{bmatrix} c \\\\ m\\end{bmatrix}\n",
    "$$ because if we now take the inner product between these to vectors we\n",
    "recover $$\n",
    "\\mathbf{ x}\\cdot\\mathbf{ w}= 1 \\times c + x \\times m = mx + c\n",
    "$$ In `numpy` we can define this vector as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the vector w\n",
    "w = np.zeros(shape=(2, 1))\n",
    "w[0] = m\n",
    "w[1] = c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us the equivalence between original operation and an\n",
    "operation in vector space. Whilst the notation here isn’t a lot shorter,\n",
    "the beauty is that we will be able to add as many features as we like\n",
    "and still keep the seame representation. In general, we are now moving\n",
    "to a system where each of our predictions is given by an inner product.\n",
    "When we want to represent a linear product in linear algebra, we tend to\n",
    "do it with the transpose operation, so since we have\n",
    "$\\mathbf{a}\\cdot\\mathbf{b} = \\mathbf{a}^\\top\\mathbf{b}$ we can write $$\n",
    "f(\\mathbf{ x}_i) = \\mathbf{ x}_i^\\top\\mathbf{ w}.\n",
    "$$ Where we’ve assumed that each data point, $\\mathbf{ x}_i$, is now\n",
    "written by appending a 1 onto the original vector $$\n",
    "\\mathbf{ x}_i = \\begin{bmatrix} \n",
    "1 \\\\\n",
    "x_i\n",
    "\\end{bmatrix}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Design Matrix\n",
    "\n",
    "We can do this for the entire data set to form a [*design\n",
    "matrix*](http://en.wikipedia.org/wiki/Design_matrix) $\\mathbf{X}$, $$\n",
    "\\mathbf{X}\n",
    "= \\begin{bmatrix} \n",
    "\\mathbf{ x}_1^\\top \\\\\\ \n",
    "\\mathbf{ x}_2^\\top \\\\\\ \n",
    "\\vdots \\\\\\\n",
    "\\mathbf{ x}_n^\\top\n",
    "\\end{bmatrix} = \\begin{bmatrix}\n",
    "1 & x_1 \\\\\\\n",
    "1 & x_2 \\\\\\\n",
    "\\vdots\n",
    "& \\vdots \\\\\\\n",
    "1 & x_n\n",
    "\\end{bmatrix},\n",
    "$$ which in `numpy` can be done with the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.hstack((np.ones_like(x), x))\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing the Objective with Linear Algebra\n",
    "\n",
    "When we think of the objective function, we can think of it as the\n",
    "errors where the error is defined in a similar way to what it was in\n",
    "Legendre’s day $y_i - f(\\mathbf{ x}_i)$, in statistics these errors are\n",
    "also sometimes called\n",
    "[*residuals*](http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics).\n",
    "So we can think as the objective and the prediction function as two\n",
    "separate parts, first we have, $$\n",
    "E(\\mathbf{ w}) = \\sum_{i=1}^n(y_i - f(\\mathbf{ x}_i; \\mathbf{ w}))^2,\n",
    "$$ where we’ve made the function $f(\\cdot)$’s dependence on the\n",
    "parameters $\\mathbf{ w}$ explicit in this equation. Then we have the\n",
    "definition of the function itself, $$\n",
    "f(\\mathbf{ x}_i; \\mathbf{ w}) = \\mathbf{ x}_i^\\top \\mathbf{ w}.\n",
    "$$ Let’s look again at these two equations and see if we can identify\n",
    "any inner products. The first equation is a sum of squares, which is\n",
    "promising. Any sum of squares can be represented by an inner product, $$\n",
    "a = \\sum_{i=1}^{k} b^2_i = \\mathbf{b}^\\top\\mathbf{b},\n",
    "$$ so if we wish to represent $E(\\mathbf{ w})$ in this way, all we need\n",
    "to do is convert the sum operator to an inner product. We can get a\n",
    "vector from that sum operator by placing both $y_i$ and\n",
    "$f(\\mathbf{ x}_i; \\mathbf{ w})$ into vectors, which we do by defining $$\n",
    "\\mathbf{ y}= \\begin{bmatrix}y_1\\\\ y_2\\\\ \\vdots \\\\ y_n\\end{bmatrix}\n",
    "$$ and defining $$\n",
    "\\mathbf{ f}(\\mathbf{ x}_1; \\mathbf{ w}) = \\begin{bmatrix}f(\\mathbf{ x}_1; \\mathbf{ w})\\\\ f(\\mathbf{ x}_2; \\mathbf{ w})\\\\ \\vdots \\\\ f(\\mathbf{ x}_n; \\mathbf{ w})\\end{bmatrix}.\n",
    "$$ The second of these is actually a vector-valued function. This term\n",
    "may appear intimidating, but the idea is straightforward. A vector\n",
    "valued function is simply a vector whose elements are themselves defined\n",
    "as *functions*, i.e. it is a vector of functions, rather than a vector\n",
    "of scalars. The idea is so straightforward, that we are going to ignore\n",
    "it for the moment, and barely use it in the derivation. But it will\n",
    "reappear later when we introduce *basis functions*. So we will, for the\n",
    "moment, ignore the dependence of $\\mathbf{ f}$ on $\\mathbf{ w}$ and\n",
    "$\\mathbf{X}$ and simply summarise it by a vector of numbers $$\n",
    "\\mathbf{ f}= \\begin{bmatrix}f_1\\\\f_2\\\\\n",
    "\\vdots \\\\ f_n\\end{bmatrix}.\n",
    "$$ This allows us to write our objective in the folowing, linear\n",
    "algebraic form, $$\n",
    "E(\\mathbf{ w}) = (\\mathbf{ y}- \\mathbf{ f})^\\top(\\mathbf{ y}- \\mathbf{ f})\n",
    "$$ from the rules of inner products. But what of our matrix $\\mathbf{X}$\n",
    "of input data? At this point, we need to dust off [*matrix-vector\n",
    "multiplication*](http://en.wikipedia.org/wiki/Matrix_multiplication).\n",
    "Matrix multiplication is simply a convenient way of performing many\n",
    "inner products together, and it’s exactly what we need to summarise the\n",
    "operation $$\n",
    "f_i = \\mathbf{ x}_i^\\top\\mathbf{ w}.\n",
    "$$ This operation tells us that each element of the vector $\\mathbf{ f}$\n",
    "(our vector valued function) is given by an inner product between\n",
    "$\\mathbf{ x}_i$ and $\\mathbf{ w}$. In other words it is a series of\n",
    "inner products. Let’s look at the definition of matrix multiplication,\n",
    "it takes the form $$\n",
    "\\mathbf{c} = \\mathbf{B}\\mathbf{a},\n",
    "$$ where $\\mathbf{c}$ might be a $k$ dimensional vector (which we can\n",
    "intepret as a $k\\times 1$ dimensional matrix), and $\\mathbf{B}$ is a\n",
    "$k\\times k$ dimensional matrix and $\\mathbf{a}$ is a $k$ dimensional\n",
    "vector ($k\\times 1$ dimensional matrix).\n",
    "\n",
    "The result of this multiplication is of the form $$\n",
    "\\begin{bmatrix}c_1\\\\c_2 \\\\ \\vdots \\\\\n",
    "a_k\\end{bmatrix} = \n",
    "\\begin{bmatrix} b_{1,1} & b_{1, 2} & \\dots & b_{1, k} \\\\\n",
    "b_{2, 1} & b_{2, 2} & \\dots & b_{2, k} \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "b_{k, 1} & b_{k, 2} & \\dots & b_{k, k} \\end{bmatrix} \\begin{bmatrix}a_1\\\\a_2 \\\\\n",
    "\\vdots\\\\ c_k\\end{bmatrix} = \\begin{bmatrix} b_{1, 1}a_1 + b_{1, 2}a_2 + \\dots +\n",
    "b_{1, k}a_k\\\\\n",
    "b_{2, 1}a_1 + b_{2, 2}a_2 + \\dots + b_{2, k}a_k \\\\ \n",
    "\\vdots\\\\\n",
    "b_{k, 1}a_1 + b_{k, 2}a_2 + \\dots + b_{k, k}a_k\\end{bmatrix}\n",
    "$$ so we see that each element of the result, $\\mathbf{a}$ is simply the\n",
    "inner product between each *row* of $\\mathbf{B}$ and the vector\n",
    "$\\mathbf{c}$. Because we have defined each element of $\\mathbf{ f}$ to\n",
    "be given by the inner product between each *row* of the design matrix\n",
    "and the vector $\\mathbf{ w}$ we now can write the full operation in one\n",
    "matrix multiplication,\n",
    "\n",
    "$$\n",
    "\\mathbf{ f}= \\mathbf{X}\\mathbf{ w}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = X@w # The @ sign performs matrix multiplication"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining this result with our objective function, $$\n",
    "E(\\mathbf{ w}) = (\\mathbf{ y}- \\mathbf{ f})^\\top(\\mathbf{ y}- \\mathbf{ f})\n",
    "$$ we find we have defined the *model* with two equations. One equation\n",
    "tells us the form of our predictive function and how it depends on its\n",
    "parameters, the other tells us the form of our objective function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resid = (y-f)\n",
    "E = np.dot(resid.T, resid) # matrix multiplication on a single vector is equivalent to a dot product.\n",
    "print(\"Error function is:\", E)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution with QR Decomposition\n",
    "\n",
    "<span class=\"editsection-bracket\" style=\"\">\\[</span><span\n",
    "class=\"editsection\"\n",
    "style=\"\"><a href=\"https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/qr-decomposition-regression.md\" target=\"_blank\" onclick=\"ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/qr-decomposition-regression.md', 13);\">edit</a></span><span class=\"editsection-bracket\" style=\"\">\\]</span>\n",
    "\n",
    "Performing a solve instead of a matrix inverse is the more numerically\n",
    "stable approach, but we can do even better. A\n",
    "[QR-decomposition](http://en.wikipedia.org/wiki/QR_decomposition) of a\n",
    "matrix factorises it into a matrix which is an orthogonal matrix\n",
    "$\\mathbf{Q}$, so that $\\mathbf{Q}^\\top \\mathbf{Q} = \\mathbf{I}$. And a\n",
    "matrix which is upper triangular, $\\mathbf{R}$. $$\n",
    "\\mathbf{X}^\\top \\mathbf{X}\\boldsymbol{\\beta} =\n",
    "\\mathbf{X}^\\top \\mathbf{ y}\n",
    "$$ $$\n",
    "(\\mathbf{Q}\\mathbf{R})^\\top\n",
    "(\\mathbf{Q}\\mathbf{R})\\boldsymbol{\\beta} = (\\mathbf{Q}\\mathbf{R})^\\top\n",
    "\\mathbf{ y}\n",
    "$$ $$\n",
    "\\mathbf{R}^\\top (\\mathbf{Q}^\\top \\mathbf{Q}) \\mathbf{R}\n",
    "\\boldsymbol{\\beta} = \\mathbf{R}^\\top \\mathbf{Q}^\\top \\mathbf{ y}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\mathbf{R}^\\top \\mathbf{R} \\boldsymbol{\\beta} = \\mathbf{R}^\\top \\mathbf{Q}^\\top\n",
    "\\mathbf{ y}\n",
    "$$ $$\n",
    "\\mathbf{R} \\boldsymbol{\\beta} = \\mathbf{Q}^\\top \\mathbf{ y}\n",
    "$$\n",
    "\n",
    "This is a more numerically stable solution because it removes the need\n",
    "to compute $\\mathbf{X}^\\top\\mathbf{X}$ as an intermediate. Computing\n",
    "$\\mathbf{X}^\\top\\mathbf{X}$ is a bad idea because it involves squaring\n",
    "all the elements of $\\mathbf{X}$ and thereby potentially reducing the\n",
    "numerical precision with which we can represent the solution. Operating\n",
    "on $\\mathbf{X}$ directly preserves the numerical precision of the model.\n",
    "\n",
    "This can be more particularly seen when we begin to work with *basis\n",
    "functions* in the next session. Some systems that can be resolved with\n",
    "the QR decomposition can not be resolved by using solve directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy as sp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q, R = np.linalg.qr(X)\n",
    "w = sp.linalg.solve_triangular(R, Q.T@y) \n",
    "w = pd.DataFrame(w, index=X.columns)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "-   For fitting linear models: Section 1.1-1.2 of Rogers and\n",
    "    Girolami (2011)\n",
    "\n",
    "-   Section 1.2.5 up to equation 1.65 of Bishop (2006)\n",
    "\n",
    "-   Section 1.3 for Matrix & Vector Review of Rogers and Girolami (2011)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thanks!\n",
    "\n",
    "For more information on these subjects and more you might want to check\n",
    "the following resources.\n",
    "\n",
    "-   twitter: [@lawrennd](https://twitter.com/lawrennd)\n",
    "-   podcast: [The Talking Machines](http://thetalkingmachines.com)\n",
    "-   newspaper: [Guardian Profile\n",
    "    Page](http://www.theguardian.com/profile/neil-lawrence)\n",
    "-   blog:\n",
    "    [http://inverseprobability.com](http://inverseprobability.com/blog.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bishop, C.M., 2006. Pattern recognition and machine learning. springer.\n",
    "\n",
    "Legendre, A.-M., 1805. Nouvelles méthodes pour la détermination des\n",
    "orbites des comètes. F. Didot.\n",
    "\n",
    "Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC\n",
    "Press."
   ]
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {}
}