{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries\n", "\n", "- Goal \n", " - Introduction to Bayesian (Linear) Regression\n", "- Materials \n", " - Mandatory\n", " - These lecture notes\n", " - Optional\n", " - Bishop pp. 152-158 \n", " - In this and forthcoming lectures, we will make use of some elementary matrix calculus. The most important formulas are summarized at the bottom of this notebook in an [OPTIONAL SLIDE on matrix calculus](#matrix-calculus).\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Regression - Illustration\n", "\n", "\n", "
\n", "\n", "Given a set of (noisy) data measurements, find the 'best' relation between an input variable $x \\in \\mathbb{R}^D$ and input-dependent outcomes $y \\in \\mathbb{R}$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Regression vs Density Estimation\n", "\n", "\n", "- Observe $N$ IID data **pairs** $D=\\{(x_1,y_1),\\dotsc,(x_N,y_N)\\}$ with $x_n \\in \\mathbb{R}^M$ and $y_n \\in \\mathbb{R}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Assume that we are interested in (a model for) the responses $y_n$ for **given inputs** $x_n$?, I.o.w. we are interested in building a model for the conditional distribution $p(y|x)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that, since $p(x,y)=p(y|x)\\, p(x)$, building a model $p(y|x)$ is similar to density estimation with the assumption that $x$ is drawn from a uniform distribution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Bayesian Linear Regression\n", "\n", "- Next, we discuss (1) model specification, (2) Inference and (3) a prediction application for a Bayesian linear regression problem. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "##### 1. Model Specification\n", "\n", "\n", "- In a _regression_ model, we try to 'explain the data' by a purely deterministic term $f(x_n,w)$, plus a purely random term $\\epsilon_n$ for 'unexplained noise',\n", "\n", " $$\n", " y_n = f(x_n,w) + \\epsilon_n\n", " $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In a _linear regression_ model, i.e., linear w.r.t. the parameters $w$, we assume that \n", "$$f(x_n,w)= \\sum_{j=0}^{M-1} w_j \\phi_j(x_n) = w^T \\phi(x_n)$$\n", "where $\\phi_j(x)$ are called basis functions.\n", " - For notational simplicity, from now on we will assume $f(x_n,w) = w^T x_n$, with $x_n \\in \\mathbb{R}^M$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In _ordinary linear regression_ , the noise process $\\epsilon_n$ is zero-mean Gaussian with constant variance, i.e.\n", "$$\n", "y_n = w^T x_n + \\mathcal{N}(0,\\beta^{-1}) \\,.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Hence, given a data set $D=\\{(x_1,y_1),\\dotsc,(x_N,y_N)\\}$, the likelihood for an ordinary linear regression model is \n", "\n", "$$\\begin{align*}\n", "p(y\\,|\\,\\mathbf{X},w,\\beta) &= \\mathcal{N}(y\\,|\\,\\mathbf{X} w,\\beta^{-1} \\mathbf{I}) \\\\\n", " &= \\prod_n \\mathcal{N}(y_n\\,|\\,w^T x_n,\\beta^{-1}) \\tag{B-3.10}\n", "\\end{align*}$$\n", "\n", "where $w = \\left(\\begin{matrix} w_1 \\\\ w_2 \\\\ \\vdots \\\\ w_{M} \\end{matrix} \\right)$, the $(N\\times M)$-dim matrix $\\mathbf{X} = \\left(\\begin{matrix}x_1^T \\\\ x_2^T \\\\ \\vdots \\\\ x_N^T \\end{matrix} \\right) = \\left(\\begin{matrix}x_{11},x_{12},\\dots,x_{1M}\\\\ x_{21},x_{22},\\dots,x_{2M} \\\\ \\vdots \\\\ x_{N1},x_{N2},\\dots,x_{NM} \\end{matrix} \\right) $ and $y = \\left(\\begin{matrix} y_1 \\\\ y_2 \\\\ \\vdots \\\\ y_N \\end{matrix} \\right)$.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For full Bayesian learning we should also choose a prior $p(w)$:\n", "\n", "$$\\begin{equation}\n", "p(w\\,|\\,\\alpha) = \\mathcal{N}(w\\,|\\,0,\\alpha^{-1}\\mathbf{I}) \\tag{B-3.52}\n", "\\end{equation}$$\n", "\n", "- For simplicity, we will assume that $\\alpha$ and $\\beta$ are known. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "##### 2. Inference\n", "\n", "- We'll do Bayesian inference for the parameters $w$. \n", "\n", "$$\\begin{align*}\n", "p(w|D) &\\propto p(D|w)\\cdot p(w) \\\\\n", " &= \\mathcal{N}(y\\,|\\,\\mathbf{X} w,\\beta^{-1} \\mathbf{I}) \\cdot \\mathcal{N}(w\\,|\\,0,\\alpha^{-1}\\mathbf{I}) \\\\\n", " &\\propto \\exp \\big( -\\frac{\\beta}{2} \\big( {y - \\mathbf{X}w } \\big)^T \\big( {y - \\mathbf{X}w } \\big) - \\frac{\\alpha}{2}w^T w \\big)\\tag{B-3.55} \\\\\n", " &= \\exp\\big( -\\frac{1}{2} w^T\\big(\\underbrace{\\beta \\mathbf{X}^T\\mathbf{X} + \\alpha \\mathbf{I}}_{\\Lambda_N}\\big)w + \\big(\\underbrace{\\beta \\mathbf{X}^Ty}_{\\eta_N}\\big)^T w - \\frac{\\beta}{2}y^T y \\big) \\\\\n", " &\\propto \\mathcal{N}_c\\left(w\\,|\\,\\eta_N,\\Lambda_N \\right)\n", "\\end{align*}$$\n", "\n", "with natural parameters (see the [natural parameterization of Gaussian](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/The-Gaussian-Distribution.ipynb#natural-parameterization)):\n", "\n", "$$\\begin{align*}\n", "\\eta_N &= \\beta\\mathbf{X}^Ty \\\\\n", "\\Lambda_N &= \\beta \\mathbf{X}^T\\mathbf{X} + \\alpha \\mathbf{I}\n", "\\end{align*}$$\n", "\n", "- Or equivalently (in the moment parameterization of the Gaussian):\n", "$$\\begin{align*}\n", "p(w|D) &= \\mathcal{N}\\left(w\\,|\\,m_N,S_N \\right) \\tag{B-3.49} \\\\\n", "m_N &= \\beta S_N \\mathbf{X}^T y \\tag{B-3.53}\\\\\n", "S_N &= \\left(\\alpha \\mathbf{I} + \\beta \\mathbf{X}^T \\mathbf{X}\\right)^{-1} \\tag{B-3.54}\n", "\\end{align*}$$\n", "\n", "- Note that B-3.53 and B-3.54 combine to\n", "$$\n", "m_N = \\left(\\frac{\\alpha}{\\beta}\\mathbf{I} + \\mathbf{X}^T \\mathbf{X} \\right)^{-1} \\mathbf{X}^T y\\,.\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " - (Bishop Fig.3.7) Illustration of sequential Bayesian learning for a simple linear model of the form $y(x, w) =\n", "w_0 + w_1 x$. (Bishop Fig.3.7, detailed description at Bishop, pg.154.)\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "##### 3. Application: predictive distribution\n", "\n", "\n", "- Assume we are interested in the distribution $p(y_\\bullet \\,|\\, x_\\bullet, D)$ for a new input $x_\\bullet$. This can be worked out as (exercise B-3.10)\n", "\n", "$$\\begin{align*}\n", "p(y_\\bullet \\,|\\, x_\\bullet, D) &= \\int p(y_\\bullet \\,|\\, x_\\bullet, w) p(w\\,|\\,D)\\,\\mathrm{d}w \\\\\n", "&= \\int \\mathcal{N}(y_\\bullet \\,|\\, w^T x_\\bullet, \\beta^{-1}) \\mathcal{N}(w\\,|\\,m_N,S_N)\\,\\mathrm{d}w \\\\\n", "&= \\int \\mathcal{N}(y_\\bullet \\,|\\, z, \\beta^{-1}) \\mathcal{N}(z\\,|\\,x_\\bullet^T m_N,x_\\bullet^T S_N x_\\bullet)\\,\\mathrm{d}z \\quad \\text{(sub. }z=x_\\bullet^T w \\text{)} \\\\\n", "&= \\int \\underbrace{\\mathcal{N}(z \\,|\\, y_\\bullet, \\beta^{-1}) \\mathcal{N}(z\\,|\\,x_\\bullet^T m_N,x_\\bullet^T S_N x_\\bullet)}_{\\text{Use Gaussian product formula SRG-6}}\\,\\mathrm{d}z \\\\\n", "&= \\int \\mathcal{N}\\left(y_\\bullet\\,|\\, m_N^T x_\\bullet, \\sigma_N^2(x_\\bullet) \\right) \\underbrace{\\mathcal{N}\\left(z\\,|\\, m_z, S_z\\right)}_{\\text{integrate this out}} \\mathrm{d}z\\\\\n", "&= \\mathcal{N}\\left(y_\\bullet\\,|\\, m_N^T x_\\bullet, \\sigma_N^2(x_\\bullet) \\right)\n", "\\end{align*}$$\n", "with\n", "$$\\begin{align*}\n", "\\sigma_N^2(x_\\bullet) = \\beta^{-1} + x^T_\\bullet S_N x_\\bullet \\tag{B-3.59}\n", "\\end{align*}$$\n", "\n", "- So, the uncertainty $\\sigma_N^2(x_\\bullet)$ about the output contains comprises both uncertainty about the process ($\\beta^{-1}$) and about the model parameters ($x^T_\\bullet S_N x_\\bullet$). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example Predictive Distribution\n", "\n", "- As an example, let's do Bayesian Linear Regression for a synthetic sinusoidal data set and a model with 9 Gaussian basis functions \n", "$$\\begin{align*}\n", "y &=\\sum_{m=1}^9 w_m \\phi_m(x_n) + \\mathcal{N}(0,\\beta^{-1}) \\\\\n", "\\phi_m(x_n) &= \\exp\\left( \\frac{x_n-\\mu_m}{\\sigma^2}\\right)\n", "\\end{align*}$$\n", "\n", "\n", "\n", "- The predictive distributions for $y$ are shown in the following plots (Bishop, Fig.3.8)\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- And some plots of draws of posteriors for the functions $w^T \\phi(x)$ (Bishop, Fig.3.9) \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Maximum Likelihood Estimation for Linear Regression Model\n", "\n", "- Recall the posterior mean for the weight vector\n", "$$\n", "m_N = \\left(\\frac{\\alpha}{\\beta}\\mathbf{I} + \\mathbf{X}^T \\mathbf{X} \\right)^{-1} \\mathbf{X}^T y\n", "$$\n", "where $\\alpha$ is the prior precision for the weights." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The Maximum Likelihood solution for $w$ is obtained by letting $\\alpha \\rightarrow 0$, which leads to \n", "\n", "$$\\begin{equation*}\n", "\\hat w_{\\text{ML}} = (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T y\n", "\\end{equation*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The matrix $\\mathbf{X}^\\dagger \\equiv (\\mathbf{X}^T \\mathbf{X})^{-1}\\mathbf{X}^T$ is also known as the **Moore-Penrose pseudo-inverse** (which is sort-of-an-inverse for non-square matrices)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that if we have fewer training samples than input dimensions, i.e., if $N