{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Gaussian Process Regression

\n", "
(C) Nikolai Nowaczyk, Jörg Kienitz 2021
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gaussian process regression (GPR) is a powerful alternative to linear regression, which allows us to make predictions on a test data set $X^*$ given some labelled training data set $(X,Y)$. A big advantage is that we do not only obtain the predictions themselves, but also confidence regions around them. In this notebook we introduce Gaussian processes and how to use them for regression." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib widget" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import ipywidgets as wdg" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# uncomment to support dark themes\n", "#plt.style.use(['dark_background'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaussian Processes\n", "A key idea of Gaussian process regression is to think of an input data point $x_i$ not as a point, but as a mean of some distribution. Gaussian processes formalize that intution, which we introduce in this section and illustrate with an example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Definition (Gaussian Process):** Let $(\\Omega, \\mathcal{F}, \\mathbb{P})$ be a probability space, let $L^2(\\Omega)$ denote the real-valued random variables on it and let $\\mathcal{X}$ be any set. A *Gaussian Process* is a map $F:\\mathcal{X} \\times \\Omega \\to \\mathbb{R}$ such that\n", "* for any $x \\in \\mathcal{X}$, the map $F(x) := F(x, \\_): \\Omega \\to \\mathbb{R}$ is an $L^2$ random variable,\n", "* for any finite number $n$ and any $X=(x_1, \\ldots, x_n)$, $x_i \\in \\mathcal{X}$, the random variable $(F(x_1), \\ldots, F(x_n))$ is multi-variate Gaussian on $\\mathbb{R}^{n}$. \n", "The functions\n", "\\begin{align*}\n", " m: & \\mathcal{X} \\to \\mathbb{R}, \\quad x \\mapsto \\mathbb{E}[F(x)], \\\\\n", " k:& \\mathcal{X} \\times \\mathcal{X} \\to \\mathbb{R}, \\quad (x, x') \\mapsto \\operatorname{Cov}[F(x), F(x')] = \\mathbb{E}[(F(x)-m(x))((F(y)-m(y)))]\n", "\\end{align*}\n", "are called the associated *mean* respectively *covariance function*. We also write\n", "\\begin{align*}\n", " F \\sim \\mathcal{GP}(m, k).\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Remarks**: \n", "* In most practical applications, the mean function is zero, i.e. $m \\equiv 0$.\n", "* The requirement that $F(x)$ is square integrable ensures that the covariance function is well-defined.\n", "* In most practical applications, the set $\\mathcal{X}$ is given by $\\mathcal{X} = \\mathbb{R}^d$. Thus, while $\\mathcal{X} = \\mathbb{R}_{\\geq 0}$ is also a valid set, Gaussian processes are more general than stochastic processes as their parameter set $\\mathcal{X}$ can be bigger. They are more special than stochastic processes due to the requirement of always beeing normally distributed.\n", "* Analogously to stochastic processes, for any $\\omega \\in \\Omega$, the map $F(\\_, \\omega): \\mathcal{X} \\to \\mathbb{R}$ is called a *path*.\n", "* By slight abuse of notation, for any vector $X=(x_1, \\ldots, x_n)$, we set $F(X) = (F(x_1), \\ldots, F(x_n))$.\n", "* By slight abuse of notation, for any two vectors $X$, $X'$, we denote by $K(X,X') := (k(x_i, x'_k))_{ij}$ the evaluation of the covariance function on all the entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lemma:** The distribution of a Gaussian process is uniquely determined by its mean and its covariance function.\n", "\n", "**Proof:** This simply follows from the fact that a Gaussian distribution is uniquely determined by its mean and its covariance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We assume that $\\mathcal{X}=\\mathbb{R}$, $m\\equiv 0$ and that\n", "\\begin{align*}\n", " k(x,x') := c^2 \\exp(-\\tfrac{1}{2 \\ell}|x-x'|^2),\n", "\\end{align*}\n", "for some $c, \\ell \\in \\mathbb{R}$. For any given data $x^* = (x^*_1, \\ldots, x^*_M)$, the resulting random variables are distributed as\n", "\\begin{align*}\n", " F(x^*) \\sim \\mathcal{N}(0, K(x^*, x^*)).\n", "\\end{align*}\n", "We plot a few realizations below." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "M = 20\n", "num_paths=5\n", "xs = np.linspace(0, 1, M)\n", "Ks = np.array([np.array([10 * np.exp(-(x1-x2)**2/0.2) for x1 in xs]) for x2 in xs])\n", "np.random.seed(1)\n", "fs = np.random.multivariate_normal(mean=np.zeros(M), cov=Ks, size=num_paths) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "57fecdfeaeb4413888618b4f4dd7dd7c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_gp_rel, ax_gp_rel = plt.subplots()\n", "fig_gp_rel.suptitle('Realizations of a Gaussian Process')\n", "ax_gp_rel.clear()\n", "for path in range(num_paths):\n", " ax_gp_rel.plot(xs, fs[path], label='path %i' % path, marker='x')\n", "ax_gp_rel.set_xlim([0, 1])\n", "ax_gp_rel.set_ylim([-8, 8])\n", "ax_gp_rel.set_xlabel('x')\n", "ax_gp_rel.set_ylabel('f')\n", "ax_gp_rel.legend(loc='lower left', prop={'size': 6})\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gaussian Process Regression (GPR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Formulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Gaussian processes can be used for regression as follows: We make the assumption that our training labels $y_i$ are related to the training data $x_i$ as possibly noisy observations of some unknown Gaussian process $F:\\mathcal{X} \\times \\Omega \\to \\mathbb{R}$, $\\mathcal{X}=\\mathbb{R}^d$, $F \\sim \\mathcal{GP}(0, k)$ on some probability space $(\\Omega, \\mathcal{F}, \\mathbb{P})$. We assume that there exist random variable $\\varepsilon_i$ on $\\Omega$, which are iid and Gaussian with zero mean and variance $\\sigma^2$. We also assume that each $\\varepsilon_i$ is independent from $F(x_i)$. Finally, we assume that our input data are a realization of that setup, i.e. there exists $\\omega \\in \\Omega$ such that\n", "\\begin{align*}\n", " y_i = F(x_i, \\omega) + \\varepsilon_i(\\omega)\n", "\\end{align*}\n", "for all $i=1, \\ldots, N$.\n", "\n", "Technically, we assume that we are given the following inputs:\n", "* traning data points $X = (x_1, \\ldots, x_N)$, $x_i \\in \\mathbb{R}^d$,\n", "* corresponding data labels $y = (y_1, \\ldots, y_N)$, $y_i \\in \\mathbb{R}$,\n", "* the covariance kernel function $k$ and the noise variance $\\sigma^2$ (see [hyperparameter optimization](#sect_hyperparameter_optimization) for a discussion),\n", "* test data points $x^* = (x^*_1, \\ldots, x^*_M)$, $x_i^* \\in \\mathbb{R}^d$.\n", "\n", "The $\\omega$ and the noise $\\varepsilon_i$ itself are not given as an input. The case $\\sigma=0$, i.e. $\\varepsilon_i \\equiv 0$ is allowed and called *noise free*. \n", "\n", "What we want to produce as an ouput is:\n", "* the conditional distribution of $F(X^*)$ given the input data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theoretical Solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the heart of GPR lies the following theorem that describes the predictive distribution.\n", "\n", "**Theorem (predictive distribution):** Let $F:\\mathcal{X} \\times \\Omega \\to \\mathbb{R}$, $\\mathcal{X} = \\mathbb{R}^d$, be a Gaussian process with zero mean and covariance function $k:\\mathcal{X} \\times \\mathcal{X} \\to \\mathbb{R}$. Let $X=(x_1, \\ldots, x_N)$ be a set of training data, $y=(y_1, \\ldots, y_N)$ be the corresponding labels and let $\\varepsilon_i$ be Gaussian iid noise with zero mean and variance $\\sigma^2$, which is independent of the $F(x_i)$ and set $\\varepsilon:=(\\varepsilon_1, \\ldots, \\varepsilon_N)$. Define $Y := F(X) + \\varepsilon$. For any test data $X^* = (x^*_1, \\ldots, x^*_M)$ the following hold: If $K(X, X) + \\sigma^2 I$ is invertible, then the conditional distribution of $F(X^*)$ given $Y=y$ satisfies \n", "\\begin{align*}\n", " F(X^*) \\mid Y = y \\sim \\mathcal{N}(\\bar m, \\bar k),\n", "\\end{align*}\n", "where\n", "\\begin{align*}\n", " m^* &:= K(X, X^*) (K(X, X) + \\sigma^2 I)^{-1} y\\\\\n", " k^* &:= K(X^*, X^*) - K(X^*, X)(K(X,X) + \\sigma^2 I)^{-1}K(X,X^*).\n", "\\end{align*}\n", "\n", "\n", "**Proof:** By definition of a Gaussian Process, $F(X) \\sim \\mathcal{N}(0, K(X,X))$ and by assumption $\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2 I)$. Therefore, $Y$ is Gaussian as well with mean zero and the independence assumptions imply\n", "\\begin{align*}\n", " \\operatorname{Cov}[Y_i, Y_j] \n", " = \\operatorname{Cov}[F(x_i) + \\varepsilon_i, F(x_j) + \\varepsilon_j]\n", " =k(x_i, x_j) + \\delta_{ij} \\sigma ^2,\n", "\\end{align*}\n", "thus, the covariance matrix of $Y$ is given by $K(X,X) + \\sigma^2 I$. This implies that \n", "\\begin{align*}\n", " \\begin{pmatrix}\n", " Y \\\\\n", " F(X^*)\n", " \\end{pmatrix} \\sim \\mathcal{N}\\Big( 0, \\begin{pmatrix}\n", " K(X,X) + \\sigma^2 I & K(X, X^*) \\\\\n", " K(X^*, X) & K(X^*, X^*)\n", " \\end{pmatrix} \\Big)\n", "\\end{align*}\n", "Now the claim follows from the [Gaussian conditioning theorem](#lem_cond_part_gaussian)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lemma (noise-free):** In case the input data is noise-free, i.e. $\\sigma=0$, hence $\\varepsilon \\equiv 0$, the mean and covariance of the predictive disribution simplify to\n", "\\begin{align*}\n", " m^* &:= K(X, X^*) K(X, X)f \\\\\n", " k^* &:= K(X^*, X^*) - K(X^*, X)K(X,X)^{-1}K(X,X^*),\n", "\\end{align*}\n", "where $f=(f_1, \\ldots, f_N)$, $f_i := F(x_i, \\omega) = y_i$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practical Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us illustrate these theoretical results with some examples. We again use the Gaussian process from the previous example, i.e. we assume that $\\mathcal{X}=\\mathbb{R}$, $m\\equiv 0$ and that\n", "\\begin{align*}\n", " k(x,x') := c^2 \\exp(-\\tfrac{1}{2 \\ell^2}|x-x'|^2),\n", "\\end{align*}\n", "for some $c, \\ell \\in \\mathbb{R}$. The calculation of the conditional mean and covariance are easily performed as follows:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def gpr_pred(x_train, y_train, x_pred, cov_fun, sigma=0):\n", " k_xt_xt = np.matrix([[cov_fun(x1, x2) for x1 in x_train] for x2 in x_train]).T\n", " k_xt_xt += sigma**2 * np.diag(k_xt_xt[0])\n", " k_xt_xs = np.matrix([[cov_fun(x1, x2) for x1 in x_train] for x2 in x_pred]).T\n", " k_xs_xt = np.matrix([[cov_fun(x1, x2) for x1 in x_pred] for x2 in x_train]).T\n", " k_xs_xs = np.matrix([[cov_fun(x1, x2) for x1 in x_pred] for x2 in x_pred]).T\n", " m = k_xs_xt @ k_xt_xt.I @ y_train\n", " c = k_xs_xs - k_xs_xt @ k_xt_xt.I @ k_xt_xs\n", " return np.asarray(m).squeeze(), c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Noise-free Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be instructive to consider the easier noise-free case first, so let us also assume that $\\varepsilon \\equiv 0$." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# training set\n", "x_train = np.array([0.1, 0.2, 0.5, 0.8])\n", "y_train = np.array([-0.1, 0.3, 0.8, 0.1])\n", "\n", "# test set\n", "M = 61\n", "x_pred = np.linspace(0, 1, M)\n", "\n", "# covariance kernel\n", "cov_fun = lambda x, y : np.exp(-(x-y)**2/0.04)\n", "\n", "# perform GPR\n", "y_pred, y_cov = gpr_pred(x_train, y_train, x_pred, cov_fun)\n", "y_var = np.sqrt(np.maximum(np.asarray(y_cov).diagonal(),0))\n", "\n", "# compute some paths\n", "np.random.seed(2)\n", "num_paths=3\n", "fs = np.random.multivariate_normal(mean=y_pred, cov=y_cov, size=num_paths) " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7761ad7a52b14d8397cd3fa9e57f05e7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_gpr_pred, ax_gpr_pred = plt.subplots()\n", "fig_gpr_pred.suptitle('GPR Prediction without noise')\n", "ax_gpr_pred.scatter(x_train, y_train, marker='o', label='train', c='r')\n", "ax_gpr_pred.plot(x_pred, y_pred, marker='x', label='predict', c='b')\n", "ax_gpr_pred.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred - 1.96 * y_var,\n", " (y_pred + 1.96 * y_var)[::-1]]), label='95% confidence', \n", " alpha=.2, fc='c', ec='None' )\n", "for i in range(num_paths):\n", " ax_gpr_pred.plot(x_pred, fs[i], label='path %i' % i, linestyle='--', alpha=0.5)\n", "ax_gpr_pred.legend(loc='upper left', prop={'size': 6})\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Remarks:**\n", "* We see that the prediction exactly interpolates the input data points. That is because we are considering the noise-free case.\n", "* We can see how the paths mostly traverse the inner regions of the $95\\%$ confidence zone.\n", "* Unlike for linear regression, we now have a clear picture of uncertainty between the data points and also how quickly uncertainty grows if one extrapolates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Noisy Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we consider the same example as above, but we add some noise to the observations." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "sigma = 0.4\n", "y_train += np.random.normal(0, sigma, size=y_train.shape)\n", "y_pred, y_cov = gpr_pred(x_train, y_train, x_pred, lambda x, y : np.exp(-(x-y)**2/0.04), sigma)\n", "y_var = np.sqrt(np.maximum(np.asarray(y_cov).diagonal(), 0))\n", "\n", "np.random.seed(2)\n", "num_paths=3\n", "fs = np.random.multivariate_normal(mean=y_pred, cov=y_cov, size=num_paths) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d0e0595a87024546a1b3a931b09a00e0", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_gpr_pred_noise, ax_gpr_pred_noise = plt.subplots()\n", "fig_gpr_pred_noise.suptitle('GPR Prediction with noise')\n", "ax_gpr_pred_noise.scatter(x_train, y_train, marker='o', label='train', c='r')\n", "ax_gpr_pred_noise.plot(x_pred, y_pred, marker='x', label='predict', c='b')\n", "ax_gpr_pred_noise.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred - 1.96 * y_var,\n", " (y_pred + 1.96 * y_var)[::-1]]), label='confidence', \n", " alpha=.2, fc='c', ec='None' )\n", "for i in range(num_paths):\n", " ax_gpr_pred_noise.plot(x_pred, fs[i], label='path %i' % i, linestyle='--', alpha=0.5)\n", "ax_gpr_pred_noise.legend(loc='upper left', prop={'size': 6})\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Remarks:**\n", "* Due to the noise the prediction no longer interpolates the input data points. Instead, even at those points we now have a non-trivial confidence region.\n", "* However, the confidence region is still smaller around the input data point than at the point between or beyond them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hyperparameter Optimization\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above example, we have used the kernel\n", "\\begin{align*}\n", " k(x,x') := c^2 \\exp(-\\tfrac{1}{2 \\ell^2}|x-x'|^2),\n", "\\end{align*}\n", "for some $c, \\ell \\in \\mathbb{R}$. The function $k$ is called the *exponential kernel*, $\\ell$ is called the *length scale* and $c$ is called the *signal variance*. We also used the *noise variance* $\\sigma^2$. Thus, by using these our predictions depend on the three *hyperparameters* $(\\sigma, c, \\ell)$ and we have assumed that these are known in advance. While this is useful to derive the theory and quickly look into examples, this assumption is unrealistic. In most practical cases, these parameters will not be known and thus have to be determined. Before we delve into the method of doing this, let us first study the impact of these parameters on the prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Impact of Hyperparameters" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "974e39ef602141f980fddf50effeef29", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "681c796536424b0eb038f481b3578a78", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(FloatSlider(value=0.4, description='sigma', max=1.0), FloatSlider(value=1.0, description…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_hyp, ax_hyp = plt.subplots()\n", "fig_hyp.suptitle('Impact of Hyperparameters')\n", "\n", "x_train = np.array([0.1, 0.2, 0.5, 0.8])\n", "y_train = np.array([-0.1, 0.3, 0.8, 0.1])\n", "M = 61\n", "x_pred = np.linspace(0, 1, M)\n", "\n", "@wdg.interact(sigma=wdg.FloatSlider(min=0., max=1, value=0.4, steps=10),\n", " c=wdg.FloatSlider(min=1, max=5, value=1, steps=10),\n", " l=wdg.FloatSlider(min=0.01, max=1, value=0.2, steps=10))\n", "def plot_prediction(sigma, c, l):\n", "\n", " cov_fun = lambda x, y : c**2 * np.exp(-(x-y)**2/(2 * l**2))\n", " np.random.seed(1)\n", " y_train_noisy = y_train + np.random.normal(0, sigma, size=y_train.shape)\n", " y_pred, y_cov = gpr_pred(x_train, y_train, x_pred, cov_fun, sigma)\n", " y_var = np.sqrt(np.maximum(np.asarray(y_cov).diagonal(),0))\n", " num_paths=3\n", " fs = np.random.multivariate_normal(mean=y_pred, cov=y_cov, size=num_paths) \n", "\n", " ax_hyp.clear()\n", " ax_hyp.scatter(x_train, y_train, marker='o', label='train', c='r')\n", " ax_hyp.plot(x_pred, y_pred, marker='x', label='predict', c='b')\n", " ax_hyp.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred - 1.96 * y_var,\n", " (y_pred + 1.96 * y_var)[::-1]]), label='confidence', \n", " alpha=.2, fc='c', ec='None' )\n", " for i in range(num_paths):\n", " ax_hyp.plot(x_pred, fs[i], label='path %i' % i, linestyle='--', alpha=0.5)\n", " ax_hyp.legend(loc='lower center', prop={'size': 6})\n", " ax_hyp.set_ylim([-3, 3])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimize via Marginal Log-Likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For any continuous random variable, its *log-likelihood* function is simply the logarithm of its density. Thus, for GPR we can consider the variable $Y = F(X) + \\varepsilon$ from the predictive distribution theorem above. This random variable is Gaussian with zero mean and covariance matrix $K(X,X) + \\sigma^2 I $. Notice that the hyperparameters $c, \\ell$ are implicit in the $K$. As the density $p(y)$ of a multivariate Gaussian $Y$ is well known, the log-likelihood of $Y$ is given by\n", "\\begin{align*}\n", " L(y;\\sigma, c, \\ell)\n", " = \\log(p(y; \\sigma, c, \\ell)) \n", " = - \\frac{1}{2} ( \\log(\\det(K(X,X) + \\sigma^2 I) - y^{\\top}(K(X,X) + \\sigma^2 I)^{-1}) y + N \\log(2 \\pi)).\n", "\\end{align*}\n", "\n", "One typical method of chosing the hyperparameters $(\\sigma, c, \\ell)$ is to maximize the log-marginal likelihood at the training data $y$. This is a multivariate non-linear optimization problem, hence, one should use a library function for that, see [tech stack choices](#sect_tech_stack)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Covariance Kernels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One should highlight that not only the parameters of the kernel $k$, but also the kernel function $k$ itself is a matter of choice and various alternatives are popular. Most of those are readily available in standard libraries. Notice that kernels can also be combined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Squared Exponential Kernel (SE)\n", "The *squared exponential kernel* (SE) is the one we have already been using, i.e.\n", "\\begin{align*}\n", " k(x,x') = c^2 \\exp(-\\tfrac{\\|x-x'\\|^2}{2 \\ell}),\n", "\\end{align*}\n", "where $c$ is the signal variance and $\\ell$ is the length scale.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic Relevance Determination Squared Exponential (ARDSE)\n", "This kernel is a slight refinement of the SE kernel and defined by\n", "\\begin{align*}\n", " k(x, x') = c^2 \\exp\\Big( - \\frac{1}{2} \\sum_{i=1}^{d}{\\Big( \\frac{x_i - x'_i}{\\ell_i} \\Big)} \\Big),\n", "\\end{align*}\n", "where $c$ is still the* signal variance* the $\\ell_i$ are celled *individual (characteristic) length scales*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matérn Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *Matérn class* is defined by\n", "\\begin{align*}\n", " k(x,x') = c^2 \\frac{2^{1-\\nu}}{\\Gamma(\\nu)} \\Big( \\frac{\\sqrt{2 \\nu} \\|x-x'\\|^2}{\\ell} \\Big) K(\\nu) \\Big( \\frac{\\sqrt{2 \\nu} \\|x-x'\\|^2}{\\ell} \\Big),\n", "\\end{align*}\n", "where $\\Gamma$ is [Euler's Gamma function](https://en.wikipedia.org/wiki/Gamma_function) and $K_\\nu$ is the [modified Bessel function](https://en.wikipedia.org/wiki/Bessel_function#Modified_Bessel_functions:_Iα,_Kα)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tech Stack\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The toy implementation used in the examples is not recommended for production use. For bigger problem instances it is advisable to use a library that handles the data, the kernel functions and the numerical tasks such as hyperparameter optimization. We present the two very common libraries [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html) and [GPy](https://sheffieldml.github.io/GPy/) and re-compute the above examples with these libraries." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# example data\n", "x_train = np.array([0.1, 0.2, 0.5, 0.8])\n", "y_train = np.array([-0.1, 0.3, 1.8, 0.1])\n", "np.random.seed(1)\n", "y_train_noisy = y_train + np.random.normal(0, sigma, size=y_train.shape)\n", "M = 61\n", "x_pred = np.linspace(0, 1, M)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPy" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import GPy" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization restart 1/10, f = 5.047986574363724\n", "Optimization restart 2/10, f = 4.92214006226174\n", "Optimization restart 3/10, f = 5.047986828828963\n", "Optimization restart 4/10, f = 4.922134776292944\n", "Optimization restart 5/10, f = 5.047986597397468\n", "Optimization restart 6/10, f = 4.922134768704536\n", "Optimization restart 7/10, f = 4.922135309524814\n", "Optimization restart 8/10, f = 5.047986793712312\n", "Optimization restart 9/10, f = 5.047986585252078\n", "Optimization restart 10/10, f = 5.04798663150114\n" ] } ], "source": [ "# specify the desired kernel\n", "kernel_gpy = GPy.kern.RBF(input_dim=1)\n", "\n", "# instantiate the GPR class with training data\n", "# (all data must be of format (num_samples, num_dimensions))\n", "gpy = GPy.models.GPRegression(np.atleast_2d(x_train).T,\n", " np.atleast_2d(y_train_noisy).T,\n", " kernel_gpy)\n", "\n", "# fit the data\n", "#gpy.optimize(messages=True) # this performs only one restart of the optimizer\n", "gpy.optimize_restarts(num_restarts = 10) # often better to perform multiple\n", "\n", "# perform the prediction\n", "y_pred_mean_gpy, y_pred_cov_gpy = gpy.predict(np.atleast_2d(x_pred).T, full_cov=True)\n", "y_pred_mean_gpy = y_pred_mean_gpy.squeeze()\n", "y_pred_std_gpy = np.sqrt(y_pred_cov_gpy.diagonal()).squeeze()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "

\n", "Model: GP regression
\n", "Objective: 4.922134768704536
\n", "Number of Parameters: 3
\n", "Number of Optimization Parameters: 3
\n", "Updates: True
\n", "

\n", "\n", "\n", "\n", "\n", "\n", "
GP_regression. valueconstraintspriors
rbf.variance 0.7846753171664994 +ve
rbf.lengthscale 0.10664893213350811 +ve
Gaussian_noise.variance3.009352837717333e-08 +ve
" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpy # let's take a look at the result" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# notice how GPy has optimized all 3 parameters including the noise" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5d7f5de67e71497b940c543e56e5e9d4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_gpy, ax_gpy = plt.subplots()\n", "gpy.plot(ax=ax_gpy) # very easy plotting of results\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'GP regression',\n", " 'class': 'GPy.models.GPRegression',\n", " 'X': [[0.1], [0.2], [0.5], [0.8]],\n", " 'Y': [[0.5497381454652968],\n", " [0.055297434539969825],\n", " [1.5887312990946176],\n", " [-0.3291874488624682]],\n", " 'kernel': {'input_dim': 1,\n", " 'active_dims': [0],\n", " 'name': 'rbf',\n", " 'useGPU': False,\n", " 'variance': [0.7846753171664994],\n", " 'lengthscale': [0.10664893213350811],\n", " 'ARD': False,\n", " 'class': 'GPy.kern.RBF',\n", " 'inv_l': False},\n", " 'likelihood': {'name': 'Gaussian_noise',\n", " 'gp_link_dict': {'class': 'GPy.likelihoods.link_functions.Identity'},\n", " 'class': 'GPy.likelihoods.Gaussian',\n", " 'variance': [3.009352837717333e-08]},\n", " 'inference_method': {'class': 'GPy.inference.latent_function_inference.exact_gaussian_inference.ExactGaussianInference'}}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpy.to_dict() # transforms all the relevant data of the model into a dict" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "noise_var_gpy = gpy.to_dict()['likelihood']['variance'][0] # save for later\n", "signal_var_gpy = gpy.to_dict()['kernel']['variance'][0]\n", "lengthscale_gpy = gpy.to_dict()['kernel']['lengthscale'][0]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3.009352837717333e-08, 0.7846753171664994, 0.10664893213350811)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "noise_var_gpy, signal_var_gpy, lengthscale_gpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## scikit-learn" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.gaussian_process.kernels import RBF, ConstantKernel\n", "from sklearn.gaussian_process import GaussianProcessRegressor" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# specify kernel with initial parameters and the bounds within to optimize\n", "kernel = ConstantKernel() * RBF()\n", "\n", "# instantiate the GPR with pre-specified noise parameter alpha\n", "# (we use the noise (alpha parameter) from GPy here for comparison but of course any value is possible)\n", "gpr = GaussianProcessRegressor(kernel=kernel, alpha=noise_var_gpy, n_restarts_optimizer=50)\n", "\n", "# fit the GPR to the data\n", "# (this class assumes that all arrays are of the format (num_samples, num_dimensions))\n", "gpr.fit(np.atleast_2d(x_train).T, np.atleast_2d(y_train_noisy).T)\n", "\n", "# run the prediction\n", "y_pred_mean_sk, y_pred_std_sk = gpr.predict(np.atleast_2d(x_pred).T, return_std=True)\n", "y_pred_mean_sk = y_pred_mean_sk.squeeze() # for univariate data the second dimension is spurious\n", "y_pred_std = y_pred_std_sk.squeeze()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'k1': 0.886**2,\n", " 'k2': RBF(length_scale=0.107),\n", " 'k1__constant_value': 0.7846749475148752,\n", " 'k1__constant_value_bounds': (1e-05, 100000.0),\n", " 'k2__length_scale': 0.10664884328209365,\n", " 'k2__length_scale_bounds': (1e-05, 100000.0)}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpr.kernel_.get_params()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "signal_var_sk = gpr.kernel_.get_params()['k1__constant_value']\n", "lengthscale_sk = gpr.kernel_.get_params()['k2__length_scale']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d71a2c3261e94f819c3d2812852d273c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_gpr, ax_gpr = plt.subplots()\n", "fig_gpr.suptitle('GPR in scikit-learn')\n", "ax_gpr.scatter(x_train, y_train_noisy, marker='o', label='train', c='r')\n", "ax_gpr.plot(x_pred, y_pred_mean_sk, marker='x', label='predict', c='b')\n", "ax_gpr.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred_mean_sk - 1.96 * y_pred_std_sk,\n", " (y_pred_mean_sk + 1.96 * y_pred_std_sk)[::-1]]), label='confidence', \n", " alpha=.2, fc='c', ec='None' )\n", "\n", "ax_gpr.legend(loc='lower center', prop={'size': 6})\n", "ax_gpr.set_ylim([-3, 3])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sanity Check" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The GPy and the scikit-learn implementations should give the same result as the ad-hoc implementation from the earlier example. We validate that this is the case." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "cov_fun = lambda x, y : signal_var_sk * np.exp(-(x-y)**2/(2 * lengthscale_sk**2))\n", "y_pred_mean_adh, y_pred_cov_adh = gpr_pred(x_train, y_train_noisy, x_pred, cov_fun, np.sqrt(noise_var_gpy))\n", "y_pred_std_adh = np.sqrt(np.maximum(np.asarray(y_pred_cov_adh).diagonal(),0))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c91624bce6034b31bdfd612f3a09cadc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_check, ax_check = plt.subplots()\n", "\n", "ax_check.set_title('Predictions')\n", "ax_check.scatter(x_train, y_train_noisy, marker='o', label='train', c='r')\n", "ax_check.plot(x_pred, y_pred_mean_gpy, marker='x', label='predict GPy', c='g')\n", "ax_check.plot(x_pred, y_pred_mean_sk, marker='x', label='predict scikit', c='b')\n", "ax_check.plot(x_pred, y_pred_mean_adh, marker='x', label='predict ad-hoc', c='y')\n", "ax_check.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred_mean_gpy - 1.96 * y_pred_std_gpy,\n", " (y_pred_mean_gpy + 1.96 * y_pred_std_gpy)[::-1]]), label='confidence GPy',\n", " alpha=.2, fc='c', ec='None' )\n", "ax_check.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred_mean_sk - 1.96 * y_pred_std_sk,\n", " (y_pred_mean_sk + 1.96 * y_pred_std_sk)[::-1]]), label='confidence scikit',\n", " alpha=.2, fc='m', ec='None' )\n", "ax_check.fill(np.concatenate([x_pred, x_pred[::-1]]),\n", " np.concatenate([y_pred_mean_adh - 1.96 * y_pred_std_adh,\n", " (y_pred_mean_adh + 1.96 * y_pred_std_adh)[::-1]]), label='confidence ad-hoc',\n", " alpha=.2, fc='y', ec='None' )\n", "ax_check.legend(loc='lower left', prop={'size': 6})\n", "ax_check.set_ylim([-3, 3])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, all the predictions and confidence bounds of the three methods agree." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix A: GPR as a Bayesian View of Linear Regression\n", "\n", "The setup of Gaussian process regression is similar to that of linear regression. In fact, one can show that a Bayesian view of linear regression is equivalent to Gaussian process regression with a bilinear kernel. We derive the details in the following section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reminder of Linear Regression (LR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In linear regression, the labelled input data set is the same as in GPR, i.e. we are given $x_i \\in \\mathbb{R}^d$, $y_i \\in \\mathbb{R}$, $i=1, \\ldots, N$. \n", "\n", "**Feature Extraction:** We usually do not work on the $x_i$ directly, but first extract some features via a *feature map* $\\phi:\\mathbb{R}^{d} \\to \\mathbb{R}^m$, for example the popular *polynomial features map of degree $m-1$:\n", "\\begin{align*}\n", " x \\mapsto \\begin{pmatrix}\n", " 1 & x & x^2 \\ldots & x^m\n", " \\end{pmatrix}^{\\top}.\n", "\\end{align*}\n", "\n", "**Linear Hypothesis:** We want to fit a linear hypothesis \n", "\\begin{align*}\n", " f=f_{\\beta}:\\mathbb{R}^d \\to \\mathbb{R}, \\quad\n", " x \\mapsto \\phi(x)^{\\top} \\beta\n", "\\end{align*}\n", "to that data by finding an optimal $\\beta \\in \\mathbb{R}^m$.\n", "\n", "**Assumtions:** We assume that for any $i=1, \\ldots, N$,\n", "\\begin{align*}\n", " y_i = f_{\\beta}(x_i) + \\varepsilon_i,\n", "\\end{align*}\n", "where $\\varepsilon_i$ are realizations of an iid random variable $\\varepsilon$ with zero mean called *noise*. \n", "\n", "**Cost Functional:** The problem of finding $\\beta$ is typically achieved by solving the *least squares problem*\n", "\\begin{align*}\n", " \\hat \\beta = \\operatorname{argmin}_{\\beta}{\\sum_{i=1}^{N}{(f_{\\beta}(x_i)-y_i)^2}}.\n", "\\end{align*}\n", "\n", "**Design Matrix:** To solve the least squares problem, we collect all the inputs in the *design matrix* (written as column matrix)\n", "\\begin{align*}\n", " Phi(X) := \\begin{pmatrix}\n", " \\varphi(x_1) &\n", " \\ldots &\n", " \\varphi(x_N)\n", " \\end{pmatrix} \\in \\mathbb{R}^{m \\times N}\n", "\\end{align*}\n", "\n", "**Normal Equations:** The solution $\\hat \\beta $ to the least squares problem is given by the *normal equation*\n", "\\begin{align*}\n", " \\hat \\beta = (\\Phi(X) \\Phi(X)^{\\top})^{-1} \\Phi y \\in \\mathbb{R}^{m}\n", "\\end{align*}\n", "under the assumption that $\\Phi \\Phi^{\\top} \\in \\mathbb{R}^{m \\times m}$ is invertible. For practical applications with $N \\gg m$, this is typically the case.\n", "\n", "**Predictions:** We can then make predictions on new data $x^* \\in \\mathbb{R}^m$ via \n", "\\begin{align*}\n", " f_{\\hat \\beta}(x^*) = \\phi(x^*)^{\\top} \\beta \\in \\mathbb{R} \n", "\\end{align*}\n", "In case of multiple predictions, we can collect them in a matrix $X^* \\in \\mathbb{R}^{m \\times l}$ and apply the predictor $f_{\\hat \\beta}$ to every component resulting in\n", "\\begin{align*}\n", " F(X^*) := \\Phi(X^*)^{\\top} \\hat \\beta = \\Phi(X^*)^{\\top} (\\Phi(X) \\Phi(X)^{\\top})^{-1} \\Phi(X) y \\in \\mathbb{R}^l\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian View of Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Model Assumptions:** For the Bayesian view of linear regression, let\n", "* $(\\Omega, \\mathcal{F}, \\mathbb{P})$ be a probability space\n", "* $\\mathscr{x}_i$ be $\\mathbb{R}^d$-valued iid random variables\n", "* $\\varepsilon_i$ are $\\mathbb{R}$-valued iid and Gaussian, i.e. $\\varepsilon_i \\sim \\mathcal{N}(0, \\sigma^2)$\n", "* $b$ is an $\\mathbb{R}^m$-valued random variable\n", "* $\\mathscr{x}_i$ and $\\varepsilon_i$ are independent, $i=1, \\ldots, N$\n", "* $\\mathscr{y}_i = \\Phi(\\mathscr{x})^{\\top}_i \\beta + \\varepsilon_i$.\n", "\n", "We set $\\mathscr{Y} = (\\mathscr{y}_1, \\ldots, \\mathscr{y}_N)$, $\\mathscr{X}=(\\mathscr{x}_1, \\ldots, \\mathscr{x}_N)$ and $\\varepsilon = (\\varepsilon_1, \\ldots, \\varepsilon_N)$, thus we can write\n", "\\begin{align*}\n", " \\mathscr{Y} = \\Phi(\\mathscr{X})^{\\top} b + \\varepsilon.\n", "\\end{align*}\n", "\n", "**Input Data:** We still assume to be given the same input data, i.e.\n", "* $X=(x_1, \\ldots, x_N)$, $x_i \\in \\mathbb{R}^d$,\n", "* $Y=(y_1, \\ldots, y_N)$, $y_i \\in \\mathbb{R}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predictive Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now describe how to fit the model to the data. The definitions immediately imply the following.\n", "\n", "**Lemma**: The conditional distribution of $\\mathscr{Y}$ given $\\mathscr{X}=X$ and $b=\\beta$ is\n", "\\begin{align*}\n", " \\mathcal{Y} \\mid \\mathcal{X}=X, b=\\beta \\sim \\mathcal{N}(\\Phi(X)^{\\top} \\beta, \\sigma^2 I).\n", "\\end{align*}\n", "\n", "Using Bayes theorem for densities one can prove the following:\n", "\n", "**Theorem**: If $b \\sim \\mathcal{N}(0, \\Sigma)$, then\n", "\\begin{align*}\n", " b \\mid \\mathcal{X}=X, \\mathcal{Y}=Y \\sim \\mathcal{N} (\\bar \\beta, A^{-1}),\n", "\\end{align*}\n", "where\n", "\\begin{align*}\n", " A:= \\sigma^{-2} \\Phi(X) \\Phi(X)^{\\top} + \\Sigma^{-1} \\in \\mathbb{R}^{m \\times m}, && \\bar \\beta := \\sigma^{-2} A^{-1}\\Phi(X) Y \\in \\mathbb{R}^{m},\n", "\\end{align*}\n", "under the assumption that $A$ is invertible.\n", "\n", "**Remark**: Notice that $\\bar \\beta$ is thus given by\n", "\\begin{align*}\n", " \\bar \\beta = (\\Phi(X) \\Phi(X)^{\\top} + \\sigma^2 \\Sigma)^{-1} \\Phi(X) y,\n", "\\end{align*}\n", "which is the same as the $\\hat \\beta$ in LR above, but with the $\\sigma^2 \\Sigma$ term added.\n", "\n", "**Lemma**: Let $\\mathscr{x}^*$ be an $\\mathbb{R}^d$-valued random variable, $x^* \\in \\mathbb{R}^d$, and $f(\\mathscr{x}^*) := f_{\\bar \\beta}(\\mathscr{x}^*)$. Then\n", "\\begin{align*}\n", " f(\\mathscr{x}^*) \\mid \\mathscr{x}^*=x^*, \\mathscr{X}=X, \\mathscr{Y}=y \\sim \\mathcal{N}(\\phi(x^*)^{\\top} \\bar \\beta , \\phi(x^*)^{\\top} A^{-1} \\phi(x^*)).\n", "\\end{align*}\n", "This is called the *predictive distribution*. \n", "\n", "In case of multiple predictions $X^*$, which are $\\mathbb{R}^{d \\times l}$-valued, we obtain analogously \n", "\\begin{align*}\n", " F(\\mathscr{X}^*) \\mid \\mathscr{X}^* = X^*, \\mathscr{X}=X, \\mathscr{Y}=Y] \\sim \\mathcal{N}(\\Phi(X^*)^{\\top} \\bar \\beta , \\Phi(X^*)^{\\top} A^{-1} \\Phi(X^*).\n", "\\end{align*}\n", "\n", "The predictions then are made by taking the mean of the predictive distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bilinear Kernel Function\n", "One can re-write the mean and the variance of the predictive distribution via the *kernel function*\n", "\\begin{align*}\n", " k:\\mathbb{R}^{m} \\times \\mathbb{R}^m \\to \\mathbb{R}, &&\n", " (x, x') \\mapsto \\phi(x)^{\\intercal} \\Sigma \\phi(x')\n", "\\end{align*}\n", "and as usual we employ the convention that for any two (column) matrices (of vectors) $V \\in \\mathbb{R}^{m \\times r}$, $W \\in \\mathbb{R}^{m \\times s}$\n", "\\begin{align*}\n", " K(V,W) := (k(v_{i},w_j))_{1 \\leq i \\leq r, 1 \\leq j \\leq s} \\in \\mathbb{R}^{r \\times s}\n", "\\end{align*}\n", "\n", "**Theorem (predictive distribution with bilinear kernel)**: For any $\\mathbb{R}^{d \\times l}$-valued $\\mathscr{X}^*$ and any data $X^* \\in \\mathbb{R}^{d \\times l}$, the following holds: If $K(X,X) + \\sigma^2 I \\in \\mathbb{R}^{N \\times N}$ is invertible, then the predictive distribution is given as\n", "\\begin{align*}\n", " F(\\mathscr{X}^*) \\mid \\mathscr{X}^*=X^*, \\mathscr{X}=X, \\mathscr{Y}=Y \\sim \\mathcal{N}(m^*, k^*),\n", "\\end{align*}\n", "where\n", "\\begin{align*}\n", " m^* &= K(X^*,X) (K(X,X) + \\sigma^2 I)^{-1}Y \\\\\n", " k^* &= K(X^*,X^*) - K(X^*,X)(K(X,X) + \\sigma^2 I)^{-1}K(X,X^*)).\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Proof**: To see the claim about the mean, notice that\n", "\\begin{align*}\n", " A_{\\Phi} \\Sigma \\Phi(X)\n", " & = (\\sigma^{-2}\\Phi(X)\\Phi(X)^{\\top} + \\Sigma^{-1}) \\Sigma \\Phi(X) \\\\\n", " &= \\sigma^{-2}(\\Phi(X)\\Phi(X)^{\\top}\\Sigma \\Phi(X)^{\\top} + \\sigma^{2} \\Phi(X)) \\\\\n", " &= \\sigma^{-2} \\Phi(X) (\\Phi(X)^{\\top}\\Sigma \\Phi(X) + \\sigma^{2} I) \\\\\n", " &= \\sigma^{-2} \\Phi(X) (K(X,X) + \\sigma^{2} I),\n", "\\end{align*}\n", "thus\n", "\\begin{align*}\n", " \\sigma^{-2} A_{\\Phi}^{-1} \\Phi(X) & = \\Sigma \\Phi(X)(K(X,X) + \\sigma^{2} I)^{-1} \\\\\n", " \\Longrightarrow \\bar w_{\\Phi} &= \\sigma^{-2} A_{\\Phi}^{-1} \\Phi(X)Y = \\Sigma \\Phi(X)(K(X,X) + \\sigma^{2} I)^{-1}Y \\\\\n", " \\Longrightarrow \\Phi(X^*)^{\\top} \\bar w_{\\Phi}& = \\Phi(X^*)^{\\top} \\Sigma \\Phi(X)(K + \\sigma^{2} I)^{-1}Y = K(X^*,X) (K(X,X) + \\sigma^{2} I)^{-1}Y.\n", "\\end{align*}\n", "To see the claim about the variance, we apply the matrix inversion lemma (see appendix below) to $Z^{-1} := \\Sigma$, $W^{-1} := \\sigma^{2} I$, $V := U := \\Phi(X)$. This results in \n", "\\begin{align*}\n", " A^{-1} = (\\sigma^{-2} \\Phi(X) \\Phi(X)^{\\top} + \\Sigma)^{-1} = \\Sigma - \\Sigma \\Phi(X)(\\sigma^{2}I + K(X,X))^{-1} \\Phi(X)^{\\top} \\Sigma,\n", "\\end{align*}\n", "which implies the claim." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Remark (Kernel Trick):**\n", "* Notice that in the above expressions for $m^*$ and $k^*$, we only need to know expressions of the form \n", "\\begin{align}\n", " \\Phi(x)^\\top \\Sigma_p \\Phi(x), && \\Phi(x^*)^\\top \\Sigma_p \\Phi(x), && \\Phi(x^*)^\\top \\Sigma_p \\Phi(x^*)\n", "\\end{align}\n", "to evaluate the kernel $k$ on $X$ and $X^*$.\n", "* Lifting from input space $x$ to feature space $\\Phi(x)$ is called the *kernel trick*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GPR vs LR\n", "If we compare the predictions $F_*^{\\operatorname{LR}}$ made by linear regression LR and $F_*^{\\operatorname{GPR}}$ made by Gaussian Process regression with noise $\\sigma$, covariance prior $\\Sigma:=$, and $\\Phi(X)=X$, we obtain\n", "\\begin{align*}\n", " F_*^{\\operatorname{LR}} &= \\Phi(X^*)^{\\top} (\\Phi(X) \\Phi(X)^{\\top})^{-1} \\Phi(X)Y \\\\\n", " F_*^{\\operatorname{GPR}} &= \\Phi(X^*)^{\\top}(\\Phi(X)\\Phi(X)^{\\top} + \\sigma^2 I)^{-1} XY = K(X^*,X) (K(X,X) + \\sigma^2 I)^{-1}Y \\\\\n", " &=\\Phi(X^*)^{\\top} \\Phi(X)(\\Phi(X)^{\\top}\\Phi(X) + \\sigma^2 I)^{-1} Y\\\\\n", "\\end{align*}\n", "under the assumption that both $\\Phi(X)\\Phi(X)^{\\top} + \\sigma^2 I \\in \\mathbb{R}^{m \\times m}$ and $\\Phi(X)^{\\top}\\Phi(X) + \\sigma^2 I \\in \\mathbb{R}^{N \\times N}$ are invertible. As typically, $N \\gg m$, $\\Phi(X)\\Phi(X)^{\\top}\\in \\mathbb{R}^{m \\times m}$ is typically invertible and thus $\\Phi(X)\\Phi(X)^{\\top} + \\sigma^2 I$ will also be invertible even for $\\sigma=0$. The matrix $\\Phi(X)^{\\top}\\Phi(X) \\in \\mathbb{R}^{N \\times N}$ will never be invertible, however, $\\Phi(X)^{\\top}\\Phi(X) + \\sigma^2 I \\in \\mathbb{R}^{N \\times N}$ may or may not be invertible depending on $\\sigma$. For $\\sigma$ large enough it is invertible, but for $\\sigma \\searrow 0$ it will become singular at some point. The fact that the bilinear kernel without noise can get singular motivates the choice of alternative kernel functions such as the exponential kernel.\n", "\n", "One can illustrate this comparison numerically as follows:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b0327091d5fb4ddbb50d249ecb5dada5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FigureCanvasNbAgg()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f4abbb3049054e07a72d0f26321d2da4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(FloatSlider(value=10.0, description='sigma', max=10.0), Output()), _dom_classes=('widget…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "fig, ax = plt.subplots()\n", "\n", "@wdg.interact(sigma=wdg.FloatSlider(min=0, max=10, value=10, step=0.1))\n", "def plot_comparison(sigma):\n", " N = 100\n", " x = np.linspace(-1,1,N)\n", " #e = np.random.normal(0, 0.1, N)\n", " y = x**3\n", " # GPR\n", " #kernel = DotProduct(sigma_0=0, sigma_0_bounds='fixed')\n", " kernel = ConstantKernel(1.0, (1e-3, 1e10)) * RBF(100, (1e-3, 1e10))\n", " gp = GaussianProcessRegressor(kernel=kernel, alpha=sigma, n_restarts_optimizer=10)\n", " gp.fit(np.atleast_2d(x).T, np.atleast_2d(y).T)\n", " y_pred_gpr, gp_sigma = gp.predict(np.atleast_2d(x).T, return_std=True)\n", " #LR\n", " d=2\n", " poly = PolynomialFeatures(degree=d)\n", " X_ = poly.fit_transform(np.atleast_2d(x).T)\n", " reg = LinearRegression()\n", " reg.fit(X_, np.atleast_2d(y).T)\n", " y_pred_cv_reg = reg.predict(poly.fit_transform(np.atleast_2d(x).T))\n", " ax.clear()\n", " ax.scatter(x, y, marker='o', s=1, label='truth')\n", " ax.plot(x, y_pred_gpr, label=\"GPR\", color='g')\n", " ax.plot(x, y_pred_cv_reg, label=\"LR deg=2\", color='r')\n", " ax.legend(prop={'size': 6}) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix B: Mathematical Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multivariate Normal Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Definition (univariate normal)**: A real-valed random variable $X$ is *Gaussian* or *normal* if it has a density $p$ against the Lebesgue measure and there are $\\mu, \\sigma \\in \\mathbb{R}$, $\\sigma >0$ such that\n", "\\begin{align*}\n", " \\forall x \\in \\mathbb{R}: p(x) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} \\exp\\Big(-\\frac{(x-\\mu)^2}{2 \\sigma^2}\\Big)\n", "\\end{align*}\n", "In this case, we write $X \\sim \\mathcal{N}(\\mu, \\sigma^2)$. We call $\\mu$ the *mean* and $\\sigma^2$ the *variance*. \n", "In case $\\mathbb{P}[X=\\mu]=1$, we also write $X \\sim \\mathcal{N}(\\mu, 0)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Definition (multivariate normal)**: An $\\mathbb{R}^d$-valued random variable $X$ is *Gaussian* if there exists $\\mu \\in \\mathbb{R}^d$ and a positive semi-definite $\\Sigma \\in \\mathbb{R}^{d \\times d}$ such that\n", "\\begin{align*}\n", " \\forall v \\in \\mathbb{R}^d: v^{\\intercal} X \\sim \\mathcal{N}(v^{\\intercal} \\mu, v^{\\intercal} \\Sigma v)\n", "\\end{align*}\n", "In this case, we write $X \\sim \\mathcal{N}(\\mu, \\Sigma)$ and $\\mu$ is called the *mean vector* and $\\Sigma$ is called the *covariance matrix*.\n", "\n", "Notice that this implies that the components $X_i \\sim \\mathcal{N}(\\mu_i, \\Sigma_{ii})$ and $\\operatorname{Cov}[X_i, X_j] = \\Sigma_{ij}$. Conversely, if $X_1, \\ldots, X_d$ are independent and $X_i \\sim \\mathcal{N}(\\mu_i, \\sigma_i^2)$, then $X=(X_1, \\ldots, X_d) \\sim \\mathcal{N}(\\mu, \\operatorname{diag}(\\sigma_i^2))$ ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lemma (density)**: If $X$ is multivariate Gaussian and $\\Sigma$ is positive definite, then $X$ has a density satisfying\n", "\\begin{align*}\n", " \\forall x \\in \\mathbb{R}^d: p(x) = (2 \\pi)^{-\\tfrac{d}{2}} \\det(\\Sigma)^{-\\tfrac{1}{2}} \\exp(-\\tfrac{1}{2}(x-\\mu)^{\\intercal} \\Sigma^{-1}(x-\\mu))\n", "\\end{align*}\n", "The matrix $K := \\Sigma^{-1}$ is called the *concentration* of the distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lemma (linear transformations)**: Let $X \\in \\mathcal{N}(\\mu, \\Sigma)$ of dimension $d$ and $A \\in \\mathbb{R}^{r \\times d}$ and $b \\in \\mathbb{R}^r$. Then $Y := AX + b \\in \\mathcal{N}(A \\mu + b, A \\Sigma A^{\\intercal})$ of dimension $r$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Theorem (partition and condition):** Let $X=(X_1, X_2)$, where $X_i$ are $\\mathbb{R}^{d_i}$-valued. If $X \\sim \\mathcal{N}(\\mu, \\Sigma)$ of dimension $d:=d_1 + d_2$, then $X_i \\sim \\mathcal{N}(\\mu_i, \\Sigma_{ii})$ of dimension $d_i$, where\n", "\\begin{align*}\n", " \\mu = \\begin{pmatrix}\n", " \\mu_1 \\\\\n", " \\mu_2\n", " \\end{pmatrix}, \n", " && \\Sigma = \\begin{pmatrix}\n", " \\Sigma_{11} & \\Sigma_{12} \\\\\n", " \\Sigma_{21} & \\Sigma_{22}\n", " \\end{pmatrix} \n", "\\end{align*}\n", "with $\\mu_i \\in \\mathbb{R}^{d_i}$, $\\Sigma_{ii} \\in \\mathbb{R}^{d_i \\times d_i}$, $i=1,2$. \n", "\n", "In case $\\Sigma_{22}$ is regular, then the conditional distribution of $X_1$ given $X_2=x_2$ satisfies\n", "\\begin{align*}\n", " \\forall x_2 \\in \\mathbb{R}^{d_2}: X_1 \\mid X_2=x_2 \\sim \\mathcal{N}(\\mu_{1 \\mid 2}, \\Sigma_{1 \\mid 2}),\n", "\\end{align*}\n", "where\n", "\\begin{align*}\n", " \\mu_{1 \\mid 2} := \\mu_1 + \\Sigma_{12} \\Sigma_{22}^{-1}(x_2 - \\mu_2), && \\Sigma_{1 \\mid 2} := \\Sigma_{11} - \\Sigma_{12} \\Sigma_{22}^{-1} \\Sigma_{21}.\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matrix Inversion Lemma\n", "\n", "**Lemma**: Let $Z \\in \\mathbb{R}^{n \\times n}$, $W \\in \\mathbb{R}^{m \\times m}$, $U,V \\in \\mathbb{R}^{n \\times m}$. If $Z$, $W$ and $Z+UWV^{\\top}$ are invertible, then\n", "\\begin{align*}\n", " (Z+ UWV^{\\top})^{-1}\n", " &= Z^{-1} + Z^{-1} U (W^{-1} + V^{\\top} Z^{-1} U) V^{\\top} Z^{-1}\n", "\\end{align*}" ] } ], "metadata": { "kernelspec": { "display_name": "heroku", "language": "python", "name": "heroku" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }