{ "cells": [ { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# From Gaussian Algebra to Gaussian Processes (Part 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous post, we covered the following topics:\n", " \n", "- A Gaussian process (GP) defines a distribution over functions (i.e. function evaluations). √\n", "- Marginalizing a Gaussian over a subset of its elements gives another Gaussian (just pluck out the pieces of interest). √\n", "- Conditioning a subset of the elements of a Gaussian on another subset gives another Gaussian (a simple algebraic formula). √\n", "- Posterior over functions (the linear map of the posterior over weights onto some matrix $A = \\phi(X_{*})^T$) √\n", "- Covariances (the second thing we need in order to specify a multivariate Gaussian) √\n", "\n", "**If any of the above is still not clear, please look no further, and re-visit the [previous post]().**\n", "\n", "Conversely, we did not directly cover:\n", "\n", "- Kernels\n", "- Squared-exponentials\n", "\n", "Here, we'll explain these two." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The more features we use, the more expressive our model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We concluded the previous post by plotting posteriors over function evaluations given various `phi_func`s, i.e. a function that creates \"features\" $\\phi(X)$ given an input $X$. \n", "\n", "For example:\n", "\n", "```python\n", "X_train = np.array([-3, -5, 6, 2, 1]) # 5 inputs\n", "y_train = np.array([1, 4, 2, 9, 4]) # 5 corresponding outputs, which we'll use below\n", "\n", "def phi_func(x):\n", " return np.array([3 * np.cos(x), np.abs(x - np.abs(x - 3))]) # makes D=2 features for each input\n", " \n", " \n", ">>> phi_func(X_train).shape\n", "(2, 5)\n", "```\n", "\n", "One common such set of features are those given by \"radial basis functions\", a.k.a. the \"squared exponential\" function, defined as:\n", "\n", "```python\n", "def phi_func(x, D=D):\n", " return np.array([np.exp(-.5 * (x - d)**2) for d in range(int(-D / 2), int(D / 2))]) # phi_x.shape: (D, len(x))\n", "```\n", "\n", "Again, the choice of which features to use is ultimately arbitrary, i.e. a choice left to the modeler. \n", "\n", "Throughout the exercise, we saw that the larger the dimensionality $d$ of our feature function `phi_func`, the more expressive, i.e. less endemically prone to overfitting, our model became.\n", "\n", "**So, how far can we take this?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing features is expensive\n", "\n", "Ideally, we'd compute as many features as possible for each input element, i.e. employ `phi_func(x, D=some_huge_number)`. Tragically, the cost of doing so adds up, and ultimately becomes intractable past meaningfully large values of $d$. \n", "\n", "**Perhaps there's a better way?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How are these things used?\n", "\n", "Let's bring back our GP equations, and prepare ourselves to *squint*! In the previous post, we outlined the following modeling process:\n", "\n", "1. Define prior distribution over weights and function evaluations, $P(w, y)$.\n", "2. Marginalizing $P(w, y)$ over $y$, i.e. $\\int P(w, y)dy$, and given some observed function evaluations $y$, compute the posterior distribution over weights, $P(w\\vert y)$.\n", "3. Linear-mapping $P(w\\vert y)$ onto some new, transformed test input $\\phi(X_*)^T$, compute the posterior distribution over function evaluations, $P(y_*\\ \\vert\\ y) = P(\\phi(X_{*})^Tw\\ \\vert\\ y)$.\n", "\n", "Now, let's unpack #2 and #3.\n", "\n", "### $P(w\\vert y)$\n", "\n", "- First, the mathematical equation:\n", "\n", "$$\n", "\\begin{align*}\n", "P(w\\vert y)\n", " &= \\mathcal{N}(\\mu_w + \\Sigma_{wy}\\Sigma_y^{-1}(y - \\mu_y), \\Sigma_w - \\Sigma_{wy}\\Sigma_y^{-1}\\Sigma_{wy}^T)\\\\\n", " \\\\\n", " &= \\mathcal{N}(\\mu_w + \\Sigma_{wy}(\\phi(X)^T\\Sigma_w \\phi(X))^{-1}(y - \\mu_w^T \\phi(X)), \\Sigma_w - \\Sigma_{wy}(\\phi(X)^T\\Sigma_w \\phi(X))^{-1}\\Sigma_{wy}^T)\n", "\\end{align*}\n", "$$\n", "\n", "- Next, this equation in code:\n", "\n", "```python\n", "# Define initial parameters\n", "D = ... # dimensionality of `phi_func`\n", "\n", "mu_w = np.zeros(D) # often a vector of zeros, though it doesn't have to be\n", "cov_w = np.eye(D) # often the identity matrix, though it doesn't have to be\n", "\n", "# Featurize `X_train`\n", "phi_x = phi_func(X_train, D=D)\n", "\n", "# Params of prior distribution over function evals\n", "mu_y = phi_x.T @ mu_w\n", " = np.zeros(D)\n", "cov_y = phi_x.T @ cov_w @ phi_x\n", "\n", "# Params of posterior distribution over weights\n", "mu_w_post = mu_w + cov_w @ phi_x @ np.linalg.inv(cov_y) @ (y_train - mu_y)\n", " = mu_w + cov_w @ phi_x @ np.linalg.inv(cov_y) @ y_train\n", "cov_w_post = cov_w - cov_w @ phi_x @ np.linalg.inv(cov_y) @ phi_x.T @ cov_w\n", " = cov_w - cov_w @ phi_x @ np.linalg.inv(phi_x.T @ cov_w @ phi_x) @ phi_x.T @ cov_w\n", "```\n", "\n", "### $P(y_*\\ \\vert\\ y) = P(\\phi(X_{*})^Tw\\ \\vert\\ y)$\n", "\n", "Here, $X_*$ is a set of test points, e.g. `np.linspace(-10, 10, 200)`.\n", "\n", "In addition, let's call $X_* \\rightarrow$ `X_test` and $y_* \\rightarrow$ `y_test`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Never alone\n", "\n", "Squinting at the equations for `mu_y_test_post` and `cov_y_test_post`, we see that `phi_x` and `phi_x_test` appear **only in the presence of another `phi_x`, or `phi_x_test`.** \n", "\n", "These four distinct such terms are:\n", "\n", "```python\n", "phi_x_test.T @ cov_w @ phi_x_test\n", "phi_x_test.T @ cov_w @ phi_x\n", "phi_x.T @ cov_w @ phi_x\n", "phi_x.T @ cov_w @ phi_x_test\n", "```\n", "\n", "In mathematical notation, they are (respectively):\n", "\n", "- $\\phi(X_*)^T\\Sigma_w \\phi(X_*)$\n", "- $\\phi(X_*)^T\\Sigma_w \\phi(X)$\n", "- $\\phi(X)^T\\Sigma_w \\phi(X)$\n", "- $\\phi(X)^T\\Sigma_w \\phi(X_*)$\n", "\n", "# Simplifying further\n", "\n", "These are nothing more than *scaled* (via the $\\Sigma_w$ term) dot products in some expanded feature space $\\phi$. \n", "\n", "*Until now, we've explicitly chosen what this $\\phi$ function is.*\n", "\n", "If the scaling matrix $\\Sigma_w$ is [positive definite](https://en.wikipedia.org/wiki/Positive-definite_matrix), we can state the following, using $\\phi(X)^T\\Sigma_w \\phi(X)$, i.e. `phi_x.T @ cov_w @ phi_x`, as an example:\n", "\n", "$$\n", "\\begin{align*}\n", "\\Sigma_w = (\\sqrt{\\Sigma_w})^2\n", "\\end{align*}\n", "$$\n", "\n", "$$\n", "\\begin{align*}\n", "\\phi(X)^T \\Sigma_w \\phi(X) \n", " &= \\big(\\sqrt{\\Sigma_w}\\phi(X)\\big)^T\\big(\\sqrt{\\Sigma_w}\\phi(X)\\big)\\\\\n", " &= \\varphi(X)^T\\varphi(X)\\\\\n", " &= \\varphi(X) \\cdot \\varphi(X)\\\\\n", "\\end{align*}\n", "$$\n", "\n", "As such, our four distinct scaled-dot-product terms can be rewritten as:\n", "\n", "- $\\phi(X_*)^T\\Sigma_w \\phi(X_*) = \\varphi(X_*) \\cdot \\varphi(X_*)$\n", "- $\\phi(X_*)^T\\Sigma_w \\phi(X) = \\varphi(X_*) \\cdot \\varphi(X)$\n", "- $\\phi(X)^T\\Sigma_w \\phi(X) = \\varphi(X) \\cdot \\varphi(X)$\n", "- $\\phi(X)^T\\Sigma_w \\phi(X_*) = \\varphi(X) \\cdot \\varphi(X_*)$\n", "\n", "**In other words, these terms can be equivalently written as dot-products in some space $\\varphi$.** \n", "\n", "*We have **not** explicitly chosen what this $\\varphi$ function is.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Kernels\n", "\n", "A \"kernel\" is a function which gives the similarity between individual elements in two sets, i.e. a Gram matrix.\n", "\n", "For instance, imagine we have two sets of countries, $\\{\\text{France}, \\text{Germany}, \\text{Iceland}\\}$ and $\\{\\text{Morocco}, \\text{Denmark}\\}$, and that similarity is given by an integer value in $[1, 5]$, where 1 is the least similar, and 5 is the most. Applying a kernel to these sets might give a Gram matrix such as:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | France | \n", "Germany | \n", "Iceland | \n", "
---|---|---|---|
Morocco | \n", "4 | \n", "2 | \n", "1 | \n", "
Denmark | \n", "3 | \n", "3 | \n", "4 | \n", "