{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 1: Gaussian Process Regression\n",
    "### Gaussian Process Summer School 2019\n",
    "\n",
    "This lab is designed to introduce Gaussian processes in a practical way, illustrating the concepts introduced in the first two lectures. The key aspects of Gaussian process regression are covered: the covariance function (aka kernels); sampling a Gaussian process; and the regression model. The notebook will introduce the Python library `GPy`$^\\dagger$ which handles the kernels, regression and optimisation of hyperparameter, allowing us to easily access the results we want.\n",
    "\n",
    "The level of this notebook is aimed at introductory, as the background of attendees is diverse, and so cover a wide range of basic GP concepts. There are seven exercises to complete, the difficulty of which varies, but you should aim to complete all during the lab session. The notebook will not be marked and we will provide answers to the exercises after the lab session.\n",
    "\n",
    "In addition, there is a second _optional_ notebook with extra work. The content of this is more advanced, so completion is at your discretion.\n",
    "\n",
    "$^\\dagger$`GPy`: A Gaussian process framework in Python (since 2012). Available from http://github.com/SheffieldML/GPy\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Getting started\n",
    "\n",
    "**[Binder]** If you are using Binder, all libraries required are installed on the Binder image.\n",
    "\n",
    "**[Local]** If you are running this notebook on a local machine, make sure that `GPy` is already installed on your machine. You should be using Python 3.5 or 3.6. A set of instructions for setting up your environment are [available from the GPSS site](http://gpss.cc/gpss19/getting_started).\n",
    "\n",
    "First, we need to setup our notebook with the libraries we are going to use. We will use `numpy` for maths functionality, `pyplot` for plotting, and of course `GPy` for Gaussian processes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Support for maths\n",
    "import numpy as np\n",
    "# Plotting tools\n",
    "from matplotlib import pyplot as plt\n",
    "# we use the following for plotting figures in jupyter\n",
    "%matplotlib inline\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# GPy: Gaussian processes library\n",
    "import GPy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The documentation for `GPy` is available at [gpy.readthedocs.io](http://gpy.readthedocs.io/en/deploy/). We will be using GPy to define our kernels, and regression. Note that `GPy` also contains plotting utilities, but we will not use these in this lab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Covariance functions, aka kernels\n",
    "\n",
    "We will define a covariance function, from hereon referred to as a kernel, using `GPy`. The most commonly used kernel in machine learning is the Gaussian-form radial basis function (RBF) kernel. It is also commonly referred to as the exponentiated quadratic or squared exponential kernel &ndash; all are equivalent.\n",
    "\n",
    "The definition of the (1-dimensional) RBF kernel has a Gaussian-form, defined as:\n",
    "\n",
    "$$\n",
    "    \\kappa_\\mathrm{rbf}(x,x') = \\sigma^2\\exp\\left(-\\frac{(x-x')^2}{2\\mathscr{l}^2}\\right)\n",
    "$$\n",
    "\n",
    "It has two parameters, described as the variance, $\\sigma^2$ and the lengthscale $\\mathscr{l}$.\n",
    "\n",
    "In GPy, we define our kernels using the input dimension as the first argument, in the simplest case `input_dim=1` for 1-dimensional regression. We can also explicitly define the parameters, but for now we will use the default values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a 1-D RBF kernel with default parameters\n",
    "k = GPy.kern.RBF(1)\n",
    "# Preview the kernel's parameters\n",
    "k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the above table that our kernel has two parameters, `variance` and `lengthscale`, both with value `1.0`. There is also information on the constraints and priors on each parameter, but we will look at this later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualising the kernel\n",
    "\n",
    "We can visualise our kernel in a few different ways. We can plot the _shape_ of the kernel by plotting $k(x,0)$ over some sample space $x$ which, looking at the equation above, clearly has a Gaussian shape. This describes the covariance between each sample location and $0$.\n",
    "\n",
    "Alternatively, we can construct a full covariance matrix, $\\mathbf{K}_{xx} \\triangleq k(x,x')$ with samples $x = x'$. The resulting GP prior is a multivariate normal distribution over the space of samples $x$: $\\mathcal{N}(\\mathbf{0}, \\mathbf{K}_{xx})$. It should be evident then that the elements of the matrix represents the covariance between respective points in $x$ and $x'$, and that it is exactly $\\sigma^2[=1]$ in the diagonal.\n",
    "\n",
    "We can show this using `pyplot` to plot the vector $k(x,0)$ and the matrix $k(x,x')$ using `k.K(`$\\cdot$ `,` $\\cdot$`)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space: 100 samples in the interval [-4,4]\n",
    "X = np.linspace(-4.,4.,100)[:, None] # we need [:, None] to reshape X into a column vector for use in GPy\n",
    "\n",
    "# Set up the plotting environment\n",
    "plt.figure(figsize=(18,5))\n",
    "\n",
    "# ==== k(x,0)\n",
    "\n",
    "plt.subplot(121) # left plot\n",
    "\n",
    "# First, sample kernel at x' = 0\n",
    "K = k.K(X, np.array([[0.]])) # k(x,0)\n",
    "\n",
    "# Plot covariance vector\n",
    "plt.plot(X,K)\n",
    "\n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa$\")\n",
    "plt.title(\"$\\kappa_{rbf}(x,0)$\")\n",
    "\n",
    "# ==== k(x,x')\n",
    "\n",
    "plt.subplot(122) # right plot\n",
    "\n",
    "# The kernel takes two inputs, and outputs the covariance between each respective point in the two inputs\n",
    "K = k.K(X,X)\n",
    "\n",
    "# Plot the covariance of the sample space\n",
    "plt.pcolor(X.T, X, K)\n",
    "\n",
    "# Format and annotate plot\n",
    "plt.gca().invert_yaxis(), plt.gca().axis(\"image\")\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"x'\"), plt.colorbar()\n",
    "plt.title(\"$\\kappa_{rbf}(x,x')$\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting the kernel parameters\n",
    "\n",
    "Looking at the above definition of the RBF kernel, we can see that the parameters, i.e. variance and lengthscale, control the shape of the covariance function and therefore the value of the covariance between points $x$ and $x'$.\n",
    "\n",
    "We can access the value of the kernel parameters in `GPy` and manually set them by calling `k.param_name`, e.g. `k.lengthscale` or `k.variance` for the RBF kernel. The following example demonstrates how the value of the lengthscale affects the RBF kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space : 100 samples in the interval [-4,4] \n",
    "X = np.linspace(-4.,4.,250)[:, None] # we use more samples to get a smoother plot at low lengthscales\n",
    "\n",
    "# Create a 1-D RBF kernel with default parameters\n",
    "k = GPy.kern.RBF(1)\n",
    "\n",
    "# Set up the plotting environment\n",
    "plt.figure(figsize=(18, 7))\n",
    "\n",
    "# Set up our list of different lengthscales\n",
    "ls = [0.25, 0.5, 1., 2., 4.]\n",
    "\n",
    "# Loop over the lengthscale values\n",
    "for l in ls:\n",
    "    # Set the lengthscale to be l\n",
    "    k.lengthscale = l\n",
    "    # Calculate the new covariance function at k(x,0)\n",
    "    C = k.K(X, np.array([[0.]]))\n",
    "    # Plot the resulting covariance vector\n",
    "    plt.plot(X,C)\n",
    "\n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa(x,0)$\") \n",
    "plt.title(\"Effects of different lengthscales on the Gaussian RBF kernel\")\n",
    "plt.legend(labels=ls);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "(a) What is the effect of the lengthscale parameter on the covariance function?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b) Change the code used above to plot the covariance function showing the effects of the variance on the covariance function. Comment on the effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Types of covariance function\n",
    "\n",
    "There are many different covariance functions already implemented in `GPy`. Aside from the `RBF` kernel, there are others such as the following:\n",
    "- `Exponential`\n",
    "- `Matern32`\n",
    "- `Matern52`\n",
    "- `Brownian`\n",
    "- `Bias`\n",
    "- `Linear`\n",
    "- `StdPeriodic`\n",
    "- `Cosine`\n",
    "- `PeriodicMatern32`, \n",
    "\n",
    "Note: when defining these, all are preceded by `GPy.kern.` The following are some examples of the [Matérn 5/2](https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function) and Cosine kernels, compared with the RBF kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space : 100 samples in the interval [-4,4] \n",
    "X = np.linspace(-4.,4.,250)[:, None]\n",
    "\n",
    "# RBF kernel\n",
    "k_R = GPy.kern.RBF(1)\n",
    "C_R = k_R.K(X, np.array([[0.]]))\n",
    "\n",
    "# Matern 5/2\n",
    "k_M = GPy.kern.Matern52(1)\n",
    "C_M = k_M.K(X, np.array([[0.]]))\n",
    "\n",
    "# Cosine \n",
    "k_C = GPy.kern.Cosine(1)\n",
    "C_C = k_C.K(X, np.array([[0.]]))\n",
    "\n",
    "plt.figure(figsize=(18,7))\n",
    "plt.plot(X, C_R, X, C_M, X, C_C);\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa$\") \n",
    "plt.legend(labels=[\"Gaussian RBF\", \"Matérn 5/2\", \"Cosine\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not every kernel has the same set of parameters. Some kernels are not parameterised by a lengthscale, for example, like the `Linear` kernel which only has a list of variances corresponding to each linear component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "GPy.kern.Linear(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Likewise, not every kernel is stationary. In the case of the Gaussian RBF, or Matérn kernels, the kernel can be written $\\kappa(x,x') = f(x-x')$, however this is not true for, e.g., the Brownian motion covariance function, which is defined as $k(x,x') = \\min(x,x')$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space : 100 samples in the interval [-2,8] \n",
    "X = np.linspace(-2., 8., 100)[:,None]\n",
    "\n",
    "# Note that the Brownian kernel is defined:\n",
    "#   k = min(abs(x),abs(x')) if sgn(x) = sgn(x')\n",
    "#   k = 0 if sgn(x) ≠ sgn(x')\n",
    "\n",
    "# We define our Brownian kernel\n",
    "k_B = GPy.kern.Brownian(1)\n",
    "\n",
    "plt.figure(figsize=(18,7))\n",
    "\n",
    "x_s = [0., 2., 4., 6.] # values of x'\n",
    "# Loop through values of x'\n",
    "for x_ in x_s:\n",
    "    # Evaluate kernel at k(x,x')\n",
    "    K_B = k_B.K(X, np.array([[x_]])) \n",
    "    # Plot covariance vector\n",
    "    plt.plot(X, K_B)\n",
    "    \n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa$\")\n",
    "plt.title(\"Effects of different inputs on a non-stationary, Brownian kernel\")\n",
    "plt.legend(labels=[\"$\\kappa(x,0)$\", \"$\\kappa(x,2)$\", \"$\\kappa(x,4)$\", \"$\\kappa(x,6)$\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Combining covariance functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "(a) A matrix, $\\mathbf{K}$, is positive semi-definite if the matrix inner product is greater than or equal to zero, $\\mathbf{x}^\\text{T}\\mathbf{K}\\mathbf{x} \\geq 0$, _regardless of the values in $\\mathbf{x}$_. Given this, it should be easy to see that the sum of two positive semi-definite matrices is also positive semi-definite. In the context of Gaussian processes, this is the sum of two covariance functions. What does this mean from a modelling perspective?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b) What about the element-wise product of two covariance functions? If we define $k(\\mathbf{x}, \\mathbf{x}') = k_1(\\mathbf{x},\\mathbf{x}')k_2(\\mathbf{x},\\mathbf{x}')$, then is $k(\\mathbf{x},\\mathbf{x}')$ a valid covariance function?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining kernels in GPy\n",
    "\n",
    "We can easily combine kernels using `GPy` using the `+` and `*` operators, respectively denoting addition and product of kernels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summing kernels\n",
    "An example of adding kernels is shown here. We create a new kernel that is the `sum` of an RBF and a Matern 5/2 kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the first kernel: a 1-D RBF with lengthscale 2.0\n",
    "k_R = GPy.kern.RBF(1, lengthscale=2., name=\"RBF\")\n",
    "# Create the second kernel: a 1-D Matern52 with variance 2.0 and lengthscale 4.0\n",
    "k_M = GPy.kern.Matern52(1, variance=2., lengthscale=4., name=\"Matern52\")\n",
    "\n",
    "# Add the kernels together\n",
    "k_sum = k_R + k_M\n",
    "# Preview the properties of the composite kernel\n",
    "k_sum"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualise our kernel sum to see the resulting effect. It should be fairly clear that the result is simply the sum of evaluations of the respective kernels for each sample point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space : 100 samples in the interval [-10,10] \n",
    "X = np.linspace(-10., 10., 100)[:,None]\n",
    "\n",
    "# Set up the plotting environment\n",
    "plt.figure(figsize=(18,7))\n",
    "\n",
    "# Here we sample from the consituent and composite kernels\n",
    "K_R   = k_R.K(X, np.array([[0.]]))    # RBF\n",
    "K_M   = k_M.K(X, np.array([[0.]]))    # Matern 5/2\n",
    "K_sum = k_sum.K(X, np.array([[0.]]))  # RBF + Matern\n",
    "\n",
    "# Plot each of our covariance vectors\n",
    "plt.plot(X, K_R, X, K_M, X, K_sum)\n",
    "    \n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa$\")\n",
    "plt.legend(labels=[\"Gaussian RBF\", \"Matérn 5/2\", \"RBF + Matérn\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multiplying two kernels\n",
    "\n",
    "We also demonstrate here the effect of multiplying two kernels. Here, we multiply an RBF and Periodic kernel, effectively encapsulating the periodicity into an RBF window:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the first kernel: a 1-D RBF with lengthscale 5.0\n",
    "k_R = GPy.kern.RBF(1, lengthscale=5., name=\"RBF\")\n",
    "# Create the second kernel: a 1-D StdPeriodic with period 5.0\n",
    "k_P = GPy.kern.StdPeriodic(1, period=5., name=\"Periodic\")\n",
    "\n",
    "# Multiply the kernels together\n",
    "k_mul = k_R * k_P\n",
    "\n",
    "# Preview the properties of the composite kernel\n",
    "k_mul"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Our sample space : 100 samples in the interval [-10,10] \n",
    "X = np.linspace(-10., 10., 100)[:,None]\n",
    "\n",
    "# Set up the plotting environment\n",
    "plt.figure(figsize=(18,7))\n",
    "\n",
    "# Here we sample from the consituent and composite kernels\n",
    "K_R   = k_R.K(X, np.array([[0.]]))      # RBF\n",
    "K_P   = k_P.K(X, np.array([[0.]]))      # StdPeriodic\n",
    "K_mul = k_mul.K(X, np.array([[0.]]))    # RBF * StdPeriodic\n",
    "\n",
    "# Plot each of our covariance vectors\n",
    "plt.plot(X, K_R, X, K_P, X, K_mul)\n",
    "    \n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"$\\kappa$\")\n",
    "plt.legend(labels=[\"Gaussian RBF\", \"Periodic\", \"RBF x Periodic\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Sampling from a Gaussian Process\n",
    "\n",
    "A Gaussian process provides a prior over some infinite-dimensional function, defined by a mean function and covariance function\n",
    "\n",
    "$$ f(x) \\sim \\mathcal{GP}(m(x), k(x,x'))$$\n",
    "\n",
    "When we sample from the covariance function, $k$, to create a matrix over some sample space, we are creating a matrix of values that describe the covariance between sample points. Since it is not possible to sample every single point in an infinite dimensional function, we have to sample a finite subset of the input domain. Let $\\mathbf{X}$ denote some sample inputs, and $\\mathbf{K}$ the covariance matrix, with elements $K_{ij} = k(\\mathbf{X}_i,\\mathbf{X}_j)$, then we can describe the prior over $f(\\mathbf{X})$ as a (finite-dimensional) normal distribution with covariance $\\mathbf{K}$. As such, we can easily create samples of $f$ which, for a good choice of $\\mathbf{X}$, are representative of the true function.\n",
    "\n",
    "We can also sample from the kernel prior by creating a covariance matrix over a sample space and sampling from a zero-mean multivariate normal distribution with covariance $\\mathbf{K}$. Below are examples of different kernels with different parameters, including composite kernels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "ks = [ # List of example kernels\n",
    "    GPy.kern.RBF(1, lengthscale=1.),\n",
    "    GPy.kern.RBF(1, lengthscale=0.5),\n",
    "    GPy.kern.RBF(1, lengthscale=0.25, variance=2.),\n",
    "    GPy.kern.Exponential(1),\n",
    "    GPy.kern.Matern32(1),\n",
    "    GPy.kern.Matern52(1),\n",
    "    GPy.kern.StdPeriodic(1, period=2.),\n",
    "    GPy.kern.Cosine(1),\n",
    "    GPy.kern.Brownian(1),\n",
    "    GPy.kern.Linear(1),\n",
    "    GPy.kern.Bias(1),\n",
    "    GPy.kern.White(1),\n",
    "    GPy.kern.StdPeriodic(1)*GPy.kern.RBF(1),\n",
    "    GPy.kern.Linear(1) + GPy.kern.Exponential(1)\n",
    "]\n",
    "# The name of our kernels (for the legend)\n",
    "kernel_name = [\"RBF ls=1\", \"RBF ls=0.5\", \"RBF ls=0.25, var=2\", \"Exponential\", \"Matern 3/2\", \n",
    "               \"Matern 5/2\", \"Periodic\", \"Cosine\", \"Brownian\", \"Linear\", \"Bias\", \"White\", \"Periodic x RBF\", \"Linear + Exponential\"]\n",
    "\n",
    "# Our sample space\n",
    "X = np.linspace(-5., 5., 250)[:, None]\n",
    "\n",
    "print(\"The following plots demonstrate samples from a Gaussian process prior and the corresponding covariance matrix\")\n",
    "\n",
    "# Loop through our kernels\n",
    "for i,k in enumerate(ks):\n",
    "    # The mean function is set to 0\n",
    "    mu = np.zeros((250)) # we have 250 sample inputs\n",
    "    # Get the covariance matrix\n",
    "    if i is not 11:\n",
    "        C  = k.K(X,X)\n",
    "    else: # We have to sample White noise kernel differently\n",
    "        C = k.K(X)\n",
    "        \n",
    "    # Sample 5 times from a multivariate Gaussian distribution with mean 0 and covariance k(X,X)\n",
    "    Z  = np.random.multivariate_normal(mu, C, 5)\n",
    "    \n",
    "    # Setup figure environment\n",
    "    plt.figure(figsize=(18, 7))\n",
    "    \n",
    "    # Show samples on left hand side\n",
    "    plt.subplot(121)\n",
    "    for j in range(5 if i < 11 else 2): # Loop through samples\n",
    "        plt.plot(X[:],Z[j,:])\n",
    "        plt.title(kernel_name[i])\n",
    "        \n",
    "    # Visualise covariance matrix on right hand side\n",
    "    plt.subplot(122)\n",
    "    plt.pcolor(X.T, X, C)\n",
    "    # Annotate plot\n",
    "    plt.gca().invert_yaxis(), plt.gca().axis(\"image\")\n",
    "    plt.colorbar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These samples are from the Gaussian process prior made up of the covariance function and a zero mean. After GP regression, the fitted posterior can also be sampled in this manner, to get samples of the fitted function.\n",
    "\n",
    "### Exercise 3\n",
    "\n",
    "Can you identify the covariance function used to generate the following samples?\n",
    "\n",
    "![exercise_3](https://github.com/gpschool/labs/raw/2019/.resources/lab1_kernel_samples_example.png)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "(a)\n",
    "\n",
    "(b)\n",
    "\n",
    "(c)\n",
    "\n",
    "(d)\n",
    "\n",
    "(e)\n",
    "\n",
    "(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Gaussian Process Regression\n",
    "\n",
    "We will now use our Gaussian process prior with some observed data to form a GP regression model. \n",
    "\n",
    "Suppose we have a data model for which we only have noisy observations, $y = f(x) + \\epsilon$ at some small number of sample locations, $\\mathbf{X}$. Here, we set up an example function\n",
    "\n",
    "$$\n",
    "    f(x) = -\\cos(2\\pi x) + \\frac{1}{2}\\sin(6\\pi x)\n",
    "$$\n",
    "$$\n",
    "    \\mathbf{y} = f(\\mathbf{X}) + \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, 0.01)\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# lambda function, call f(x) to generate data\n",
    "f = lambda x: -np.cos(2*np.pi*x) + 0.5*np.sin(6*np.pi*x)\n",
    "\n",
    "# 10 equally spaced sample locations \n",
    "X = np.linspace(0.05, 0.95, 10)[:,None]\n",
    "\n",
    "# y = f(X) + epsilon\n",
    "Y = f(X) + np.random.normal(0., 0.1, (10,1)) # note that np.random.normal takes mean and s.d. (not variance), 0.1^2 = 0.01\n",
    "\n",
    "# Setup our figure environment\n",
    "plt.figure(figsize=(18, 7))\n",
    "\n",
    "# Plot observations\n",
    "plt.plot(X, Y, \"kx\", mew=2)\n",
    "\n",
    "# Annotate plot\n",
    "plt.xlabel(\"x\"), plt.ylabel(\"f\")\n",
    "plt.legend(labels=[\"sample points\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will first fit a Gaussian process using the exact equations.\n",
    "\n",
    "A Gaussian process regression model using a Gaussian RBF covariance function can be defined first by setting up the kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = GPy.kern.RBF(1, variance=1., lengthscale=0.1, name=\"rbf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then combining it with the data to form a Gaussian process regression model, with $\\mathbf{X}^*$ representing _any_ new inputs (imagine $\\mathbf{f}^*$ approximates $f(\\mathbf{X}^*)$):\n",
    "\n",
    "$$\n",
    "\\left.\\mathbf{f}^*\\,\\right|\\,\\mathbf{X}^*,\\mathbf{X},\\mathbf{y} \\sim \\mathcal{N}\\left(\\mathbf{m}, \\mathbf{C}\\right),\n",
    "$$\n",
    "\n",
    "where $\n",
    "\\mathbf{m} = \\mathbf{K}_{*x}(\\mathbf{K}_{xx} + \\sigma^2\\mathbf{I})^{-1}\\mathbf{y}$ and $\\mathbf{C} = \\mathbf{K}_{**} -  \\mathbf{K}_{*x}(\\mathbf{K}_{xx} + \\sigma^2\\mathbf{I})^{-1}\\mathbf{K}_{*x}^\\text{T}\n",
    "$ and covariance matrices are defined by evaluations of the kernel functions: $\\mathbf{K}_{xx} = k(\\mathbf{X}, \\mathbf{X})$; $\\mathbf{K}_{*x} = k(\\mathbf{X}^*, \\mathbf{X})$; and $\\mathbf{K}_{**} = k(\\mathbf{X}^*,\\mathbf{X}^*)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# New test points to sample function from\n",
    "Xnew = np.linspace(-0.05, 1.05, 100)[:, None]\n",
    "\n",
    "# Covariance between training sample points (+ Gaussian noise)\n",
    "Kxx = k.K(X,X) + 1 * np.eye(10)\n",
    "\n",
    "# Covariance between training and test points\n",
    "Ksx = k.K(Xnew, X)\n",
    "\n",
    "# Covariance between test points\n",
    "Kss = k.K(Xnew,Xnew)\n",
    "\n",
    "# The mean of the GP fit (note that @ is matrix multiplcation: A @ B is equivalent to np.matmul(A,B))\n",
    "mean = Ksx @ np.linalg.inv(Kxx) @ Y\n",
    "# The covariance matrix of the GP fit\n",
    "Cov = Kss - Ksx @ np.linalg.inv(Kxx) @ Ksx.T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we define a quick plotting utility function for our GP fits. There are a number of plotting options available in GPy, but we will use the below method, which plots the mean and 95% confidence fit of a GP for a given input $\\mathbf{X}^*$. Optionally, we will allow it to plot the initial training points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_gp(X, m, C, training_points=None):\n",
    "    \"\"\" Plotting utility to plot a GP fit with 95% confidence interval \"\"\"\n",
    "    # Plot 95% confidence interval \n",
    "    plt.fill_between(X[:,0],\n",
    "                     m[:,0] - 1.96*np.sqrt(np.diag(C)),\n",
    "                     m[:,0] + 1.96*np.sqrt(np.diag(C)),\n",
    "                     alpha=0.5)\n",
    "    # Plot GP mean and initial training points\n",
    "    plt.plot(X, m, \"-\")\n",
    "    plt.legend(labels=[\"GP fit\"])\n",
    "    \n",
    "    plt.xlabel(\"x\"), plt.ylabel(\"f\")\n",
    "    \n",
    "    # Plot training points if included\n",
    "    if training_points is not None:\n",
    "        X_, Y_ = training_points\n",
    "        plt.plot(X_, Y_, \"kx\", mew=2)\n",
    "        plt.legend(labels=[\"GP fit\", \"sample points\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(14, 8))\n",
    "\n",
    "# Plot the GP fit mean and covariance\n",
    "plot_gp(Xnew, mean, Cov, training_points=(X,Y))\n",
    "plt.title(\"Explicit (homemade) regression model fit\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also save effort and time by to do Gaussian process regression using `GPy`, by creating a GP regression model with sample points $(\\mathbf{X}, \\mathbf{Y})$ and the Gaussian RBF kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = GPy.models.GPRegression(X, Y, k)\n",
    "m "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use GPy's regression and prediction tools, which _should_ give the same result as our basic implementation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use GPy model to calculate the mean and covariance of the fit at Xnew\n",
    "mean, Cov = m.predict_noiseless(Xnew, full_cov=True)\n",
    "\n",
    "plt.figure(figsize=(14, 8))\n",
    "\n",
    "# Plot the GP fit mean and covariance\n",
    "plot_gp(Xnew, mean, Cov, training_points=(X,Y))\n",
    "plt.title(\"GPy regression model fit\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can be clearly seen that this *is* the same fit as our above model. However, using GPy gives flexibility and ease of use for extending the capabilities of the fitting, including use of different kernels, optimising parameters and solving more complicated problems, including classification. We also don't need to write explicit equations and manually creating covariances matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "\n",
    "(a) What do you think of this initial fit? Does the prior given by the GP seem to be adapted?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b) The parameters of the model can be editted much like those of the kernel. For example, \n",
    "```\n",
    "m.Gaussian_noise = 0.1\n",
    "```\n",
    "or\n",
    "```\n",
    "m.rbf.variance = 2.0\n",
    "```\n",
    "Change the values of the parameters to try and obtain a better fit of the GP. You can recalculate the updated mean and covariance after changing the values by calling `m.predict_noiseless` as above.\n",
    "\n",
    "*Note: changing the original kernel `k` will also affect the model parameters due to how Python connects objects, but this is not a reliable way of setting the parameters, so you should adjust the kernel parameters via the model `m` as described*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c) Given that we can obtain the mean and covariance of the GP fit, we can also sample the GP posterior as a multivariate Gaussian process. This can be done as in Section 4, where we sampled the priors as defined by the kernels, i.e. with `np.random.multivariate_normal`. Obtain 10 samples from the GP posterior and plot them alongside the data. Try to simulate noisy measurements using `m.predict` (rather than `m.predict_noiseless`).\n",
    "\n",
    "*Remember to get the full covariance matrix, using `full_cov=True`, and note that to make the mean vector 1-D (for sampling a multivariate normal), you need `np.random.multivariate_normal(mean[:,0], Cov)`)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Covariance Function Parameter Estimation\n",
    "\n",
    "As discussed in the lectures, the values of kernel parameters can be estimated by maximising the likelihood of the observations. This is useful to optimise our estimate of the underlying function, without eye-balling parameters to get a good fit.\n",
    "\n",
    "In `GPy`, the `model` objects such as `GPRegression`, have parameter optimisation functionality. We can call this as following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m.optimize()\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the selected parameters in the model table above. The regression fit with the optimised parameters can be plotted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get mean and covariance of optimised GP\n",
    "mean, Cov = m.predict_noiseless(Xnew, full_cov=True)\n",
    "\n",
    "# Setup the figure environment\n",
    "plt.figure(figsize=(14, 8))\n",
    "\n",
    "# Plot the GP fit mean and covariance\n",
    "plot_gp(Xnew, mean, Cov, training_points=(X,Y))\n",
    "plt.plot(Xnew, f(Xnew), \"r:\", lw=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameter constraints\n",
    "\n",
    "We can see in the above model that the regression model is fit to the data, as the optimiser has minimised the noise effect in the model, `Gaussian_noise.variance`$ = 4.804\\times10^{-8}$. If we *know*, or can reasonably approximate, the variance of the observation noise $\\epsilon$, we can fix this parameter for the optimiser, using `fix`, which in the case of the above is $0.01$. We can also limit the values that the parameters take by adding constraints. For example, the variance and lengthscale can only be positive, so calling `constrain_positive`, we can enforce this (note that this is the default constraint for GP regression anyway)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Constrain the regression parameters to be positive only\n",
    "m.constrain_positive()\n",
    "\n",
    "# Fix the Gaussian noise variance at 0.01 \n",
    "m.Gaussian_noise.variance = 0.01 # (Reset the parameter first)\n",
    "m.Gaussian_noise.variance.fix()\n",
    "\n",
    "# Reoptimise\n",
    "m.optimize()\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see our constraints in the corresponding column in the above table, where \"`+ve`\" means we are constrained to positive values, and `fixed` means the optimiser will not try and optimise this parameter. We can see here that the variance of the noise in the model is unchanged by the optimiser. Looking at the resulting plot, we can see that we have a much more reasonable confidence in our estimate, and that the true function is hard to distinguish from samples drawn from our fit, indicating that we have reasonable approximation of the true function given noisy observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get mean and covariance of optimised GP\n",
    "mean, Cov = m.predict_noiseless(Xnew, full_cov=True)\n",
    "\n",
    "# Setup our figure environment\n",
    "plt.figure(figsize=(18, 14))\n",
    "\n",
    "# The top plot shows our mean regression fit and 95% confidence intervals \n",
    "plt.subplot(211)\n",
    "# Plot the GP fit mean and covariance\n",
    "plot_gp(Xnew, mean, Cov, training_points=(X,Y))\n",
    "plt.title(\"GP posterior\")\n",
    "plt.subplot(212)\n",
    "\n",
    "plt.plot(Xnew, f(Xnew),\"r:\", lw=3)\n",
    "\n",
    "Z  = np.random.multivariate_normal(mean[:,0], Cov, 20)\n",
    "for z in Z:\n",
    "    plt.plot(Xnew,z, \"g-\", alpha=0.2)\n",
    "    \n",
    "plt.xlabel(\"x\"), plt.ylabel(\"f\")\n",
    "plt.legend(labels=[\"true $f(x)$\", \"samples from GP\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the prior knowledge of the noise in the data has given us a reasonably good approximation of the true function. The samples from the GP demonstrate provide fits to the observed data and roughly follow the shape of the true function $f$, particularly in the range for which we have samples.\n",
    "\n",
    "### Exercise 5\n",
    "\n",
    "The function we used is a sum of sinusoids, and therefore has inherent periodicity. However, we only have samples for a single period, so this is not directly seen in the domain of $[0, 1]$.\n",
    "\n",
    "(a) Try predicting the function on the range of $[0, 2]$ instead (`Xnew = np.linspace(0., 2., 100)[:,None]`), and compare the fitted GP posterior with the true function $f$.\n",
    "\n",
    "*Note, you shouldn't need to retrain the model or adjust the hyperparameters.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b) Comment on the fit of the GP, and the uncertainty in regions where we have no observations. Is the GP still a good fit? How might we produce a better fit, for example, if we knew $f(x)$ had a periodic nature?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Real World Example\n",
    "\n",
    "We'll consider now a classic real world example using atmospheric CO$_2$ observations from the Mauna Loa Observatory, Hawaii. The dataset is a good demonstration of the predictive power of Gaussian processes, and we will use it to show how we can encode our prior beliefs to combine kernels.\n",
    "\n",
    "First, we need to download the dataset, which can be done through a utility provided by `GPy`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "# Load in the data\n",
    "with open(\"../.resources/mauna_loa\", \"rb\") as fid:\n",
    "    data = pickle.load(fid)\n",
    "    \n",
    "print(\"\\nData keys:\")\n",
    "print(data.keys())\n",
    "\n",
    "print(\"\\nCitation:\")\n",
    "print(data['citation'])\n",
    "\n",
    "print(\"\\nInfo:\")\n",
    "print(data['info'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training data (X = input, Y = observation)\n",
    "X, Y = data['X'], data['Y']\n",
    "\n",
    "# Test data (Xtest = input, Ytest = observations)\n",
    "Xtest, Ytest = data['Xtest'], data['Ytest']\n",
    "\n",
    "# Set up our plotting environment\n",
    "plt.figure(figsize=(14, 8))\n",
    "\n",
    "# Plot the training data in blue and the test data in red\n",
    "plt.plot(X, Y, \"b.\", Xtest, Ytest, \"r.\")\n",
    "\n",
    "# Annotate plot\n",
    "plt.legend(labels=[\"training data\", \"test data\"])\n",
    "plt.xlabel(\"year\"), plt.ylabel(\"CO$_2$ (PPM)\"), plt.title(\"Monthly mean CO$_2$ at the Mauna Loa Observatory, Hawaii\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**You may want to subsample the data to save time during the labs**\n",
    "\n",
    "Run the following to reduce the datasize by only using every other training point:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X[::2, :]\n",
    "Y = Y[::2, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Naive GP regression\n",
    "\n",
    "First, we will try to fit a basic RBF to our data, as we have used in previous examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = GPy.kern.RBF(1, name=\"rbf\")\n",
    "\n",
    "m = GPy.models.GPRegression(X, Y, k)\n",
    "m.optimize()\n",
    "\n",
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xnew = np.vstack([X, Xtest])\n",
    "\n",
    "mean, Cov = m.predict(Xnew, full_cov=True)\n",
    "\n",
    "plt.figure(figsize=(14, 8))\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "plt.plot(X, Y, \"b.\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is possible to make the model fit the data near perfectly by minimising the variance of the Gaussian noise in the likelihood and fixing the kernel variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Effectively remove noise parameter (needs to be >0, so select value that is very low)\n",
    "m.Gaussian_noise.variance = 0.00001\n",
    "m.Gaussian_noise.variance.fix()\n",
    "\n",
    "# We will fix the variance as well, so that only the lengthscale is optimised\n",
    "m.rbf.variance = 10.\n",
    "m.rbf.variance.fix()\n",
    "\n",
    "# This should minimize the lengthscale to fit closely to the training points\n",
    "m.optimize()\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**But** this has no predictive power, and we have really just overfitted to the training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean, Cov = m.predict(Xnew, full_cov=True)\n",
    "\n",
    "plt.figure(figsize=(18, 5))\n",
    "\n",
    "# The left plot shows the GP fit to a subsample of our training set\n",
    "plt.subplot(121)\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "plt.plot(X, Y, \"b.\");\n",
    "plt.gca().set_xlim([1960,1980]), plt.gca().set_ylim([310, 340])\n",
    "\n",
    "# The right plot shows that the GP has no predictive power and reverts to 0\n",
    "plt.subplot(122)\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "plt.plot(X, Y, \"b.\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GP Regression with combined covariance functions\n",
    "\n",
    "Taking a look at the training data, we can see a number of features that occur in the data. There is a clear periodic trend that is yearly, and an approximately linear trend. We can use this prior information in our choice of kernel to give some meaning to the GP fit.\n",
    "\n",
    "First, we will look at the linear trend. It should be obvious that the overall trend (ignoring the periodicity) can be described approximately by $f(x) \\approx a + bx$. To embed this as a covariance function, we can use the  `Linear` covariance functions, which adds a linear trend to the covariance.\n",
    "\n",
    "### Exercise 6\n",
    "\n",
    "(a) Create a `Linear` kernel with reasonable estimates of the parameters that represent the trend?\n",
    "\n",
    "(b) How might we encode this trend in the mean estimate of the GP ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c) Create a GP regression model using the kernels to fit the data. Comment on good is the fit and the predictive power of the model? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Periodicity\n",
    "\n",
    "It's there is a seasonal trend over the year, and that a simple linear fit cannot capture this information. However, we can add this using our choice of kernel by adding a `StdPeriodic` kernel to our regression model. It's evident to the data that the period is yearly, so a period of $\\omega=1$ is a sensible choice for our initial parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ks = [ # Our kernels\n",
    "    GPy.kern.Bias(1, variance=10000.),    # Constant offset\n",
    "    GPy.kern.Linear(1),                   # Linear trend\n",
    "    GPy.kern.StdPeriodic(1, period=1)     # Periodicity\n",
    "]\n",
    "\n",
    "# Create a regression model with an additive kernel (bias + linear + periodic)\n",
    "m = GPy.models.GPRegression(X, Y, ks[0] + ks[1] + ks[2])\n",
    "\n",
    "# Optimise hyperparameters to maximise likelihood\n",
    "m.optimize()\n",
    "\n",
    "# Predict the value of CO2 using our GP fit\n",
    "mean, Cov = m.predict(Xnew)\n",
    "\n",
    "# Set up plotting\n",
    "plt.figure(figsize=(14,8))\n",
    "\n",
    "# Plot GP fit\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "# Show our training and test points\n",
    "plt.plot(X, Y, \".b\", Xtest, Ytest, \".r\", alpha=0.4)\n",
    "# Annotate plot\n",
    "plt.legend(labels=[\"GP fit\",\"training points\", \"test points\"])\n",
    "\n",
    "# Preview the parameters of our model\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the plot we can see that while we have maintained the periodicity in our prediction, there is some deviation in the amplitude of each period. Likewise, the data is not strictly linear. We can embed model these non-linearities and deviations by adding an `Exponential` kernel. First, in the summative kernel, but also by multiplying a Gaussian RBF by the periodic kernel. For some intiution on how this allows for deviation of the periodicity, see the corresponding sample plot in Section 4.\n",
    "\n",
    "Note that the `Exponential` kernel describes an exponential decay in covariance with distance between points, and does **not** model an exponential trend in the overall data. The `RBF` kernel describes a smoother (Gaussian) decay, which may be a better model for changes in the periodic signal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ks = [ # Our kernels\n",
    "    GPy.kern.Exponential(1),              # Non-linearity in overall trend\n",
    "    GPy.kern.Bias(1, variance=10000.),    # Constant offset\n",
    "    GPy.kern.Linear(1),                   # Linear trend\n",
    "    GPy.kern.StdPeriodic(1, period=1.),   # Periodicity (short term trend)\n",
    "    GPy.kern.RBF(1)                       # Amplitude modulator (long term trend)\n",
    "]\n",
    "\n",
    "# Composite kernel: exponential + bias + linear + (periodic * rbf)\n",
    "k = ks[0] + ks[1] + ks[2] + ks[3]*ks[4]\n",
    "k"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create our GP regression model with the composite kernel\n",
    "m = GPy.models.GPRegression(X, Y, k)\n",
    "\n",
    "# Predict the CO2 at the training and test locations\n",
    "mean, Cov = m.predict(Xnew)\n",
    "\n",
    "# Setup figure \n",
    "plt.figure(figsize=(14,8))\n",
    "\n",
    "# Plot the GP fit and training points\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "plt.plot(X, Y, \".b\", Xtest, Ytest, \".r\", alpha=0.4);\n",
    "\n",
    "# Annotate plot\n",
    "plt.legend(labels=[\"GP fit\",\"training points\", \"test points\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that with even with our initial (mostly default) parameter choices, the GP fit to the training data is representative of the trends we are observing. However, the prediction of the test data is relatively poor (though we still have an upward trend). The obvious step now is to optimise our kernel parameters.\n",
    "\n",
    "Optimisation of the GP in itself is imperfect, often because the likelihood we are maximising can be multimodal, or flat, and so it can get stuck in local-maxima. It is always important to sanity-check the GP fit when optimising the parameters, to mitigate problems that could occur as a result (such as the minimised noise example we saw earlier). One of the ways to avoid the problem of local maxima is to reinitialise the optimiser with different starting locations, and take the maximum of these outputs. This is possible in `GPy` using `optimize_restarts(n)`, which will optimise the parameters $n$ times and take the best estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optimise hyperparameters 5 times to get a good estimate \n",
    "m.optimize_restarts(5, robust=True) # We could do this more times, but it can be quite time consuming\n",
    "\n",
    "# Get the moments of our model fit\n",
    "mean, Cov = m.predict(Xnew)\n",
    "\n",
    "# Set up plot environment\n",
    "plt.figure(figsize=(14,8))\n",
    "\n",
    "# Plot our optimised GP and training/test points\n",
    "plot_gp(Xnew, mean, Cov)\n",
    "plt.plot(X, Y, \".b\", Xtest, Ytest, \".r\", alpha=0.4);\n",
    "# Annotate plot\n",
    "plt.legend(labels=[\"GP fit\",\"training points\", \"test points\"])\n",
    "# Preview model parameters\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the output of the optimiser that there are a number of different minima in the negative log-likelihood (corresponding to maxima in the likelihood), implying that there are several modes. However, it's possible to get a good fit of the function with a GP with great predictive power. You might try allowing the optimiser to run for more iterations when you have more time.\n",
    "\n",
    "We can also look at the effects of each kernel on the GP model, which will show the features that it represents in the fit. For example, we expect the `Bias` to capture the offset, and the `Exponential` kernel to extract most of the non-linearity (that's not explained by the periodicity). We can view the effects of each of the components and the combination of them to see how they deconstruct the training data. This shows what features are being learned by which kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup figure environment (4x2 grid)\n",
    "plt.figure(figsize=(18,18))\n",
    "\n",
    "# Show Bias kernel effect\n",
    "plt.subplot(421)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.bias)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.title(\"Bias\")\n",
    "\n",
    "# Show Linear kernel effect\n",
    "plt.subplot(423)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.linear)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.title(\"Linear\")\n",
    "\n",
    "# Show combination of Bias and Linear kernels\n",
    "plt.subplot(424)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.linear + m.sum.bias)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.plot(X, Y, \".b\", ms=1.)\n",
    "plt.title(\"Linear + Bias\")\n",
    "\n",
    "# Show modulated Periodic x RBF kernel\n",
    "plt.subplot(425)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.mul)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.title(\"Periodic x RBF\")\n",
    "\n",
    "# Show combination of Periodic, Bias and Linear kernels\n",
    "plt.subplot(426)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.mul + m.sum.linear + m.sum.bias)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.plot(X, Y, \".b\", ms=1.)\n",
    "plt.title(\"(Periodic x RBF) + Linear + Bias\")\n",
    "\n",
    "# Show Exponential kernel effect\n",
    "plt.subplot(427)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.Exponential)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.title(\"Exponential\")\n",
    "\n",
    "# Show combination of Periodic, Bias, Linear and Exponential kernel (this is our full GP fit !)\n",
    "plt.subplot(428)\n",
    "mean, Cov = m.predict_noiseless(X, kern=m.sum.Exponential + m.sum.mul + m.sum.linear + m.sum.bias)\n",
    "plot_gp(X, mean, Cov)\n",
    "plt.plot(X, Y, \".b\", ms=1.)\n",
    "plt.title(\"Exponential + (Periodic x RBF) + Linear + Bias\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's clear that while our final fit is very good, some of the kernel effects are not as clear cut as we expected. For example, much of the upward trend is captured by the periodic combination, likely due to its ability to vary non-linearly. However, we can see that, for example, the `Exponential` kernel has a clear purpose in adjusting the GP fit to the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7\n",
    "\n",
    "What improvements might be made to the GP model, for example in terms  combinations of kernels and parameters, to get a better fit, especially for the prediction of our testing data?"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Multiple Inputs\n",
    "\n",
    "Typically, we will have data that is more than one-dimensional. Gaussian processes regression is scalable in dimension, and everything used above can be used with multiple inputs. The following problem is reasonably straight-forward, a simple extension of the regression problem we've used to 2-dimensional input.\n",
    "\n",
    "Consider the toy example:\n",
    "$$\n",
    "    f = (x_1, x_2) \\mapsto \\sin(x_1)\\sin(x_2)\n",
    "$$\n",
    "$$\n",
    "    y = f(\\mathbf{X}) + \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, \\sigma^2)\n",
    "$$\n",
    "\n",
    "Now, our input space is made up of two coordinates, $(x_1, x_2)$. Suppose we have some random noisy observations of the function, which are 1-dimensional in nature:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lambda function, f, the unknown function we are trying to predict\n",
    "f = lambda xi,xj: np.sin(xi) * np.sin(xj)\n",
    "\n",
    "# Our test grid\n",
    "[Xi, Xj] = np.meshgrid(np.linspace(-np.pi, np.pi, 50), np.linspace(-np.pi, np.pi, 50))\n",
    "\n",
    "# Number of samples [YOU CAN PLAY AROUND WITH THE NUMBER OF RANDOM SAMPLES TO SEE HOW THE FIT IS AFFECTED]\n",
    "num_measurements = 50\n",
    "\n",
    "# Random sample locations (2-D)\n",
    "X2 = np.random.uniform(-np.pi, np.pi, (num_measurements, 2))\n",
    "\n",
    "# Setup plot enviornment\n",
    "plt.figure(figsize=(14, 6))\n",
    "\n",
    "plt.subplot(121)\n",
    "# Show true function\n",
    "plt.contour(Xi, Xj, f(Xi,Xj))\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"$f(x_1,x_2)$\")\n",
    "\n",
    "plt.subplot(122)\n",
    "# Show sample locations\n",
    "plt.plot(X2[:,0],X2[:,1],'o')\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"Sample locations\");  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our observations then are simply our sample locations propagated through $f$ with some Gaussian noise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y2 = np.array([f(x1,x2) for (x1,x2) in zip(X2[:,0], X2[:,1])])[:,None] + 0.05 * np.random.randn(X2.shape[0], 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can fit a Gaussian process over the observations by selecting a 2-D kernel. For example, the `RBF` kernel has a 2-D (really N-D) form, so we can use this as our kernel choice. Here we need to set both the `input_dim` to `2`, and specify the dimensions of the data we want the kernel to act over. In this case, it is the first 2 (of 2) dimensions, which in Python are indexed by `0` and `1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a 2-D RBF kernel active over both input dimensions\n",
    "k = GPy.kern.RBF(2, active_dims=[0,1])\n",
    "\n",
    "# Create a GP Regression model with the sample locations and observations using the RBF kernel\n",
    "m = GPy.models.GPRegression(X2, Y2, k)\n",
    "\n",
    "# Fix the Gaussian noise variance, which we know\n",
    "m.Gaussian_noise.variance = 0.05 \n",
    "m.Gaussian_noise.variance.fix()\n",
    "\n",
    "# Optimise the kernel parameters\n",
    "m.optimize()\n",
    "\n",
    "# We need to augument our test space to be a list of coordinates for input to the GP\n",
    "Xnew2 = np.vstack((Xi.ravel(), Xj.ravel())).T # Change our input grid to list of coordinates\n",
    "\n",
    "# Predict the mean and covariance of the GP fit at the test locations\n",
    "mean2, Cov2 = m.predict_noiseless(Xnew2, full_cov=False)\n",
    "\n",
    "# Setup plot environment\n",
    "plt.figure(figsize=(14, 6))\n",
    "\n",
    "# Left plot shows mean of GP fit\n",
    "plt.subplot(121)\n",
    "\n",
    "# Plot mean surface\n",
    "plt.contour(Xi, Xj, mean2.reshape(Xi.shape))\n",
    "# Show sample locations\n",
    "plt.plot(X2[:,0],X2[:,1],'o'), plt.axis(\"square\")\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"Mean of GP fit\"), plt.colorbar()\n",
    "\n",
    "# Right plot shows the variance of the GP\n",
    "plt.subplot(122)    \n",
    "# Plot variance surface\n",
    "plt.pcolor(Xi, Xj, Cov2.reshape(Xi.shape))\n",
    "# Show sample locations\n",
    "plt.plot(X2[:,0],X2[:,1],'o'), plt.axis(\"square\")\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"Variance of GP fit\"), plt.colorbar()\n",
    "\n",
    "# Preview GP model parameters\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see here that the fit with the GP is generally good (this is quite a simple example, after all), and that functionally, the setup is almost identical to the 1-D case using `GPy`. We can also observe from the variance surface that we have the most uncertainty where there are no samples, which matches our expectations and the observation of the power of GPs in 1-D.\n",
    "\n",
    "Notice that in the model preview, we only have a single lengthscale that is based on both dimensions. This is because by default, the RBF kernel is isotropic, which in our case is fine. However, in reality, it is often desirable to have different parameters corresponding to different dimensions. There are two approaches to doing this in `GPy`. The first is to enable different anisotropy within the kernel itself, by using the `ARD` keyword, e.g.\n",
    "```\n",
    "k = GPy.kern.RBF(2, ARD=True)\n",
    "```\n",
    "\n",
    "Alternatively, we can use kernel composition to create two kernels acting seperately on different dimensions. An alternative construction of the anisotropic RBF is to construct a kernel from two 1-D `RBF` kernels active on $x_1$ and $x_2$ respectively:\n",
    "```\n",
    "k = GPy.kern.RBF(1, active_dims=[0]) * GPy.kern.RBF(1, active_dims=[1])\n",
    "```\n",
    "\n",
    "We can also use this construction to allow for different covariance assumptions in different dimensions. In the following example, we use an RBF to model $x_1$ and a Matérn 3/2 kernel for $x_2$:\n",
    "```\n",
    "k = GPy.kern.RBF(1, active_dims=[0]) * GPy.kern.Matern32(1, active_dims=[1])\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Construct a 2-D kernel with two 1-D kernels: RBF and Matern32\n",
    "k = GPy.kern.RBF(1, active_dims=[0]) * GPy.kern.Matern32(1, active_dims=[1])\n",
    "\n",
    "# Create a GP regression model with our 2-D composite kernel\n",
    "m = GPy.models.GPRegression(X2, Y2, k)\n",
    "\n",
    "# Fix the Gaussian noise variance \n",
    "m.Gaussian_noise.variance = 0.05 \n",
    "m.Gaussian_noise.variance.fix()\n",
    "\n",
    "# Optimise the kernel parameters\n",
    "m.optimize()\n",
    "\n",
    "# Change our input grid to list of coordinates\n",
    "Xnew2 = np.vstack((Xi.ravel(), Xj.ravel())).T\n",
    "\n",
    "# Predict the mean and covariance function\n",
    "mean2, Cov2 = m.predict_noiseless(Xnew2, full_cov=False)\n",
    "\n",
    "# Setup figure environment\n",
    "plt.figure(figsize=(14, 6))\n",
    "\n",
    "# Plot mean on left\n",
    "plt.subplot(121)      \n",
    "\n",
    "# Plot mean surface\n",
    "plt.contour(Xi, Xj, mean2.reshape(Xi.shape))\n",
    "# Plot sample locations\n",
    "plt.plot(X2[:,0],X2[:,1],'o'), plt.axis(\"square\")\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"Mean of GP fit\"), plt.colorbar()\n",
    "           \n",
    "# Show variance on right\n",
    "plt.subplot(122)\n",
    "\n",
    "# Plot variance surface\n",
    "plt.pcolor(Xi, Xj, Cov2.reshape(Xi.shape))\n",
    "# Plot sample locations\n",
    "plt.plot(X2[:,0],X2[:,1],'o'), plt.axis(\"square\")\n",
    "# Annotate plot\n",
    "plt.xlabel(\"$x_1$\"), plt.ylabel(\"$x_2$\")\n",
    "plt.title(\"Variance of GP fit\"), plt.colorbar()\n",
    "\n",
    "# Preview model parameters\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Footnote\n",
    "\n",
    "Using Gaussian processes for machine learning gives you a powerful tool for learning latent functions in a Bayesian fashion, especially with a low number of observations. However, full GP regression can be quite time consuming, given it scales $\\mathscr{O}(n^3)$ with the number of training points, $n$. There are methods for dealing with this: for example, sparse approximation or representing the regression problem as a state-space model. Each of these comes with their own assumptions and drawbacks. In tomorrow's session, there will be a lecture on computational efficiency in GPs and the next lab will have examples of some of these approaches.\n",
    "\n",
    "When using GPs in a computational setting, there are a wide range of tools that can be used depending on your preference of programming language and libraries. `GPy`, the library we use in these labs, is a well established Python library that was developed here at the University of Sheffield, and is maintained by members here and at Amazon Research Cambridge. Some other libraries that are commonly used for Gaussian processes include:\n",
    "\n",
    "| Name     | Language | Comments |\n",
    "|---------|-----------|----------|\n",
    "| `GPy`     | Python | One of the earliest Python frameworks for Gaussian processes |\n",
    "| `GPML`    | MATLAB | Examples and code used in Rasmussen & Williams GPML book |\n",
    "| `GPstuff` | MATLAB, Octave, R | A MATLAB library with a wide arrange of inference methods, including HMC |\n",
    "| `GPflow`  | Python | GP library built on `TensorFlow`, similar notation to `GPy` |\n",
    "| `GPyTorch` | Python | GP library built on `PyTorch` |\n",
    "| `Stan`    | R, Python, others | Probabilistic programming using MCMC that can be easily be used to model GPs |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Credit\n",
    "\n",
    "This notebook was written by Wil Ward. It adapted from notebooks by [Rich Wilkinson](https://rich-d-wilkinson.github.io/) and [Neil Lawrence](http://inverseprobability.com/). Additional writing and feedback was provided by Maria Skoularidou and Chunchao Ma."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}