{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Binary classification with the crabs data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The package is designed to handle models of the following general form\n", "$$\\begin{aligned}\n", "\\mathbf{y} \\ |\\ \\mathbf{f}, \\theta &\\sim \\prod_{i=1}^n p(y_i \\ | \\ f_i,\\theta), \\\\\n", " \\mathbf{f} \\ | \\ \\theta &\\sim \\mathcal{GP}\\left(m_{\\theta}(\\mathbf{x}), k_{\\theta}(\\mathbf{x}, \\mathbf{x}')\\right),\\\\\n", " \\theta &\\sim p(\\theta), \n", "\\end{aligned}$$\n", "where $\\mathbf{y}=(y_1,y_2,\\ldots,y_n) \\in \\mathcal{Y}$ and $\\mathbf{x} \\in \\mathcal{X}$ are the observations and covariates, respectively, and $f_i:=f(\\mathbf{x}_i)$ is the latent function which we model with a Gaussian process prior. We assume that the responses $\\mathbf{y}$ are independent and identically distributed and as a result the likelihood $p(\\mathbf{y} \\ | \\ \\mathbf{f}, \\theta)$, can be factorized over the observations. \n", "\n", "In the case where the observations are Gaussian distributed, the marginal likelihood and predictive distribution can be derived analytically. See the [Regression notebook](https://github.com/STOR-i/GaussianProcesses.jl/blob/master/notebooks/Regression.ipynb) for an illustration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we show how the GP **Monte Carlo** function can be used for **supervised learning classification**. We use the Crab dataset from the R package MASS. In this dataset we are interested in predicting whether a crab is of colour form blue or orange. Our aim is to perform a Bayesian analysis and calculate the posterior distribution of the latent GP function $\\mathbf{f}$ and model parameters $\\theta$ from the training data $\\{\\mathbf{X}, \\mathbf{y}\\}$." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "using GaussianProcesses, RDatasets\n", "import Distributions:Normal\n", "\n", "srand(113355)\n", "\n", "crabs = dataset(\"MASS\",\"crabs\"); # load the data \n", "crabs = crabs[shuffle(1:size(crabs)[1]), :]; # shuffle the data\n", "\n", "train = crabs[1:div(end,2),:];\n", "\n", "y = Array{Bool}(size(train)[1]); # response\n", "y[train[:Sp].==\"B\"]=0; # convert characters to booleans\n", "y[train[:Sp].==\"O\"]=1;\n", "\n", "X = convert(Array,train[:,4:end]); # predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We assume a zero mean GP with a Matern 3/2 kernel. We use the automatic relevance determination (ARD) kernel to allow each dimension of the predictor variables to have a different length scale. As this is binary classifcation, we use the Bernoulli likelihood, \n", "\n", "$$\n", "y_i \\sim Bernoulli(\\Phi(f_i))\n", "$$\n", "where $\\Phi: \\mathbb{R} \\rightarrow [0,1]$ is the cumulative distribution function of a standard Gaussian and acts as a squash function that maps the GP function to the interval [0,1], giving the probability that $y_i=1$. \n", "\n", "**Note** that `BernLik` requires the observations to be of type `Bool` and unlike some likelihood functions (e.g. student-t) does not contain any parameters to be set at initialisation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Select mean, kernel and likelihood function\n", "mZero = MeanZero(); # Zero mean function\n", "kern = Matern(3/2,zeros(5),0.0); # Matern 3/2 ARD kernel (note that hyperparameters are on the log scale)\n", "lik = BernLik(); # Bernoulli likelihood for binary data {0,1}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fit the GP using the general `GP` function. This function is a shorthand for the `GPMC` function which is used to generate **Monte Carlo approximations** of the latent function when the **likelihood is non-Gaussian**. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "GP Monte Carlo object:\n", " Dim = 5\n", " Number of observations = 100\n", " Mean function:\n", " Type: GaussianProcesses.MeanZero, Params: Float64[]\n", " Kernel:\n", " Type: GaussianProcesses.Mat32Ard, Params: [-0.0, -0.0, -0.0, -0.0, -0.0, 0.0]\n", " Likelihood:\n", " Type: GaussianProcesses.BernLik, Params: Any[]\n", " Input observations = \n", "[16.2 11.2 … 11.6 18.5; 13.3 10.0 … 9.1 14.6; … ; 41.7 26.9 … 28.4 42.0; 15.4 9.4 … 10.4 16.6]\n", " Output observations = Bool[false, false, false, false, true, true, false, true, true, true … false, true, false, false, false, true, false, false, false, true]\n", " Log-posterior = -161.209" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gp = GP(X',y,mZero,kern,lik) # Fit the Gaussian process model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We assign `Normal` priors from the `Distributions` package to each of the Matern kernel parameters. If the mean and likelihood function also contained parameters, then we could set these priors in the same using `gp.m` and `gp.lik` in place of `gp.k`, respectively." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6-element Array{Distributions.Normal{Float64},1}:\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)\n", " Distributions.Normal{Float64}(μ=0.0, σ=2.0)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set_priors!(gp.k,[Normal(0.0,2.0) for i in 1:6])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Samples of the latent function $f | X,y,\\theta$ are drawn using MCMC sampling. The MCMC routine uses the `Klara` package. By default, the `mcmc` function uses the Hamiltonian Monte Carlo algorithm. Alternative samplers, such as `Klara.MALA()` and `Klara.NUTS()`, can also be used (see the [Klara](https://github.com/JuliaStats/Klara.jl) documentation for more details). Unless defined, the default settings for the MCMC sampler are: 1,000 iterations with no burn-in phase or thinning of the Markov chain. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BasicMCJob:\n", " Variable [1]: p (BasicContMuvParameter)\n", " GenericModel: 1 variables, 0 dependencies (directed graph)\n", " MALA sampler: drift step = 0.1\n", " VanillaMCTuner: period = 100, verbose = false\n", " BasicMCRange: number of steps = 4996, burnin = 1000, thinning = 5" ] } ], "source": [ "samples = mcmc(gp;sampler=Klara.MALA(0.1),nIter=5000,burnin=1000,thin=5);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We test the predictive accuracy of the fitted model against a hold-out dataset" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "test = crabs[div(end,2)+1:end,:]; # select test data\n", "\n", "yTest = Array{Bool}(size(test)[1]); # test response data\n", "yTest[test[:Sp].==\"B\"]=0; # convert characters to booleans\n", "yTest[test[:Sp].==\"O\"]=1;\n", "\n", "xTest = convert(Array,test[:,4:end]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the posterior samples $\\{f^{(i)},\\theta^{(i)}\\}_{i=1}^N$ from $p(f,\\theta \\ | \\ X,y)$ we can make predictions $\\hat{y}$ using the `predict_y` function to sample predictions conditional on the MCMC samples." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ymean = Array{Float64}(size(samples,2),size(xTest,1));\n", "\n", "for i in 1:size(samples,2)\n", " set_params!(gp,samples[:,i])\n", " update_target!(gp)\n", " ymean[i,:] = predict_y(gp,xTest')[1]\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each of the posterior samples we plot the predicted observation $\\hat{y}$ (given as lines) and overlay the true observations from the held-out data (circles)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using Plots\n", "gr()\n", "\n", "plot(ymean',leg=false,html_output_format=:png)\n", "scatter!(yTest)" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.0", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }