{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    }
   ],
   "source": [
    "%pylab inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook I'll show probabilistic interpretation of the nearest neighbours algorith as a mixture of Gaussians. Following Barber 2012. First I'll give an example of ... Next, I'll show how to reformulate ... using \n",
    "\n",
    "## Theory\n",
    "\n",
    "This pretty much follows from [Barber, 2012](http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/090310.pdf)\n",
    "\n",
    "kNN is simple to understand and implement, and often used as a baseline.\n",
    "\n",
    "Some limitations of this approach.\n",
    "* In metric based methods, how do we measure distance? euclidean distance does not account for how data is distributed\n",
    "* The whole dataset needs to be stored to make a classification since the novel point must be compared to all of the train points.\n",
    "* Each distance calculation can be expensive if the datapoints are high dimensional"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can reformulate the kNN as a class conditional mixture of Gaussians. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Probabilistic NN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In [Barber 2012](http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/090310.pdf) shows a \n",
    "an of the nearest neighbour method as the limiting case of a mixture model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What follows is a solution for **Exercise 158** from [Barber 2012]() in python."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write a routine\n",
    "SoftNearNeigh(xtrain,xtest,trainlabels,sigma)\n",
    "to implement soft\n",
    "nearest  neighbours,  analogous  to\n",
    "nearNeigh.m\n",
    ".  Here\n",
    "sigma\n",
    "is  the  variance\n",
    "σ\n",
    "2\n",
    "in  equation  (14.3.1).  As\n",
    "above,  the  file\n",
    "NNdata.mat\n",
    "contains  training  and  test  data  for  the  handwritten  digits  5  and  9.   Using\n",
    "leave  one  out  cross-validation,  find  the  optimal\n",
    "σ\n",
    "2\n",
    "and  use  this  to  compute  the  classification  accuracy  of\n",
    "the  method  on  the  test  data.   Hint:  you  may  have  numerical  difficulty  with  this  method.   To  avoid  this,\n",
    "consider  using  the  logarithm,  and  how  to  numerically  compute\n",
    "log\n",
    "(\n",
    "e\n",
    "a\n",
    "+\n",
    "e\n",
    "b\n",
    ")\n",
    "for  large  (negative)\n",
    "a\n",
    "and\n",
    "b\n",
    ".\n",
    "See also\n",
    "logsumexp.m\n",
    "."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this exercise we'll be using a subsent of the MNIST dataset provided in BRMLtoolkit at [Barber 2012](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Software)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import scipy.io as sio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nndata = sio.loadmat('/Users/gm/Downloads/BRMLtoolkit/data/NNdata.mat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'__globals__': [],\n",
       " '__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN, Created on: Sat Aug 01 14:24:46 2009',\n",
       " '__version__': '1.0',\n",
       " 'test5': array([[0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        ..., \n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),\n",
       " 'test9': array([[0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        ..., \n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),\n",
       " 'train5': array([[0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        ..., \n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),\n",
       " 'train9': array([[0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        ..., \n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0],\n",
       "        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nndata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to solve a binary classification problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class0_train = nndata['train5']\n",
    "class0_test = nndata['test5']\n",
    "\n",
    "class1_train = nndata['train9']\n",
    "class1_test = nndata['test9']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Barber follows a generative approach and uses kernel density estimation\n",
    "(Parzen estimator) to interpret kNN as the limiting case of a mixture of Gaussians. An isotropic Gaussian of width $\\sigma^2$ is placed at each data point, and a mixture is used to model each class.\n",
    "\n",
    "### The Parzen estimator\n",
    "\n",
    "With kernel density estimation we want to approximate a PDF with a mixture of continuos probability distributions. A Parzen estimator centers a probability distribution at each data point $\\textbf{x}_n$ as\n",
    "\n",
    "$P(\\textbf{x}) = \\frac{1}{N} \\sum_{n=1}^{N} P(\\textbf{x}|\\textbf{x}_{n})$\n",
    "\n",
    "For a D dimensional $\\textbf{x}$ we choose an isotropic Gaussian $P(\\textbf{x}|\\textbf{x}_{n}) = \\mathcal{N} (\\textbf{x}|\\textbf{x}_{n}, \\sigma^2 \\textbf{I}_{D})$, which gives the mixture\n",
    "\n",
    "$P(x) = \\frac{1}{N} \\sum_{n=1}^{N} \\frac{1}{(2 \\pi \\sigma^2)^{D/2}}\\mathcal{e}^{- (\\textbf{x} - \\textbf{x}_n)^2 / 2\\sigma^2} $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nearest Neighbour classification\n",
    "\n",
    "Given classes $c = {0, 1}$, \n",
    "we consider the following mixture model\n",
    "\n",
    "$P(\\textbf{x}|c=0) = \\frac{1}{N_0} \\sum_{n \\in \\textit{class 0}} \\mathcal{N}(\\textbf{x}| \\textbf{x}_n, \\sigma^2\\textbf{I})  =  \\frac{1}{N_0} \\frac{1}{(2 \\pi \\sigma^2)^{\\frac{D}{2}}} \\sum_{n \\in \\textit{class 0}} e^{-(\\textbf{x} - \\textbf{x}_n)^2 / (2 \\sigma^2) } $\n",
    "\n",
    "$P(\\textbf{x}|c=1) = \\frac{1}{N_1} \\sum_{n \\in \\textit{class 1}} \\mathcal{N}(\\textbf{x}| \\textbf{x}_n, \\sigma^2\\textbf{I}) = \\frac{1}{N_1} \\frac{1}{(2 \\pi \\sigma^2)^{\\frac{D}{2}}} \\sum_{n \\in \\textit{class 1}} e^{-(\\textbf{x} - \\textbf{x}_n)^2 / (2 \\sigma^2) } $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To classify a new instance $\\textbf{x}_{*}$ we calculate the posterior for both classes and take the ratio\n",
    "\n",
    "$\\frac{P(c=0|\\textbf{x}_{\\star})}{P(c=1|\\textbf{x}_{\\star})} = \\frac{P(\\textbf{x}_{\\star}|c=0) P(c=0)}{P(\\textbf{x}_{\\star}|c=1) P(c=1)}$. If this ratio is $\\gt 1$ than we classify $\\textbf{n}_{\\star}$ as class 0, otherwise as class 1. The class probabilities can be determined by maximum likelihood with the following setting $P(c) = \\sum_{i=0}^N \\frac{N_{i}}{N}$.\n",
    "\n",
    "TODO: proof!\n",
    "\n",
    "To understand how this relates to the nearest neighbour method, we need to consider the case $\\sigma^2 \\rightarrow 0$. \n",
    "\n",
    "Note that both nominator and denominator are sums of exponentials.\n",
    "Intuitively, if the veriance is small, the nominator will be dominated by the term for which point $x_{n_0} \\in class 0$ is closest to $\\textbf{x}_{\\star}$. The same holds for the denominator and points in class 1.\n",
    "\n",
    "$\\frac{1}{(2 \\pi \\sigma^2)^{\\frac{D}{2}}}$ cancels out, and for vanishingly small values of $\\sigma$ we have\n",
    "\n",
    "$\\frac{P(c=0|\\textbf{x}_{\\star})}{P(c=1|\\textbf{x}_{\\star})} \\approx $\n",
    "\n",
    "$\\frac{ e^{-(\\textbf{x}_{\\star} - x_{n_0})^2 / (2 \\sigma^2)} P(c=0)/N_{0}}  {e^{-(\\textbf{x}_{\\star} - x_{n_1})^2 / (2 \\sigma^2)}P(c=1)/N_{1}} $\n",
    "\n",
    "$\\frac{ e^{-(\\textbf{x}_{\\star} - x_{n_0})^2 / (2 \\sigma^2)}}  {e^{-(\\textbf{x}_{\\star} - x_{n_1})^2 / (2 \\sigma^2)}} $ \n",
    "\n",
    "In the limit $\\sigma^2 \\rightarrow 0$ we have \n",
    "\n",
    "$\\frac{ e^{-(\\textbf{x}_{\\star} - x_{n_0})^2}}{e^{-(\\textbf{x}_{\\star} - x_{n_1})^2}} $ \n",
    "\n",
    "\n",
    "so we classify $\\textbf{x}_{\\star}$ as class 0 if $\\textbf{x}_{\\star}$ has a point in class 0 than the closest point in class 1.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 428,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def log_sum_exp(x):\n",
    "    \"\"\"\n",
    "    Log likelihood of a Parzen estimator\n",
    "    \"\"\"  \n",
    "    a = np.max(x)\n",
    "    return a + np.log(np.sum(np.exp(x-a)))\n",
    "\n",
    "def log_mean_exp(x):\n",
    "    \"\"\"\n",
    "    Log likelihood of a Parzen estimator\n",
    "    \"\"\"  \n",
    "    a = np.max(x)\n",
    "    return a + np.log(np.mean(np.exp(x-a)))\n",
    "\n",
    "def parzen(x, mu, sigma=1.0):\n",
    "    \"\"\"\n",
    "    Makes a function that allows the evalution of a Parzen \n",
    "    estimator where the Kernel is a normal distribution with \n",
    "    stddev sigma and with points at mu.\n",
    "    \n",
    "    Parameters\n",
    "    -----------\n",
    "    x : numpy array\n",
    "        Classification input\n",
    "    mu : numpy matrix\n",
    "        Contains the data points over which this distribution is based.\n",
    "    sigma : scalar\n",
    "        The standard deviation of the normal distribution around each data \\\n",
    "        point.\n",
    "    Returns\n",
    "    -------\n",
    "    lpdf : callable\n",
    "        Estimator of the log of the probability density under a point.\n",
    "    \"\"\"\n",
    "    # z = (1 / mu.shape[0]) * (1 / np.sqrt(sigma * np.pi * 2.0))\n",
    "    z = mu.shape[0] * np.sqrt(sigma * np.pi * 2.0)\n",
    "    e = (-(x - mu)**2.0) / (2.0 * sigma)\n",
    "    log_p = log_mean_exp(e) \n",
    "    return log_p - z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 445,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1e-13 231 292 0.791095890410959\n",
      "1e-13 214 292 0.7328767123287672\n"
     ]
    }
   ],
   "source": [
    "sigmas = [1e-13]\n",
    "\n",
    "priorC0 = class0_train.T.shape[0] / (class0_train.T.shape[0] + class1_train.T.shape[0])\n",
    "priorC1 = class1_train.T.shape[0] / (class1_train.T.shape[0] + class1_train.T.shape[0])\n",
    "\n",
    "\n",
    "for sigma in sigmas:\n",
    "    correct = 0\n",
    "    for x in class0_test.T:\n",
    "        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) \n",
    "        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma) \n",
    "\n",
    "        if (c0p / c1p) > 1:\n",
    "            correct += 1\n",
    "\n",
    "    print(sigma, correct, class1_test.shape[1], correct / class0_test.shape[1])\n",
    "\n",
    "    correct = 0\n",
    "    for x in class1_test.T:\n",
    "        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) \n",
    "        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma)\n",
    "\n",
    "        if (c0p / c1p) <= 1:\n",
    "            correct += 1\n",
    "\n",
    "    print(sigma, correct, class0_test.shape[1], correct / (class1_test.shape[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem; when (c0p / c1p) we have a tie! How should we interpret that? My assumption is to assing ties to class 1!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  kNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = np.append(class0_train.T, class1_train.T, axis=0)\n",
    "y_train = ['a'] * class0_train.shape[1] + ['b'] * class1_train.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1200, 784)"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_test = np.append(class0_test.T, class1_test.T, axis=0)\n",
    "y_test = ['a'] * class0_test.shape[1] + ['b'] * class1_test.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 296,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((584, 784), 584)"
      ]
     },
     "execution_count": 296,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.shape, len(y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 402,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import  KNeighborsClassifier\n",
    "knn = KNeighborsClassifier(n_neighbors=10, \n",
    "                           n_jobs=4, \n",
    "                           metric='euclidean', \n",
    "                           algorithm='ball_tree')\n",
    "fitted = knn.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 291,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "?KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 404,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_pred = fitted.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 405,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: {} 0.98801369863\n",
      "[[292   0]\n",
      " [  7 285]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "\n",
    "print(\"Accuracy: {}\", accuracy_score(y_test, y_pred))\n",
    "print(confusion_matrix(y_test, y_pred))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}