{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d642ce77",
   "metadata": {},
   "source": [
    "Title: Introduction to Classification\n",
    "Author: Thomas Breuel\n",
    "Institution: UniKL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a269b49c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n",
    "import numpy,scipy,scipy.ndimage,zlib\n",
    "import random as pyrandom\n",
    "\n",
    "from pylab import *\n",
    "from pylab import random as arandom\n",
    "\n",
    "def method(cls):\n",
    "    import new\n",
    "    def _wrap(f):\n",
    "        cls.__dict__[f.func_name] = new.instancemethod(f,None,cls)\n",
    "        return None\n",
    "    return _wrap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b79c0edb",
   "metadata": {},
   "source": [
    "# An Object-Oriented View of Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "564455b7",
   "metadata": {},
   "source": [
    "In regular software engineering, people think of software \n",
    "development as starting from a specification.  In pattern \n",
    "recognition, however, you don't know the \"specification\".  Instead, \n",
    "you have a source of data that generates instances.  Let's \n",
    "look at how this works.\n",
    "\n",
    "First, we generate a problem instance. In this case, the problem instance is actually the \n",
    "instantiation of a class, but in the real world, the problem instance \n",
    "is usually some particular physical instance: a device, \n",
    "a web site, a person, a book, etc.\n",
    "\n",
    "We may call the problem instance nature, \n",
    "in order to emphasize that it is usually \n",
    "given by physics and not another software system.\n",
    "However, \"nature\" may well be another software system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "e96f3c4d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Nature\n",
    "class Nature:\n",
    "    def training_sample(self):\n",
    "        return (c,r)\n",
    "    def challenge(self):\n",
    "        return c\n",
    "    def response(self,r):\n",
    "        return reward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ebe9e99c",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# An instance of \"nature\"\n",
    "class SevenSegments(Nature):\n",
    "    def __init__(self):\n",
    "        self.vs = [None] * 10\n",
    "        self.vs[0] = array((1,1,1,1,1,1,0))\n",
    "        self.vs[1] = array((0,1,1,0,0,0,0))\n",
    "        self.vs[2] = array((1,1,0,1,1,0,1))\n",
    "        self.vs[3] = array((1,1,1,1,0,0,1))\n",
    "        self.vs[4] = array((0,1,1,0,0,1,1))\n",
    "        self.vs[5] = array((1,0,1,1,0,1,1))\n",
    "        self.vs[6] = array((1,0,1,1,1,1,1))\n",
    "        self.vs[7] = array((1,1,1,0,0,0,0))\n",
    "        self.vs[8] = array((1,1,1,1,1,1,1))\n",
    "        self.vs[9] = array((1,1,1,1,0,1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "712a8381",
   "metadata": {},
   "source": [
    "Nature gives us _training samples_ consisting of measurements and correct responses.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "3855135d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "@method(SevenSegments)\n",
    "def training_sample(self):\n",
    "    c = pyrandom.randint(0,9)\n",
    "    v = self.vs[c]\n",
    "    return (v,c)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4301aa58",
   "metadata": {},
   "source": [
    "Based on these training samples, we need to build a model that then returns correct classifications. These are training samples without the classification.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "6370e213",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "@method(SevenSegments)\n",
    "def challenge(self):\n",
    "    v,self.c = self.training_sample()\n",
    "    return v"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48508c74",
   "metadata": {},
   "source": [
    "Our classifier needs to respond with a class, and nature evaluates our response and gives us feedback.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9518a38e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "@method(SevenSegments)\n",
    "def response(self,c):\n",
    "    return (c==self.c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "5aa34c99",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([1, 1, 1, 1, 1, 1, 0]), 0)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nature = SevenSegments()\n",
    "nature.training_sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dabf0858",
   "metadata": {},
   "source": [
    "We are trying to build a _model_ of nature.\n",
    "These respond to challenges by predicting a response.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "d46afd31",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Models\n",
    "class Model:\n",
    "    def __init__(self,dataset):\n",
    "        self.dataset = dataset\n",
    "    def predict(self,v):\n",
    "        return 0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f69daeb3",
   "metadata": {},
   "source": [
    "Nature gives us training examples.  These training examples consist of some kind of measurement vector v and a corresponding classification c.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "b7fc36e0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "measurement vector [1 0 1 1 0 1 1]\n",
      "class 5\n"
     ]
    }
   ],
   "source": [
    "v,c = nature.training_sample()\n",
    "print \"measurement vector\",v\n",
    "print \"class\",c"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f869f2c8",
   "metadata": {},
   "source": [
    "At training time, we can request training samples.\n",
    "Usually, generating labeled training samples costs money, so we only obtain a limited number of them.  \n",
    "These are usually collected in a dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "da1fb60d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(array([1, 1, 0, 1, 1, 0, 1]), 2),\n",
       " (array([1, 0, 1, 1, 1, 1, 1]), 6),\n",
       " (array([1, 1, 1, 1, 1, 1, 0]), 0),\n",
       " (array([1, 1, 1, 1, 1, 1, 0]), 0),\n",
       " (array([1, 1, 1, 1, 0, 0, 1]), 3)]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainingset = [nature.training_sample() for i in range(100)]\n",
    "trainingset[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04b3580d",
   "metadata": {},
   "source": [
    "We usually use the dataset to create a model. \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "3b31eed4",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = Model(trainingset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd43aaa3",
   "metadata": {},
   "source": [
    "After training, nature presents us with challenges or test samples.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "37ca2c8e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 0, 1, 1])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nature.challenge()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afaab8b6",
   "metadata": {},
   "source": [
    "The task of the model is to predict a value based on the challenge.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "d201f147",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "prediction = model.predict(nature.challenge())\n",
    "print prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b57735",
   "metadata": {},
   "source": [
    "Our prediction is then handed back to nature for evaluation.  We may or may not see this evaluation, but eventually, our model will be judged on the quality of its predictions.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "a4a665fb",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nature.response(prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9494acc4",
   "metadata": {},
   "source": [
    "A second important question is how we validate model. In standard software engineering, \n",
    "the assumption is (rightly or wrongly) that all we need to show is \n",
    "conformance to the specification. However, pattern recognition and \n",
    "machine learning methods do not have a specification. They do, however, \n",
    "have lots of data, and that lets us make good empirical predictions about performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "356b6722",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "93"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# evaluating the model\n",
    "def count_errors(nature,model,trials=100):\n",
    "    errors = 0\n",
    "    for i in range(trials):\n",
    "        v = nature.challenge()\n",
    "        c = model.predict(v)\n",
    "        if nature.response(c)!=True:\n",
    "            errors += 1\n",
    "    return errors\n",
    "count_errors(nature,model,trials=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fc4217f",
   "metadata": {},
   "source": [
    "OK, that's not very good... the model is wrong about 90 percent of the time--chance level.  Let's try for something better.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "83e261ab",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class MemoryModel:\n",
    "    def __init__(self,dataset):\n",
    "        self.memory = {}\n",
    "        for v,c in dataset:\n",
    "            self.memory[tuple(v)] = c\n",
    "    def predict(self,v):\n",
    "        return self.memory.get(tuple(v),0)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "e8e07326",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = MemoryModel(trainingset)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "f68736d8",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_errors(nature,model,trials=100)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3247437f",
   "metadata": {},
   "source": [
    "In the noise free case, memorization of samples may give good response.\n",
    "\n",
    "It still lacks _generalization_: that is, it can't make good predictions for previously unseen samples that are similar to, but different from, existing samples."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb24cbad",
   "metadata": {},
   "source": [
    "# Noisy Samples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95b18e36",
   "metadata": {},
   "source": [
    "In the above example, the model just generated fixed patterns\n",
    "with a one-to-one correspondence to classes.\n",
    "\n",
    "In practice, there is usually noise.\n",
    "\n",
    "Noise means:\n",
    "\n",
    "- multiple patterns correspond to each class\n",
    "- each pattern may correspond to multiple classes\n",
    "- the correspondences are non-deterministic\n",
    "\n",
    "The overall goal is to minimize the *error rate*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "ae9e20ee",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pylab import random as arandom\n",
    "class NoisySevenSegments(SevenSegments):\n",
    "    def __init__(self):\n",
    "        self.p_noise = 0.07\n",
    "        self.vs = [None] * 10\n",
    "        self.vs[0] = array((1,1,1,1,1,1,0))\n",
    "        self.vs[1] = array((0,1,1,0,0,0,0))\n",
    "        self.vs[2] = array((1,1,0,1,1,0,1))\n",
    "        self.vs[3] = array((1,1,1,1,0,0,1))\n",
    "        self.vs[4] = array((0,1,1,0,0,1,1))\n",
    "        self.vs[5] = array((1,0,1,1,0,1,1))\n",
    "        self.vs[6] = array((1,0,1,1,1,1,1))\n",
    "        self.vs[7] = array((1,1,1,0,0,0,0))\n",
    "        self.vs[8] = array((1,1,1,1,1,1,1))\n",
    "        self.vs[9] = array((1,1,1,1,0,1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0dc948f1",
   "metadata": {},
   "source": [
    "Now the training samples consist of the true targets, but with some of the segments flipped.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "8b7113e4",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "@method(NoisySevenSegments)\n",
    "def training_sample(self):\n",
    "    c = pyrandom.randint(0,len(self.vs)-1)\n",
    "    v = self.vs[c]\n",
    "    flip = 1.0*(arandom(size=7)<self.p_noise)\n",
    "    v = 1.0*(v!=flip)       \n",
    "    return (v,c)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3416e4",
   "metadata": {},
   "source": [
    "Let's see how well the memory model works.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "7691a888",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3425\n"
     ]
    }
   ],
   "source": [
    "nature = NoisySevenSegments()\n",
    "trainingset = [nature.training_sample() for i in range(100)]\n",
    "model = MemoryModel(trainingset)\n",
    "print count_errors(nature,model,trials=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1f2ebcd",
   "metadata": {},
   "source": [
    "Let's use \"similarity\" to help improve performance.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "138fb20e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.8478801 ,  4.20201828,  3.70312897,  5.37046302,  3.23728652,\n",
       "         4.94973151,  3.44314282,  2.96195443,  3.27705394,  3.53697688]])"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.spatial import distance\n",
    "distance.cdist(randn(1,10),randn(10,10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "5b03baa6",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# nearest neighbor models\n",
    "class NearestNeighborModel:\n",
    "    def __init__(self,dataset):\n",
    "        self.centers = array([v for v,c in dataset],'f')\n",
    "        self.classes = array([c for v,c in dataset],'i')\n",
    "    def predict(self,v):\n",
    "        ds = distance.cdist(array([v],'f'),self.centers)\n",
    "        i = argmin(ds[0])\n",
    "        return self.classes[i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "c6e3e7ba",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = NearestNeighborModel(trainingset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "e1eaa10f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2859"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_errors(nature,model,trials=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bc32b56",
   "metadata": {},
   "source": [
    "We often determine the *error rate*,\n",
    "the percentage of samples that are *misclassified*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "3a2273ce",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2785"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_errors(nature,model,trials=10000)/10000.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c0586fe",
   "metadata": {},
   "source": [
    "Test Sets\n",
    "=========\n",
    "\n",
    "Above, we evaluated the model by trying it out in the real world\n",
    "and getting feedback from nature.  However, usually, we use\n",
    "a *test set* instead, a set of samples obtained like training\n",
    "samples.\n",
    "\n",
    "We can use this to estimate the error rate as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "15660a1e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "26"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "testset = [nature.training_sample() for i in range(100)]\n",
    "truth = [c for v,c in testset]\n",
    "predictions = [model.predict(v) for v,c in testset]\n",
    "sum(array(predictions)!=array(truth))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3906f65",
   "metadata": {},
   "source": [
    "We can wrap this up as a function,\n",
    "a function that evaluates using a testset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "d1a08d39",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def dataset_eval(testset,model):\n",
    "    truth = [c for v,c in testset]\n",
    "    predictions = [model.predict(v) for v,c in testset]\n",
    "    return sum(array(predictions)!=array(truth))*1.0/len(testset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "2c4c292e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.26000000000000001"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_eval(testset,model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "042f894f",
   "metadata": {},
   "source": [
    "Note that we can also evaluate the performance of the\n",
    "model on the training set itself.\n",
    "This will, in general, give an error rate that is\n",
    "too low.\n",
    "For nearest neighbor classifiers, it is easy to see why."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "1674d651",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.13"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_eval(trainingset,model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26bb88e9",
   "metadata": {},
   "source": [
    "Two important things to remember about test sets are:\n",
    "\n",
    "- the test set must be statistically representative of the training set\n",
    "- the error rate on the training set itself is often lower than the error rate on the test set"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8a3c22d",
   "metadata": {},
   "source": [
    "# Manual Creation of Test/Training Sets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32f1940f",
   "metadata": {},
   "source": [
    "In fact, often we just get a set of training samples.  In that case, we need to\n",
    "perform manually splitting of the data into training and test sets.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "9c294b55",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random as pyrandom\n",
    "all_indexes = set(range(len(trainingset)))\n",
    "test_indexes = set(pyrandom.sample(all_indexes,int(0.1*len(all_indexes))))\n",
    "training_indexes = set(all_indexes-test_indexes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "c7d7d133",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_testset = [trainingset[i] for i in test_indexes]\n",
    "my_trainingset = [trainingset[i] for i in training_indexes]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20cc2218",
   "metadata": {},
   "source": [
    "In fact, often we split labeled data into three sets:\n",
    "\n",
    "- a *training set* that we use to train one or more models\n",
    "- a *validation set* that we use to optimize parameter choices\n",
    "- a *test set* that we use for final evaluation of error rates\n",
    "\n",
    "Note that you cannot use a test set too often, because just by chance\n",
    "you may find a model that seems to work well by chance.\n",
    "\n",
    "We will talk about ways of dealing with this problem later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eb9284d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}