{ "cells": [ { "cell_type": "markdown", "id": "d642ce77", "metadata": {}, "source": [ "Title: Introduction to Classification\n", "Author: Thomas Breuel\n", "Institution: UniKL" ] }, { "cell_type": "code", "execution_count": 18, "id": "a269b49c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "\n", "import numpy,scipy,scipy.ndimage,zlib\n", "import random as pyrandom\n", "\n", "from pylab import *\n", "from pylab import random as arandom\n", "\n", "def method(cls):\n", " import new\n", " def _wrap(f):\n", " cls.__dict__[f.func_name] = new.instancemethod(f,None,cls)\n", " return None\n", " return _wrap" ] }, { "cell_type": "markdown", "id": "b79c0edb", "metadata": {}, "source": [ "# An Object-Oriented View of Classification" ] }, { "cell_type": "markdown", "id": "564455b7", "metadata": {}, "source": [ "In regular software engineering, people think of software \n", "development as starting from a specification. In pattern \n", "recognition, however, you don't know the \"specification\". Instead, \n", "you have a source of data that generates instances. Let's \n", "look at how this works.\n", "\n", "First, we generate a problem instance. In this case, the problem instance is actually the \n", "instantiation of a class, but in the real world, the problem instance \n", "is usually some particular physical instance: a device, \n", "a web site, a person, a book, etc.\n", "\n", "We may call the problem instance nature, \n", "in order to emphasize that it is usually \n", "given by physics and not another software system.\n", "However, \"nature\" may well be another software system." ] }, { "cell_type": "code", "execution_count": 19, "id": "e96f3c4d", "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Nature\n", "class Nature:\n", " def training_sample(self):\n", " return (c,r)\n", " def challenge(self):\n", " return c\n", " def response(self,r):\n", " return reward" ] }, { "cell_type": "code", "execution_count": 20, "id": "ebe9e99c", "metadata": { "collapsed": true }, "outputs": [], "source": [ "# An instance of \"nature\"\n", "class SevenSegments(Nature):\n", " def __init__(self):\n", " self.vs = [None] * 10\n", " self.vs[0] = array((1,1,1,1,1,1,0))\n", " self.vs[1] = array((0,1,1,0,0,0,0))\n", " self.vs[2] = array((1,1,0,1,1,0,1))\n", " self.vs[3] = array((1,1,1,1,0,0,1))\n", " self.vs[4] = array((0,1,1,0,0,1,1))\n", " self.vs[5] = array((1,0,1,1,0,1,1))\n", " self.vs[6] = array((1,0,1,1,1,1,1))\n", " self.vs[7] = array((1,1,1,0,0,0,0))\n", " self.vs[8] = array((1,1,1,1,1,1,1))\n", " self.vs[9] = array((1,1,1,1,0,1,1))" ] }, { "cell_type": "markdown", "id": "712a8381", "metadata": {}, "source": [ "Nature gives us _training samples_ consisting of measurements and correct responses.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "3855135d", "metadata": { "collapsed": false }, "outputs": [], "source": [ "@method(SevenSegments)\n", "def training_sample(self):\n", " c = pyrandom.randint(0,9)\n", " v = self.vs[c]\n", " return (v,c)" ] }, { "cell_type": "markdown", "id": "4301aa58", "metadata": {}, "source": [ "Based on these training samples, we need to build a model that then returns correct classifications. These are training samples without the classification.\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "6370e213", "metadata": { "collapsed": false }, "outputs": [], "source": [ "@method(SevenSegments)\n", "def challenge(self):\n", " v,self.c = self.training_sample()\n", " return v" ] }, { "cell_type": "markdown", "id": "48508c74", "metadata": {}, "source": [ "Our classifier needs to respond with a class, and nature evaluates our response and gives us feedback.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9518a38e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "@method(SevenSegments)\n", "def response(self,c):\n", " return (c==self.c)" ] }, { "cell_type": "code", "execution_count": 23, "id": "5aa34c99", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array([1, 1, 1, 1, 1, 1, 0]), 0)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nature = SevenSegments()\n", "nature.training_sample()" ] }, { "cell_type": "markdown", "id": "dabf0858", "metadata": {}, "source": [ "We are trying to build a _model_ of nature.\n", "These respond to challenges by predicting a response.\n", "\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "d46afd31", "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Models\n", "class Model:\n", " def __init__(self,dataset):\n", " self.dataset = dataset\n", " def predict(self,v):\n", " return 0" ] }, { "cell_type": "markdown", "id": "f69daeb3", "metadata": {}, "source": [ "Nature gives us training examples. These training examples consist of some kind of measurement vector v and a corresponding classification c.\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "b7fc36e0", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "measurement vector [1 0 1 1 0 1 1]\n", "class 5\n" ] } ], "source": [ "v,c = nature.training_sample()\n", "print \"measurement vector\",v\n", "print \"class\",c" ] }, { "cell_type": "markdown", "id": "f869f2c8", "metadata": {}, "source": [ "At training time, we can request training samples.\n", "Usually, generating labeled training samples costs money, so we only obtain a limited number of them. \n", "These are usually collected in a dataset.\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "id": "da1fb60d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(array([1, 1, 0, 1, 1, 0, 1]), 2),\n", " (array([1, 0, 1, 1, 1, 1, 1]), 6),\n", " (array([1, 1, 1, 1, 1, 1, 0]), 0),\n", " (array([1, 1, 1, 1, 1, 1, 0]), 0),\n", " (array([1, 1, 1, 1, 0, 0, 1]), 3)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainingset = [nature.training_sample() for i in range(100)]\n", "trainingset[:5]" ] }, { "cell_type": "markdown", "id": "04b3580d", "metadata": {}, "source": [ "We usually use the dataset to create a model. \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "id": "3b31eed4", "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = Model(trainingset)" ] }, { "cell_type": "markdown", "id": "bd43aaa3", "metadata": {}, "source": [ "After training, nature presents us with challenges or test samples.\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "37ca2c8e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, 1, 0, 1, 1])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nature.challenge()" ] }, { "cell_type": "markdown", "id": "afaab8b6", "metadata": {}, "source": [ "The task of the model is to predict a value based on the challenge.\n", "\n" ] }, { "cell_type": "code", "execution_count": 29, "id": "d201f147", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n" ] } ], "source": [ "prediction = model.predict(nature.challenge())\n", "print prediction" ] }, { "cell_type": "markdown", "id": "84b57735", "metadata": {}, "source": [ "Our prediction is then handed back to nature for evaluation. We may or may not see this evaluation, but eventually, our model will be judged on the quality of its predictions.\n", "\n" ] }, { "cell_type": "code", "execution_count": 30, "id": "a4a665fb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nature.response(prediction)" ] }, { "cell_type": "markdown", "id": "9494acc4", "metadata": {}, "source": [ "A second important question is how we validate model. In standard software engineering, \n", "the assumption is (rightly or wrongly) that all we need to show is \n", "conformance to the specification. However, pattern recognition and \n", "machine learning methods do not have a specification. They do, however, \n", "have lots of data, and that lets us make good empirical predictions about performance." ] }, { "cell_type": "code", "execution_count": 31, "id": "356b6722", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "93" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# evaluating the model\n", "def count_errors(nature,model,trials=100):\n", " errors = 0\n", " for i in range(trials):\n", " v = nature.challenge()\n", " c = model.predict(v)\n", " if nature.response(c)!=True:\n", " errors += 1\n", " return errors\n", "count_errors(nature,model,trials=100)" ] }, { "cell_type": "markdown", "id": "0fc4217f", "metadata": {}, "source": [ "OK, that's not very good... the model is wrong about 90 percent of the time--chance level. Let's try for something better.\n", "\n" ] }, { "cell_type": "code", "execution_count": 32, "id": "83e261ab", "metadata": { "collapsed": true }, "outputs": [], "source": [ "class MemoryModel:\n", " def __init__(self,dataset):\n", " self.memory = {}\n", " for v,c in dataset:\n", " self.memory[tuple(v)] = c\n", " def predict(self,v):\n", " return self.memory.get(tuple(v),0)\n", " " ] }, { "cell_type": "code", "execution_count": 33, "id": "e8e07326", "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = MemoryModel(trainingset)\n" ] }, { "cell_type": "code", "execution_count": 34, "id": "f68736d8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_errors(nature,model,trials=100)\n" ] }, { "cell_type": "markdown", "id": "3247437f", "metadata": {}, "source": [ "In the noise free case, memorization of samples may give good response.\n", "\n", "It still lacks _generalization_: that is, it can't make good predictions for previously unseen samples that are similar to, but different from, existing samples." ] }, { "cell_type": "markdown", "id": "fb24cbad", "metadata": {}, "source": [ "# Noisy Samples" ] }, { "cell_type": "markdown", "id": "95b18e36", "metadata": {}, "source": [ "In the above example, the model just generated fixed patterns\n", "with a one-to-one correspondence to classes.\n", "\n", "In practice, there is usually noise.\n", "\n", "Noise means:\n", "\n", "- multiple patterns correspond to each class\n", "- each pattern may correspond to multiple classes\n", "- the correspondences are non-deterministic\n", "\n", "The overall goal is to minimize the *error rate*." ] }, { "cell_type": "code", "execution_count": 62, "id": "ae9e20ee", "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pylab import random as arandom\n", "class NoisySevenSegments(SevenSegments):\n", " def __init__(self):\n", " self.p_noise = 0.07\n", " self.vs = [None] * 10\n", " self.vs[0] = array((1,1,1,1,1,1,0))\n", " self.vs[1] = array((0,1,1,0,0,0,0))\n", " self.vs[2] = array((1,1,0,1,1,0,1))\n", " self.vs[3] = array((1,1,1,1,0,0,1))\n", " self.vs[4] = array((0,1,1,0,0,1,1))\n", " self.vs[5] = array((1,0,1,1,0,1,1))\n", " self.vs[6] = array((1,0,1,1,1,1,1))\n", " self.vs[7] = array((1,1,1,0,0,0,0))\n", " self.vs[8] = array((1,1,1,1,1,1,1))\n", " self.vs[9] = array((1,1,1,1,0,1,1))" ] }, { "cell_type": "markdown", "id": "0dc948f1", "metadata": {}, "source": [ "Now the training samples consist of the true targets, but with some of the segments flipped.\n", "\n" ] }, { "cell_type": "code", "execution_count": 63, "id": "8b7113e4", "metadata": { "collapsed": false }, "outputs": [], "source": [ "@method(NoisySevenSegments)\n", "def training_sample(self):\n", " c = pyrandom.randint(0,len(self.vs)-1)\n", " v = self.vs[c]\n", " flip = 1.0*(arandom(size=7)