{ "cells": [ { "cell_type": "markdown", "id": "96c5dc26", "metadata": {}, "source": [ "# Generative Classification\n", "\n", "- **[1]** You have a machine that measures property $x$, the \"orangeness\" of liquids. You wish to discriminate between $C_1 = \\text{Fanta'}$ and $C_2 = \\text{Orangina'}$. It is known that\n", "\n", "\\begin{align*}\n", "p(x|C_1) &= \\begin{cases} 10 & 1.0 \\leq x \\leq 1.1\\\\\n", " 0 & \\text{otherwise}\n", " \\end{cases}\\\\\n", "p(x|C_2) &= \\begin{cases} 200(x - 1) & 1.0 \\leq x \\leq 1.1\\\\\n", "0 & \\text{otherwise}\n", "\\end{cases}\n", "\\end{align*}\n", "\n", "The prior probabilities $p(C_1) = 0.6$ and $p(C_2) = 0.4$ are also known from experience. \n", " (a) (##) A \"Bayes Classifier\" is given by\n", " \n", "$$\\text{Decision} = \\begin{cases} C_1 & \\text{if } p(C_1|x)>p(C_2|x) \\\\\n", " C_2 & \\text{otherwise}\n", " \\end{cases}\n", "$$\n", "\n", "Derive the optimal Bayes classifier. \n", " (b) (###) The probability of making the wrong decision, given $x$, is\n", " \n", "$$\n", "p(\\text{error}|x)= \\begin{cases} p(C_1|x) & \\text{if we decide C_2}\\\\\n", " p(C_2|x) & \\text{if we decide C_1}\n", "\\end{cases}\n", "$$\n", "\n", "Compute the **total** error probability $p(\\text{error})$ for the Bayes classifier in this example.\n", "\n", "> (a) We choose $C_1$ if $p(C_1|x)/p(C_2|x) > 1$. This condition can be worked out as\n", "$$\n", "\\frac{p(C_1|x)}{p(C_2|x)} = \\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)} = \\frac{10 \\times 0.6}{200(x-1)\\times 0.4}>1\n", "$$\n", "which evaluates to choosing\n", "\\begin{align*}\n", "C_1 &\\quad \\text{ if 1.0\\leq x < 1.075}\\\\ \n", "C_2 &\\quad \\text{ if 1.075 \\leq x \\leq 1.1 }\n", "\\end{align*}\n", "The probability that $x$ falls outside the interval $[1.0,1.1]$ is zero. \n", "> (b) The total probability of error $p(\\text{error})=\\int_x p(\\text{error}|x)p(x) \\mathrm{d}{x}$. We can work this out as\n", "\n", "\\begin{align*}\n", "p(\\text{error}) &= \\int_x p(\\text{error}|x)p(x)\\mathrm{d}{x}\\\\\n", "&= \\int_{1.0}^{1.075} p(C_2|x)p(x) \\mathrm{d}{x} + \\int_{1.075}^{1.1} p(C_1|x)p(x) \\mathrm{d}{x}\\\\\n", "&= \\int_{1.0}^{1.075} p(x|C_2)p(C_2) \\mathrm{d}{x} + \\int_{1.075}^{1.1} p(x|C_1)p(C_1) \\mathrm{d}{x}\\\\\n", "&= \\int_{1.0}^{1.075}0.4\\cdot 200(x-1) \\mathrm{d}{x} + \\int_{1.075}^{1.1} 0.6\\cdot 10 \\mathrm{d}{x}\\\\\n", "&=80\\cdot[x^2/2-x]_{1.0}^{1.075} + 6\\cdot[x]_{1.075}^{1.1}\\\\\n", "&=0.225 + 0.15\\\\\n", "&=0.375\n", "\\end{align*}\n", "\n", "\n", "\n", "- **[2]** (#) (see Bishop exercise 4.8): Using (4.57) and (4.58) (from Bishop's book), derive the result (4.65) for the posterior class probability in the two-class generative model with Gaussian densities, and verify the results (4.66) and (4.67) for the parameters $w$ and $w0$. \n", "\n", "> Substitute 4.64 into 4.58 to get \n", "\\begin{align*}\n", "a &= \\log \\left( \\frac{ \\frac{1}{(2\\pi)^{D/2}} \\cdot \\frac{1}{|\\Sigma|^{1/2}} \\cdot \\exp\\left( -\\frac{1}{2}(x-\\mu_1)^T \\Sigma^{-1} (x-\\mu_1)\\right) \\cdot p(C_1)}{\\frac{1}{(2\\pi)^{D/2}} \\cdot \\frac{1}{|\\Sigma|^{1/2}}\\cdot \\exp\\left( -\\frac{1}{2}(x-\\mu_2)^T \\Sigma^{-1} (x-\\mu_2)\\right) \\cdot p(C_2)}\\right) \\\\\n", "&= \\log \\left( \\exp\\left(-\\frac{1}{2}(x-\\mu_1)^T \\Sigma^{-1} (x-\\mu_1) + \\frac{1}{2}(x-\\mu_2)^T \\Sigma^{-1} (x-\\mu_2) \\right) \\right) + \\log \\frac{p(C_1)}{p(C_2)} \\\\\n", "&= ... \\\\\n", "&=( \\mu_1-\\mu_2)^T\\Sigma^{-1}x - 0.5\\left(\\mu_1^T\\Sigma^{-1}\\mu_1 - \\mu_2^T\\Sigma^{-1} \\mu_2\\right)+ \\log \\frac{p(C_1)}{p(C_2)} \n", "\\end{align*} \n", "Substituting this into the right-most form of (4.57) we obtain (4.65), with $w$ and $w0$ given by (4.66) and (4.67), respectively.\n", "\n", "- **[3]** (###) (see Bishop exercise 4.9). \n", "> The Log-likelihood is given by\n", "$$\\log p(\\{\\phi_n,t,n\\} | \\{\\pi_k\\}) = \\sum_n \\sum_k t_{nk}\\left(\\log p(\\phi_n|C_k) + \\log \\pi_k\\right)\\,.$$ Using the method of Lagrange multipliers (Bishop app.E), we augment the log-likelihood with the constraint and obtain the Lagrangian $$\\log p(\\{\\phi_n,t_{nk}\\} | \\{\\pi_k\\})+\\lambda \\left(\\sum_k \\pi_k -1 \\right)\\,.$$ In order to maximize, we set the derivative with respect to $\\pi_k$ equal to zero and obtain \n", "\\begin{align*}\n", "\\sum_n \\frac{t_{nk}}{\\pi_k}+\\lambda &=0 \\\\\n", "-\\pi_k\\lambda &=\\sum_n t_{nk} = N_k \\\\\n", "-\\lambda \\sum_k \\pi_k &= \\sum_k \\sum_n t_{nk} \\\\\n", "\\lambda &= -N \n", "\\end{align*}\n", "\n", "- **[4]** (##) (see Bishop exercise 4.10). \n", "> We can write the log-likelihood as\n", "\\begin{align*}\n", "\\log p(\\{\\phi_n,t_n\\}|\\{\\pi_k\\}) \\propto -0.5\\sum_n\\sum_kt_{nk}\\left(\\log |\\Sigma|+(\\phi_n-\\mu_k)^T\\Sigma^{-1}(\\phi-\\mu)\\right)\n", "\\end{align*}\n", "The derivatives of the likelihood with respect to mean and shared covariance are respectively\n", "\\begin{align*}\n", "\\nabla_{\\mu_k}\\log p(\\{\\phi_n,t_n\\}|\\{\\pi_k\\}) &= \\sum_n\\sum_k t_{nk}\\Sigma^{-1}\\left(\\phi_n-\\mu_k\\right) = 0 \\\\\n", "\\sum_n t_{nk}\\left(\\phi_n-\\mu_k\\right))&=0 \\\\\n", "\\mu_k &= \\frac{1}{N_k}\\sum_n t_{nk}\\phi_n \\\\\n", "\\nabla_{\\Sigma}\\log p(\\{\\phi_n,t_n\\}|\\{\\pi_k\\})&=\\sum_n\\sum_k t_{nk}\\left(\\Sigma - (\\phi_n-\\mu_k)(\\phi_n-\\mu_k)^T\\right) = 0 \\\\\n", "\\sum_n\\sum_k t_{nk}\\Sigma &= \\sum_n\\sum_k t_{nk}(\\phi_n-\\mu_k)(\\phi_n-\\mu_k)^T \\\\\n", "\\Sigma &= \\frac{1}{N}\\sum_k\\sum_n t_{nk}(\\phi_n-\\mu_k)(\\phi_n-\\mu_k)^T\n", "\\end{align*}\n", "\n", "" ] }, { "cell_type": "markdown", "id": "dac56bf0", "metadata": {}, "source": [ "# Discriminative Classification\n", "\n", "- **[1]** Given a data set $D=\\{(x_1,y_1),\\ldots,(x_N,y_N)\\}$, where $x_n \\in \\mathbb{R}^M$ and $y_n \\in \\{0,1\\}$. The probabilistic classification method known as *logistic regression* attempts to model these data as\n", "$$p(y_n=1|x_n) = \\sigma(\\theta^T x_n + b)$$\n", "where $\\sigma(x) = 1/(1+e^{-x})$ is the *logistic function*. Let's introduce shorthand notation $\\mu_n=\\sigma(\\theta^T x_n + b)$. So, for every input $x_n$, we have a model output $\\mu_n$ and an actual data output $y_n$. \n", " (a) Express $p(y_n|x_n)$ as a Bernoulli distribution in terms of $\\mu_n$ and $y_n$. \n", " (b) If furthermore is given that the data set is IID, show that the log-likelihood is given by\n", "$$\n", "L(\\theta) \\triangleq \\log p(D|\\theta) = \\sum_n \\left\\{y_n \\log \\mu_n + (1-y_n)\\log(1-\\mu_n)\\right\\}\n", "$$ \n", " (c) Prove that the derivative of the logistic function is given by\n", "$$\n", "\\sigma^\\prime(\\xi) = \\sigma(\\xi)\\cdot\\left(1-\\sigma(\\xi)\\right)\n", "$$ \n", " (d) Show that the derivative of the log-likelihood is\n", "$$\n", "\\nabla_\\theta L(\\theta) = \\sum_{n=1}^N \\left( y_n - \\sigma(\\theta^T x_n +b)\\right)x_n\n", "$$ \n", " (e) Design a gradient-ascent algorithm for maximizing $L(\\theta)$ with respect to $\\theta$. \n", "\n", " > (a) $p(y_n|x_n) = p(y_n=1|x_n)^{y_n} p(y_n=0|x_n)^{1-y_n} = \\mu_n^{y_n}(1-\\mu_n)^{1-y_n}$ \n", " > (b) The log-likelihood is given by\n", "\\begin{align*} L(\\theta) &= \\log p(D|\\theta) = \\sum_n \\log p(y_n|x_n,\\theta)\\\\\n", "&= \\sum_n \\left\\{y_n \\log \\mu_n + (1-y_n)\\log(1-\\mu_n)\\right\\}\n", "\\end{align*} \n", " > (c) \\begin{align*}\n", "\\frac{d{}}{d{x}}\\left( \\frac{1}{1+e^{-x}}\\right) &= \\frac{(1+e^{-x})\\cdot 0 - (-e^{-x}\\cdot 1)}{(1+e^{-x})^2}\\\\\n", "&= \\frac{e^{-x}}{(1+e^{-x})^2} = \\frac{1}{1+e^{-x}}\\cdot \\frac{e^{-x}}{1+e^{-x}}\\\\\n", "&=\\sigma(x)\\left( 1-\\sigma(x)\\right)\n", "\\end{align*} \n", " > (d)\n", "\\begin{align*}\n", "\\nabla_\\theta L(\\theta) &= \\sum_n \\frac{\\partial{L}}{\\partial{\\mu_n}}\\cdot \\frac{\\partial{\\mu_n}}{\\partial{(\\theta^T x_n +b)}} \\cdot \\frac{\\partial{(\\theta^T x_n +b)}}{\\partial{\\theta}}\\\\\n", "&= \\sum_n \\left(\\frac{y_n}{\\mu_n} - \\frac{1-y_n}{1-\\mu_n} \\right) \\cdot \\mu_n(1-\\mu_n) \\cdot x_n\\\\\n", "&= \\sum_n \\frac{y_n - \\mu_n}{\\mu_n(1-\\mu_n)} \\cdot \\mu_n(1-\\mu_n) \\cdot x_n\\\\\n", "&= \\sum_n (y_n - \\mu_n) \\cdot x_n\n", "\\end{align*} \n", " > (e)\n", "$$\\theta^{(t+1)} = \\theta^{(t)} + \\rho \\sum_n (y_n - \\mu_n^{(t)})x_n$$\n", "\n", "- **[2]** Describe shortly the similarities and differences between the discriminative and generative approach to classification.\n", " > Both aim to build an algorithm for $p(y|x)$ where $y$ is a discrete class label and $x$ is a vector of real (or possibly discretely valued) variables. In the discriminative approach, we propose a model $p(y|x,\\theta)$ and use a training data set $D=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}$ to infer good values for the parameters. For instance, in a maximum likelihood setting, we choose the parameters $\\hat{\\theta}$ that maximize $p(D|\\theta)$. The classification algorithm is now given by $$p(y|x) = p(y|x,\\hat{\\theta})\\,.$$ In the generative approach, we also aim to design an algorithm $p(y|x)$ through a parametric model that is now given by $p(y,x|\\theta) = p(x|y,\\theta)p(y|\\theta)$. Again, we use the data set to train the parameters, eg, $\\hat{\\theta} = \\arg\\max_\\theta p(D|\\theta)$, and the classification algorithm is now given by Bayes rule: \n", " $$\n", " p(y|x) \\propto p(x|y,\\hat{\\theta})\\cdot p(y|\\hat{\\theta})\n", "$$\n", "\n", "- **[3]** (Bishop ex.4.7) (#) Show that the logistic sigmoid function $\\sigma(a) = \\frac{1}{1+\\exp(-a)}$ satisfies the property $\\sigma(-a) = 1-\\sigma(a)$ and that its inverse is given by $\\sigma^{-1}(y) = \\log\\{y/(1-y)\\}$.\n", " > \\begin{align*} \n", " 1- \\sigma(a) &= 1 - \\frac{1}{1 + \\exp(-a)} = \\frac{1+\\exp(-a) - 1}{1+\\exp(-a)} \\\\\n", " &= \\frac{\\exp(-a)}{1+\\exp(-a)} = \\frac{1}{\\exp(a)+1} = \\sigma(-a)\\end{align*}\n", " \n", " > Regarding the inverse, \n", " \\begin{align*} \n", " y = \\sigma(a) &= \\frac{1}{1+\\exp(-a)} \\\\\n", " \\Rightarrow \\frac1y - 1 &= \\exp(-a) \\\\\n", " \\Rightarrow \\log\\left( \\frac{1-y}{y}\\right) &= -a \\\\\n", " \\Rightarrow \\log\\left( \\frac{y}{1-y}\\right) &= a = \\sigma^{-1}(y)\n", "\\end{align*}\n", " \n", "- **[4]** (Bishop ex.4.16) (###) Consider a binary classification problem in which each observation $x_n$ is known to belong to one of two classes, corresponding to $y_n = 0$ and $y_n = 1$. Suppose that the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data point $x_n$, instead of having a value $y_n$ for the class label, we have instead a value $\\pi_n$ representing the probability that $y_n = 1$. Given a probabilistic model $p(y_n = 1|x_n,\\theta)$, write down the log-likelihood function appropriate to such a data set.\n", " > If the values of the $\\{y_n\\}$ were known then each data point for which $y_n = 1$ would contribute $\\log p(y_n = 1|x_n,\\theta)$ to the log likelihood, and each point for which $y_n = 0$ would contribute $\\log p(y_n=0|x_n,\\theta) = \\log(1 − p(y_n = 1|x_n,\\theta))$ to the log likelihood. A data point whose probability of having $y_n = 1$ is given by $\\pi_n$ will therefore contribute\n", " $$\\pi_n \\log p(y_n=1|x_n,\\theta) + (1 - \\pi_n) \\log p(y_n=0|x_n,\\theta)$$\n", " and so the overall log-likelihood given the data set is \n", "$$\\sum_{n=1}^N \\pi_n \\log p(y_n=1|x_n,\\theta) + (1 - \\pi_n) \\log p(y_n=0|x_n,\\theta)$$\n", "\n", "\n", "\n", "- **[5]** (###) Let $X$ be a real valued random variable with probability density\n", "$$\n", "p_X(x) = \\frac{e^{-x^2/2}}{\\sqrt{2\\pi}},\\quad\\text{for all x}.\n", "$$\n", "Also $Y$ is a real valued random variable with conditional density\n", "$$\n", "p_{Y|X}(y|x) = \\frac{e^{-(y-x)^2/2}}{\\sqrt{2\\pi}},\\quad\\text{for all x and y}. \n", "$$\n", " (a) Give an (integral) expression for $p_Y(y)$. Do not try to evaluate the integral. \n", " (b) Approximate $p_Y(y)$ using the Laplace approximation.\n", " Give the detailed derivation, not just the answer.\n", "Hint: You may use the following results.\n", "Let \n", "$$g(x) = \\frac{e^{-x^2/2}}{\\sqrt{2\\pi}}$$\n", "and\n", "$$\n", "h(x) = \\frac{e^{-(y-x)^2/2}}{\\sqrt{2\\pi}}$$\n", "for some real value $y$. Then:\n", "\\begin{align*}\n", "\\frac{\\partial}{\\partial x} g(x) &= -xg(x) \\\\\n", "\\frac{\\partial^2}{\\partial x^2} g(x) &= (x^2-1)g(x) \\\\\n", "\\frac{\\partial}{\\partial x} h(x) &= (y-x)h(x) \\\\\n", "\\frac{\\partial^2}{\\partial x^2} h(x) &= ((y-x)^2-1)h(x) \n", "\\end{align*} \n", "> (a) \n", "$$\n", "p_Y(y) = \\int_{-\\infty}^{\\infty} p_X(x)p_{Y|X}(y|x)\\,dx =\n", " \\int_{-\\infty}^{\\infty}\n", " \\frac{e^{-\\frac12(x^2+(y-x)^2)}}{2\\pi}\\,dx\n", "$$\n", "> (b) Using the hint we determine the first derivative of\n", " \\begin{align*}\n", " f(x) &= g(x)h(x), \\\\\n", " \\frac{\\partial}{\\partial x} f(x) &= \\frac{\\partial}{\\partial x} g(x)\\cdot h(x) = -xg(x)h(x)+g(x)(y-x)h(x) = (y-2x)f(x)\n", "\\end{align*} \n", "Setting this to zero gives \n", "\\begin{align*}\n", " y-2x&= 0; \\quad \\text{so}\\quad x=\\frac12y. \\\\\n", " \\frac{\\partial}{\\partial x} \\ln f(x) &= \\frac{\\frac{\\partial}{\\partial x} f(x)}{f(x)} = (y-2x) \\\\\n", " \\frac{\\partial^2}{\\partial x^2} \\ln f(x) &= \\frac{\\partial}{\\partial x} (y-2x) = -2.\n", "\\end{align*}\n", "So, we find $A=2$, see lecture notes, and thus\n", " \\begin{align*}\n", " p_Y(y) &= \\int_{-\\infty}^{\\infty}f(x)\\,dx\\approx f(\\frac{y}{2})\\sqrt{\\frac{2\\pi}{A}} \\\\\n", " &= g(\\frac{y}{2})h(\\frac{y}{2})\\sqrt{\\frac{2\\pi}{A}} \\\\\n", " &= \\frac{1}{\\sqrt{2\\pi\\cdot2}}e^{-y^2/4}.\n", " \\end{align*}\n", "So $Y$ is a Gaussian with mean $m=0$ and variance $\\sigma^2=2$.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7dc10e41", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.2", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.2" } }, "nbformat": 4, "nbformat_minor": 5 }