{ "cells": [ { "cell_type": "markdown", "id": "d3027ced", "metadata": {}, "source": [ "# Bayesian Machine Learning\n", "\n", "\n", "- **[1]** (#) (a) Explain shortly the relation between machine learning and Bayes rule. \n", " (b) How are Maximum a Posteriori (MAP) and Maximum Likelihood (ML) estimation related to Bayes rule and machine learning?\n", "> (a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. *Bayes rule* makes this statement precise. Let $\\theta \\in \\Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule,\n", "$$\n", "p(\\theta|D) = \\frac{p(D|\\theta)}{p(D)} p(\\theta)\n", "$$\n", "relates the information that we have about $\\theta$ before we saw the data (i.e., the distribution $p(\\theta)$) to what we know after having seen the data, $p(\\theta|D)$. \n", "> (b) The *Maximum a Posteriori* (MAP) estimate picks a value $\\hat\\theta$ for which the posterior distribution $p(\\theta|D)$ is maximal, i.e.,\n", "$$\\hat\\theta_{MAP} = \\arg\\max_\\theta p(\\theta|D)$$\n", "In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\\theta|D)$ by $\\delta(\\theta-\\hat\\theta_{\\text{MAP}})$. Note that, by Bayes rule, $$\\arg\\max_\\theta p(\\theta|D) = \\arg\\max_\\theta p(D|\\theta)p(\\theta)$$\n", "If we further assume that prior to seeing the data all values for $\\theta$ are equally likely (i.e., $p(\\theta)=\\text{const.}$), then the MAP estimate reduces to the *Maximum Likelihood* estimate,\n", "$$\\hat\\theta_{ML} = \\arg\\max_\\theta p(D|\\theta)$$\n", "\n", "- **[2]** (#) What are the four stages of the Bayesian design approach?\n", "> (1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.\n", "\n", "- **[3]** (##) The Bayes estimate is a summary of a posterior distribution by a delta distribution on its mean, i.e., \n", "$$\n", "\\hat \\theta_{bayes} = \\int \\theta \\, p\\left( \\theta |D \\right)\n", "\\,\\mathrm{d}{\\theta}\n", "$$\n", "Proof that the Bayes estimate minimizes the expected mean-squared error, i.e., proof that\n", "$$\n", "\\hat \\theta_{bayes} = \\arg\\min_{\\hat \\theta} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n", "$$\n", ">To minimize the expected mean-squared error we will look for $\\hat{\\theta}$ that makes the gradient of the integral with respect to $\\hat{\\theta}$ vanish.\n", "\\begin{align*}\n", " \\nabla_{\\hat{\\theta}} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta \\nabla_{\\hat{\\theta}} (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta 2(\\hat \\theta -\\theta) p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta \\hat \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n", " \\hat \\theta \\underbrace{\\int_\\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}}_{1} &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n", " \\hat \\theta &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n", " \\end{align*}\n", "\n", "- **[4]** (###) We make $N$ IID observations $D=\\{x_1 \\dots x_N\\}$ and assume the following model\n", "$$\n", "x_k = A + \\epsilon_k \n", "$$\n", " where $\\epsilon_k = \\mathcal{N}(\\epsilon_k | 0,\\sigma^2)$ with known $\\sigma^2=1$. We are interested in deriving an estimator for $A$. \n", " (a) Make a reasonable assumption for a prior on $A$ and derive a Bayesian (posterior) estimate. \n", " (b) (##) Derive the Maximum Likelihood estimate for $A$. \n", " (c) Derive the MAP estimates for $A$. \n", " (d) Now assume that we do not know the variance of the noise term? Describe the procedure for Bayesian estimation of both $A$ and $\\sigma^2$ (No need to fully work out to closed-form estimates). \n", "> (a) Since there is no assumption on the values A can take it makes sense to assume a distribution that has support over the reals. A Gaussian prior is a good candidate. Let us assume $p(A) = \\mathcal{N}(A|m_A,v_A)$. Since $p(D|A) = \\prod_k \\mathcal{N}(x_k|A,\\sigma^2)$ is a Gaussian likelihood and $p(A)$ is a Gaussian prior, their multiplication is proportional to a Gaussian. We will work this out with the canonical parameterization of the Gaussian since it is easier to multiply Gaussians in that domain. This means the posterior $p(A|D)$ is\n", "\\begin{align*}\n", " p(A|D) &\\propto p(A) p(D|A) \\\\\n", " &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(x_k|A,\\sigma^2) \\\\\n", " &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(A|x_k,\\sigma^2) \\\\\n", " &= \\mathcal{N}_c\\big(A \\Bigm|\\frac{m_A}{v_A},\\frac{1}{v_A}\\big)\\prod_{k=1}^N \\mathcal{N}_c\\big(A\\Bigm| \\frac{x_k}{\\sigma^2},\\frac{1}{\\sigma^2}\\big) \\\\\n", " &\\propto \\mathcal{N}_c\\big(A \\Bigm| \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k , \\frac{1}{v_A} + \\frac{N}{\\sigma^2} \\big) \\,, \n", "\\end{align*}\n", "where we have made use of the fact that precision-weighted means and precisions add when multiplying Gaussians. In principle this description of the posterior completes the answer. \n", "> (b) The ML estimate can be found by\n", " \\begin{align*}\n", " \\nabla \\log p(D|A) &=0\\\\\n", " \\nabla \\sum_k \\log \\mathcal{N}(x_k|A,\\sigma^2) &= 0 \\\\\n", " \\nabla \\frac{-1}{2}\\sum_k \\frac{(x_k-A)^2}{\\sigma^2} &=0\\\\\n", " \\sum_k(x_k-A) &= 0 \\\\\n", " \\hat{A}_{ML} = \\frac{1}{N}\\sum_{k=1}^N x_k\n", " \\end{align*} \n", "> (c) The MAP is simply the location where the posterior has its maximum value, which for a Gaussian posterior is its mean value. We computed in (a) the precision-weighted mean, so we need to divide by precision (or multiply by variance) to get the location of the mean: \n", "\\begin{align*} \n", "\\hat{A}_{MAP} &= \\left( \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k\\right)\\cdot \\left( \\frac{1}{v_A} + \\frac{N}{\\sigma^2} \\right)^{-1} \\\\\n", "&= \\frac{v_A \\sum_k x_k + \\sigma^2 m_A}{N v_A + \\sigma^2}\n", "\\end{align*} \n", "> (d) A Bayesian treatment requires putting a prior on the unknown variance. The variance is constrained to be positive hence the support of the prior distribution needs to be on the positive reals. (In a multivariate case positivity needs to be extended to symmetric positive definiteness.) Choosing a conjugate prior will simplify matters greatly. In this scenerio the inverse Gamma distribution is the conjugate prior for the unknown variance. In the literature this model is called a Normal-Gamma distribution. See https://www.seas.harvard.edu/courses/cs281/papers/murphy-2007.pdf for the analytical treatment. \n", " \n", "- **[5]** (##) We consider the coin toss example from the notebook and use a conjugate prior for a Bernoulli likelihood function. \n", " (a) Derive the Maximum Likelihood estimate. \n", " (b) Derive the MAP estimate. \n", " (c) Do these two estimates ever coincide (if so under what circumstances)? \n", "> (a) \\begin{align*}\n", " \\nabla \\log p(D|\\mu) &= 0 \\\\\n", " \\nabla \\left( n\\log \\mu + (N-n)\\log(1-\\mu)\\right) &= 0\\\\\n", " \\frac{n}{\\mu} - \\frac{N-n}{1-\\mu} &= 0 \\\\\n", " \\hat{\\mu}_{\\text{ML}} = \\frac{n}{N}\n", " \\end{align*} \n", "> (b) Assuming a beta prior $\\mathcal{B}(\\mu|\\alpha,\\beta)$, we can write the posterior as as\n", " \\begin{align*}\n", " p(\\mu|D) &\\propto p(D|\\mu)p(\\mu) \\\\\n", " &\\propto \\mu^n (1-\\mu)^{N-n} \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n", " &\\propto \\mathcal{B}(\\mu|n+\\alpha,N-n+\\beta)\n", " \\end{align*} \n", " The MAP estimate for a beta distribution $\\mathcal{B}(a,b)$ is located at $\\frac{a - 1}{a+b-2}$, see [wikipedia](https://en.wikipedia.org/wiki/Beta_distribution). Hence, \n", " \\begin{align*}\n", "\\hat{\\mu}_{\\text{MAP}} &= \\frac{(n+\\alpha)-1}{(n+\\alpha) + (N-n+\\beta) -2} \\\\\n", " &= \\frac{n+\\alpha-1}{N + \\alpha +\\beta -2}\n", "\\end{align*} \n", "> (c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution." ] }, { "cell_type": "code", "execution_count": null, "id": "27db69a5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "fb682af7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.6.3", "language": "julia", "name": "julia-1.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.6.3" } }, "nbformat": 4, "nbformat_minor": 5 }