{ "cells": [ { "cell_type": "markdown", "id": "d3027ced", "metadata": {}, "source": [ "# Bayesian Machine Learning\n", "\n", "\n", "- **[1]** (#) (a) Explain shortly the relation between machine learning and Bayes rule. \n", " (b) How are Maximum a Posteriori (MAP) and Maximum Likelihood (ML) estimation related to Bayes rule and machine learning?\n", "> (a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. *Bayes rule* makes this statement precise. Let $\\theta \\in \\Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule,\n", "$$\n", "p(\\theta|D) = \\frac{p(D|\\theta)}{p(D)} p(\\theta)\n", "$$\n", "relates the information that we have about $\\theta$ before we saw the data (i.e., the distribution $p(\\theta)$) to what we know after having seen the data, $p(\\theta|D)$. \n", "> (b) The *Maximum a Posteriori* (MAP) estimate picks a value $\\hat\\theta$ for which the posterior distribution $p(\\theta|D)$ is maximal, i.e.,\n", "$$ \\hat\\theta_{MAP} = \\arg\\max_\\theta p(\\theta|D)$$\n", "In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\\theta|D)$ by $\\delta(\\theta-\\hat\\theta_{\\text{MAP}})$. Note that, by Bayes rule, $$\\arg\\max_\\theta p(\\theta|D) = \\arg\\max_\\theta p(D|\\theta)p(\\theta)$$\n", "If we further assume that prior to seeing the data all values for $\\theta$ are equally likely (i.e., $p(\\theta)=\\text{const.}$), then the MAP estimate reduces to the *Maximum Likelihood* estimate,\n", "$$ \\hat\\theta_{ML} = \\arg\\max_\\theta p(D|\\theta)$$\n" ] }, { "cell_type": "markdown", "id": "3a217d6b", "metadata": {}, "source": [ "\n", "- **[2]** (#) What are the four stages of the Bayesian design approach?\n", "> (1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cd9ec40a", "metadata": {}, "source": [ "\n", "- **[3]** (##) The Bayes estimate is a summary of a posterior distribution by a delta distribution on its mean, i.e., \n", "$$\n", "\\hat \\theta_{bayes} = \\int \\theta \\, p\\left( \\theta |D \\right)\n", "\\,\\mathrm{d}{\\theta}\n", "$$\n", "Proof that the Bayes estimate minimizes the mean-squared error, i.e., proof that\n", "$$\n", "\\hat \\theta_{bayes} = \\arg\\min_{\\hat \\theta} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n", "$$\n", "> To minimize the expected mean-squared error we will look for $\\hat{\\theta}$ that makes the gradient of the integral with respect to $\\hat{\\theta}$ vanish.\n", "$$\\begin{align*}\n", " \\nabla_{\\hat{\\theta}} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta \\nabla_{\\hat{\\theta}} (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta 2(\\hat \\theta -\\theta) p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n", " \\int_\\theta \\hat \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n", " \\hat \\theta \\underbrace{\\int_\\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}}_{1} &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n", " \\Rightarrow \\hat \\theta &= \\int_\\theta \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n", "\\end{align*}$$\n", " \n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b75af0a8", "metadata": {}, "source": [ "- **[4]** (##) We consider the coin toss example from the notebook and use a conjugate prior for a Bernoulli likelihood function. \n", " (a) Derive the Maximum Likelihood estimate. \n", " (b) Derive the MAP estimate. \n", " (c) Do these two estimates ever coincide (if so under what circumstances)? \n", "> (a) The likelihood is given by $p(D|\\mu) = \\mu^n\\cdot (1-\\mu)^{(N-n)}$. It follows that\n", "$$\\begin{align*}\n", " \\nabla \\log p(D|\\mu) &= 0 \\\\\n", " \\nabla \\left( n\\log \\mu + (N-n)\\log(1-\\mu)\\right) &= 0\\\\\n", " \\frac{n}{\\mu} - \\frac{N-n}{1-\\mu} &= 0 \\\\\n", " \\rightarrow \\hat{\\mu}_{\\text{ML}} &= \\frac{n}{N}\n", " \\end{align*}$$ \n", "> (b) Assuming a beta prior $\\mathcal{B}(\\mu|\\alpha,\\beta)$, we can write the posterior as as\n", "$$\\begin{align*}\n", " p(\\mu|D) &\\propto p(D|\\mu)p(\\mu) \\\\\n", " &\\propto \\mu^n (1-\\mu)^{N-n} \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n", " &\\propto \\mathcal{B}(\\mu|n+\\alpha,N-n+\\beta)\n", " \\end{align*}$$ \n", "> The MAP estimate for a beta distribution $\\mathcal{B}(a,b)$ is located at $\\frac{a - 1}{a+b-2}$, see [wikipedia](https://en.wikipedia.org/wiki/Beta_distribution). Hence, \n", "$$\\begin{align*}\n", "\\hat{\\mu}_{\\text{MAP}} &= \\frac{(n+\\alpha)-1}{(n+\\alpha) + (N-n+\\beta) -2} \\\\\n", " &= \\frac{n+\\alpha-1}{N + \\alpha +\\beta -2}\n", "\\end{align*}$$ \n", "> (c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ffee53ca", "metadata": {}, "source": [ "\n", "\n", "- **[5]** (##) A model $m_1$ is described by a single parameter $\\theta$, with $0 \\leq \\theta \\leq1 $. The system can produce data $x \\in \\{0,1\\}$. The sampling distribution and prior are given by\n", "$$\\begin{align*}\n", "p(x|\\theta,m_1) &= \\theta^x (1-\\theta)^{(1-x)} \\\\\n", "p(\\theta|m_1) &= 6\\theta(1-\\theta)\n", "\\end{align*}$$ \n", " (a) Work out the probability $p(x=1|m_1)$. \n", "$$\\begin{align*}\n", " p(x=1|m_1) &= \\int_0^1 p(x=1|\\theta,m_1) p(\\theta|m_1) \\mathrm{d}\\theta \\\\\n", " &= \\int \\theta \\cdot 6\\theta (1-\\theta) \\mathrm{d}\\theta \\\\\n", " &= 6 \\cdot \\left(\\frac{1}{3}\\theta^3 - \\frac{1}{4}\\theta^4\\right) \\bigg|_0^1 \\\\\n", " &= 6 \\cdot (\\frac{1}{3} - \\frac{1}{4}) = \\frac{1}{2}\n", "\\end{align*}$$ \n", "\n", " (b) Determine the posterior $p(\\theta|x=1,m_1)$. \n", "$$\\begin{align*}\n", " p(\\theta|x=1,m_1) &= \\frac{p(x=1|\\theta) p(\\theta|m_1)}{p(x=1|m_1)} \\\\\n", " &= 2\\cdot \\theta \\cdot 6\\theta (1-\\theta) \\\\\n", " &= \\begin{cases} 12 \\theta^2 (1-\\theta) & \\text{if }0 \\leq \\theta \\leq 1 \\\\\n", " 0 & \\text{otherwise} \\end{cases}\n", " \\end{align*}$$ \n", "\n", "Now consider a second model $m_2$ with the following sampling distribution and prior on $0 \\leq \\theta \\leq 1$:\n", "$$\\begin{align*}\n", "p(x|\\theta,m_2) &= (1-\\theta)^x \\theta^{(1-x)} \\\\\n", "p(\\theta|m_2) &= 2\\theta\n", "\\end{align*}$$\n", " (c) ​Determine the probability $p(x=1|m_2)$. \n", "$$\\begin{align*}\n", " p(x=1|m_2) &= \\int_0^1 p(x=1|\\theta,m_2) p(\\theta|m_2) \\mathrm{d}\\theta \\\\\n", " &= \\int (1-\\theta) \\cdot 2\\theta \\mathrm{d}\\theta \\\\\n", " &= 2 \\cdot \\left( \\frac{1}{2}\\theta^2 - \\frac{1}{3}\\theta^3 \\right) \\bigg|_0^1 \\\\\n", " &= 2 \\cdot (\\frac{1}{2} - \\frac{1}{3}) = \\frac{1}{3}\n", " \\end{align*}$$ \n", " \n", "Now assume that the model priors are given by\n", " $$\\begin{align*}\n", " p(m_1) &= 1/3 \\\\\n", " p(m_2) &= 2/3\n", " \\end{align*}$$ \n", " (d) Compute the probability $p(x=1)$ by \"Bayesian model averaging\", i.e., by weighing the predictions of both models appropriately. \n", "$$\\begin{align*}\n", " p(x=1) &= \\sum_{k=1}^2 p(x=1|m_k) p(m_k) \\\\\n", " &= \\frac{1}{2} \\cdot \\frac{1}{3} + \\frac{1}{3} \\cdot \\frac{2}{3} = \\frac{7}{18} \n", " \\end{align*}$$ \n", " (e) Compute the fraction of posterior model probabilities $\\frac{p(m_1|x=1)}{p(m_2|x=1)}$. \n", "$$\\frac{p(m_1|x=1)}{p(m_2|x=1)} = \\frac{p(x=1|m_1) p(m_1)}{p(x=1|m_2) p(m_2)} = \\frac{\\frac{1}{2} \\cdot \\frac{1}{3}}{\\frac{1}{3} \\cdot \\frac{2}{3}} =\\frac{3}{4}$$ \n", " (f) Which model do you prefer after observation $x=1$?\n", " > In principle, the observation $x=1$ favors model $m_2$, since $p(m_2|x=1) = \\frac{4}{3} \\times p(m_1|x=1)$. However, note that $\\log_{10} \\frac{3}{4} \\approx -0.125$, so the extra evidence for $m_2$ relative to $m_1$ is very low. At this point, after 1 observation, we have no preference for a model yet. \n", "\n", "​\n", " \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fb682af7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.8.2", "language": "julia", "name": "julia-1.8" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.8.2" } }, "nbformat": 4, "nbformat_minor": 5 }