{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d3027ced",
   "metadata": {},
   "source": [
    "# Bayesian Machine Learning\n",
    "\n",
    "\n",
    "- **[1]** (#) (a) Explain shortly the relation between machine learning and Bayes rule.     \n",
    "   (b) How are Maximum a Posteriori (MAP) and Maximum Likelihood (ML) estimation related to Bayes rule and machine learning?\n",
    "> (a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. *Bayes rule* makes this statement precise. Let $\\theta \\in \\Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule,\n",
    "$$\n",
    "p(\\theta|D) = \\frac{p(D|\\theta)}{p(D)} p(\\theta)\n",
    "$$\n",
    "relates the information that we have about $\\theta$ before we saw the data (i.e., the distribution $p(\\theta)$) to what we know after having seen the data, $p(\\theta|D)$.      \n",
    "> (b) The *Maximum a Posteriori* (MAP) estimate picks a value $\\hat\\theta$ for which the posterior distribution $p(\\theta|D)$ is maximal, i.e.,\n",
    "$$ \\hat\\theta_{MAP} = \\arg\\max_\\theta p(\\theta|D)$$\n",
    "In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\\theta|D)$ by $\\delta(\\theta-\\hat\\theta_{\\text{MAP}})$. Note that, by Bayes rule, $$\\arg\\max_\\theta p(\\theta|D) = \\arg\\max_\\theta p(D|\\theta)p(\\theta)$$\n",
    "If we further assume that prior to seeing the data all values for $\\theta$ are equally likely (i.e., $p(\\theta)=\\text{const.}$), then the MAP estimate reduces to the *Maximum Likelihood* estimate,\n",
    "$$ \\hat\\theta_{ML} = \\arg\\max_\\theta p(D|\\theta)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a217d6b",
   "metadata": {},
   "source": [
    "\n",
    "- **[2]** (#) What are the four stages of the Bayesian design approach?\n",
    "> (1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cd9ec40a",
   "metadata": {},
   "source": [
    "\n",
    "- **[3]** (##) The Bayes estimate is a summary of a posterior distribution by a delta distribution on its mean, i.e., \n",
    "$$\n",
    "\\hat \\theta_{bayes}  = \\int \\theta \\, p\\left( \\theta |D \\right)\n",
    "\\,\\mathrm{d}{\\theta}\n",
    "$$\n",
    "Proof that the Bayes estimate minimizes the mean-squared error, i.e., proof that\n",
    "$$\n",
    "\\hat \\theta_{bayes} = \\arg\\min_{\\hat \\theta} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n",
    "$$\n",
    ">  To minimize the expected mean-squared error we will look for $\\hat{\\theta}$ that makes the gradient of the integral with respect to $\\hat{\\theta}$ vanish.\n",
    "$$\\begin{align*}\n",
    "  \\nabla_{\\hat{\\theta}}  \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta \\nabla_{\\hat{\\theta}}  (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta  2(\\hat \\theta -\\theta) p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta  \\hat \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n",
    "  \\hat \\theta \\underbrace{\\int_\\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}}_{1} &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n",
    "  \\Rightarrow \\hat \\theta &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n",
    "\\end{align*}$$\n",
    "   \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b75af0a8",
   "metadata": {},
   "source": [
    "- **[4]** (##) We consider the coin toss example from the notebook and use a conjugate prior for a Bernoulli likelihood function.    \n",
    "  (a) Derive the Maximum Likelihood estimate.    \n",
    "  (b) Derive the MAP estimate.          \n",
    "  (c) Do these two estimates ever coincide (if so under what circumstances)?    \n",
    "> (a) The likelihood is given by $p(D|\\mu) = \\mu^n\\cdot (1-\\mu)^{(N-n)}$. It follows that\n",
    "$$\\begin{align*}\n",
    "    \\nabla \\log p(D|\\mu) &= 0 \\\\\n",
    "    \\nabla \\left( n\\log \\mu + (N-n)\\log(1-\\mu)\\right) &= 0\\\\\n",
    "    \\frac{n}{\\mu} - \\frac{N-n}{1-\\mu} &= 0 \\\\\n",
    "    \\rightarrow \\hat{\\mu}_{\\text{ML}} &= \\frac{n}{N}\n",
    "  \\end{align*}$$         \n",
    "> (b) Assuming a beta prior $\\mathcal{B}(\\mu|\\alpha,\\beta)$, we can write the posterior as as\n",
    "$$\\begin{align*}\n",
    "   p(\\mu|D) &\\propto p(D|\\mu)p(\\mu) \\\\\n",
    "      &\\propto \\mu^n (1-\\mu)^{N-n} \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n",
    "      &\\propto \\mathcal{B}(\\mu|n+\\alpha,N-n+\\beta)\n",
    "   \\end{align*}$$   \n",
    ">  The MAP estimate for a beta distribution $\\mathcal{B}(a,b)$ is located at $\\frac{a - 1}{a+b-2}$, see [wikipedia](https://en.wikipedia.org/wiki/Beta_distribution). Hence, \n",
    "$$\\begin{align*}\n",
    "\\hat{\\mu}_{\\text{MAP}} &= \\frac{(n+\\alpha)-1}{(n+\\alpha) + (N-n+\\beta) -2} \\\\\n",
    "  &= \\frac{n+\\alpha-1}{N + \\alpha +\\beta -2}\n",
    "\\end{align*}$$ \n",
    ">  (c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ffee53ca",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "- **[5]** (##) A model $m_1$ is described by a single parameter $\\theta$, with $0 \\leq \\theta \\leq1 $. The system can produce data $x \\in \\{0,1\\}$. The sampling distribution and prior are given by\n",
    "$$\\begin{align*}\n",
    "p(x|\\theta,m_1) &=  \\theta^x (1-\\theta)^{(1-x)} \\\\\n",
    "p(\\theta|m_1) &= 6\\theta(1-\\theta)\n",
    "\\end{align*}$$  \n",
    "  (a) Work out the probability $p(x=1|m_1)$.    \n",
    "$$\\begin{align*}\n",
    "  p(x=1|m_1) &= \\int_0^1 p(x=1|\\theta,m_1) p(\\theta|m_1) \\mathrm{d}\\theta \\\\\n",
    "  &= \\int \\theta \\cdot 6\\theta (1-\\theta) \\mathrm{d}\\theta \\\\\n",
    "  &= 6 \\cdot \\left(\\frac{1}{3}\\theta^3 - \\frac{1}{4}\\theta^4\\right) \\bigg|_0^1 \\\\\n",
    "  &= 6 \\cdot (\\frac{1}{3} - \\frac{1}{4}) = \\frac{1}{2}\n",
    "\\end{align*}$$    \n",
    "\n",
    "  (b) Determine the posterior $p(\\theta|x=1,m_1)$.     \n",
    "$$\\begin{align*}\n",
    "  p(\\theta|x=1,m_1) &= \\frac{p(x=1|\\theta) p(\\theta|m_1)}{p(x=1|m_1)} \\\\\n",
    "  &= 2\\cdot \\theta \\cdot 6\\theta (1-\\theta) \\\\\n",
    "  &= \\begin{cases} 12 \\theta^2 (1-\\theta) & \\text{if }0 \\leq \\theta \\leq 1 \\\\\n",
    "  0 & \\text{otherwise} \\end{cases}\n",
    "  \\end{align*}$$   \n",
    "\n",
    "Now consider a second model $m_2$ with the following sampling distribution and prior on $0 \\leq \\theta \\leq 1$:\n",
    "$$\\begin{align*}\n",
    "p(x|\\theta,m_2) &= (1-\\theta)^x \\theta^{(1-x)} \\\\\n",
    "p(\\theta|m_2) &= 2\\theta\n",
    "\\end{align*}$$\n",
    "  (c) ​Determine the probability $p(x=1|m_2)$.    \n",
    "$$\\begin{align*}\n",
    "  p(x=1|m_2) &= \\int_0^1 p(x=1|\\theta,m_2) p(\\theta|m_2) \\mathrm{d}\\theta \\\\\n",
    "  &= \\int (1-\\theta) \\cdot 2\\theta \\mathrm{d}\\theta \\\\\n",
    "  &= 2 \\cdot \\left( \\frac{1}{2}\\theta^2 - \\frac{1}{3}\\theta^3 \\right) \\bigg|_0^1 \\\\\n",
    "  &= 2 \\cdot (\\frac{1}{2} - \\frac{1}{3}) = \\frac{1}{3}\n",
    "  \\end{align*}$$       \n",
    "   \n",
    "Now assume that the model priors are given by\n",
    "  $$\\begin{align*}\n",
    "    p(m_1) &= 1/3  \\\\\n",
    "    p(m_2) &= 2/3\n",
    "    \\end{align*}$$       \n",
    "  (d) Compute the probability $p(x=1)$ by \"Bayesian model averaging\", i.e., by weighing the predictions of both models appropriately.  \n",
    "$$\\begin{align*}\n",
    "    p(x=1) &= \\sum_{k=1}^2 p(x=1|m_k) p(m_k)  \\\\\n",
    "    &= \\frac{1}{2} \\cdot \\frac{1}{3} + \\frac{1}{3} \\cdot \\frac{2}{3} = \\frac{7}{18} \n",
    "    \\end{align*}$$     \n",
    "  (e) Compute the fraction of posterior model probabilities $\\frac{p(m_1|x=1)}{p(m_2|x=1)}$.     \n",
    "$$\\frac{p(m_1|x=1)}{p(m_2|x=1)} = \\frac{p(x=1|m_1) p(m_1)}{p(x=1|m_2) p(m_2)} = \\frac{\\frac{1}{2} \\cdot \\frac{1}{3}}{\\frac{1}{3} \\cdot \\frac{2}{3}} =\\frac{3}{4}$$       \n",
    "  (f) Which model do you prefer after observation $x=1$?\n",
    "  > In principle, the observation $x=1$ favors model $m_2$, since $p(m_2|x=1) = \\frac{4}{3} \\times p(m_1|x=1)$. However, note that $\\log_{10} \\frac{3}{4} \\approx -0.125$, so the extra evidence for $m_2$ relative to $m_1$ is very low. At this point, after 1 observation, we have no preference for a model yet. \n",
    "\n",
    "​\n",
    " \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb682af7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.8.2",
   "language": "julia",
   "name": "julia-1.8"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}