{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d3027ced",
   "metadata": {},
   "source": [
    "# Bayesian Machine Learning\n",
    "\n",
    "\n",
    "- **[1]** (#) (a) Explain shortly the relation between machine learning and Bayes rule.     \n",
    "   (b) How are Maximum a Posteriori (MAP) and Maximum Likelihood (ML) estimation related to Bayes rule and machine learning?\n",
    "> (a) Machine learning is inference over models (hypotheses, parameters, etc.) from a given data set. *Bayes rule* makes this statement precise. Let $\\theta \\in \\Theta$ and $D$ represent a model parameter vector and the given data set, respectively. Then, Bayes rule,\n",
    "$$\n",
    "p(\\theta|D) = \\frac{p(D|\\theta)}{p(D)} p(\\theta)\n",
    "$$\n",
    "relates the information that we have about $\\theta$ before we saw the data (i.e., the distribution $p(\\theta)$) to what we know after having seen the data, $p(\\theta|D)$.      \n",
    "> (b) The *Maximum a Posteriori* (MAP) estimate picks a value $\\hat\\theta$ for which the posterior distribution $p(\\theta|D)$ is maximal, i.e.,\n",
    "$$ \\hat\\theta_{MAP} = \\arg\\max_\\theta p(\\theta|D)$$\n",
    "In a sense, MAP estimation approximates Bayesian learning, since we approximated $p(\\theta|D)$ by $\\delta(\\theta-\\hat\\theta_{\\text{MAP}})$. Note that, by Bayes rule, $$\\arg\\max_\\theta p(\\theta|D) = \\arg\\max_\\theta p(D|\\theta)p(\\theta)$$\n",
    "If we further assume that prior to seeing the data all values for $\\theta$ are equally likely (i.e., $p(\\theta)=\\text{const.}$), then the MAP estimate reduces to the *Maximum Likelihood* estimate,\n",
    "$$ \\hat\\theta_{ML} = \\arg\\max_\\theta p(D|\\theta)$$\n",
    "\n",
    "- **[2]** (#) What are the four stages of the Bayesian design approach?\n",
    "> (1) Model specification, (2) parameter estimation, (3) model evaluation and (4) application of the model to tasks.\n",
    "\n",
    "- **[3]** (##) The Bayes estimate is a summary of a posterior distribution by a delta distribution on its mean, i.e., \n",
    "$$\n",
    "\\hat \\theta_{bayes}  = \\int \\theta \\, p\\left( \\theta |D \\right)\n",
    "\\,\\mathrm{d}{\\theta}\n",
    "$$\n",
    "Proof that the Bayes estimate minimizes the expected mean-squared error, i.e., proof that\n",
    "$$\n",
    "\\hat \\theta_{bayes} = \\arg\\min_{\\hat \\theta} \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n",
    "$$\n",
    ">To minimize the expected mean-squared error we will look for $\\hat{\\theta}$ that makes the gradient of the integral with respect to $\\hat{\\theta}$ vanish.\n",
    "$$\\begin{align*}\n",
    "  \\nabla_{\\hat{\\theta}}  \\int_\\theta (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta \\nabla_{\\hat{\\theta}}  (\\hat \\theta -\\theta)^2 p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta  2(\\hat \\theta -\\theta) p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= 0 \\\\\n",
    "  \\int_\\theta  \\hat \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n",
    "  \\hat \\theta \\underbrace{\\int_\\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}}_{1} &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta} \\\\\n",
    "   \\hat \\theta &= \\int_\\theta  \\theta p \\left( \\theta |D \\right) \\,\\mathrm{d}{\\theta}\n",
    "   \\end{align*}$$\n",
    "\n",
    "- **[4]** (###) We make $N$ IID observations $D=\\{x_1 \\dots x_N\\}$ and assume the following model\n",
    "$$\n",
    "x_k = A + \\epsilon_k \n",
    "$$\n",
    "  where $\\epsilon_k = \\mathcal{N}(\\epsilon_k | 0,\\sigma^2)$ with known $\\sigma^2=1$. We are interested in deriving an estimator for $A$.   \n",
    "  (a) Make a reasonable assumption for a prior on $A$ and derive a Bayesian (posterior) estimate.     \n",
    "  (b) (##) Derive the Maximum Likelihood estimate for $A$.      \n",
    "  (c) Derive the MAP estimates for $A$.    \n",
    "  (d) Now assume that we do not know the variance of the noise term? Describe the procedure for Bayesian estimation of both $A$ and $\\sigma^2$ (No need to fully work out to closed-form estimates). \n",
    "> (a) Since there is no assumption on the values A can take it makes sense to assume a distribution that has support over the reals. A Gaussian prior is a good candidate. Let us assume $p(A) = \\mathcal{N}(A|m_A,v_A)$. Since $p(D|A) = \\prod_k \\mathcal{N}(x_k|A,\\sigma^2)$ is a Gaussian likelihood and $p(A)$ is a Gaussian prior, their multiplication is proportional to a Gaussian. We will work this out with the canonical parameterization of the Gaussian since it is easier to multiply Gaussians in that domain. This means the posterior $p(A|D)$ is\n",
    "$$\\begin{align*}\n",
    "   p(A|D) &\\propto p(A) p(D|A) \\\\\n",
    "   &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(x_k|A,\\sigma^2) \\\\\n",
    "   &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(A|x_k,\\sigma^2) \\\\\n",
    "   &= \\mathcal{N}_c\\big(A \\Bigm|\\frac{m_A}{v_A},\\frac{1}{v_A}\\big)\\prod_{k=1}^N \\mathcal{N}_c\\big(A\\Bigm| \\frac{x_k}{\\sigma^2},\\frac{1}{\\sigma^2}\\big) \\\\\n",
    "       &\\propto \\mathcal{N}_c\\big(A \\Bigm| \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k , \\frac{1}{v_A} + \\frac{N}{\\sigma^2}  \\big)      \\,, \n",
    "\\end{align*}$$\n",
    "where we have made use of the fact that precision-weighted means and precisions add when multiplying Gaussians. In principle this description of the posterior completes the answer.  \n",
    "> (b) The ML estimate can be found by\n",
    "    $$\\begin{align*}\n",
    "    \\nabla \\log p(D|A) &=0\\\\\n",
    "    \\nabla \\sum_k \\log \\mathcal{N}(x_k|A,\\sigma^2) &= 0 \\\\\n",
    "      \\nabla \\frac{-1}{2}\\sum_k \\frac{(x_k-A)^2}{\\sigma^2} &=0\\\\\n",
    "    \\sum_k(x_k-A) &= 0 \\\\\n",
    "    \\hat{A}_{ML} = \\frac{1}{N}\\sum_{k=1}^N x_k\n",
    "    \\end{align*}$$    \n",
    "> (c) The MAP is simply the location where the posterior has its maximum value, which for a Gaussian posterior is its mean value. We computed in (a) the precision-weighted mean, so we need to divide by precision (or multiply by variance) to get the location of the mean:  \n",
    "$$\\begin{align*}   \n",
    "\\hat{A}_{MAP}  &= \\left( \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k\\right)\\cdot \\left(  \\frac{1}{v_A} + \\frac{N}{\\sigma^2} \\right)^{-1} \\\\\n",
    "&= \\frac{v_A \\sum_k x_k + \\sigma^2 m_A}{N v_A + \\sigma^2}\n",
    "\\end{align*}$$    \n",
    "> (d) A Bayesian treatment requires putting a prior on the unknown variance. The variance is constrained to be positive hence the support of the prior distribution needs to be on the positive reals. (In a multivariate case positivity needs to be extended to symmetric positive definiteness.) Choosing a conjugate prior will simplify matters greatly. In this scenerio the inverse Gamma distribution is the conjugate prior for the unknown variance. In the literature this model is called a Normal-Gamma distribution. See https://www.seas.harvard.edu/courses/cs281/papers/murphy-2007.pdf for the analytical treatment. \n",
    "   \n",
    "- **[5]** (##) We consider the coin toss example from the notebook and use a conjugate prior for a Bernoulli likelihood function.    \n",
    "  (a) Derive the Maximum Likelihood estimate.    \n",
    "  (b) Derive the MAP estimate.          \n",
    "  (c) Do these two estimates ever coincide (if so under what circumstances)?    \n",
    "> (a) $$\\begin{align*}\n",
    "        \\nabla \\log p(D|\\mu) &= 0 \\\\\n",
    "        \\nabla \\left( n\\log \\mu + (N-n)\\log(1-\\mu)\\right) &= 0\\\\\n",
    "        \\frac{n}{\\mu} - \\frac{N-n}{1-\\mu} &= 0 \\\\\n",
    "        \\hat{\\mu}_{\\text{ML}} = \\frac{n}{N}\n",
    "       \\end{align*}$$         \n",
    "> (b) Assuming a beta prior $\\mathcal{B}(\\mu|\\alpha,\\beta)$, we can write the posterior as as\n",
    "   $$\\begin{align*}\n",
    "   p(\\mu|D) &\\propto p(D|\\mu)p(\\mu) \\\\\n",
    "            &\\propto \\mu^n (1-\\mu)^{N-n} \\mu^{\\alpha-1} (1-\\mu)^{\\beta-1} \\\\\n",
    "            &\\propto \\mathcal{B}(\\mu|n+\\alpha,N-n+\\beta)\n",
    "   \\end{align*}$$   \n",
    "   The MAP estimate for a beta distribution $\\mathcal{B}(a,b)$ is located at $\\frac{a - 1}{a+b-2}$, see [wikipedia](https://en.wikipedia.org/wiki/Beta_distribution). Hence, \n",
    "   $$\\begin{align*}\n",
    "\\hat{\\mu}_{\\text{MAP}} &= \\frac{(n+\\alpha)-1}{(n+\\alpha) + (N-n+\\beta) -2} \\\\\n",
    "  &= \\frac{n+\\alpha-1}{N + \\alpha +\\beta -2}\n",
    "\\end{align*}$$ \n",
    "> (c) As $N$ gets larger, the MAP estimate approaches the ML estimate. In the limit the MAP solution converges to the ML solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27db69a5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb682af7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.6.3",
   "language": "julia",
   "name": "julia-1.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}