{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "87601599",
   "metadata": {},
   "source": [
    "# Continuous Data and the Gaussian Distribution\n",
    "\n",
    "\n",
    "- **[1]** (##) We are given an IID data set $D = \\{x_1,x_2,\\ldots,x_N\\}$, where $x_n \\in \\mathbb{R}^M$. Let's assume that the data were drawn from a multivariate Gaussian (MVG),\n",
    "$$\\begin{align*}\n",
    "p(x_n|\\theta) = \\mathcal{N}(x_n|\\,\\mu,\\Sigma) = |2 \\pi \\Sigma|^{-\\frac{1}{2}} \\exp\\left\\{-\\frac{1}{2}(x_n-\\mu)^T\n",
    "\\Sigma^{-1} (x_n-\\mu) \\right\\}\n",
    "\\end{align*}$$      \n",
    "  (a) Derive the log-likelihood of the parameters for these data.       \n",
    "  (b) Derive the maximum likelihood estimates for the mean $\\mu$ and variance $\\Sigma$ by setting the derivative of the log-likelihood to zero.\n",
    "\n",
    "> (a) Let $\\theta ={\\mu,\\Sigma}$. Then the log-likelihood can be worked out as \n",
    "$$\\begin{align*}\n",
    "\\log p(D|\\theta) &= \\log \\prod_n p(x_n|\\theta) \\\\\n",
    " &= \\log \\prod_n \\mathcal{N}(x_n|\\mu, \\Sigma) \\\\\n",
    "&= \\log \\prod_n (2\\pi)^{-M/2} |\\Sigma|^{-1/2} \\exp\\left( -\\frac{1}{2}(x_n-\\mu)^T \\Sigma^{-1}(x_n-\\mu)\\right) \\\\\n",
    "&= \\sum_n \\left( \\log (2\\pi)^{-M/2} + \\log  |\\Sigma|^{-1/2} -\\frac{1}{2}(x_n-\\mu)^T \\Sigma^{-1}(x_n-\\mu)\\right) \\\\\n",
    "&\\propto \\frac{N}{2}\\log  |\\Sigma|^{-1} - \\frac{1}{2}\\sum_n (x-\\mu)^T \\Sigma^{-1}(x-\\mu)\n",
    "\\end{align*}$$ \n",
    "> (b) First we take the derivative with respect to the mean.\n",
    "$$\\begin{align*}\n",
    "\\nabla_{\\mu} \\log p(D|\\theta) &\\propto - \\sum_n \\nabla_{\\mu} \\left(x_n-\\mu \\right)^T\\Sigma^{-1}\\left(x_n-\\mu \\right)  \\\\\n",
    "&= - \\sum_n \\nabla_{\\mu} \\mathrm{Tr}\\left[-2 \\mu^T\\Sigma^{-1}x_n + \\mu^T \\Sigma^{-1}\\mu \\right] \\\\\n",
    "&= - \\sum_n \\left(-2 \\Sigma^{-1}x_n + 2\\Sigma^{-1}\\mu \\right) \\\\\n",
    "&= \\Sigma^{-1} \\sum_n (x_n - \\mu)\n",
    "\\end{align*}$$\n",
    "Setting the derivative to zeros leads to $\\hat{\\mu} = \\frac{1}{N}\\sum_n x_n$.\n",
    "The derivative with respect to covariance is a bit more involved. It's actually easier to compute this by taking the derivative to the precision:\n",
    "$$\\begin{align*}\n",
    "\\nabla_{\\Sigma^{-1}} \\log p(D|\\theta) &= \\nabla_{\\Sigma^{-1}} \\left( \\frac{N}{2} \\log |\\Sigma| ^{-1} -\\frac{1}{2}\\sum_n (x_n-\\mu)^T\n",
    "\\Sigma^{-1} (x_n-\\mu)\\right)  \\\\\n",
    "&= \\nabla_{\\Sigma^{-1}} \\left( \\frac{N}{2} \\log |\\Sigma| ^{-1} - \\frac{1}{2}\\sum_n \\mathrm{Tr}\\left[(x_n-\\mu)\n",
    "(x_n-\\mu)^T \\Sigma^{-1} \\right]\\right) \\\\\n",
    "&=\\frac{N}{2}\\Sigma - \\frac{1}{2}\\sum_n (x_n-\\mu)\n",
    "(x_n-\\mu)^T\n",
    "\\end{align*}$$\n",
    "Setting the derivative to zero leads to $\\hat{\\Sigma} = \\frac{1}{N}\\sum_n (x_n-\\hat{\\mu})\n",
    "(x_n-\\hat{\\mu})^T$.\n",
    "- **[2]** (#) Shortly explain why the Gaussian distribution is often preferred as a prior distribution over other distributions with the same support?\n",
    "> You can get this answer straight from the lession notebook. Aside from the computational advantages (operations on distributions tends to make them more Gaussian, and Gaussians tends to remain Gaussians in computational manipulations), the Gaussian distribution is also the maximum-entropy distribution among distributions that are defined over real numbers. This means that there is no distribution with the same variance that assumes less information about its argument. \n",
    "\n",
    "- **[3]** (###) Proof that the Gaussian distribution is the maximum entropy distribution over the reals with specified mean and variance. \n",
    "> This is a challenging question (e.g., too diffucult for a written exam:) which requires calculus of variations to solve rigorously. We will show how to maximize the entropy functional which is $-\\int q(x) \\log q(x) \\mathrm{d}x$ with the specified constraints. We have three constraints: (1) we require $q(x)$ to be normalized, (2) $\\mathbb{E}[x] = m$ and (3) $\\mathbb{E}[x^2] = m^2+\\sigma^2$, where $m \\in \\mathbb{R}$ and $\\sigma^2 \\in \\mathbb{R}^{+}$ are arbitrary. Let us write entropy with the given constraints with undetermined multipliers as a Lagrangian\n",
    "$$\\begin{align*}\n",
    "L[q] = -\\int q(x)\\log q(x)\\mathrm{d}x + \\lambda \\left(\\int xq(x)\\mathrm{d}x - m\\right) + \\gamma \\left(\\int x^2q(x)\\mathrm{d}x - (\\sigma^2+m^2)\\right) + \\psi \\left(\\int q(x)\\mathrm{d}x -1 \\right).\n",
    "\\end{align*}$$\n",
    "We are searching for a distribution in a space of functions that minimizes the above Lagrangian. This is is a functional minimization problem that is defined over a function space as opposed to ordinary ($\\mathbb{R}^N$). Even though the computational mechanics are somewhat different the idea is same with ordinary minimization problems. We look at the functional derivative that has a similar interpretation as a gradient(It can be thought of as the derivative of a functional with respect to a function). We want to solve\n",
    "$$\\begin{align*}\n",
    "\\frac{\\delta L[q]}{\\delta q} &= 0 \\\\\n",
    "-\\log q(x) + \\psi + \\lambda x + \\gamma x^2 &= 0 \\\\\n",
    "q(x) &= \\exp(+\\psi +\\lambda x + \\gamma x^2)\n",
    "\\end{align*}$$\n",
    "where $\\frac{\\delta L[q]}{\\delta q} $ is the functional derivative. We can plug $q(x)$ back into the constraints and solve for the multipliers. Doing that we obtain $\\lambda=\\frac{m}{\\sigma^2}$,$\\gamma = -\\frac{1}{2\\sigma^2}$ and $\\psi = -\\frac{m^2}{2\\sigma^2}-\\log \\sqrt{2\\pi \\sigma^2}$. This means the distribution that maximizes entropy, $q(x)$, is a Gaussian distribution. \n",
    "\n",
    "- **[4]** (##) Proof that a linear transformation $z=Ax+b$ of a Gaussian variable $\\mathcal{N}(x|\\mu,\\Sigma)$ is Gaussian distributed as\n",
    "$$\n",
    "p(z) = \\mathcal{N} \\left(z \\,|\\, A\\mu+b, A\\Sigma A^T \\right) \n",
    "$$    \n",
    "\n",
    "> First, we show that a linear transformation of a Gaussian is a Gaussian. In general, the transformed distribution of $z=g(x)$ is given by\n",
    "$$ p_Z(z) = \\frac{p_X(g^{-1}(z))}{\\mathrm{det}[g(z)]}\\,.$$ Since the transformation is linear, $\\mathrm{det}[g] = \\mathrm{det}[A]$, which is independent of $z$, and consequently $p_Z(z)$ has the same functional form as $p_X(x)$, i.e. $p_Z(z)$ is a also Gaussian. The mean and variance can easily be determined by the calculation that we used in [question 8 of the Probability Theory exercises](https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Solutions-Probability-Theory-Review.ipynb#distribution-of-sum). This results in\n",
    "$p(z) = \\mathcal{N} \\left(z \\,|\\, A\\mu+b, A\\Sigma A^T \\right)$.\n",
    "\n",
    "- **[5]** (#) Given independent variables\n",
    "$x \\sim \\mathcal{N}(\\mu_x,\\sigma_x^2)$ and $y \\sim \\mathcal{N}(\\mu_y,\\sigma_y^2)$, what is the PDF for $z = A\\cdot(x -y) + b$?    \n",
    ">  $z$ is also Gaussian with \n",
    "$$\n",
    "p_z(z) = \\mathcal{N}(z \\,|\\, A(\\mu_x-\\mu_y)+b, \\, A (\\sigma_x^2 + \\sigma_y^2) A^T)\n",
    "$$\n",
    "\n",
    "\n",
    "- **[6]** (###) Compute\n",
    "\n",
    "\\begin{equation*}\n",
    "        \\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x \\,.\n",
    "    \\end{equation*}\n",
    "   \n",
    "> For a Gaussian with zero mean and varance equal to $1$ we have\n",
    "$$\n",
    "\\int \\frac{1}{\\sqrt{2\\pi}}\\exp(-\\frac{1}{2}x^2) \\mathrm{d}x = 1\n",
    "$$\n",
    "Substitution of $x = \\sqrt{2}y$ with $\\mathrm{d}x=\\sqrt{2}\\mathrm{d}y$ will simply lead you to $ \\int_{-\\infty}^{\\infty} \\exp(-y^2)\\mathrm{d}y=\\sqrt{\\pi}$.\n",
    "If you don't want to use the result of the Gaussian integral, you can still do this integral, see [youtube clip](https://www.youtube.com/watch?v=FYNHt4AMxc0). \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "<!---\n",
    "The following computation is due to Poisson. First we note that the integrand is an even function so we can write\n",
    "$$ \\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x = 2\\int_{0}^{\\infty} \\exp(-x^2)\\mathrm{d}x \\,.$$ Let the right hand side be called $J$. Then we can write\n",
    "$$\\begin{align*}\n",
    "J^2 &= \\int_{0}^{\\infty} \\int_{0}^{\\infty}\\exp\\left(-(x^2+y^2\\right)\\mathrm{d}x\\mathrm{d}y \\\\\n",
    "&= \\int_0^{\\infty}r\\exp(-r^2)\\mathrm{d}r\\int_0^{\\frac{\\pi}{2}}\\mathrm{d}\\theta \\\\\n",
    "&= \\frac{\\pi}{2}\\frac{1}{2}\\left.\\exp(-r^2)\\right\\vert_{0}^{\\infty} \\\\\n",
    "&= \\frac{\\pi}{4}\\end{align*}$$    and consequently\n",
    "$$\\begin{equation*}\n",
    "\\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x = \\sqrt{\\pi}\n",
    "\\end{equation*}$$\n",
    "where the polar coordinate transformation is used and the 2d integral is taken over the first quadrant.\n",
    "\n",
    "- Derive the conditional distribution $p(x_a|x_b)$ and the marginal distribution $p(x_a)$ given that \n",
    "$$\n",
    "\\begin{bmatrix} x_a \\\\ x_b \\end{bmatrix} \\sim \\mathcal{N}\\left(\\begin{bmatrix} m_a \\\\m_b \\end{bmatrix} , \\begin{bmatrix} \\Sigma_a ~ \\Sigma_{ab} \\\\ \\Sigma_{ba} ~  \\Sigma_b\\end{bmatrix}\\right),\n",
    "$$\n",
    "where $x_a$ and $x_b$ are vectors. You may make use of \n",
    "$$\n",
    "       \\begin{bmatrix} A ~ B \\\\ C ~ D \\end{bmatrix}^{-1} = \\begin{bmatrix} M \\qquad -MBD^{-1} \\\\ -D^{-1}CM \\qquad D^{-1}+D^{-1}CMBD^{-1} \\end{bmatrix}\n",
    "$$\n",
    "where $M = (A - BD^{-1}C)^{-1}$ is called the Schur's complement.\n",
    "\n",
    "    \n",
    "- In the notebook, for the model \n",
    "$$\\begin{align*}\n",
    "p(x_t |\\theta) &= \\mathcal{N}(x_t\\,|\\,\\theta,\\sigma^2) \\\\\n",
    "p(\\theta) &= \\mathcal{N}(\\theta\\,|\\,\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "we found the following posteestimator for the hidden states:\n",
    "$$\\begin{align*}\n",
    "p(\\theta|x) &= \\mathcal{N} \\left( \\theta\\,|\\,\\mu_1, \\sigma_1^2 \\right)\\,,\n",
    "\\end{align*}$$\n",
    "with\n",
    "$$\\begin{align*}\n",
    "K &= \\frac{\\sigma_0^2}{\\sigma_0^2+\\sigma^2} \\qquad \\text{($K$ is called: Kalman gain)}\\\\\n",
    "\\mu_1 &= \\mu_0 + K \\cdot (x-\\mu_0)\\\\\n",
    "\\sigma_1^2 &= \\left( 1-K \\right) \\sigma_0^2  \n",
    "\\end{align*}$$\n",
    " \n",
    "\n",
    "\n",
    "- Show that Eq.SRG-8 is a special case of Eq.SRG-4a. \n",
    "\n",
    "- Proof\n",
    "$$\n",
    "p(x,\\theta) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\sigma_0^2\\\\ \n",
    "         \\sigma_0^2 &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "- Look up conditioning and marginalization in canonical coordinates and compare to the formulas for the moment parameterization of the Gaussian. Any conclusions?\n",
    "\n",
    "- Derive the conditional distribution $p(x_a|x_b)$ and the marginal distribution $p(x_a)$ given that \n",
    "$$\n",
    "\\begin{bmatrix} x_a \\\\ x_b \\end{bmatrix} \\sim \\mathcal{N}\\left(\\begin{bmatrix} m_a \\\\m_b \\end{bmatrix} , \\begin{bmatrix} \\Sigma_a ~ \\Sigma_{ab} \\\\ \\Sigma_{ba} ~  \\Sigma_b\\end{bmatrix}\\right),\n",
    "$$\n",
    "where $x_a$ and $x_b$ are vectors. You may make use of \n",
    "$$\n",
    "       \\begin{bmatrix} A ~ B \\\\ C ~ D \\end{bmatrix}^{-1} = \\begin{bmatrix} M \\qquad -MBD^{-1} \\\\ -D^{-1}CM \\qquad D^{-1}+D^{-1}CMBD^{-1} \\end{bmatrix}\n",
    "$$\n",
    "where $M = (A - BD^{-1}C)^{-1}$ is called the Schur's complement.\n",
    "\n",
    "    \n",
    "- In the notebook, for the model \n",
    "$$\\begin{align*}\n",
    "p(x_t |\\theta) &= \\mathcal{N}(x_t\\,|\\,\\theta,\\sigma^2) \\\\\n",
    "p(\\theta) &= \\mathcal{N}(\\theta\\,|\\,\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "we found the following posteestimator for the hidden states:\n",
    "$$\\begin{align*}\n",
    "p(\\theta|x) &= \\mathcal{N} \\left( \\theta\\,|\\,\\mu_1, \\sigma_1^2 \\right)\\,,\n",
    "\\end{align*}$$\n",
    "with\n",
    "$$\\begin{align*}\n",
    "K &= \\frac{\\sigma_0^2}{\\sigma_0^2+\\sigma^2} \\qquad \\text{($K$ is called: Kalman gain)}\\\\\n",
    "\\mu_1 &= \\mu_0 + K \\cdot (x-\\mu_0)\\\\\n",
    "\\sigma_1^2 &= \\left( 1-K \\right) \\sigma_0^2  \n",
    "\\end{align*}$$\n",
    " \n",
    "\n",
    "\n",
    "- Show that Eq.SRG-8 is a special case of Eq.SRG-4a. \n",
    "\n",
    "- Proof\n",
    "$$\n",
    "p(x,\\theta) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\sigma_0^2\\\\ \n",
    "         \\sigma_0^2 &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "- Look up conditioning and marginalization in canonical coordinates and compare to the formulas for the moment parameterization of the Gaussian. Any conclusions?\n",
    "--->\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "663a1d04",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.6.3",
   "language": "julia",
   "name": "julia-1.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}