{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "87601599",
   "metadata": {},
   "source": [
    "# Continuous Data and the Gaussian Distribution\n",
    "\n",
    "\n",
    "- **[1]** (##) We are given an IID data set $D = \\{x_1,x_2,\\ldots,x_N\\}$, where $x_n \\in \\mathbb{R}^M$. Let's assume that the data were drawn from a multivariate Gaussian (MVG),\n",
    "$$\\begin{align*}\n",
    "p(x_n|\\theta) = \\mathcal{N}(x_n|\\,\\mu,\\Sigma) = \\frac{1}{\\sqrt{(2 \\pi)^{M} |\\Sigma|}} \\exp\\left\\{-\\frac{1}{2}(x_n-\\mu)^T\n",
    "\\Sigma^{-1} (x_n-\\mu) \\right\\}\n",
    "\\end{align*}$$      \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5e471e11",
   "metadata": {},
   "source": [
    "\n",
    "  (a) Derive the log-likelihood of the parameters for these data.  \n",
    "> (a) Let $\\theta ={\\mu,\\Sigma}$. Then the log-likelihood can be worked out as \n",
    "\n",
    "$$\\begin{align*}\n",
    "\\log p(D|\\theta) &= \\log \\prod_n p(x_n|\\theta) \\\\\n",
    " &= \\log \\prod_n \\mathcal{N}(x_n|\\mu, \\Sigma) \\\\\n",
    "&= \\log \\prod_n (2\\pi)^{-M/2} |\\Sigma|^{-1/2} \\exp\\left\\{ -\\frac{1}{2}(x_n-\\mu)^T \\Sigma^{-1}(x_n-\\mu)\\right\\} \\\\\n",
    "&= \\sum_n \\left( \\log (2\\pi)^{-M/2} + \\log  |\\Sigma|^{-1/2} -\\frac{1}{2}(x_n-\\mu)^T \\Sigma^{-1}(x_n-\\mu)\\right) \\\\\n",
    "&\\propto \\frac{N}{2}\\log  |\\Sigma|^{-1} - \\frac{1}{2}\\sum_n (x_n-\\mu)^T \\Sigma^{-1}(x_n-\\mu)\n",
    "\\end{align*}$$  \n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f5f0a208",
   "metadata": {},
   "source": [
    "\n",
    "  (b) Derive the maximum likelihood estimates for the mean $\\mu$ and variance $\\Sigma$ by setting the derivative of the log-likelihood to zero.\n",
    "> (b) First we take the derivative with respect to the mean.\n",
    "$$\\begin{align*}\n",
    "\\nabla_{\\mu} \\log p(D|\\theta) &\\propto - \\sum_n \\nabla_{\\mu} \\left(x_n-\\mu \\right)^T\\Sigma^{-1}\\left(x_n-\\mu \\right)  \\\\\n",
    "&= - \\sum_n \\nabla_{\\mu} \\left(-2 \\mu^T\\Sigma^{-1}x_n + \\mu^T \\Sigma^{-1}\\mu \\right) \\\\\n",
    "&= - \\sum_n \\left(-2 \\Sigma^{-1}x_n + 2\\Sigma^{-1}\\mu \\right) \\\\\n",
    "&= -2 \\Sigma^{-1} \\sum_n (x_n - \\mu)\n",
    "\\end{align*}$$\n",
    ">  Setting the derivative to zeros leads to $\\hat{\\mu} = \\frac{1}{N}\\sum_n x_n$.\n",
    "The derivative with respect to covariance is a bit more involved. It's actually easier to compute this by taking the derivative to the precision:\n",
    "$$\\begin{align*}\n",
    "\\nabla_{\\Sigma^{-1}} \\log p(D|\\theta) &= \\nabla_{\\Sigma^{-1}} \\left( \\frac{N}{2} \\log |\\Sigma| ^{-1} -\\frac{1}{2}\\sum_n (x_n-\\mu)^T\n",
    "\\Sigma^{-1} (x_n-\\mu)\\right)  \\\\\n",
    "&= \\nabla_{\\Sigma^{-1}} \\left( \\frac{N}{2} \\log |\\Sigma| ^{-1} - \\frac{1}{2}\\sum_n \\mathrm{Tr}\\left[(x_n-\\mu)\n",
    "(x_n-\\mu)^T \\Sigma^{-1} \\right]\\right) \\\\\n",
    "&=\\frac{N}{2}\\Sigma - \\frac{1}{2}\\sum_n (x_n-\\mu)\n",
    "(x_n-\\mu)^T\n",
    "\\end{align*}$$    \n",
    ">  Setting the derivative to zero leads to $\\hat{\\Sigma} = \\frac{1}{N}\\sum_n (x_n-\\hat{\\mu})\n",
    "(x_n-\\hat{\\mu})^T$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd83659a",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "- **[2]** (#) Shortly explain why the Gaussian distribution is often preferred as a prior distribution over other distributions with the same support?\n",
    ">  You can get this answer straight from the lession notebook. Aside from the computational advantages (operations on distributions tends to make them more Gaussian, and Gaussians tends to remain Gaussians in computational manipulations), the Gaussian distribution is also the maximum-entropy distribution among distributions that are defined over real numbers. This means that there is no distribution with the same variance that assumes less information about its argument. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33c31507",
   "metadata": {},
   "source": [
    "\n",
    "- **[3]** (###) We make $N$ IID observations $D=\\{x_1 \\dots x_N\\}$ and assume the following model\n",
    "$$\\begin{aligned}\n",
    "x_k &= A + \\epsilon_k \\\\\n",
    "A &\\sim \\mathcal{N}(m_A,v_A) \\\\\n",
    "\\epsilon_k &\\sim \\mathcal{N}(0,\\sigma^2) \\,.\n",
    "\\end{aligned}$$\n",
    "We assume that $\\sigma$ has a known value and are interested in deriving an estimator for $A$ .\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "efccd30b",
   "metadata": {},
   "source": [
    "  (a) Derive the Bayesian (posterior) estimate $p(A|D)$.   \n",
    "> Since $p(D|A) = \\prod_k \\mathcal{N}(x_k|A,\\sigma^2)$ is a Gaussian likelihood and $p(A)$ is a Gaussian prior, their multiplication is proportional to a Gaussian. We will work this out with the canonical parameterization of the Gaussian since it is easier to multiply Gaussians in that domain. This means the posterior $p(A|D)$ is    \n",
    "$$\\begin{align*}\n",
    "   p(A|D) &\\propto p(A) p(D|A) \\\\\n",
    "   &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(x_k|A,\\sigma^2) \\\\\n",
    "   &= \\mathcal{N}(A|m_A,v_A) \\prod_{k=1}^N \\mathcal{N}(A|x_k,\\sigma^2) \\\\\n",
    "   &= \\mathcal{N}_c\\big(A \\Bigm|\\frac{m_A}{v_A},\\frac{1}{v_A}\\big)\\prod_{k=1}^N \\mathcal{N}_c\\big(A\\Bigm| \\frac{x_k}{\\sigma^2},\\frac{1}{\\sigma^2}\\big) \\\\\n",
    "       &\\propto \\mathcal{N}_c\\big(A \\Bigm| \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k , \\frac{1}{v_A} + \\frac{N}{\\sigma^2}  \\big)      \\,, \n",
    "  \\end{align*}$$\n",
    "> where we have made use of the fact that precision-weighted means and precisions add when multiplying Gaussians. In principle this description of the posterior completes the answer. \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d4155662",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "  (b) (##) Derive the Maximum Likelihood estimate for $A$.\n",
    ">  The ML estimate can be found by   \n",
    "$$\\begin{align*}\n",
    "  \\nabla \\log p(D|A) &=0\\\\\n",
    "  \\nabla \\sum_k \\log \\mathcal{N}(x_k|A,\\sigma^2) &= 0 \\\\\n",
    "  \\nabla \\frac{-1}{2}\\sum_k \\frac{(x_k-A)^2}{\\sigma^2} &=0\\\\\n",
    "  \\sum_k(x_k-A) &= 0 \\\\\n",
    "  \\Rightarrow \\hat{A}_{ML} = \\frac{1}{N}\\sum_{k=1}^N x_k\n",
    "\\end{align*}$$ \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c7b85f5e",
   "metadata": {},
   "source": [
    "\n",
    "  (c) Derive the MAP estimates for $A$.  \n",
    ">  The MAP is simply the location where the posterior has its maximum value, which for a Gaussian posterior is its mean value. We computed in (a) the precision-weighted mean, so we need to divide by precision (or multiply by variance) to get the location of the mean:  \n",
    "$$\\begin{align*}   \n",
    "\\hat{A}_{MAP}  &= \\left( \\frac{m_A}{v_A} + \\frac{1}{\\sigma^2} \\sum_k x_k\\right)\\cdot \\left(  \\frac{1}{v_A} + \\frac{N}{\\sigma^2} \\right)^{-1} \\\\\n",
    "&= \\frac{v_A \\sum_k x_k + \\sigma^2 m_A}{N v_A + \\sigma^2}\n",
    "\\end{align*}$$    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91ce0f6c",
   "metadata": {},
   "source": [
    "\n",
    "  (d) Now assume that we do not know the variance of the noise term? Describe the procedure for Bayesian estimation of both $A$ and $\\sigma^2$ (No need to fully work out to closed-form estimates). \n",
    ">  A Bayesian treatment requires putting a prior on the unknown variance. The variance is constrained to be positive hence the support of the prior distribution needs to be on the positive reals. (In a multivariate case positivity needs to be extended to symmetric positive definiteness.) Choosing a conjugate prior will simplify matters greatly. In this scenerio the inverse Gamma distribution is the conjugate prior for the unknown variance. In the literature this model is called a Normal-Gamma distribution. See https://www.seas.harvard.edu/courses/cs281/papers/murphy-2007.pdf for the analytical treatment. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6543b2f4",
   "metadata": {},
   "source": [
    "<!---\n",
    "- **[4]** (###) Proof that the Gaussian distribution is the maximum entropy distribution over the reals with specified mean and variance. \n",
    "> This is a challenging question (e.g., too diffucult for a written exam:) which requires calculus of variations to solve rigorously. We will show how to maximize the entropy functional which is $-\\int q(x) \\log q(x) \\mathrm{d}x$ with the specified constraints. We have three constraints: (1) we require $q(x)$ to be normalized, (2) $\\mathbb{E}[x] = m$ and (3) $\\mathbb{E}[x^2] = m^2+\\sigma^2$, where $m \\in \\mathbb{R}$ and $\\sigma^2 \\in \\mathbb{R}^{+}$ are arbitrary. Let us write entropy with the given constraints with undetermined multipliers as a Lagrangian\n",
    "$$\\begin{align*}\n",
    "L[q] = -\\int q(x)\\log q(x)\\mathrm{d}x + \\lambda \\left(\\int xq(x)\\mathrm{d}x - m\\right) + \\gamma \\left(\\int x^2q(x)\\mathrm{d}x - (\\sigma^2+m^2)\\right) + \\psi \\left(\\int q(x)\\mathrm{d}x -1 \\right).\n",
    "\\end{align*}$$\n",
    "We are searching for a distribution in a space of functions that minimizes the above Lagrangian. This is is a functional minimization problem that is defined over a function space as opposed to ordinary ($\\mathbb{R}^N$). Even though the computational mechanics are somewhat different the idea is same with ordinary minimization problems. We look at the functional derivative that has a similar interpretation as a gradient(It can be thought of as the derivative of a functional with respect to a function). We want to solve\n",
    "$$\\begin{align*}\n",
    "\\frac{\\delta L[q]}{\\delta q} &= 0 \\\\\n",
    "-\\log q(x) + \\psi + \\lambda x + \\gamma x^2 &= 0 \\\\\n",
    "q(x) &= \\exp(+\\psi +\\lambda x + \\gamma x^2)\n",
    "\\end{align*}$$\n",
    "where $\\frac{\\delta L[q]}{\\delta q} $ is the functional derivative. We can plug $q(x)$ back into the constraints and solve for the multipliers. Doing that we obtain $\\lambda=\\frac{m}{\\sigma^2}$,$\\gamma = -\\frac{1}{2\\sigma^2}$ and $\\psi = -\\frac{m^2}{2\\sigma^2}-\\log \\sqrt{2\\pi \\sigma^2}$. This means the distribution that maximizes entropy, $q(x)$, is a Gaussian distribution. \n",
    "--->\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a3eeb763",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "- **[4]** (##) Proof that a linear transformation $z=Ax+b$ of a Gaussian variable $\\mathcal{N}(x|\\mu,\\Sigma)$ is Gaussian distributed as\n",
    "$$\n",
    "p(z) = \\mathcal{N} \\left(z \\,|\\, A\\mu+b, A\\Sigma A^T \\right) \n",
    "$$    \n",
    ">  First, we show that a linear transformation of a Gaussian is a Gaussian. In general, the transformed distribution of $z=g(x)$ is given by   \n",
    ">  $$ p_Z(z) = \\frac{p_X(g^{-1}(z))}{\\mathrm{det}[g(z)]}\\,.$$    \n",
    "\n",
    ">  Since the transformation is linear, $\\mathrm{det}[g] = \\mathrm{det}[A]$, which is independent of $z$, and consequently $p_Z(z)$ has the same functional form as $p_X(x)$, i.e. $p_Z(z)$ is a also Gaussian. The mean and variance can easily be determined by the calculation that we used in [question 8 of the Probability Theory exercises](https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Solutions-Probability-Theory-Review.ipynb#distribution-of-sum). This results in    \n",
    "$$\n",
    "p(z) = \\mathcal{N}\\left( z \\,|\\, A\\mu+b, A\\Sigma A^T \\right) \\,.\n",
    "$$\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a8af1a2e",
   "metadata": {},
   "source": [
    "\n",
    "- **[5]** (#) Given independent variables\n",
    "$x \\sim \\mathcal{N}(\\mu_x,\\sigma_x^2)$ and $y \\sim \\mathcal{N}(\\mu_y,\\sigma_y^2)$, what is the PDF for $z = A\\cdot(x -y) + b$?    \n",
    ">  $z$ is also Gaussian with \n",
    "$$\n",
    "p_z(z) = \\mathcal{N}(z \\,|\\, A(\\mu_x-\\mu_y)+b, \\, A (\\sigma_x^2 + \\sigma_y^2) A^T)\n",
    "$$\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0376db7d",
   "metadata": {},
   "source": [
    "\n",
    "- **[6]** (###) Compute\n",
    "\n",
    "\\begin{equation*}\n",
    "        \\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x \\,.\n",
    "    \\end{equation*}\n",
    "   \n",
    ">  For a Gaussian with zero mean and varance equal to $1$ we have\n",
    "$$\n",
    "\\int \\frac{1}{\\sqrt{2\\pi}}\\exp(-\\frac{1}{2}x^2) \\mathrm{d}x = 1 $$\n",
    ">  Substitution of $x = \\sqrt{2}y$ with $\\mathrm{d}x=\\sqrt{2}\\mathrm{d}y$ will simply lead you to $ \\int_{-\\infty}^{\\infty} \\exp(-y^2)\\mathrm{d}y=\\sqrt{\\pi}$. If you don't want to use the result of the Gaussian integral, you can still do this integral, see [youtube clip](https://www.youtube.com/watch?v=FYNHt4AMxc0). \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5222b9a",
   "metadata": {},
   "source": [
    "\n",
    "- **[7]** (##) Show that the system\n",
    "$$\\begin{align*}\n",
    "p(x\\,|\\,\\theta) &= \\mathcal{N}(x\\,|\\,\\theta,\\sigma^2) \\\\\n",
    "p(\\theta) &= \\mathcal{N}(\\theta\\,|\\,\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "can be written as\n",
    "$$\n",
    "p(z) = p\\left(\\begin{bmatrix} x \\\\ \\theta \\end{bmatrix}\\right) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\sigma_0^2\\\\ \n",
    "         \\sigma_0^2 &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "\n",
    ">  Let's first compute the moments for the marginals $p(x)$ and $p(\\theta)$:\n",
    "$$\\begin{align*}\n",
    "p(x) &= \\int p(x|\\theta) p(\\theta) \\mathrm{d}\\theta \\\\\n",
    "  &= \\int \\mathcal{N}(x|\\theta,\\sigma^2) \\mathcal{N}(\\theta|\\mu_0,\\sigma_0^2) \\mathrm{d}\\theta \\\\\n",
    "  &= \\int \\mathcal{N}(\\theta|x,\\sigma^2) \\mathcal{N}(\\theta|\\mu_0,\\sigma_0^2) \\mathrm{d}\\theta \\\\\n",
    "  &= \\mathcal{N}(x|\\mu_0,\\sigma^2+\\sigma_0^2) \\underbrace{\\int \\mathcal{N}(\\theta| \\cdot,\\cdot) \\mathrm{d}\\theta}_{=1} \\\\\n",
    "  &= \\mathcal{N}(x|\\mu_0,\\sigma^2+\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "\n",
    "> and for $p(\\theta)$:\n",
    "$$\\begin{align*}\n",
    "p(\\theta) &= \\int p(x|\\theta) p(\\theta) \\mathrm{d}x \\\\\n",
    "  &= \\mathcal{N}(\\theta|\\mu_0,\\sigma_0^2) \\underbrace{\\int \\mathcal{N}(x|\\theta,\\sigma^2)  \\mathrm{d}x}_{=1} \\\\\n",
    "  &= \\mathcal{N}(\\theta|\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "\n",
    "> With this information, we have\n",
    "$$\n",
    "p(z) = p\\left(\\begin{bmatrix} x \\\\ \\theta \\end{bmatrix}\\right) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\cdot \\\\ \n",
    "         \\cdot &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "> So, we only need to compute $\\Sigma_{x\\theta} = \\Sigma_{\\theta x}^T$. It helps here to also write the system as\n",
    "$$\\begin{align*}\n",
    "x &= \\theta + \\epsilon \\\\\n",
    "\\theta &\\sim \\mathcal{N}(\\mu_0,\\sigma_0^2) \\\\\n",
    "\\epsilon &\\sim \\mathcal{N}(0,\\sigma^2)\n",
    "\\end{align*}$$\n",
    "> Now we work out $\\Sigma_{x\\theta}$: \n",
    "$$\\begin{align*}\n",
    "\\Sigma_{x\\theta} &= E[(x-E[x])(\\theta-E[\\theta])^T] \\\\\n",
    "&= E[(x-\\mu_0)(\\theta-\\mu_0)^T] \\\\\n",
    "&= E[x\\theta^T] - \\mu_0 E[\\theta^T] - E[x]\\mu_0^T + \\mu_0 \\mu_0^T \\\\\n",
    "&= E[x\\theta^T] - \\mu_0 \\mu_0^T  \\\\\n",
    "&= E[(\\theta + \\epsilon)\\theta^T] - \\mu_0 \\mu_0^T  \\\\\n",
    "&= E[\\theta \\theta^T] + \\underbrace{E[\\epsilon]}_{=0} E[\\theta^T] - \\mu_0 \\mu_0^T \\\\\n",
    "&= Var[\\theta] + E[\\theta] E[\\theta]^T  - \\mu_0 \\mu_0^T \\\\\n",
    "&= \\sigma_0^2 + \\mu_0 \\mu_0^T - \\mu_0 \\mu_0^T \\\\\n",
    "&= \\sigma_0^2\n",
    "\\end{align*}$$\n",
    "\n",
    "<!---\n",
    "The following computation is due to Poisson. First we note that the integrand is an even function so we can write\n",
    "$$ \\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x = 2\\int_{0}^{\\infty} \\exp(-x^2)\\mathrm{d}x \\,.$$ Let the right hand side be called $J$. Then we can write\n",
    "$$\\begin{align*}\n",
    "J^2 &= \\int_{0}^{\\infty} \\int_{0}^{\\infty}\\exp\\left(-(x^2+y^2\\right)\\mathrm{d}x\\mathrm{d}y \\\\\n",
    "&= \\int_0^{\\infty}r\\exp(-r^2)\\mathrm{d}r\\int_0^{\\frac{\\pi}{2}}\\mathrm{d}\\theta \\\\\n",
    "&= \\frac{\\pi}{2}\\frac{1}{2}\\left.\\exp(-r^2)\\right\\vert_{0}^{\\infty} \\\\\n",
    "&= \\frac{\\pi}{4}\\end{align*}$$    and consequently\n",
    "$$\\begin{equation*}\n",
    "\\int_{-\\infty}^{\\infty} \\exp(-x^2)\\mathrm{d}x = \\sqrt{\\pi}\n",
    "\\end{equation*}$$\n",
    "where the polar coordinate transformation is used and the 2d integral is taken over the first quadrant.\n",
    "\n",
    "- Derive the conditional distribution $p(x_a|x_b)$ and the marginal distribution $p(x_a)$ given that \n",
    "$$\n",
    "\\begin{bmatrix} x_a \\\\ x_b \\end{bmatrix} \\sim \\mathcal{N}\\left(\\begin{bmatrix} m_a \\\\m_b \\end{bmatrix} , \\begin{bmatrix} \\Sigma_a ~ \\Sigma_{ab} \\\\ \\Sigma_{ba} ~  \\Sigma_b\\end{bmatrix}\\right),\n",
    "$$\n",
    "where $x_a$ and $x_b$ are vectors. You may make use of \n",
    "$$\n",
    "       \\begin{bmatrix} A ~ B \\\\ C ~ D \\end{bmatrix}^{-1} = \\begin{bmatrix} M \\qquad -MBD^{-1} \\\\ -D^{-1}CM \\qquad D^{-1}+D^{-1}CMBD^{-1} \\end{bmatrix}\n",
    "$$\n",
    "where $M = (A - BD^{-1}C)^{-1}$ is called the Schur's complement.\n",
    "\n",
    "    \n",
    "- In the notebook, for the model \n",
    "$$\\begin{align*}\n",
    "p(x_t |\\theta) &= \\mathcal{N}(x_t\\,|\\,\\theta,\\sigma^2) \\\\\n",
    "p(\\theta) &= \\mathcal{N}(\\theta\\,|\\,\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "we found the following posterior estimator for the hidden states:\n",
    "$$\\begin{align*}\n",
    "p(\\theta|x) &= \\mathcal{N} \\left( \\theta\\,|\\,\\mu_1, \\sigma_1^2 \\right)\\,,\n",
    "\\end{align*}$$\n",
    "with\n",
    "$$\\begin{align*}\n",
    "K &= \\frac{\\sigma_0^2}{\\sigma_0^2+\\sigma^2} \\qquad \\text{($K$ is called: Kalman gain)}\\\\\n",
    "\\mu_1 &= \\mu_0 + K \\cdot (x-\\mu_0)\\\\\n",
    "\\sigma_1^2 &= \\left( 1-K \\right) \\sigma_0^2  \n",
    "\\end{align*}$$\n",
    " \n",
    "\n",
    "\n",
    "- Show that Eq.SRG-8 is a special case of Eq.SRG-4a. \n",
    "\n",
    "- Proof\n",
    "$$\n",
    "p(x,\\theta) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\sigma_0^2\\\\ \n",
    "         \\sigma_0^2 &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "- Look up conditioning and marginalization in canonical coordinates and compare to the formulas for the moment parameterization of the Gaussian. Any conclusions?\n",
    "\n",
    "- Derive the conditional distribution $p(x_a|x_b)$ and the marginal distribution $p(x_a)$ given that \n",
    "$$\n",
    "\\begin{bmatrix} x_a \\\\ x_b \\end{bmatrix} \\sim \\mathcal{N}\\left(\\begin{bmatrix} m_a \\\\m_b \\end{bmatrix} , \\begin{bmatrix} \\Sigma_a ~ \\Sigma_{ab} \\\\ \\Sigma_{ba} ~  \\Sigma_b\\end{bmatrix}\\right),\n",
    "$$\n",
    "where $x_a$ and $x_b$ are vectors. You may make use of \n",
    "$$\n",
    "       \\begin{bmatrix} A ~ B \\\\ C ~ D \\end{bmatrix}^{-1} = \\begin{bmatrix} M \\qquad -MBD^{-1} \\\\ -D^{-1}CM \\qquad D^{-1}+D^{-1}CMBD^{-1} \\end{bmatrix}\n",
    "$$\n",
    "where $M = (A - BD^{-1}C)^{-1}$ is called the Schur's complement.\n",
    "\n",
    "    \n",
    "- In the notebook, for the model \n",
    "$$\\begin{align*}\n",
    "p(x_t |\\theta) &= \\mathcal{N}(x_t\\,|\\,\\theta,\\sigma^2) \\\\\n",
    "p(\\theta) &= \\mathcal{N}(\\theta\\,|\\,\\mu_0,\\sigma_0^2)\n",
    "\\end{align*}$$\n",
    "we found the following posteestimator for the hidden states:\n",
    "$$\\begin{align*}\n",
    "p(\\theta|x) &= \\mathcal{N} \\left( \\theta\\,|\\,\\mu_1, \\sigma_1^2 \\right)\\,,\n",
    "\\end{align*}$$\n",
    "with\n",
    "$$\\begin{align*}\n",
    "K &= \\frac{\\sigma_0^2}{\\sigma_0^2+\\sigma^2} \\qquad \\text{($K$ is called: Kalman gain)}\\\\\n",
    "\\mu_1 &= \\mu_0 + K \\cdot (x-\\mu_0)\\\\\n",
    "\\sigma_1^2 &= \\left( 1-K \\right) \\sigma_0^2  \n",
    "\\end{align*}$$\n",
    " \n",
    "\n",
    "\n",
    "- Show that Eq.SRG-8 is a special case of Eq.SRG-4a. \n",
    "\n",
    "- Proof\n",
    "$$\n",
    "p(x,\\theta) = \\mathcal{N} \\left( \\begin{bmatrix} x\\\\ \n",
    "  \\theta  \\end{bmatrix} \n",
    "  \\,\\left|\\, \\begin{bmatrix} \\mu_0\\\\ \n",
    "  \\mu_0\\end{bmatrix}, \n",
    "         \\begin{bmatrix} \\sigma_0^2+\\sigma^2  & \\sigma_0^2\\\\ \n",
    "         \\sigma_0^2 &\\sigma_0^2 \n",
    "  \\end{bmatrix} \n",
    "  \\right. \\right)\n",
    "$$\n",
    "- Look up conditioning and marginalization in canonical coordinates and compare to the formulas for the moment parameterization of the Gaussian. Any conclusions?\n",
    "--->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d434e691",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.10.4",
   "language": "julia",
   "name": "julia-1.10"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}