{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "27c5d444",
   "metadata": {},
   "source": [
    "# Latent Variable Models and Variational Bayes\n",
    "\n",
    "\n",
    "- **[1]** (##) For a Gaussian mixture model, given by generative equations\n",
    "\n",
    "$$\n",
    "p(x,z) = \\prod_{k=1}^K (\\underbrace{\\pi_k \\cdot \\mathcal{N}\\left( x | \\mu_k, \\Sigma_k\\right) }_{p(x,z_{k}=1)})^{z_{k}} \n",
    "$$\n",
    "\n",
    "proof that the marginal distribution for observations $x_n$ evaluates to \n",
    "\n",
    "$$\n",
    "p(x) = \\sum_{j=1}^K \\pi_k \\cdot \\mathcal{N}\\left( x | \\mu_j, \\Sigma_j \\right) \n",
    "$$\n",
    "\n",
    "\n",
    "- **[2]** (#) Given the free energy functional $F[q] = \\sum_z q(z) \\log \\frac{q(z)}{p(x,z)}$, proof the [EE, DE and AC decompositions](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Latent-Variable-Models-and-VB.ipynb#fe-decompositions). \n",
    " \n",
    "\n",
    "- **[3]** (#) The Free energy functional $\\mathrm{F}[q] = -\\sum_z q(z) \\log p(x,z) - \\sum_z q(z) \\log \\frac{1}{q(z)}$ decomposes into \"Energy minus Entropy\". So apparently the entropy of the posterior $q(z)$ is maximized. This entropy maximization may seem puzzling at first because inference should intuitively lead to *more* informed posteriors, i.e., posterior distributions whose entropy is smaller than the entropy of the prior. Explain why entropy maximization is still a reasonable objective. \n",
    "  \n",
    "\n",
    "- **[4]** (#) Explain the following update rule for the mean of the Gaussian cluster-conditional data distribution (from the example about mean-field updating of a Gaussian mixture model):\n",
    "\n",
    "$$\n",
    "m_k = \\frac{1}{\\beta_k} \\left( \\beta_0 m_0 + N_k \\bar{x}_k \\right) \\tag{B-10.61} \n",
    "$$\n",
    "  \n",
    "- **[5]** (##) Consider a model $p(x,z|\\theta)$, where $D=\\{x_1,x_2,\\ldots,x_N\\}$ is observed, $z$ are unobserved variables and $\\theta$ are parameters. The EM algorithm estimates the parameters by iterating over the following two equations ($i$ is the iteration index):\n",
    "\n",
    "$$\\begin{align*}\n",
    "q^{(i)}(z) &= p(z|D,\\theta^{(i-1)}) \\\\\n",
    "\\theta^{(i)} &= \\arg\\max_\\theta \\sum_z q^{(i)}(z) \\cdot \\log p(D,z|\\theta)\n",
    "\\end{align*}$$\n",
    "\n",
    "Proof that this algorithm minimizes the Free Energy functional \n",
    "$$\\begin{align*}\n",
    "F[q,\\theta] =  \\sum_z q(z) \\log \\frac{q(z)}{p(D,z|\\theta)} \n",
    "\\end{align*}$$\n",
    " \n",
    "\n",
    "- **[6]** (###) Consult the internet on what *overfitting* and *underfitting* is and then explain how FE minimization finds a balance between these two (unwanted) extremes.\n",
    "   \n",
    "\n",
    "- **[7]** (##) Consider a model $p(x,z|\\theta) = p(x|z,\\theta) p(z|\\theta)$ where $x$ and $z$ relate to observed and unobserved variables, respectively. Also available is an observed data set $D=\\left\\{x_1,x_2,\\ldots,x_N\\right\\}$. One iteration of the EM-algorithm for estimating the parameters $\\theta$ is described by ($m$ is the iteration counter)\n",
    "$$\n",
    "\\hat{\\theta}^{(m+1)} :=  \\arg \\max_\\theta \\left(\\sum_z p(z|x=D,\\hat{\\theta}^{(m)}) \\log p(x=D,z|\\theta) \\right) \\,.\n",
    "$$\n",
    "\n",
    "  (a) Apparently, in order to execute EM, we need to work out an expression for the 'responsibility' $p(z|x=D,\\hat{\\theta}^{(m)})$. Use Bayes rule to show how we can compute the responsibility that allows us to execute an EM step.    \n",
    "\n",
    "  (b) Why do we need multiple iterations in the EM algorithm?      \n",
    "\n",
    "  (c) Why can't we just use simple maximum log-likelihood to estimate parameters, as described by \n",
    "$$\n",
    "\\hat{\\theta} := \\arg \\max_\\theta  \\log p(x=D,z|\\theta) \\,?\n",
    "$$    \n",
    "\n",
    "- **[8]** In a particular model with hidden variables, the log-likelihood can be worked out to the following expression:\n",
    "$$\n",
    " L(\\theta) = \\sum_n \\log \\left(\\sum_k \\pi_k\\,\\mathcal{N}(x_n|\\mu_k,\\Sigma_k)\\right)\n",
    "$$\n",
    "Do you prefer a gradient descent or EM algorithm to estimate maximum likelihood values for the parameters?  Explain your answer. (No need to work out the equations.)\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e4b3855",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.5.2",
   "language": "julia",
   "name": "julia-1.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}