{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Latent Variable Models and Variational Bayes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "### Preliminaries\n", "\n", "- Goal \n", " - Introduction to latent variable models and variational inference by Free energy minimization \n", "- Materials\n", " - Mandatory\n", " - These lecture notes\n", " - Ariel Caticha (2010), [Entropic Inference](https://arxiv.org/abs/1011.0723)\n", " - tutorial on entropic inference, which is a generalization to Bayes rule and provides a foundation for variational inference.\n", " - Optional \n", " - Bishop (2016), pp. 461-486 (sections 10.1, 10.2 and 10.3) \n", " - references \n", " - Blei et al. (2017), [Variational Inference: A Review for Statisticians](https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773) \n", " - Lanczos (1961), [The variational principles of mechanics](https://www.amazon.com/Variational-Principles-Mechanics-Dover-Physics/dp/0486650677)\n", " - Zhang et al. (2017), [Unifying Message Passing Algorithms Under the Framework of Constrained Bethe Free Energy Minimization](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Zhang-2017-Unifying-Message-Passing-Algorithms.pdf)\n", " - Dauwels (2007), [On variational message passing on factor graphs](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/Dauwels-2007-on-variational-message-passing-on-factor-graphs)\n", " - Caticha (2010), [Entropic Inference](https://arxiv.org/abs/1011.0723)\n", " - Shore and Johnson (1980), [Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/ShoreJohnson-1980-Axiomatic-Derivation-of-the-Principle-of-Maximum-Entropy.pdf)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Illustrative Example : Density Modeling for the Old Faithful Data Set\n", "\n", "- You're now asked to build a density model for a data set ([Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful), Bishop pg. 681) that clearly is not distributed as a single Gaussian:\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Unobserved Classes\n", "\n", "Consider again a set of observed data $D=\\{x_1,\\dotsc,x_N\\}$\n", "\n", "- This time we suspect that there are _unobserved_ class labels that would help explain (or predict) the data, e.g.,\n", " - the observed data are the color of living things; the unobserved classes are animals and plants.\n", " - observed are wheel sizes; unobserved categories are trucks and personal cars.\n", " - observed is an audio signal; unobserved classes include speech, music, traffic noise, etc.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " \n", "- Classification problems with unobserved classes are called **Clustering** problems. The learning algorithm needs to **discover the classes from the observed data**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Gaussian Mixture Model\n", "\n", "- The spread of the data in the illustrative example looks like it could be modeled by two Gaussians. Let's develop a model for this data set. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Let $D=\\{x_n\\}$ be a set of observations. We associate a one-hot coded hidden class label $z_n$ with each observation:\n", "\n", "$$\\begin{equation*}\n", "z_{nk} = \\begin{cases} 1 & \\text{if } x_n \\in \\mathcal{C}_k \\text{ (the k-th class)}\\\\\n", " 0 & \\text{otherwise} \\end{cases}\n", "\\end{equation*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We consider the same model as we did in the [generative classification lesson](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Generative-Classification.ipynb#GDA):\n", "$$\\begin{align*}\n", "p(x_n | z_{nk}=1) &= \\mathcal{N}\\left( x_n | \\mu_k, \\Sigma_k\\right)\\\\\n", "p(z_{nk}=1) &= \\pi_k\n", "\\end{align*}$$\n", "which can be summarized with the selection variables $z_{nk}$ as\n", "$$\\begin{align*}\n", "p(x_n,z_n) &= p(x_n | z_n) p(z_n) = \\prod_{k=1}^K (\\underbrace{\\pi_k \\cdot \\mathcal{N}\\left( x_n | \\mu_k, \\Sigma_k\\right) }_{p(x_n,z_{nk}=1)})^{z_{nk}} \n", "\\end{align*}$$\n", "\n", "- *Again*, this is the same model as we defined for the generative classification model: A Gaussian-Categorical model but now with unobserved classes). \n", "\n", "- This model (with **unobserved class labels**) is known as a **Gaussian Mixture Model** (GMM)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Marginal Distribution for the GMM\n", "\n", "- In the literature, the GMM is often introduced the marginal distribution for an _observed_ data point $x_n$, given by\n", "\n", "$$\\begin{align*}{}\n", "p(x_n) &= \\sum_{z_n} p(x_n,z_n) \\\\\n", " &= \\sum_{k=1}^K \\pi_k \\cdot \\mathcal{N}\\left( x_n | \\mu_k, \\Sigma_k \\right) \\tag{B-9.12}\n", "\\end{align*}$$\n", "\n", "- Full proof as an [exercise](https://nbviewer.org/github/bertdv/BMLIP/blob/master/lessons/exercises/Exercises-Latent-Variable-Models-and-VB.ipynb). \n", "\n", "- Eq. B-9.12 reveals the link to the name Gaussian *mixture model*. The priors $\\pi_k$ are also called **mixture coefficients**. \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### GMM is a very flexible model\n", "\n", "- GMMs are very popular models. They have decent computational properties and are **universal approximators of densities** (as long as there are enough Gaussians of course)\n", "\n", "\n", "\n", "- (In the above figure, the Gaussian components are shown in red and the pdf of the mixture models in blue)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Latent Variable Models\n", "\n", "- The GMM contains both _observed_ variables $\\{x_n\\}$, (unobserved) _parameters_ $\\theta= \\{\\pi_k,\\mu_k, \\Sigma_k\\}$ _and_ unobserved (synonym: latent, hidden) variables $\\{z_{nk}\\}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- From a Bayesian viewpoint, both latent variables $\\{z_{nk}\\}$ and parameters $\\theta$ are just unobserved variables for which we can set a prior and compute a posterior by Bayes rule. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note that $z_{nk}$ has a subscript $n$, hence its value depends not only on the cluster ($k$) but also on the $n$-th observation (in constrast to parameters). These observation-dependent variables are generally a useful tool to encode structure in the model. Here (in the GMM), the latent variables $\\{z_{nk}\\}$ encode (unobserved) class membership. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- By adding model structure through (equations among) latent variables, we can build very complex models. Unfortunately, inference in latent variable models can also be much more complex than for fully observed models." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Inference for GMM is Difficult\n", "\n", "\n", "- We recall here the log-likelihood for the Gaussian-Categorial Model, see [generative classification lesson)](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Generative-Classification.ipynb)\n", "\n", "$$\n", "\\log\\, p(D|\\theta) = \\sum_{n,k} y_{nk} \\underbrace{ \\log\\mathcal{N}(x_n|\\mu_k,\\Sigma) }_{ \\text{Gaussian} } + \\underbrace{ \\sum_{n,k} y_{nk} \\log \\pi_k }_{ \\text{multinomial} } \\,.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Since the class labels $y_{nk} \\in \\{0,1\\}$ were given, this expression decomposed into a set of simple updates for the Gaussian and multinomial distributions. For both distributions, we have conjugate priors, so inference is easily solved. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- However, for the Gaussian mixture model (same log-likelihood function with $z_{nk}$ replacing $y_{nk}$), the class labels $\\{z_{nk}\\}$ are _unobserved_ and need to estimated alongside with the parameters." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- There is no conjugate prior on the latent variables for the GMM likelihood function $p(\\underbrace{D}_{\\{x_n\\}}\\,|\\,\\underbrace{\\{z_{nk}\\},\\{\\mu_k,\\Sigma_k,\\pi_k\\}}_{\\text{all latent variables}})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In this lesson, we introduce an approximate Bayesian inference method known as **Variational Bayes** (VB) (also known as **Variational Inference**) that can be used for Bayesian inference in models with latent variables. Later in this lesson, we will use VB to do inference in the GMM. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- As a motivation for variational inference, we first discuss inference by the **Method of Maximum Relative Entropy**, [(Caticha, 2010)](https://arxiv.org/abs/1011.0723). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Inference When Information is in the Form of Constraints \n", "\n", "- In the [probability theory lesson](https://nbviewer.jupyter.org/github/bertdv/BMLIP/blob/master/lessons/notebooks/Probability-Theory-Review.ipynb#Bayes-rule), we recognized Bayes rule as the fundamental rule for learning from data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We will now generalize this notion and consider learning as a process that updates a prior into a posterior distribution whenever new information becomes available." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In this context, *new information* is not necessarily a new observation, but could (for instance) also relate to *constraints* on the posterior distribution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For example, consider a model $\\prod_n p(x_n,z) = p(z) \\prod_n p(x_n|z)$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Our prior beliefs about $z$ are represented by $p(z)$. In the following, we will write $q(z)$ to denote a posterior distribution for $z$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- We might be interested in obtaining rational posterior beliefs $q(z)$ in consideration of the following additional constraints:\n", " 1. *Data Constraints*: e.g., two observed data points $x_1=5$ and $x_2=3$, which lead to constraints $q(x_1) = \\delta(x_1-5)$ and $q(x_2)=\\delta(x_2-3)$.\n", " 2. *Form Constraints*: e.g., we only consider the Gamma distribution for $q(z) = \\mathrm{Gam}(z|\\alpha,\\beta)$.\n", " 3. *Factorization Constraints*:, e.g., we consider independent marginals for the posterior distribution: $q(z) = \\prod_k q(z_k)$. \n", " 3. *Moment Constraints*: e.g., the first moment of the posterior is given by $\\int z q(z) \\mathrm{d}z = 3$.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Note that this is not \"just\" applying Bayes rule to compute a posterior, which can only deal with data constraints as specified by the likelihood function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Note also that observations _can_ be rephrased as constraints on the posterior, e.g., observation $x_1=5$ can be phrased as a constraint $q(x_1)=\\delta(x_1-5)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\Rightarrow$ Updating a prior to a posterior on the basis of constraints on the posterior is more general than updating based on observations alone." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Method of Maximum Relative Entropy" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [Caticha (2010)](https://arxiv.org/abs/1011.0723) (based on earlier work by [Shore and Johnson (1980)](https://github.com/bertdv/BMLIP/blob/master/lessons/notebooks/files/ShoreJohnson-1980-Axiomatic-Derivation-of-the-Principle-of-Maximum-Entropy.pdf)) developed the **method of maximum (relative) entropy** for rational updating of priors to posteriors when faced with new information in the form of constraints." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Consider prior beliefs $p(z)$ about $z$. New information in the form of constraints is obtained and we are interested in the \"best update\" to a posterior $q(z)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In order to define what \"best update\" means, we assume a ranking function $S[q,p]$ that generates a preference score for each candidate posterior $q$ for a given $p$. The best update from $p$ to $q$ is identified as $q^* = \\arg\\max_q S[q,p]$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Similarly to [Cox' method to deriving probability theory](https://en.wikipedia.org/wiki/Cox%27s_theorem), Caticha introduced the following mild criteria based on a rational principle (the **principle of minimal updating**, see [Caticha 2010](https://arxiv.org/abs/1011.0723)) that the ranking function needs to adhere to: \n", " 1. *Locality*: local information has local effects.\n", " 2. *Coordinate invariance*: the system of coordinates carries no information. \n", " 3. *Independence*: When systems are known to be independent, it should not matter whether they are treated separately or jointly. \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- These three criteria **uniquely identify the Relative Entropy** (RE) as the proper ranking function: \n", "$$\\begin{align*}\n", "S[q,p] = - \\sum_z q(z) \\log \\frac{q(z)}{p(z)}\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- $\\Rightarrow$ When information is supplied in the form of constaints on the posterior, we *should* select the posterior that maximizes the Relative Entropy, subject to the constraints. This is the **Method of Maximum (Relative) Entropy** (MRE). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Relative Entropy, KL Divergence and Free Energy\n", "\n", "- Note that the Relative Entropy is technically a *functional*, i.e., a function of function(s) (since it is a function of probability distributions). The calculus of functionals is also called **variational calculus**.\n", " - See [Lanczos, The variational principles of mechanics (1961)](https://www.amazon.com/Variational-Principles-Mechanics-Dover-Physics/dp/0486650677) for a great book on variational calculus. For a summary, see Bishop (2016), app.D.\n", " - It is customary to use square brackets (like $S[q,p]$) for functionals rather than round brackets, which we use for functions. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Also note the complimentary relation between the Maximum Relative Entropy method and Probability Theory: \n", " - PT describes how to _represent_ beliefs about events and how to _calculate_ beliefs about joint and conditional events. \n", " - In contrast, the MRE method describes how to _update_ beliefs when new information becomes available.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- PT and the MRE method are both needed to describe the theory of optimal information processing. (As we will see below, Bayes rule is a special case of the MRE method.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- In principle, entropies can always be considered as a *relative* score against a reference distribution. For instance, the score $-\\sum_{z_i} q(z_i) \\log q(z_i)$ can be interpreted as a score against the uniform distribution, i.e., $-\\sum_{z_i} q(z_i) \\log q(z_i) \\propto -\\sum_{z_i} q(z_i) \\log \\frac{q(z_i)}{\\mathrm{Uniform(z_i)}}$. Therefore, the \"method of maximum relative entropy\" is often simply known as the \"method of maximum entropy\". " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- The negative relative entropy is known as the **Kullback-Leibler** (KL) divergence:\n", "$$\n", "\\mathrm{KL}[q,p] \\triangleq \\sum_z q(z) \\log \\frac{q(z)}{p(z)} \\tag{B-1.113}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The [Gibbs inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality) ([proof](https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof)), is a famous theorem in information theory that states that \n", "$$\\mathrm{KL}[q,p] \\geq 0 $$\n", "with equality only iff $p=q$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " - The KL divergence can be interpreted as a \"distance\" between two probability distributions. Note however that the KL divergence is an asymmetric distance measure, i.e., in general $\\mathrm{KL}[q,p] \\neq \\mathrm{KL}[p,q]$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We also introduce here the notion of a (variational) **Free Energy** (FE) functional, which is just a generalization of the KL-divergence that allows the prior to be unnormalized. Let $f(z) = Z \\cdot p(z)$ be an unnormalized distribution, then the FE is defined as\n", "$$\\begin{align*}\n", "F[q,p] &\\triangleq \\sum_z q(z) \\log \\frac{q(z)}{f(z)} \\\\\n", "&= \\sum_z q(z) \\log \\frac{q(z)}{Z\\cdot p(z)} \\\\\n", "&= \\underbrace{\\sum_z q(z) \\log \\frac{q(z)}{p(z)}}_{\\text{KL divergence }\\geq 0} - \\log Z\n", "\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Since the second term ($\\log Z$) is constant, FE minimization (subject to constraints) with respect to $q$ leads to the same posteriors as KL minimization and RE maximization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- If we are only interested in minimizing FE with respect to $q$, then we'll write $F[q]$ (rather than $F[q,p]$) fo brevity. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In this class, we prefer to discuss inference in terms of minimizing Free Energy (subject to the constraints) rather than maximizing Relative Entropy, but note that these two concepts are equivalent. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example KL divergence for Gaussians\n", "\n", "\n", "\n", "source: By Mundhenk at English Wikipedia, CC BY-SA 3.0, Link\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The Free Energy Functional and Variational Bayes\n", "\n", "- Let's get back to the issue of doing inference for models with latent variables (such as the GMM). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Consider a generative model specified by $p(x,z)$ where $x$ and $z$ represent the observed and unobserved variables, respectively." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Assume further that $x$ has been observed as $x=\\hat{x}$ and we are interested in performing inference, i.e., we want to compute the posterior $p(z|x=\\hat{x})$ for the latent variables and we want to compute the evidence $p(x=\\hat{x})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- According to the Method of Maximum Relative Entropy, in order to find the correct posterior $q(x,z)$, we should minimize\n", "\n", "$$\n", "\\mathrm{F}[q] = \\sum_{x,z} q(x,z) \\log \\frac{q(x,z)}{p(x,z)}\\,, \\quad \\text{subject to data constraint }x=\\hat{x}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The data constraint $x=\\hat{x}$ fixes posterior $q(x) = \\delta(x-\\hat{x})$, so the Free Energy evaluates to \n", "\n", "$$\\begin{align*}\n", "F[q] &= \\sum_{z} \\sum_{x}q(z|x)q(x) \\log \\frac{q(z|x) q(x)}{p(z|x) p(x)} \\\\\n", " &= \\sum_{z} \\sum_{x} q(z|x)\\delta(x-\\hat{x}) \\log \\frac{q(z|x)\\delta(x-\\hat{x})}{p(z|x) p(x)} \\\\\n", " &= \\sum_{z} q(z|\\hat{x}) \\log \\frac{q(z|\\hat{x})}{p(z|\\hat{x}) p(\\hat{x}) } \\\\\n", " &= \\underbrace{\\sum_{z}q(z|\\hat{x}) \\log \\frac{q(z|\\hat{x})}{p(z|\\hat{x})}}_{\\text{KL divergence }\\geq 0} - \\underbrace{\\log p(\\hat{x}) }\\tag{B-10.2\n", "}_{\\text{log-evidence}}\\end{align*}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The log-evidence term does not depend on $q$. Hence, the global minimum $q(z|\\hat{x})^* \\triangleq \\arg\\min_q F[q]$ is obtained for $$q^*(z|\\hat{x}) = p(z|\\hat{x})\\,,$$ which is the correct **Bayesian posterior**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Furthermore, the minimal free energy $$F^* \\triangleq F[q^*] = -\\log p(\\hat{x})$$ equals minus **log-evidence**. (Or, equivalently, the evidence is given by $p(\\hat{x}) = \\exp\\left(-F[q^*] \\right)$.) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "