{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Variational Inference\n",
"\n",
"[](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)\n",
"[](https://nbviewer.jupyter.org/github/Sayam753/Sayam753.github.io/blob/website/docs/gsoc/variational-inference.ipynb)\n",
"\n",
"Variational Inference is a powerful algorithm for fitting Bayesian networks. In this blog, you will learn about maths and intuition behind Variational Inference, Mean Field approximation and its implementation in Tensorflow Probability.\n",
"\n",
"## Intro to Bayesian Networks\n",
"\n",
"### Random Variables\n",
"\n",
"Random Variables are simply variables whose values are uncertain. Eg -\n",
"\n",
"1. In case of flipping a coin $n$ times, a random variable $X$ can be number of heads shown up.\n",
"\n",
"2. In COVID-19 pandemic situation, random variable can be number of patients found positive with virus daily.\n",
"\n",
"### Probability Distributions\n",
"\n",
"Probability Distributions governs the amount of uncertainty of random variables. They have a math function with which they assign probabilities to different values taken by random variables. The associated math function is called probability density function (pdf). For simplicity, let's denote any random variable as $X$ and its corresponding pdf as $P\\left (X\\right )$. Eg - Following figure shows the probability distribution for number of heads when an unbiased coin is flipped 5 times.\n",
"\n",
"\n",
"### Bayesian Networks\n",
"\n",
"Bayesian Networks are graph based representations to acccount for randomness while modelling our data. The nodes of the graph are random variables and the connections between nodes denote the direct influence from parent to child.\n",
"\n",
"### Bayesian Network Example\n",
"\n",
"\n",
"\n",
"Let's say a student is taking a class during school. The `difficulty` of the class and the `intelligence` of the student together directly influence student's `grades`. And the `grades` affects his/her acceptance to the university. Also, the `intelligence` factor influences student's `SAT` score. Keep this example in mind.\n",
"\n",
"More formally, Bayesian Networks represent joint probability distribution over all the nodes of graph -\n",
"$P\\left (X_1, X_2, X_3, ..., X_n\\right )$ or $P\\left (\\bigcap_{i=1}^{n}X_i\\right )$ where $X_i$ is a random variable. Also Bayesian Networks follow local Markov property by which every node in the graph is independent on its **non-descendants** given its **parents**. In this way, the joint probability distribution can be decomposed as -\n",
"\n",
"$$\n",
"P\\left (X_1, X_2, X_3, ..., X_n\\right ) = \\prod_{i=1}^{n} P\\left (X_i | Par\\left (X_i\\right )\\right )\n",
"$$\n",
"\n",
" Extra: Proof of decomposition
\n",
"
First, let's recall conditional probability,
\n",
" $$P\\left (A|B\\right ) = \\frac{P\\left (A, B\\right )}{P\\left (B\\right )}$$\n",
" The above equation is so derived because of reduction of sample space of $A$ when $B$ has already occured.\n",
" Now, adjusting terms -
\n",
" $$P\\left (A, B\\right ) = P\\left (A|B\\right )*P\\left (B\\right )$$\n",
" This equation is called chain rule of probability. Let's generalize this rule for Bayesian Networks. The ordering of names of nodes is such that parent(s) of nodes lie above them (Breadth First Ordering).
\n",
" $$P\\left (X_1, X_2, X_3, ..., X_n\\right ) = P\\left (X_n, X_{n-1}, X_{n-2}, ..., X_1\\right )\\\\\n",
" = P\\left (X_n|X_{n-1}, X_{n-2}, X_{n-3}, ..., X_1\\right ) * P \\left (X_{n-1}, X_{n-2}, X_{n-3}, ..., X_1\\right ) \\left (Chain Rule\\right )\\\\ \n",
" = P\\left (X_n|X_{n-1}, X_{n-2}, X_{n-3}, ..., X_1\\right ) * P \\left (X_{n-1}|X_{n-2}, X_{n-3}, X_{n-4}, ..., X_1\\right ) * P \\left (X_{n-2}, X_{n-3}, X_{n-4}, ..., X_1\\right )$$\n",
" Applying chain rule repeatedly, we get the following equation -
\n",
" $$P\\left (\\bigcap_{i=1}^{n}X_i\\right ) = \\prod_{i=1}^{n} P\\left (X_i | P\\left (\\bigcap_{j=1}^{i-1}X_j\\right )\\right )$$\n",
" Keep the above equation in mind. Let's bring back Markov property. To bring some intuition behind Markov property, let's reuse Bayesian Network Example. If we say, the student scored very good grades, then it is highly likely the student gets acceptance letter to university. No matter how difficult the class was, how much intelligent the student was, and no matter what his/her SAT score was. The key thing to note here is by observing the node's parent, the influence by non-descendants towards the node gets eliminated. Now, the equation becomes -
\n",
" $$P\\left (\\bigcap_{i=1}^{n}X_i\\right ) = \\prod_{i=1}^{n} P\\left (X_i | Par\\left (X_i\\right )\\right )$$\n",
" Bingo, with the above equation, we have proved Factorization Theorem in Probability.\n",
"
Note
\n", "\n", " If two distributions are similar, then their entropies are similar, implies the KL divergence with respect to two distributions will be smaller. And vica versa. In Variational Inference, the whole idea is to minimize KL divergence so that our approximating distribution $q(\\theta)$ can be made similar to $p(\\theta|D)$.\n", "
\n", "
\n",
" If you go about exploring any paper talking about Variational Inference, then most certainly, the papers mention about latent variables instead of parameters. The parameters are fixed quantities for the model whereas latent variables are unobserved quantities of the model conditioned on parameters. Also, we model parameters by probability distributions. For simplicity, let's consider the running terminology of parameters only.\n",
"
Observe
\n", "\n", " Minimizing the KL divergence is equivalent to maximizing the ELBO. Also, the ELBO does not depend on posterior distribution.\n", "
\n", "
To simplify notations, let's use $Y=T(X)$ instead of $\\zeta=T(\\theta)$. After reaching the results, we will put the values back. Also, let's denote cummulative distribution function (cdf) as $F$. There are two cases which respect to properties of function $T$.
Case 1 - When $T$ is an increasing function $$F_Y(y) = P(Y <= y) = P(T(X) <= y)\\\\\n",
" = P\\left(X <= T^{-1}(y) \\right) = F_X\\left(T^{-1}(y) \\right)\\\\\n",
" F_Y(y) = F_X\\left(T^{-1}(y) \\right)$$Let's differentiate with respect to $y$ both sides - $$\\frac{\\mathrm{d} (F_Y(y))}{\\mathrm{d} y} = \\frac{\\mathrm{d} (F_X\\left(T^{-1}(y) \\right))}{\\mathrm{d} y}\\\\\n",
" P_Y(y) = P_X\\left(T^{-1}(y) \\right) \\frac{\\mathrm{d} (T^{-1}(y))}{\\mathrm{d} y}$$Case 2 - When $T$ is a descreasing function $$F_Y(y) = P(Y <= y) = P(T(X) <= y) = P\\left(X >= T^{-1}(y) \\right)\\\\\n",
" = 1-P\\left(X < T^{-1}(y) \\right) = 1-P\\left(X <= T^{-1}(y) \\right) = 1-F_X\\left(T^{-1}(y) \\right)\\\\\n",
" F_Y(y) = 1-F_X\\left(T^{-1}(y) \\right)$$Let's differentiate with respect to $y$ both sides - $$\\frac{\\mathrm{d} (F_Y(y))}{\\mathrm{d} y} = \\frac{\\mathrm{d} (1-F_X\\left(T^{-1}(y) \\right))}{\\mathrm{d} y}\\\\\n",
" P_Y(y) = (-1) P_X\\left(T^{-1}(y) \\right) (-1) \\frac{\\mathrm{d} (T^{-1}(y))}{\\mathrm{d} y}\\\\\n",
" P_Y(y) = P_X\\left(T^{-1}(y) \\right) \\frac{\\mathrm{d} (T^{-1}(y))}{\\mathrm{d} y}$$Combining both results - $$P_Y(y) = P_X\\left(T^{-1}(y) \\right) \\left | \\frac{\\mathrm{d} (T^{-1}(y))}{\\mathrm{d} y} \\right |$$Now comes the role of Jacobians to deal with multivariate parameters $X$ and $Y$. $$J_{T^{-1}}(Y) = \\begin{vmatrix}\n",
" \\frac{\\partial (T_1^{-1})}{\\partial y_1} & ... & \\frac{\\partial (T_1^{-1})}{\\partial y_k}\\\\\n",
" . & & .\\\\\n",
" . & & .\\\\\n",
" \\frac{\\partial (T_k^{-1})}{\\partial y_1} & ... &\\frac{\\partial (T_k^{-1})}{\\partial y_k}\n",
" \\end{vmatrix}$$Concluding - $$P(Y) = P(T^{-1}(Y)) |det J_{T^{-1}}(Y)|\\\\P(Y) = P(X) |det J_{T^{-1}}(Y)|\n",
" $$Substitute $X$ as $\\theta$ and $Y$ as $\\zeta$, we will get - $$P(\\zeta) = P(T^{-1}(\\zeta)) |det J_{T^{-1}}(\\zeta)|\\\\$$\n",
"
Success
\n", "\n", " The above ELBO equation is the final one which needs to be optimized.\n", "
\n", "