{
 "metadata": {
  "name": "06_Graphical_Models"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h1>Graphical Models</h1>\n",
      "Probabilistic graph models provide an intuitive way to express joint probability relationships. The nodes in a graph represent random variables and the edges between nodes represent a probabilistic relationship between variables. \n",
      "\n",
      "<h2>Bayesian Networks</h2>\n",
      "Here we consider **acyclic directed** graphical models that express the joint probability relationship between some set of variables, known as *Bayesian Networks* because of the prominant use of Bayes' Theorem. To begin, consider the joint probability distribution $p\\left( x_1, \\ldots, x_K \\right)$ over $K$ variables. Using repeated application of the product rule, this can be expressed as \n",
      "\n",
      "$$p \\left(x_1, \\ldots, x_K \\right) = p \\left(x_K \\mid x_1, \\ldots, x_{K-1} \\right) \\ldots p\\left(x_2 \\mid x_1 \\right) p\\left(x_1\\right) $$\n",
      "\n",
      "This can be represented graphically as a **fully connected** directed graph having $K$ nodes, with each node having **incoming** links from all **lower** numbered nodes. \n",
      "\n",
      "Now consider the general case where the graph is **not necessarily** fully connected. Let the a given node $x_k$ have **incoming** links from the set of nodes $pa_k$ known as the *parent nodes* of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for *each node conditioned on its parent nodes*. A graph with $K$ nodes has the joint distribution defined as\n",
      "\n",
      "$$ p\\left(\\mathbf{x}\\right) = \\prod_{k=1}^K p\\left(x_k \\mid pa_k \\right) $$\n",
      "\n",
      "<h2>Graphical Notation</h2>\n",
      "There are several conventions used to convey information in graphical models. They are briefly summarized here - see Bishop 363-365 for more detail.\n",
      "\n",
      "+ Represent multiple nodes of the form $t_1, \\ldots, t_N$ as a single node with a box, known as a *plate*, drawn around it and annotated with $N$, the number of repeated nodes.\n",
      "\n",
      "+ Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles\n",
      "\n",
      "+ Observed variables are drawn as shaded open circles\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Conditional Independence</h2>\n",
      "Consider three random variables $a,b,c$, we use the following notation to indicate that $a$ is *conditionally idependent* of $b$ given $c$\n",
      "\n",
      "$$a \\perp\\\\!\\\\!\\\\!\\perp b \\mid c$$\n",
      "\n",
      "When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as\n",
      "\n",
      "$$p\\left(a, b \\mid c \\right) = p\\left(a\\mid b,c\\right)p\\left(b\\mid c\\right) = p\\left(a\\mid c\\right) p\\left(b \\mid c\\right)$$\n",
      "\n",
      "<h3>D-separation</h3>\n",
      "Given the description of conditional independence above, we now consider a general directed acyclic graph. Let $A$, $B$, and $C$ be arbitrary sets of *nonintersecting* nodes. We wish to determine if the graph structure implies the conditional independence $A \\perp\\\\!\\\\!\\\\!\\perp B \\mid C$. It turns out this conidition is satisfied if **all** possible paths from **any** node in $A$ to **any** node in $B$ are *blocked*, in which case $A$ is said to be *d-separated* from $B$. A path is considered blocked if either of the following hold:\n",
      "\n",
      "**(a)** There is a node, $n \\in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n\n",
      "\n",
      "**(b)** There is a node, $n \\notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.\n",
      "\n",
      "Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-------------------------------------------\n",
      "\n",
      "<h2>Training a Bayes Net</h2>\n",
      "MLE for training a Bayes Net, estimating the theta parameters for the nodes, is the equivalent to maximizing the likelihood of the data. So for each node, probability of the node being equal to $s$ be conditioned on say $f$ and $a$, then we have - TODO - generalize this to N conditional quantities\n",
      "\n",
      "$$\\theta_{s|ij} = \\frac{\\sum_{k=1}^K \\delta \\left(f_k = i, a_k =j, s_k =1 \\right)}{\\sum_{k=1}^K \\delta \\left(f_k=i, a_k=j \\right)}$$\n",
      "\n",
      "where\n",
      "$\\delta(x) = 1$ \n",
      "\n",
      "if $x$ is true, 0 otherwise"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Partially Observed Data</h3>\n",
      "Can't use the MLE approach above due to lack of data\n",
      "Let $X$ be all *observed* variables values\n",
      "Let $Z$ be all *unobserved* variables\n",
      "Assume we always observe/unobserve the same variables - it is possible to generalize this\n",
      "\n",
      "Approach:\n",
      "\n",
      "$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n",
      "\n",
      "using some probability distribution on $Z$, namely, $P(Z|X,\\theta)$ - do we need a model for this distribution?\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Expectation Maximization (EM) Algorithim</h3>\n",
      "guaranteed to find local maximum of expected log likelihood above\n",
      "\n",
      "$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n",
      "\n",
      "see slide at 29:59\n",
      "\n",
      "EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define\n",
      "\n",
      "$Q\\left(\\theta' | \\theta \\right) = E_{E_{P(Z|X,\\theta)}} \\left[\\log P(X,Z | \\theta') \\right]$\n",
      "\n",
      "Iterate until convergence:\n",
      "\n",
      "* E Step: Use $X$ and current $\\theta$ to calculate $P(Z|X,\\theta)$ - done for every variable in $Z$ for each training example, \n",
      "\n",
      "* M Step: Replace current $\\theta$ by \n",
      "\n",
      "$\\theta \\leftarrow arg max_{\\theta'} Q $\n",
      "\n",
      "using step 1, plug into into last equation on slide 29:59 and pick max theta\n",
      "\n",
      "Example: see slide 38:59, 50:59\n",
      "\n",
      "More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:\n",
      "\n",
      "* E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$\n",
      "\n",
      "* M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count\n",
      "\n",
      "$\\delta(Y=1) \\rightarrow E_{Z|X,\\theta}[Y]$\n",
      "\n",
      "$\\delta(Y=0) \\rightarrow 1 - \\delta(Y=1)$\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Example: Linear Gaussian Model</h2>\n",
      "*TODO* A useful example may be a Linear-Gaussian model - see Bishop pages 370-372 - "
     ]
    }
   ],
   "metadata": {}
  }
 ]
}