{ "metadata": { "name": "06_Graphical_Models" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Graphical Models

\n", "Probabilistic graph models provide an intuitive way to express joint probability relationships. The nodes in a graph represent random variables and the edges between nodes represent a probabilistic relationship between variables. \n", "\n", "

Bayesian Networks

\n", "Here we consider **acyclic directed** graphical models that express the joint probability relationship between some set of variables, known as *Bayesian Networks* because of the prominant use of Bayes' Theorem. To begin, consider the joint probability distribution $p\\left( x_1, \\ldots, x_K \\right)$ over $K$ variables. Using repeated application of the product rule, this can be expressed as \n", "\n", "$$p \\left(x_1, \\ldots, x_K \\right) = p \\left(x_K \\mid x_1, \\ldots, x_{K-1} \\right) \\ldots p\\left(x_2 \\mid x_1 \\right) p\\left(x_1\\right) $$\n", "\n", "This can be represented graphically as a **fully connected** directed graph having $K$ nodes, with each node having **incoming** links from all **lower** numbered nodes. \n", "\n", "Now consider the general case where the graph is **not necessarily** fully connected. Let the a given node $x_k$ have **incoming** links from the set of nodes $pa_k$ known as the *parent nodes* of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for *each node conditioned on its parent nodes*. A graph with $K$ nodes has the joint distribution defined as\n", "\n", "$$ p\\left(\\mathbf{x}\\right) = \\prod_{k=1}^K p\\left(x_k \\mid pa_k \\right) $$\n", "\n", "

Graphical Notation

\n", "There are several conventions used to convey information in graphical models. They are briefly summarized here - see Bishop 363-365 for more detail.\n", "\n", "+ Represent multiple nodes of the form $t_1, \\ldots, t_N$ as a single node with a box, known as a *plate*, drawn around it and annotated with $N$, the number of repeated nodes.\n", "\n", "+ Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles\n", "\n", "+ Observed variables are drawn as shaded open circles\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Conditional Independence

\n", "Consider three random variables $a,b,c$, we use the following notation to indicate that $a$ is *conditionally idependent* of $b$ given $c$\n", "\n", "$$a \\perp\\\\!\\\\!\\\\!\\perp b \\mid c$$\n", "\n", "When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as\n", "\n", "$$p\\left(a, b \\mid c \\right) = p\\left(a\\mid b,c\\right)p\\left(b\\mid c\\right) = p\\left(a\\mid c\\right) p\\left(b \\mid c\\right)$$\n", "\n", "

D-separation

\n", "Given the description of conditional independence above, we now consider a general directed acyclic graph. Let $A$, $B$, and $C$ be arbitrary sets of *nonintersecting* nodes. We wish to determine if the graph structure implies the conditional independence $A \\perp\\\\!\\\\!\\\\!\\perp B \\mid C$. It turns out this conidition is satisfied if **all** possible paths from **any** node in $A$ to **any** node in $B$ are *blocked*, in which case $A$ is said to be *d-separated* from $B$. A path is considered blocked if either of the following hold:\n", "\n", "**(a)** There is a node, $n \\in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n\n", "\n", "**(b)** There is a node, $n \\notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.\n", "\n", "Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------------------------------------------\n", "\n", "

Training a Bayes Net

\n", "MLE for training a Bayes Net, estimating the theta parameters for the nodes, is the equivalent to maximizing the likelihood of the data. So for each node, probability of the node being equal to $s$ be conditioned on say $f$ and $a$, then we have - TODO - generalize this to N conditional quantities\n", "\n", "$$\\theta_{s|ij} = \\frac{\\sum_{k=1}^K \\delta \\left(f_k = i, a_k =j, s_k =1 \\right)}{\\sum_{k=1}^K \\delta \\left(f_k=i, a_k=j \\right)}$$\n", "\n", "where\n", "$\\delta(x) = 1$ \n", "\n", "if $x$ is true, 0 otherwise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Partially Observed Data

\n", "Can't use the MLE approach above due to lack of data\n", "Let $X$ be all *observed* variables values\n", "Let $Z$ be all *unobserved* variables\n", "Assume we always observe/unobserve the same variables - it is possible to generalize this\n", "\n", "Approach:\n", "\n", "$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n", "\n", "using some probability distribution on $Z$, namely, $P(Z|X,\\theta)$ - do we need a model for this distribution?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Expectation Maximization (EM) Algorithim

\n", "guaranteed to find local maximum of expected log likelihood above\n", "\n", "$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n", "\n", "see slide at 29:59\n", "\n", "EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define\n", "\n", "$Q\\left(\\theta' | \\theta \\right) = E_{E_{P(Z|X,\\theta)}} \\left[\\log P(X,Z | \\theta') \\right]$\n", "\n", "Iterate until convergence:\n", "\n", "* E Step: Use $X$ and current $\\theta$ to calculate $P(Z|X,\\theta)$ - done for every variable in $Z$ for each training example, \n", "\n", "* M Step: Replace current $\\theta$ by \n", "\n", "$\\theta \\leftarrow arg max_{\\theta'} Q $\n", "\n", "using step 1, plug into into last equation on slide 29:59 and pick max theta\n", "\n", "Example: see slide 38:59, 50:59\n", "\n", "More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:\n", "\n", "* E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$\n", "\n", "* M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count\n", "\n", "$\\delta(Y=1) \\rightarrow E_{Z|X,\\theta}[Y]$\n", "\n", "$\\delta(Y=0) \\rightarrow 1 - \\delta(Y=1)$\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Example: Linear Gaussian Model

\n", "*TODO* A useful example may be a Linear-Gaussian model - see Bishop pages 370-372 - " ] } ], "metadata": {} } ] }