\n",
"Probabilistic graph models provide an intuitive way to express joint probability relationships. The nodes in a graph represent random variables and the edges between nodes represent a probabilistic relationship between variables. \n",
"\n",
"
Bayesian Networks
\n",
"Here we consider **acyclic directed** graphical models that express the joint probability relationship between some set of variables, known as *Bayesian Networks* because of the prominant use of Bayes' Theorem. To begin, consider the joint probability distribution $p\\left( x_1, \\ldots, x_K \\right)$ over $K$ variables. Using repeated application of the product rule, this can be expressed as \n",
"\n",
"$$p \\left(x_1, \\ldots, x_K \\right) = p \\left(x_K \\mid x_1, \\ldots, x_{K-1} \\right) \\ldots p\\left(x_2 \\mid x_1 \\right) p\\left(x_1\\right) $$\n",
"\n",
"This can be represented graphically as a **fully connected** directed graph having $K$ nodes, with each node having **incoming** links from all **lower** numbered nodes. \n",
"\n",
"Now consider the general case where the graph is **not necessarily** fully connected. Let the a given node $x_k$ have **incoming** links from the set of nodes $pa_k$ known as the *parent nodes* of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for *each node conditioned on its parent nodes*. A graph with $K$ nodes has the joint distribution defined as\n",
"\n",
"$$ p\\left(\\mathbf{x}\\right) = \\prod_{k=1}^K p\\left(x_k \\mid pa_k \\right) $$\n",
"\n",
"
Graphical Notation
\n",
"There are several conventions used to convey information in graphical models. They are briefly summarized here - see Bishop 363-365 for more detail.\n",
"\n",
"+ Represent multiple nodes of the form $t_1, \\ldots, t_N$ as a single node with a box, known as a *plate*, drawn around it and annotated with $N$, the number of repeated nodes.\n",
"\n",
"+ Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles\n",
"\n",
"+ Observed variables are drawn as shaded open circles\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Conditional Independence
\n",
"Consider three random variables $a,b,c$, we use the following notation to indicate that $a$ is *conditionally idependent* of $b$ given $c$\n",
"\n",
"$$a \\perp\\\\!\\\\!\\\\!\\perp b \\mid c$$\n",
"\n",
"When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as\n",
"\n",
"$$p\\left(a, b \\mid c \\right) = p\\left(a\\mid b,c\\right)p\\left(b\\mid c\\right) = p\\left(a\\mid c\\right) p\\left(b \\mid c\\right)$$\n",
"\n",
"
D-separation
\n",
"Given the description of conditional independence above, we now consider a general directed acyclic graph. Let $A$, $B$, and $C$ be arbitrary sets of *nonintersecting* nodes. We wish to determine if the graph structure implies the conditional independence $A \\perp\\\\!\\\\!\\\\!\\perp B \\mid C$. It turns out this conidition is satisfied if **all** possible paths from **any** node in $A$ to **any** node in $B$ are *blocked*, in which case $A$ is said to be *d-separated* from $B$. A path is considered blocked if either of the following hold:\n",
"\n",
"**(a)** There is a node, $n \\in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n\n",
"\n",
"**(b)** There is a node, $n \\notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.\n",
"\n",
"Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------------------------------------\n",
"\n",
"
Training a Bayes Net
\n",
"MLE for training a Bayes Net, estimating the theta parameters for the nodes, is the equivalent to maximizing the likelihood of the data. So for each node, probability of the node being equal to $s$ be conditioned on say $f$ and $a$, then we have - TODO - generalize this to N conditional quantities\n",
"\n",
"$$\\theta_{s|ij} = \\frac{\\sum_{k=1}^K \\delta \\left(f_k = i, a_k =j, s_k =1 \\right)}{\\sum_{k=1}^K \\delta \\left(f_k=i, a_k=j \\right)}$$\n",
"\n",
"where\n",
"$\\delta(x) = 1$ \n",
"\n",
"if $x$ is true, 0 otherwise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Partially Observed Data
\n",
"Can't use the MLE approach above due to lack of data\n",
"Let $X$ be all *observed* variables values\n",
"Let $Z$ be all *unobserved* variables\n",
"Assume we always observe/unobserve the same variables - it is possible to generalize this\n",
"\n",
"Approach:\n",
"\n",
"$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n",
"\n",
"using some probability distribution on $Z$, namely, $P(Z|X,\\theta)$ - do we need a model for this distribution?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Expectation Maximization (EM) Algorithim
\n",
"guaranteed to find local maximum of expected log likelihood above\n",
"\n",
"$ argmax_{\\theta} E_{P(Z|X,\\theta)}\\left[log P(x,z|\\theta) \\right]$\n",
"\n",
"see slide at 29:59\n",
"\n",
"EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define\n",
"\n",
"$Q\\left(\\theta' | \\theta \\right) = E_{E_{P(Z|X,\\theta)}} \\left[\\log P(X,Z | \\theta') \\right]$\n",
"\n",
"Iterate until convergence:\n",
"\n",
"* E Step: Use $X$ and current $\\theta$ to calculate $P(Z|X,\\theta)$ - done for every variable in $Z$ for each training example, \n",
"\n",
"* M Step: Replace current $\\theta$ by \n",
"\n",
"$\\theta \\leftarrow arg max_{\\theta'} Q $\n",
"\n",
"using step 1, plug into into last equation on slide 29:59 and pick max theta\n",
"\n",
"Example: see slide 38:59, 50:59\n",
"\n",
"More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:\n",
"\n",
"* E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$\n",
"\n",
"* M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count\n",
"\n",
"$\\delta(Y=1) \\rightarrow E_{Z|X,\\theta}[Y]$\n",
"\n",
"$\\delta(Y=0) \\rightarrow 1 - \\delta(Y=1)$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Example: Linear Gaussian Model
\n",
"*TODO* A useful example may be a Linear-Gaussian model - see Bishop pages 370-372 - "
]
}
],
"metadata": {}
}
]
}