{
 "metadata": {
  "language": "Julia",
  "name": "",
  "signature": "sha256:7df367b9e847424b7214761851b51463377f6e8a0cc629dbc920f44ecd38e474"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Model Learning\n",
      "==============\n",
      "\n",
      "This documentation covers the process of model learning."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"dbn.png\" width=\"400\" height=\"200\" />"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Model learning is the process of learning the Bayesian network from data. We show in the paper that we can restrict ourselves to the inverted tree structure shown above, where $f_{1:k}$ are the observed features which can be extracted from a given scene, and $a^\\text{fut}$ and $\\dot{\\psi}^\\text{fut}$ are the target variables, the constant acceleration and turnrate over the next time step, respectively. (we *actually* use $\\phi_\\text{des}$ and use that to compute the turnrate, see the paper.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Likelihood Maximization\n",
      "\n",
      "Bayesian structure learning seeks to find the structure $G$ that maximizes $P(G\\mid D)$, where $D$ is the discrete dataset.\n",
      "\n",
      "$$\n",
      "G \\leftarrow {arg\\,max}_G P(G\\mid D)\n",
      "$$\n",
      "\n",
      "The notation used in this work is consistent with that used in *Decision Making Under Uncertainty: Principles and Techniques*, written by Dr. Mykel Kochenderfer. Given $n$ discrete variables $X_{1:n}$, where $r_i$ is the number of instantiations of $X_i$ (number of bins), and $q_i$ is the number of instantiations of the parents of $X_i$ ($\\prod_{j \\in Pa(X_i)} r_j$). The $j$th instantiation of the parents of $X_i$ is denoted $\\pi_{ij}$.\n",
      "\n",
      "The number of sufficient statistics in a Bayesian network, once the structure is set, is $\\sum_{i=1}^{n} r_i q_i$. We write $\\theta_{ijk}$ to signify the probability $P(X_i = k\\mid \\pi_{ij})$. Note that only $\\sum_{i=1}^{n} (r_i-1)q_i$ are independent.\n",
      "\n",
      "It has been shown that the posterior log likelihood assuming a uniform Dirichlet prior over model parameters, $Dir(\\alpha_1,\\ldots)$ is given by the *Bayesian score*. The structure that maximizes the Bayesian score is the same as the structure maximizing the likelihood.\n",
      "\n",
      "$$\n",
      "\\log P(G\\mid D) = \\log P(G) + \\sum^{n}_{i=1} \\sum^{q_i}_{j=1} \\log \\left(\\frac{ \\Gamma(\\alpha_{ij0}) }{ \\Gamma(\\alpha_{ij0} + m_{ij0}) }\\right) + \\sum^{r_i}_{k=1} \\log \\left( \\frac{ \\Gamma(\\alpha_{ijk} + m_{ijk}) }{ \\Gamma(\\alpha_{ijk}) } \\right)\n",
      "$$\n",
      "\n",
      "where \n",
      "\n",
      "$$\n",
      "\\alpha_{ij0} = \\sum_{k=1}^{r_i} \\alpha_{ijk},\n",
      "$$\n",
      "\n",
      "$$\n",
      "m_{ij0} = \\sum_{k=1}^{r_i} m_{ijk}.\n",
      "$$\n",
      "\n",
      "The Bayesian score factorizes over variables, so at first glance Bayesian structure learning is simply the act of finding the optimal set of parents for each variable. What makes Bayesian structure learning NP-complete is that the structure must be acyclic. However, if we restrict our structure search to the one shown above we will always maintain acyclicity (we only need to ensure that the target variables are not parents of one another).\n",
      "\n",
      "The candidate edges are thus:\n",
      "- edges from observed variables to the target variables\n",
      "- edges between the target variables\n",
      "\n",
      "Edges *not* considered:\n",
      "- edges between observed variables\n",
      "- edges from the target variables to the observed variables\n",
      "\n",
      "Once a structure is obtained the sufficient statistics are efficiently computed via maximum likelihood."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Learning\n",
      "\n",
      "Model learning was conducted using three different heuristic methods.\n",
      "\n",
      "| Method        | Actions           |\n",
      "| ------------- |:-------------|\n",
      "| Forward Search | add the next-best edge to maximize Bayesian score |\n",
      "| Graph Search | add, remove, or reverse an edge to maximize the Bayesian score  |\n",
      "| Graph Search w/ Random Edge Initialization | Graph search starting with 2 random parents for each target variable |"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The methods are guaranteed to converge to a local minimum. The structure of the problem makes structure search quick and efficient. As discussed in the paper, graph search with random edge initialization tends to maximize the Bayesian score but requires additional processing power over the first two methods. Considering learning takes on the order of minutes on a desktop computer this is not an issue. \n",
      "\n",
      "The code for model learning is in *model_learning.jl*. It uses several packages, include `Smile` and `SmileExtra` that offer functionality for Bayesian networks and computing the sufficient statistics. Credit goes to Edward Schmerling for his efficient implementation of computing the sufficient statistics from a dataset and obtaining the Bayesian score. Note that `Smile` was *not* used for structure learning in order to take advantage of the problem-specific structure."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}