{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dynamic Programming for NLP\n",
    "\n",
    "The strategy of _dynamic programming_ reduces the complexity of a search problem which decomposes into frequently-reused instances of the same problem.\n",
    "\n",
    "You encountered dynamic programming for n-gram segmentation in HW4. We have also discussed two more dynamic programming algorithms in lecture: _Viterbi_ and _forward-backward_. For HW5, you will need to implement the Viterbi algorithm. The forward-backward algorithm is an optional extension.\n",
    "\n",
    "In this notebook, we sketch an efficient implementation of the Viterbi algorithm. We will review the forward-backward algorithm in section.\n",
    "\n",
    "### Viterbi Algorithm\n",
    "\n",
    "Recall our definition of a _first-order Markov model_ (a kind of _history-based model_) in lecture: $$f(\\mathbf{x}, c_{1:n}) = \\sum_{i = 1}^n \\hat{\\mathbf{y}}(c_{i - 1})_{c_i}.$$\n",
    "\n",
    "We fix some observation sequence $\\mathbf{x}$ of length $n.$ We refer to _classes_ and _states_ interchangeably.\n",
    "\n",
    "We refer to $\\hat{\\mathbf{y}}$ as the _edge scores_. This is what varies across applications of the Viterbi algorithm. For example, in a Hidden Markov Model (HMM), the edge scores are the joint probability of transition and emission, given the observation. In a Maximum-Entropy Markov Model (MEMM), the edge scores are a softmax over a linear model on the features.\n",
    "\n",
    "The Viterbi algorithm relies on the fact that a maximal path consists of a maximal path to the penultimate state followed by a maximizing transition and emission to the final state. The next timestep reuses each maximal path in the previous timestep. This combination of _optimal substructure_ and _overlapping subproblems_ is why dynamic programming works here.\n",
    "\n",
    "Next, we consider some practical aspects of implementing the Viterbi algorithm.\n",
    "\n",
    "#### Log Scores\n",
    "\n",
    "We often use log scores for two reasons:\n",
    "1. avoid numerical underflow as scores shrink\n",
    "2. turn multiplications for joint probabilities into cheaper additions\n",
    "\n",
    "#### Boundary States\n",
    "\n",
    "You will need to think about how to denote \"start\" and \"end\" states to demarcate sequences of interest. The sequence model has no natural notion of a \"start\" or \"end\" state. (Note that the \"start\" state is distinct from the earliest timestep in the Viterbi trellis, which is sometimes also called a \"start\" state.)\n",
    "\n",
    "We can introduce these boundary states while preprocessing or training. (In the dataset for HW5, blank rows delimit sentences.)\n",
    "\n",
    "For example, as described in lecture, we often denote the start state for sentences as `<s>` or `BOS` (beginning-of-sentence), and the end state as `</s>` or `EOS` (end-of-sentence). These boundary states let a language model recognize patterns at the start and end of a sentence, which can sometimes be an informative feature in NER. Without these boundary states, count-based initial state probabilities would contain no information about being at the start or end of a sentence.\n",
    "\n",
    "#### Precomputing Edge Scores and Efficient Matrix Operations\n",
    "\n",
    "The edge scores depend on the observed sequence, and so we must compute them anew every time we run the algorithm. \n",
    "\n",
    "We described the \"simple\" algorithm in lecture using elementwise operations and computing all the edge scores at once. We then mentioned in lecture the idea of _precomputing_ the edge scores as needed. We can reformulate this precomputation with matrix operations with the analogy of applying `:forward` to a batch.\n",
    "\n",
    "Required matrices:\n",
    "- **Maximal Path Scores.** We denoted the log-score of the most likely path as $\\pi \\in \\mathbb{R}^{n \\times |\\mathcal{C}|}.$ Then $\\pi[i, c_i]$ is the log-score of the highest-scoring path which ends in state $c_i$ in timestep $i.$\n",
    "- **Backpointers.** We denoted the backpointers as $bp \\in \\mathcal{C}^{n \\times |\\mathcal{C}|}.$ These allow us to recover the maximal paths by following pointers from the last timestep to the first.\n",
    "- **Forwarding.** We can \"forward\" the max scores through each timestep in $O(|\\mathcal{C}|^2)$ space, compared to the naive approach of computing all of the edge scores across all timesteps in advance. (Here, \"forward\" is in the sense of forwarding a batch, and not the forward algorithm which we discuss later.)\n",
    "\n",
    "The key idea per timestep is to precompute the edge scores, then compute the maxes and backpointers simultaneously by adding the `:expand`ed maxes from the previous timestep and using `:max`.\n",
    "\n",
    "#### Example\n",
    "\n",
    "Below is an example of the Viterbi algorithm in Torch: a simple HMM POS tagger for a toy sentence, \"deal talks fail.\" Compare to the neural network POS tagger you developed in HW2 based on [Collobert et al. 2011]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1\n",
       " 2\n",
       " 2\n",
       " 3\n",
       "[torch.DoubleTensor of size 4]\n",
       "\n"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- Simple HMM Part-of-Speech (POS) Tagger to demonstrate Viterbi Algorithm\n",
    "--\n",
    "-- \"deal talks fail\" phrase from John Longley's lecture notes at University of\n",
    "-- Edinburgh (Informatics 2A: Lecture 16)\n",
    "\n",
    "-- Note that we have a START boundary state but not an END one in this example.\n",
    "-- Column is the \"from\" state, row is the \"to\" state\n",
    "initial = torch.Tensor({\n",
    "    -- START, N, V\n",
    "    {1.0},\n",
    "    {0.0},\n",
    "    {0.0},\n",
    "}):log() -- precompute the log-probabilities once\n",
    "\n",
    "transition = torch.Tensor({\n",
    "    {0.0, 0.0, 0.0},\n",
    "    {0.8, 0.4, 0.6},\n",
    "    {0.2, 0.6, 0.4}\n",
    "}):log()\n",
    "\n",
    "emission = torch.Tensor({\n",
    "    -- START, deal, fail, talks\n",
    "    {1.0, 0.0, 0.0},\n",
    "    {0.0, 0.45, 0.4},\n",
    "    {0.0, 0.1, 0.4},\n",
    "    {0.0, 0.45, 0.2}\n",
    "}):log()\n",
    "\n",
    "-- number of classes\n",
    "C = initial:size(1)\n",
    "\n",
    "-- log-scores of transition and emission\n",
    "-- corresponds to the vector y in the lecture notes\n",
    "-- i: timestep for the computed score\n",
    "function score_hmm(observations, i)\n",
    "    local observation_emission = emission[observations[i]]:view(C, 1):expand(C, C)\n",
    "    -- NOTE: allocates a new Tensor\n",
    "    return observation_emission + transition\n",
    "end\n",
    "\n",
    "-- Viterbi algorithm.\n",
    "-- observations: a sequence of observations, represented as integers\n",
    "-- logscore: the edge scoring function over classes and observations in a history-based model\n",
    "function viterbi(observations, logscore)\n",
    "    local n = observations:size(1)\n",
    "    local max_table = torch.Tensor(n, C)\n",
    "    local backpointer_table = torch.Tensor(n, C)\n",
    "\n",
    "    -- first timestep\n",
    "    -- the initial most likely paths are the initial state distribution\n",
    "    -- NOTE: another unnecessary Tensor allocation here\n",
    "    local maxes, backpointers = (initial + emission[observations[1]]):max(2)\n",
    "    max_table[1] = maxes\n",
    "\n",
    "    -- remaining timesteps (\"forwarding\" the maxes)\n",
    "    for i=2,n do\n",
    "        -- precompute edge scores\n",
    "        y = logscore(observations, i)\n",
    "        scores = y + maxes:view(1, C):expand(C, C)\n",
    "\n",
    "        -- compute new maxes (NOTE: another unnecessary Tensor allocation here)\n",
    "        maxes, backpointers = scores:max(2)\n",
    "\n",
    "        -- record\n",
    "        max_table[i] = maxes\n",
    "        backpointer_table[i] = backpointers\n",
    "    end\n",
    "\n",
    "    -- follow backpointers to recover max path\n",
    "    local classes = torch.Tensor(n)\n",
    "    maxes, classes[n] = maxes:max(1)\n",
    "    for i=n,2,-1 do\n",
    "        classes[i-1] = backpointer_table[{i, classes[i]}]\n",
    "    end\n",
    "\n",
    "    return classes\n",
    "end\n",
    "\n",
    "-- \"START deal talks fail\" = 1 2 4 3 \n",
    "-- => START N N V = 1 2 2 3\n",
    "-- successfully looks ahead to infer the POS tags\n",
    "-- because N is unlikely to both follow N and emit \"fail\"\n",
    "print(viterbi(torch.Tensor({1, 2, 4, 3}), score_hmm))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "iTorch",
   "language": "lua",
   "name": "itorch"
  },
  "language_info": {
   "name": "lua",
   "version": "5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}