{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    }
   ],
   "source": [
    "%pylab inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "lines_to_next_cell": 2,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "from torchmore import layers, flex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# MOTIVATION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sequence to Sequence Models\n",
    "\n",
    "- recognition: sequence of feature vectors to outputs (e.g., OCR, handwriting, speech)\n",
    "- improving / correcting OCR output via language models (discarding unlikely hypotheses)\n",
    "- reading order determination (which text line follows which other text line)\n",
    "- ground truth alignment and semi-supervised training\n",
    "- semi-supervised training\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CLASSICAL THEORY\n",
    "\n",
    "# Languages\n",
    "\n",
    "- _languages_ are sets of strings $L\\subseteq\\Sigma^*$ over some alphabet $\\Sigma$\n",
    "- alphabets are commonly either _characters_ or _words_\n",
    "- language come in different classes (Chomsky hierarchy):\n",
    "    - regular languages = finite automaton\n",
    "    - context free languages = pushdown automaton\n",
    "    - context dependent languages = linear bounded automaton\n",
    "    - recursively enumerable = Turing machine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Statistical Language Models\n",
    "\n",
    "- statistical language models assign probabilities to strings\n",
    "- $P(\\hbox{Dog bites man.}) > P(\\hbox{Man bites dog.}) > P(\\hbox{Bites man dog.})$\n",
    "- statistical language models are used to correct ambiguous OCR / speech results\n",
    "- statistical language models include prior information about language\n",
    "\n",
    "OCR: \"The ca? drives down the road.\" ? = r or n\n",
    "\n",
    "Correct: \"The car drives down the road.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# \"Classical\" Statistical Language Models\n",
    "\n",
    "- $n$-gram models with smoothing\n",
    "- either in terms of characters or words\n",
    "- statistical version of regular languages\n",
    "- do not capture context, semantics, or other phenomena\n",
    "\n",
    "$P(\\hbox{next character} | \\hbox{previous characters})$\n",
    "\n",
    "E.g.,\n",
    "\n",
    "$$P(c | \\hbox{\"neur\"}) = ...$$\n",
    "\n",
    "High for $c \\in \\{\"a\", \"o\"\\}$, low for $c \\in \\{\"t\", \"x\"\\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# $n$-Grams\n",
    "\n",
    "For $n$-gram models, we simply tabulate $P(s_n | s_{s-1} ... s_{s-n})$ for all valid strings $s$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "282695\n",
      "['a', 'amd', 'aol', 'aws', 'aachen', 'aalborg', 'aalesund', 'aaliyah', 'aalst', 'aalto']\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "import re\n",
    "training_set = [ s.strip().lower() for s in open(\"/usr/share/dict/words\").readlines()]\n",
    "training_set = [ s for s in training_set if not re.search(r\"[^a-z]\", s)]\n",
    "print(len(training_set))\n",
    "print(training_set[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "n = 3\n",
    "suffixes, ngrams = Counter(), Counter()\n",
    "for s in training_set:\n",
    "    s = \"_\"*n + s\n",
    "    for i in range(n, len(s)):\n",
    "        suffixes[s[i-n:i-1]] += 1\n",
    "        ngrams[s[i-n:i]] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Now we can estimate probabilities easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.2603550295857988, 'aal'),\n",
       " (0.1893491124260355, 'aar'),\n",
       " (0.08875739644970414, 'aan'),\n",
       " (0.08284023668639054, 'aat'),\n",
       " (0.08284023668639054, 'aas'),\n",
       " (0.07100591715976332, 'aam'),\n",
       " (0.03550295857988166, 'aai'),\n",
       " (0.029585798816568046, 'aag'),\n",
       " (0.029585798816568046, 'aab'),\n",
       " (0.023668639053254437, 'aaf')]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def prob(s):\n",
    "    return ngrams[s] / float(suffixes[s[:-1]])\n",
    "\n",
    "probabilities = sorted([(prob(\"aa\"+chr(i)), \"aa\"+chr(i)) for i in range(ord(\"a\"), ord(\"a\")+26)], reverse=True)\n",
    "probabilities[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# OCR / SPEECH RECOGNITION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Classical OCR / Speech Recognition\n",
    "\n",
    "- segment the input image into a sequence of character images $x_i$\n",
    "- classify each character with a neural network giving posterior probabilities $P(s_i | x_i)$\n",
    "- combine the posterior probabilities with a language model (e.g., $n$-gram)\n",
    "\n",
    "Under some assumptions, the optimal solution is given by:\n",
    "\n",
    "$$D(x) = \\arg\\max_s P(s) \\prod P(s_i | x_i)$$\n",
    "\n",
    "$P(s)$: language model, $P(s_i | x_i)$: \"acoustic\" model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# First Neural Network Models for OCR / Speech\n",
    "\n",
    "Basis:\n",
    "\n",
    "- acoustic model: $P(s_i | x_i)$\n",
    "    - single character neural network model or scanning neural network model (e.g., LeCun 1995)\n",
    "- language model: $P(s)$\n",
    "    - $n$-gram language model:  $\\prod P(s_i | s_{i-1}...s_{i-n} )$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "  \n",
    "\n",
    "# First Neural Network Models for OCR / Speech\n",
    "\n",
    "Complications:\n",
    "\n",
    "- multiple alternative segmentations of the input\n",
    "    - requires maximizing over segmentations $S$\n",
    "    - $D(x) = \\arg\\max_{s,S} P(S) P(s)\\prod P(s_i|x_i, S)$\n",
    "    - requires complicated dynamic programming algorithm to solve\n",
    "- language model smoothing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# HMM Models for OCR / Speech\n",
    "\n",
    "- treat OCR/Speech as a sequence to sequence model\n",
    "- usually continuous for better performance, but think of it with VQ:\n",
    "    - speech: compute windowed FT and perform $k$-means clustering\n",
    "    - OCR: take vertical slices through the input image and perform $k$-means clustering\n",
    "- now have sequence to sequence problem:\n",
    "    - input: sequence of cluster center identifiers $1...k$\n",
    "    - output: sequence of characters / words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Recurrent Neural Networks\n",
    "\n",
    "- feature vector sequence to output string is a sequence-to-sequence transformation\n",
    "- we can learn this directly with recurrent neural networks\n",
    "- example models:\n",
    "    - TDNN (time delayed neural networks)\n",
    "    - simple recurrent neural networks\n",
    "    - LSTM and GRU networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sequence to Sequence Models\n",
    "\n",
    "- TDNN, recurrent, and LSTM networks are limited\n",
    "    - similar to HMM models / regular languages\n",
    "    - can take into account contextual information for probabilities\n",
    "    - provide black-box input-to-output mappings\n",
    "- machine translation and similar tasks require...\n",
    "    - movements and inversions of strings\n",
    "    - recognition lattices that can be processed further\n",
    "    - sequence-to-sequence models fill this gap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Recurrent Models, LSTM+CTC\n",
    "\n",
    "- already covered in previous slides\n",
    "- note that the \"CTC\" portion of LSTM training is the same algorithm as the forward-backward algorithm used in HMM training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Simple Convolutional Models\n",
    "\n",
    "- we can replace the probability estimates $P(s_i | s_{i-1}...s_{i-n})$ with neural network estimates\n",
    "- these are simple convolutional models\n",
    "    - replace the character/word sequence with a sequence of one-hot encoded character/word vectors\n",
    "    - now $P(s_i | s_{i-1}...s_{i-n}) \\approx$ model mapping $n-1 \\times r$ input to $r$ probabilities\n",
    "    - this can be either a fully connected network or even a convolutional network\n",
    "- like $n$-grams: finite history, smoothed estimates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Grave's Neural Transducer\n",
    "\n",
    "Reasoning:\n",
    "\n",
    "- LSTM+CTC only contains \"acoustic model\", no language model\n",
    "- for the language model, we need current output conditioned on actual previous outputs\n",
    "- encoder: bidirectional LSTM operating over input\n",
    "- prediction network: forward LSTM predicting $s_i$ given $s_{i-1}...s_0$\n",
    "- mirrors the dynamic programming / beam search used with HMMs + language models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Graves's Neural Transducer\n",
    "\n",
    "<img src=\"figs/graves-transducer.png\" style=\"height: 3in\" />\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Graves's Neural Transducer\n",
    "\n",
    "<img src=\"figs/seq2seq-jaitly1.png\" style=\"height=3in\" />\n",
    "\n",
    "<font size=-1>Figure: Prabhavalkar, Rohit, et al. \"A Comparison of Sequence-to-Sequence Models for Speech Recognition.\" Interspeech. 2017.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Bahdanau Attention\n",
    "\n",
    "<img src=\"figs/seq2seq-jaitly2.png\" style=\"height: 4in\" />\n",
    "\n",
    "Chorowski, Jan, et al. \"End-to-end continuous speech recognition using attention-based recurrent nn: First results.\" arXiv preprint arXiv:1412.1602 (2014).\n",
    "\n",
    "<font size=-1>Figure: Prabhavalkar, Rohit, et al. \"A Comparison of Sequence-to-Sequence Models for Speech Recognition.\" Interspeech. 2017.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sequence to Sequence\n",
    "\n",
    "<img src=\"figs/seq2seq-google.png\" style=\"height=4in\" />\n",
    "\n",
    "Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. \"Sequence to sequence learning with neural networks.\" Advances in neural information processing systems. 2014."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Note on Seq2Seq\n",
    "\n",
    "- acoustic vs linguistic analysis is likely spurious\n",
    "- the real issue is that LSTM etc. only output posteriors at time $t$\n",
    "- i.e., output is an average over outputs at time $t$\n",
    "- but that average does already contain linguistic info\n",
    "- beam search helps disentangle those averages, not add linguistic modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# When are Seq2Seq Models Useful?\n",
    "\n",
    "- when there are many possible segmentation alternatives\n",
    "- when you need to add separately constructed language models\n",
    "- when there are substantial reorderings of the input (this also requires attention or global state vectors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# SEQ2SEQ FOR DIGIT RECOGNITION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "from torch import nn\n",
    "class Seq2Seq(nn.Module):\n",
    "    def __init__(self, ninput, noutput, nhidden, pre=None, nlayers=1):\n",
    "        super().__init__()\n",
    "        self.encoder = nn.LSTM(ninput, nhidden, num_layers=nlayers)\n",
    "        self.decoder = nn.LSTM(noutput, nhidden, num_layers=nlayers)\n",
    "        self.out = nn.Linear(nhidden, noutput)\n",
    "        self.logsoftmax = nn.LogSoftmax(dim=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "For a seq2seq model, we need an encoder, a decoder, and an output map"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def forward(self, inputs, target, forcing=0.5):\n",
    "    _, state = self.encoder(inputs)\n",
    "    for i in range(len(targets)):\n",
    "        decoder_output, state = self.decoder(decoder_input.view(1, 1, -1), state)\n",
    "        _, pred = self.logsoftmax(self.out(decoder_output)[0, 0]).topk(1)\n",
    "        decoder_input = hotone(pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def greedy_predict(self, inputs):\n",
    "    _, state = self.encoder(inputs)\n",
    "    result = []\n",
    "    for i in range(MAXLEN):\n",
    "        decoder_output, state = self.decoder(decoder_input.view(1, 1, -1), state)\n",
    "        _, pred = self.logsoftmax(self.out(decoder_output)[0, 0]).topk(1)\n",
    "        decoder_input = hotone(pred)\n",
    "        result.append(pred)\n",
    "        if pred==0: break\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sequence-to-Sequence Output\n",
    "\n",
    "<table>\n",
    "    <tr>\n",
    "        <td><img src=\"figs/seq2seq-output.png\" style=\"width: 4in\" /></td>\n",
    "        <td><img src=\"figs/seq2seq-probs.png\" style=\"width: 4in\" /></td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 2D Scanning\n",
    "\n",
    "<img src=\"figs/sequential-2d-ocr.png\" style=\"height: 3in\" />\n",
    "\n",
    "<font size=3> Bluche, Théodore, Jérôome Louradour, and Ronaldo Messina. \"Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention.\" 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 2D Scanning\n",
    "\n",
    "<img src=\"figs/sequential-2d-recognition.png\" style=\"height: 3in\" />\n",
    "\n",
    "<font size=3>Wigington, Curtis, et al. \"Start, follow, read: End-to-end full-page handwriting recognition.\" Proceedings of the European Conference on Computer Vision (ECCV). 2018.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 2D Scanning Approaches\n",
    "\n",
    "- straightforward idea -- analogous to human reading\n",
    "- a kind of seq2seq with attention\n",
    "- use MDLSTM or VGG to predict start point\n",
    "- then use MDLSTM or VGG to predict next fixation point\n",
    "- can follow curved lines, deal with layout issues directly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 2D Scanning Approaches\n",
    "\n",
    "Questions:\n",
    "\n",
    "- competitive computational costs?\n",
    "- competitive performance?\n",
    "- hard to train?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Summary: Sequence to Sequence\n",
    "\n",
    "- in addition to LSTM+CTC, there are more complex sequence-to-sequence architectures\n",
    "- such architectures explicitly model dependencies within the output\n",
    "- they also allow inputs/outputs of very different lengths and rearrangements of sequence elements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Summary: Sequence to Sequence\n",
    "\n",
    "\n",
    "Applications:\n",
    "\n",
    "- potential seq2seq OCR applications (some found in the literature)\n",
    "- general purpose language modeling and OCR correction"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "-all",
   "formats": "ipynb",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}