{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import torch\n", "from torch import nn\n", "from torchmore import layers, flex" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# MOTIVATION" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequence to Sequence Models\n", "\n", "- recognition: sequence of feature vectors to outputs (e.g., OCR, handwriting, speech)\n", "- improving / correcting OCR output via language models (discarding unlikely hypotheses)\n", "- reading order determination (which text line follows which other text line)\n", "- ground truth alignment and semi-supervised training\n", "- semi-supervised training\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# CLASSICAL THEORY\n", "\n", "# Languages\n", "\n", "- _languages_ are sets of strings $L\\subseteq\\Sigma^*$ over some alphabet $\\Sigma$\n", "- alphabets are commonly either _characters_ or _words_\n", "- language come in different classes (Chomsky hierarchy):\n", " - regular languages = finite automaton\n", " - context free languages = pushdown automaton\n", " - context dependent languages = linear bounded automaton\n", " - recursively enumerable = Turing machine" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Statistical Language Models\n", "\n", "- statistical language models assign probabilities to strings\n", "- $P(\\hbox{Dog bites man.}) > P(\\hbox{Man bites dog.}) > P(\\hbox{Bites man dog.})$\n", "- statistical language models are used to correct ambiguous OCR / speech results\n", "- statistical language models include prior information about language\n", "\n", "OCR: \"The ca? drives down the road.\" ? = r or n\n", "\n", "Correct: \"The car drives down the road.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# \"Classical\" Statistical Language Models\n", "\n", "- $n$-gram models with smoothing\n", "- either in terms of characters or words\n", "- statistical version of regular languages\n", "- do not capture context, semantics, or other phenomena\n", "\n", "$P(\\hbox{next character} | \\hbox{previous characters})$\n", "\n", "E.g.,\n", "\n", "$$P(c | \\hbox{\"neur\"}) = ...$$\n", "\n", "High for $c \\in \\{\"a\", \"o\"\\}$, low for $c \\in \\{\"t\", \"x\"\\}$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# $n$-Grams\n", "\n", "For $n$-gram models, we simply tabulate $P(s_n | s_{s-1} ... s_{s-n})$ for all valid strings $s$." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "282695\n", "['a', 'amd', 'aol', 'aws', 'aachen', 'aalborg', 'aalesund', 'aaliyah', 'aalst', 'aalto']\n" ] } ], "source": [ "from collections import Counter\n", "import re\n", "training_set = [ s.strip().lower() for s in open(\"/usr/share/dict/words\").readlines()]\n", "training_set = [ s for s in training_set if not re.search(r\"[^a-z]\", s)]\n", "print(len(training_set))\n", "print(training_set[:10])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "n = 3\n", "suffixes, ngrams = Counter(), Counter()\n", "for s in training_set:\n", " s = \"_\"*n + s\n", " for i in range(n, len(s)):\n", " suffixes[s[i-n:i-1]] += 1\n", " ngrams[s[i-n:i]] += 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now we can estimate probabilities easily:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[(0.2603550295857988, 'aal'),\n", " (0.1893491124260355, 'aar'),\n", " (0.08875739644970414, 'aan'),\n", " (0.08284023668639054, 'aat'),\n", " (0.08284023668639054, 'aas'),\n", " (0.07100591715976332, 'aam'),\n", " (0.03550295857988166, 'aai'),\n", " (0.029585798816568046, 'aag'),\n", " (0.029585798816568046, 'aab'),\n", " (0.023668639053254437, 'aaf')]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def prob(s):\n", " return ngrams[s] / float(suffixes[s[:-1]])\n", "\n", "probabilities = sorted([(prob(\"aa\"+chr(i)), \"aa\"+chr(i)) for i in range(ord(\"a\"), ord(\"a\")+26)], reverse=True)\n", "probabilities[:10]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# OCR / SPEECH RECOGNITION" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classical OCR / Speech Recognition\n", "\n", "- segment the input image into a sequence of character images $x_i$\n", "- classify each character with a neural network giving posterior probabilities $P(s_i | x_i)$\n", "- combine the posterior probabilities with a language model (e.g., $n$-gram)\n", "\n", "Under some assumptions, the optimal solution is given by:\n", "\n", "$$D(x) = \\arg\\max_s P(s) \\prod P(s_i | x_i)$$\n", "\n", "$P(s)$: language model, $P(s_i | x_i)$: \"acoustic\" model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# First Neural Network Models for OCR / Speech\n", "\n", "Basis:\n", "\n", "- acoustic model: $P(s_i | x_i)$\n", " - single character neural network model or scanning neural network model (e.g., LeCun 1995)\n", "- language model: $P(s)$\n", " - $n$-gram language model: $\\prod P(s_i | s_{i-1}...s_{i-n} )$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "\n", "# First Neural Network Models for OCR / Speech\n", "\n", "Complications:\n", "\n", "- multiple alternative segmentations of the input\n", " - requires maximizing over segmentations $S$\n", " - $D(x) = \\arg\\max_{s,S} P(S) P(s)\\prod P(s_i|x_i, S)$\n", " - requires complicated dynamic programming algorithm to solve\n", "- language model smoothing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# HMM Models for OCR / Speech\n", "\n", "- treat OCR/Speech as a sequence to sequence model\n", "- usually continuous for better performance, but think of it with VQ:\n", " - speech: compute windowed FT and perform $k$-means clustering\n", " - OCR: take vertical slices through the input image and perform $k$-means clustering\n", "- now have sequence to sequence problem:\n", " - input: sequence of cluster center identifiers $1...k$\n", " - output: sequence of characters / words" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Recurrent Neural Networks\n", "\n", "- feature vector sequence to output string is a sequence-to-sequence transformation\n", "- we can learn this directly with recurrent neural networks\n", "- example models:\n", " - TDNN (time delayed neural networks)\n", " - simple recurrent neural networks\n", " - LSTM and GRU networks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequence to Sequence Models\n", "\n", "- TDNN, recurrent, and LSTM networks are limited\n", " - similar to HMM models / regular languages\n", " - can take into account contextual information for probabilities\n", " - provide black-box input-to-output mappings\n", "- machine translation and similar tasks require...\n", " - movements and inversions of strings\n", " - recognition lattices that can be processed further\n", " - sequence-to-sequence models fill this gap" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Recurrent Models, LSTM+CTC\n", "\n", "- already covered in previous slides\n", "- note that the \"CTC\" portion of LSTM training is the same algorithm as the forward-backward algorithm used in HMM training" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Simple Convolutional Models\n", "\n", "- we can replace the probability estimates $P(s_i | s_{i-1}...s_{i-n})$ with neural network estimates\n", "- these are simple convolutional models\n", " - replace the character/word sequence with a sequence of one-hot encoded character/word vectors\n", " - now $P(s_i | s_{i-1}...s_{i-n}) \\approx$ model mapping $n-1 \\times r$ input to $r$ probabilities\n", " - this can be either a fully connected network or even a convolutional network\n", "- like $n$-grams: finite history, smoothed estimates" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Grave's Neural Transducer\n", "\n", "Reasoning:\n", "\n", "- LSTM+CTC only contains \"acoustic model\", no language model\n", "- for the language model, we need current output conditioned on actual previous outputs\n", "- encoder: bidirectional LSTM operating over input\n", "- prediction network: forward LSTM predicting $s_i$ given $s_{i-1}...s_0$\n", "- mirrors the dynamic programming / beam search used with HMMs + language models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Graves's Neural Transducer\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Graves's Neural Transducer\n", "\n", "\n", "\n", "Figure: Prabhavalkar, Rohit, et al. \"A Comparison of Sequence-to-Sequence Models for Speech Recognition.\" Interspeech. 2017." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bahdanau Attention\n", "\n", "\n", "\n", "Chorowski, Jan, et al. \"End-to-end continuous speech recognition using attention-based recurrent nn: First results.\" arXiv preprint arXiv:1412.1602 (2014).\n", "\n", "Figure: Prabhavalkar, Rohit, et al. \"A Comparison of Sequence-to-Sequence Models for Speech Recognition.\" Interspeech. 2017." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequence to Sequence\n", "\n", "\n", "\n", "Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. \"Sequence to sequence learning with neural networks.\" Advances in neural information processing systems. 2014." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Note on Seq2Seq\n", "\n", "- acoustic vs linguistic analysis is likely spurious\n", "- the real issue is that LSTM etc. only output posteriors at time $t$\n", "- i.e., output is an average over outputs at time $t$\n", "- but that average does already contain linguistic info\n", "- beam search helps disentangle those averages, not add linguistic modeling" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# When are Seq2Seq Models Useful?\n", "\n", "- when there are many possible segmentation alternatives\n", "- when you need to add separately constructed language models\n", "- when there are substantial reorderings of the input (this also requires attention or global state vectors)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# SEQ2SEQ FOR DIGIT RECOGNITION" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from torch import nn\n", "class Seq2Seq(nn.Module):\n", " def __init__(self, ninput, noutput, nhidden, pre=None, nlayers=1):\n", " super().__init__()\n", " self.encoder = nn.LSTM(ninput, nhidden, num_layers=nlayers)\n", " self.decoder = nn.LSTM(noutput, nhidden, num_layers=nlayers)\n", " self.out = nn.Linear(nhidden, noutput)\n", " self.logsoftmax = nn.LogSoftmax(dim=0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "For a seq2seq model, we need an encoder, a decoder, and an output map" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def forward(self, inputs, target, forcing=0.5):\n", " _, state = self.encoder(inputs)\n", " for i in range(len(targets)):\n", " decoder_output, state = self.decoder(decoder_input.view(1, 1, -1), state)\n", " _, pred = self.logsoftmax(self.out(decoder_output)[0, 0]).topk(1)\n", " decoder_input = hotone(pred)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def greedy_predict(self, inputs):\n", " _, state = self.encoder(inputs)\n", " result = []\n", " for i in range(MAXLEN):\n", " decoder_output, state = self.decoder(decoder_input.view(1, 1, -1), state)\n", " _, pred = self.logsoftmax(self.out(decoder_output)[0, 0]).topk(1)\n", " decoder_input = hotone(pred)\n", " result.append(pred)\n", " if pred==0: break\n", " return result" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequence-to-Sequence Output\n", "\n", "\n", " \n", " \n", " \n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2D Scanning\n", "\n", "\n", "\n", " Bluche, Théodore, Jérôome Louradour, and Ronaldo Messina. \"Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention.\" 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2D Scanning\n", "\n", "\n", "\n", "Wigington, Curtis, et al. \"Start, follow, read: End-to-end full-page handwriting recognition.\" Proceedings of the European Conference on Computer Vision (ECCV). 2018." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2D Scanning Approaches\n", "\n", "- straightforward idea -- analogous to human reading\n", "- a kind of seq2seq with attention\n", "- use MDLSTM or VGG to predict start point\n", "- then use MDLSTM or VGG to predict next fixation point\n", "- can follow curved lines, deal with layout issues directly" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 2D Scanning Approaches\n", "\n", "Questions:\n", "\n", "- competitive computational costs?\n", "- competitive performance?\n", "- hard to train?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary: Sequence to Sequence\n", "\n", "- in addition to LSTM+CTC, there are more complex sequence-to-sequence architectures\n", "- such architectures explicitly model dependencies within the output\n", "- they also allow inputs/outputs of very different lengths and rearrangements of sequence elements" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary: Sequence to Sequence\n", "\n", "\n", "Applications:\n", "\n", "- potential seq2seq OCR applications (some found in the literature)\n", "- general purpose language modeling and OCR correction" ] } ], "metadata": { "celltoolbar": "Slideshow", "jupytext": { "cell_metadata_filter": "-all", "formats": "ipynb", "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }