{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    }
   ],
   "source": [
    "%pylab inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "lines_to_next_cell": 2,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "from torchmore import flex, layers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# APPLICATIONS TO OCR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Character Recognition\n",
    "\n",
    "- assuming you have a character segmentation\n",
    "  - extract each character\n",
    "  - feed to any of these architectures as if it were an object recognition problem\n",
    "\n",
    "Goodfellow, Ian J., et al. \"Multi-digit number recognition from street view imagery using deep convolutional neural networks.\" arXiv preprint arXiv:1312.6082 (2013)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Whole Word Recognition\n",
    "\n",
    "- perform word localization using Faster RCNN\n",
    "- perform whole word recognition as if it were a large object reconition problem\n",
    "\n",
    "![word recognition](figs/word-recognition.png)\n",
    "\n",
    "Jaderberg, Max, et al. \"Deep structured output learning for unconstrained text recognition.\" arXiv preprint arXiv:1412.5903 (2014)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Better Techniques\n",
    "\n",
    "- above techniques are applications of computer vision localization\n",
    "- Faster RCNN and similar techniques are ad hoc and limited\n",
    "- often require pre-segmented text for training\n",
    "- better approaches:\n",
    "  - use markers for localizing/bounding text (later)\n",
    "  - use sequence learning techniques and CTC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Convolutional Networks for OCR\n",
    "\n",
    "Historically, LSTM came first, but we're going to start off with convolutional networks analogous to object recognition networks.\n",
    "\n",
    "Structure:\n",
    "\n",
    "- perform 2D convolutions over the entire image\n",
    "- assume that that has extracted features that correspond to characters\n",
    "- project those features into a 1D sequence\n",
    "- classify the projected 1D feature sequence into characters\n",
    "- perform training using EM alignment (CTC, Viterbi, etc.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Convolutional Networks for OCR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def make_model():\n",
    "    return nn.Sequential(\n",
    "        # BDHW\n",
    "        *convolutional_layers(),\n",
    "        # BDHW, now reduce along the vertical\n",
    "        layers.Fun(lambda x: x.sum(2)),\n",
    "        # BDW\n",
    "        layers.Conv1d(num_classes, 1)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Viterbi Training\n",
    "\n",
    "- ground truth: text string = sequence of classes\n",
    "- ground truth `\"ABC\"` is replaced by regular expression `/_+A+_+B+_+C+_+/`\n",
    "- network outputs $P(c|i)$, a probability of each class $c$ at each position $i$\n",
    "- find the best possible alignment between network outputs and ground truth regex\n",
    "- that alignment gives an output for each time step\n",
    "- treat that alignment as if it were the ground truth and backpropagate\n",
    "- this is an example of an EM algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CTC Training\n",
    "\n",
    "- like Viterbi training, but instead of finding the best alignment uses an average alignment\n",
    "\n",
    "Identical to traditional HMM training in speech recognition:\n",
    "\n",
    "- Viterbi training = Viterbi training\n",
    "- CTC training = forward-backward algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CTC training with cctc2\n",
    "\n",
    "- with the `cctc2` library, we can make the alignment explicit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def train_batch(input, target):\n",
    "    optimizer.zero_grad()\n",
    "    output = model(input)\n",
    "    aligned = cctc2.align(output, target)\n",
    "    loss = mse_loss(aligned, output)\n",
    "    loss.backward()\n",
    "    optimizer.step()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Explicit CTC\n",
    "\n",
    "<img src=\"figs/breuel-lstm-progress.png\" style=\"height: 4in\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CTC in PyTorch\n",
    "\n",
    "- in PyTorch, CTC is implemented as a loss function\n",
    "- `CTCLoss` in PyTorch obscures what's going on\n",
    "- all you get is the loss output, not the EM alignment\n",
    "- sequences are packed in a special way into batches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "ctc_loss = nn.CTCLoss()\n",
    "\n",
    "def train_batch(input, target):\n",
    "    optimizer.zero_grad()\n",
    "    output = model(input)\n",
    "    loss = ctc_loss(output, target)\n",
    "    loss.backward()\n",
    "    optimizer.step()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Word / Text Line Recognition  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def make_model():\n",
    "    return nn.Sequential(\n",
    "        *convolutional_layers(),\n",
    "        layers.Fun(lambda x: x.sum(2)),\n",
    "        layers.Conv1d(num_classes, 1)\n",
    "    )\n",
    "\n",
    "def train_batch(input, target):\n",
    "    optimizer.zero_grad()\n",
    "    output = model(input)\n",
    "    loss = ctc_loss(output, target)\n",
    "    loss.backward()\n",
    "    optimizer.step()     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# VGG-Like Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "def conv2d(d, r=3, stride=1, repeat=1):\n",
    "    \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n",
    "    result = []\n",
    "    for i in range(repeat):\n",
    "        result += [\n",
    "            flex.Conv2d(d, r, padding=(r//2, r//2), stride=stride),\n",
    "            flex.BatchNorm2d(),\n",
    "            nn.ReLU()\n",
    "        ]\n",
    "    return result\n",
    "\n",
    "def conv2mp(d, r=3, mp=2, repeat=1):\n",
    "    \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n",
    "    result = conv2d(d, r, repeat=repeat)\n",
    "    if mp is not None:\n",
    "        result += [nn.MaxPool2d(mp)]\n",
    "    return result\n",
    "\n",
    "def project_and_conv1d(d, noutput, r=5):\n",
    "    return [\n",
    "        layers.Fun(\"lambda x: x.max(2)[0]\"),\n",
    "        flex.Conv1d(d, r, padding=r//2),\n",
    "        flex.BatchNorm1d(),\n",
    "        nn.ReLU(),\n",
    "        flex.Conv1d(noutput, 1),\n",
    "        layers.Reorder(\"BDL\", \"BLD\")\n",
    "    ]\n",
    "\n",
    "class Additive(nn.Module):\n",
    "    def __init__(self, *args, post=None):\n",
    "        super().__init__()\n",
    "        self.sub = nn.ModuleList(args)\n",
    "        self.post = None\n",
    "    def forward(self, x):\n",
    "        y = self.sub[0](x)\n",
    "        for f in self.sub[1:]:\n",
    "            y = y + f(x)\n",
    "        if self.post is not None:\n",
    "            y = self.post(y)\n",
    "        return y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 50, 53])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def make_vgg_model(noutput=53):\n",
    "    return nn.Sequential(\n",
    "        layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
    "        *conv2mp(100, 3, 2, repeat=2),\n",
    "        *conv2mp(200, 3, 2, repeat=2),\n",
    "        *conv2mp(300, 3, 2, repeat=2),\n",
    "        *conv2d(400, 3, repeat=2),\n",
    "        *project_and_conv1d(800, noutput)\n",
    "    )\n",
    "make_vgg_model()(torch.rand(1, 1, 60, 400)).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Resnet-Block\n",
    "\n",
    "- NB: we can easily define Resnet etc. in an object-oriented fashion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def ResnetBlock(d, r=3):\n",
    "    return Additive(\n",
    "        nn.Identity(),\n",
    "        nn.Sequential(\n",
    "            nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d), nn.ReLU(),\n",
    "            nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d)\n",
    "        )\n",
    "    )\n",
    "\n",
    "def resnet_blocks(n, d, r=3):\n",
    "    return [ResnetBlock(d, r) for _ in range(n)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Resnet-like Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 50, 53])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def make_resnet_model(noutput=53):    \n",
    "    return nn.Sequential(\n",
    "        layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
    "        *conv2mp(64, 3, (2, 1)),\n",
    "        *resnet_blocks(5, 64), *conv2mp(128, 3, 2),\n",
    "        *resnet_blocks(5, 128), *conv2mp(256, 3, 2),\n",
    "        *resnet_blocks(5, 256), *conv2mp(512, 3, 2),\n",
    "        *resnet_blocks(5, 512),\n",
    "        *project_and_conv1d(800, noutput)\n",
    "    )\n",
    "make_resnet_model()(torch.rand(1, 1, 60, 400)).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Footprints\n",
    "\n",
    "- even with projection/1D convolution, a character is first recognized in 2D by the 2D convolutional network\n",
    "- character recognition with 2D convolutional networks really a kind of deformable template matching\n",
    "- in order to recognize a character, each pixel at the output of the 2D convolutional network needs to have a footprint large enough to cover the character to be recognized\n",
    "- footprint calculation:\n",
    "  - 3x3 convolution, three maxpool operations = 24x24 footprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conv-Only Training\n",
    "\n",
    "<img src=\"figs/conv-only.png\" style=\"height: 6in\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Problems with VGG/Resnet+Conv1d\n",
    "\n",
    "Problem:\n",
    "- reduces output to H/8, W/8\n",
    "- CTC alignment needs two pixels for each character\n",
    "- result: models trouble with narrow characters\n",
    "\n",
    "Solutions:\n",
    "- use fractional max pooling\n",
    "- use upscaling\n",
    "- use transposed convolutions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Less Downscaling using `FractionalMaxPool2d`\n",
    "\n",
    "- permits more max pooling steps without making image too small\n",
    "- can be performed anisotropically\n",
    "- necessary non-uniform spacing may have additional benefits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 210, 53])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def conv2fmp(d, r=3, fmp=(0.7, 0.85), repeat=1):\n",
    "    result = conv2d(d, r, repeat=repeat)\n",
    "    if fmp is not None:\n",
    "        result += [nn.FractionalMaxPool2d(3, output_ratio=fmp)]\n",
    "    return result\n",
    "\n",
    "def make_fmp_model(noutput=53):\n",
    "    return nn.Sequential(\n",
    "        layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
    "        *[l for d in [50, 100, 150, 200, 250, 300] for l in conv2fmp(d, 3, (0.7, 0.9))],\n",
    "        *project_and_conv1d(800, noutput)\n",
    "    )\n",
    "make_fmp_model()(torch.rand(1, 1, 60, 400)).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Upscaling using `interpolate`\n",
    "\n",
    "- `interpolate` scales an image, has `backward()`\n",
    "- `MaxPool2d...interpolate` is a simple multiscale analysis\n",
    "- can be combined with loss functions at each level"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 400, 53])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch.nn.functional as F\n",
    "def make_interpolating_model(noutput=53):\n",
    "    return nn.Sequential(\n",
    "        layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
    "        *conv2mp(50, 3), *conv2mp(100, 3), *conv2mp(150, 3), *conv2mp(200, 3),\n",
    "        layers.Fun_(lambda x: F.interpolate(x, scale_factor=16)),\n",
    "        *project_and_conv1d(800, noutput)\n",
    "    )\n",
    "make_interpolating_model()(torch.rand(1, 1, 60, 400)).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Upscaling with `interpolate`\n",
    "\n",
    "<img src=\"figs/conv-keep.png\" style=\"height: 6in\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "# Upscaling using `ConvTranspose1d`\n",
    "\n",
    "\n",
    "- `ConvTranspose2d` fills in higher resolutions with \"templates\"\n",
    "- commonly used in image segmentation and superresolution\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1, 53, 49])\n",
      "torch.Size([1, 53, 25])\n"
     ]
    }
   ],
   "source": [
    "def make_ct_model(noutput=53, ct=1):\n",
    "    return nn.Sequential(\n",
    "        layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
    "        *conv2mp(50, 3), \n",
    "        *conv2mp(100, 3),\n",
    "        *conv2mp(150, 3),\n",
    "        *conv2mp(200, 3),\n",
    "        layers.Fun(\"lambda x: x.sum(2)\"), # BDHW -> BDW\n",
    "        *[flex.ConvTranspose1d(800, 1, stride=2)]*ct,\n",
    "        flex.Conv1d(noutput, 7, padding=3)\n",
    "    )\n",
    "print(make_ct_model()(torch.rand(1, 1, 60, 400)).shape)\n",
    "print(make_ct_model(ct=0)(torch.rand(1, 1, 60, 400)).shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# How well do these work?\n",
    "\n",
    "- Works for word or text line recognition.\n",
    "- All these models only require that characters are arranged left to right.\n",
    "- Input images can be rotated up to around 30 degrees and scaled.\n",
    "- Input images can be grayscale.\n",
    "- Great for scene text and degraded documents.\n",
    "\n",
    "But:\n",
    "- You pay a price for translation/scale inv: lower performance.\n",
    "- These don't use any recurrent networks, so all information flow is strictly limited to the footprint.\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "-all",
   "formats": "ipynb",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}