{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import torch\n", "from torch import nn\n", "from torchmore import flex, layers" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# APPLICATIONS TO OCR" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Character Recognition\n", "\n", "- assuming you have a character segmentation\n", " - extract each character\n", " - feed to any of these architectures as if it were an object recognition problem\n", "\n", "Goodfellow, Ian J., et al. \"Multi-digit number recognition from street view imagery using deep convolutional neural networks.\" arXiv preprint arXiv:1312.6082 (2013)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Whole Word Recognition\n", "\n", "- perform word localization using Faster RCNN\n", "- perform whole word recognition as if it were a large object reconition problem\n", "\n", "![word recognition](figs/word-recognition.png)\n", "\n", "Jaderberg, Max, et al. \"Deep structured output learning for unconstrained text recognition.\" arXiv preprint arXiv:1412.5903 (2014)." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Better Techniques\n", "\n", "- above techniques are applications of computer vision localization\n", "- Faster RCNN and similar techniques are ad hoc and limited\n", "- often require pre-segmented text for training\n", "- better approaches:\n", " - use markers for localizing/bounding text (later)\n", " - use sequence learning techniques and CTC" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Convolutional Networks for OCR\n", "\n", "Historically, LSTM came first, but we're going to start off with convolutional networks analogous to object recognition networks.\n", "\n", "Structure:\n", "\n", "- perform 2D convolutions over the entire image\n", "- assume that that has extracted features that correspond to characters\n", "- project those features into a 1D sequence\n", "- classify the projected 1D feature sequence into characters\n", "- perform training using EM alignment (CTC, Viterbi, etc.)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Convolutional Networks for OCR" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def make_model():\n", " return nn.Sequential(\n", " # BDHW\n", " *convolutional_layers(),\n", " # BDHW, now reduce along the vertical\n", " layers.Fun(lambda x: x.sum(2)),\n", " # BDW\n", " layers.Conv1d(num_classes, 1)\n", " )" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Viterbi Training\n", "\n", "- ground truth: text string = sequence of classes\n", "- ground truth `\"ABC\"` is replaced by regular expression `/_+A+_+B+_+C+_+/`\n", "- network outputs $P(c|i)$, a probability of each class $c$ at each position $i$\n", "- find the best possible alignment between network outputs and ground truth regex\n", "- that alignment gives an output for each time step\n", "- treat that alignment as if it were the ground truth and backpropagate\n", "- this is an example of an EM algorithm" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# CTC Training\n", "\n", "- like Viterbi training, but instead of finding the best alignment uses an average alignment\n", "\n", "Identical to traditional HMM training in speech recognition:\n", "\n", "- Viterbi training = Viterbi training\n", "- CTC training = forward-backward algorithm" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# CTC training with cctc2\n", "\n", "- with the `cctc2` library, we can make the alignment explicit" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def train_batch(input, target):\n", " optimizer.zero_grad()\n", " output = model(input)\n", " aligned = cctc2.align(output, target)\n", " loss = mse_loss(aligned, output)\n", " loss.backward()\n", " optimizer.step()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Explicit CTC\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# CTC in PyTorch\n", "\n", "- in PyTorch, CTC is implemented as a loss function\n", "- `CTCLoss` in PyTorch obscures what's going on\n", "- all you get is the loss output, not the EM alignment\n", "- sequences are packed in a special way into batches" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "ctc_loss = nn.CTCLoss()\n", "\n", "def train_batch(input, target):\n", " optimizer.zero_grad()\n", " output = model(input)\n", " loss = ctc_loss(output, target)\n", " loss.backward()\n", " optimizer.step()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Word / Text Line Recognition " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def make_model():\n", " return nn.Sequential(\n", " *convolutional_layers(),\n", " layers.Fun(lambda x: x.sum(2)),\n", " layers.Conv1d(num_classes, 1)\n", " )\n", "\n", "def train_batch(input, target):\n", " optimizer.zero_grad()\n", " output = model(input)\n", " loss = ctc_loss(output, target)\n", " loss.backward()\n", " optimizer.step() " ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# VGG-Like Model" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def conv2d(d, r=3, stride=1, repeat=1):\n", " \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n", " result = []\n", " for i in range(repeat):\n", " result += [\n", " flex.Conv2d(d, r, padding=(r//2, r//2), stride=stride),\n", " flex.BatchNorm2d(),\n", " nn.ReLU()\n", " ]\n", " return result\n", "\n", "def conv2mp(d, r=3, mp=2, repeat=1):\n", " \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n", " result = conv2d(d, r, repeat=repeat)\n", " if mp is not None:\n", " result += [nn.MaxPool2d(mp)]\n", " return result\n", "\n", "def project_and_conv1d(d, noutput, r=5):\n", " return [\n", " layers.Fun(\"lambda x: x.max(2)[0]\"),\n", " flex.Conv1d(d, r, padding=r//2),\n", " flex.BatchNorm1d(),\n", " nn.ReLU(),\n", " flex.Conv1d(noutput, 1),\n", " layers.Reorder(\"BDL\", \"BLD\")\n", " ]\n", "\n", "class Additive(nn.Module):\n", " def __init__(self, *args, post=None):\n", " super().__init__()\n", " self.sub = nn.ModuleList(args)\n", " self.post = None\n", " def forward(self, x):\n", " y = self.sub[0](x)\n", " for f in self.sub[1:]:\n", " y = y + f(x)\n", " if self.post is not None:\n", " y = self.post(y)\n", " return y" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 50, 53])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def make_vgg_model(noutput=53):\n", " return nn.Sequential(\n", " layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n", " *conv2mp(100, 3, 2, repeat=2),\n", " *conv2mp(200, 3, 2, repeat=2),\n", " *conv2mp(300, 3, 2, repeat=2),\n", " *conv2d(400, 3, repeat=2),\n", " *project_and_conv1d(800, noutput)\n", " )\n", "make_vgg_model()(torch.rand(1, 1, 60, 400)).shape" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Resnet-Block\n", "\n", "- NB: we can easily define Resnet etc. in an object-oriented fashion" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def ResnetBlock(d, r=3):\n", " return Additive(\n", " nn.Identity(),\n", " nn.Sequential(\n", " nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d), nn.ReLU(),\n", " nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d)\n", " )\n", " )\n", "\n", "def resnet_blocks(n, d, r=3):\n", " return [ResnetBlock(d, r) for _ in range(n)]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Resnet-like Model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 50, 53])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def make_resnet_model(noutput=53): \n", " return nn.Sequential(\n", " layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n", " *conv2mp(64, 3, (2, 1)),\n", " *resnet_blocks(5, 64), *conv2mp(128, 3, 2),\n", " *resnet_blocks(5, 128), *conv2mp(256, 3, 2),\n", " *resnet_blocks(5, 256), *conv2mp(512, 3, 2),\n", " *resnet_blocks(5, 512),\n", " *project_and_conv1d(800, noutput)\n", " )\n", "make_resnet_model()(torch.rand(1, 1, 60, 400)).shape" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Footprints\n", "\n", "- even with projection/1D convolution, a character is first recognized in 2D by the 2D convolutional network\n", "- character recognition with 2D convolutional networks really a kind of deformable template matching\n", "- in order to recognize a character, each pixel at the output of the 2D convolutional network needs to have a footprint large enough to cover the character to be recognized\n", "- footprint calculation:\n", " - 3x3 convolution, three maxpool operations = 24x24 footprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conv-Only Training\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Problems with VGG/Resnet+Conv1d\n", "\n", "Problem:\n", "- reduces output to H/8, W/8\n", "- CTC alignment needs two pixels for each character\n", "- result: models trouble with narrow characters\n", "\n", "Solutions:\n", "- use fractional max pooling\n", "- use upscaling\n", "- use transposed convolutions" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Less Downscaling using `FractionalMaxPool2d`\n", "\n", "- permits more max pooling steps without making image too small\n", "- can be performed anisotropically\n", "- necessary non-uniform spacing may have additional benefits" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 210, 53])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def conv2fmp(d, r=3, fmp=(0.7, 0.85), repeat=1):\n", " result = conv2d(d, r, repeat=repeat)\n", " if fmp is not None:\n", " result += [nn.FractionalMaxPool2d(3, output_ratio=fmp)]\n", " return result\n", "\n", "def make_fmp_model(noutput=53):\n", " return nn.Sequential(\n", " layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n", " *[l for d in [50, 100, 150, 200, 250, 300] for l in conv2fmp(d, 3, (0.7, 0.9))],\n", " *project_and_conv1d(800, noutput)\n", " )\n", "make_fmp_model()(torch.rand(1, 1, 60, 400)).shape" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "# Upscaling using `interpolate`\n", "\n", "- `interpolate` scales an image, has `backward()`\n", "- `MaxPool2d...interpolate` is a simple multiscale analysis\n", "- can be combined with loss functions at each level" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 400, 53])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch.nn.functional as F\n", "def make_interpolating_model(noutput=53):\n", " return nn.Sequential(\n", " layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n", " *conv2mp(50, 3), *conv2mp(100, 3), *conv2mp(150, 3), *conv2mp(200, 3),\n", " layers.Fun_(lambda x: F.interpolate(x, scale_factor=16)),\n", " *project_and_conv1d(800, noutput)\n", " )\n", "make_interpolating_model()(torch.rand(1, 1, 60, 400)).shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Upscaling with `interpolate`\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# Upscaling using `ConvTranspose1d`\n", "\n", "\n", "- `ConvTranspose2d` fills in higher resolutions with \"templates\"\n", "- commonly used in image segmentation and superresolution\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([1, 53, 49])\n", "torch.Size([1, 53, 25])\n" ] } ], "source": [ "def make_ct_model(noutput=53, ct=1):\n", " return nn.Sequential(\n", " layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n", " *conv2mp(50, 3), \n", " *conv2mp(100, 3),\n", " *conv2mp(150, 3),\n", " *conv2mp(200, 3),\n", " layers.Fun(\"lambda x: x.sum(2)\"), # BDHW -> BDW\n", " *[flex.ConvTranspose1d(800, 1, stride=2)]*ct,\n", " flex.Conv1d(noutput, 7, padding=3)\n", " )\n", "print(make_ct_model()(torch.rand(1, 1, 60, 400)).shape)\n", "print(make_ct_model(ct=0)(torch.rand(1, 1, 60, 400)).shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# How well do these work?\n", "\n", "- Works for word or text line recognition.\n", "- All these models only require that characters are arranged left to right.\n", "- Input images can be rotated up to around 30 degrees and scaled.\n", "- Input images can be grayscale.\n", "- Great for scene text and degraded documents.\n", "\n", "But:\n", "- You pay a price for translation/scale inv: lower performance.\n", "- These don't use any recurrent networks, so all information flow is strictly limited to the footprint.\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "jupytext": { "cell_metadata_filter": "-all", "formats": "ipynb", "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }