{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"lines_to_next_cell": 2,
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import torch\n",
"from torch import nn\n",
"from torchmore import flex, layers"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# APPLICATIONS TO OCR"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Character Recognition\n",
"\n",
"- assuming you have a character segmentation\n",
" - extract each character\n",
" - feed to any of these architectures as if it were an object recognition problem\n",
"\n",
"Goodfellow, Ian J., et al. \"Multi-digit number recognition from street view imagery using deep convolutional neural networks.\" arXiv preprint arXiv:1312.6082 (2013)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Whole Word Recognition\n",
"\n",
"- perform word localization using Faster RCNN\n",
"- perform whole word recognition as if it were a large object reconition problem\n",
"\n",
"![word recognition](figs/word-recognition.png)\n",
"\n",
"Jaderberg, Max, et al. \"Deep structured output learning for unconstrained text recognition.\" arXiv preprint arXiv:1412.5903 (2014)."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Better Techniques\n",
"\n",
"- above techniques are applications of computer vision localization\n",
"- Faster RCNN and similar techniques are ad hoc and limited\n",
"- often require pre-segmented text for training\n",
"- better approaches:\n",
" - use markers for localizing/bounding text (later)\n",
" - use sequence learning techniques and CTC"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Convolutional Networks for OCR\n",
"\n",
"Historically, LSTM came first, but we're going to start off with convolutional networks analogous to object recognition networks.\n",
"\n",
"Structure:\n",
"\n",
"- perform 2D convolutions over the entire image\n",
"- assume that that has extracted features that correspond to characters\n",
"- project those features into a 1D sequence\n",
"- classify the projected 1D feature sequence into characters\n",
"- perform training using EM alignment (CTC, Viterbi, etc.)"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Convolutional Networks for OCR"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def make_model():\n",
" return nn.Sequential(\n",
" # BDHW\n",
" *convolutional_layers(),\n",
" # BDHW, now reduce along the vertical\n",
" layers.Fun(lambda x: x.sum(2)),\n",
" # BDW\n",
" layers.Conv1d(num_classes, 1)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Viterbi Training\n",
"\n",
"- ground truth: text string = sequence of classes\n",
"- ground truth `\"ABC\"` is replaced by regular expression `/_+A+_+B+_+C+_+/`\n",
"- network outputs $P(c|i)$, a probability of each class $c$ at each position $i$\n",
"- find the best possible alignment between network outputs and ground truth regex\n",
"- that alignment gives an output for each time step\n",
"- treat that alignment as if it were the ground truth and backpropagate\n",
"- this is an example of an EM algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# CTC Training\n",
"\n",
"- like Viterbi training, but instead of finding the best alignment uses an average alignment\n",
"\n",
"Identical to traditional HMM training in speech recognition:\n",
"\n",
"- Viterbi training = Viterbi training\n",
"- CTC training = forward-backward algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# CTC training with cctc2\n",
"\n",
"- with the `cctc2` library, we can make the alignment explicit"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def train_batch(input, target):\n",
" optimizer.zero_grad()\n",
" output = model(input)\n",
" aligned = cctc2.align(output, target)\n",
" loss = mse_loss(aligned, output)\n",
" loss.backward()\n",
" optimizer.step()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Explicit CTC\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# CTC in PyTorch\n",
"\n",
"- in PyTorch, CTC is implemented as a loss function\n",
"- `CTCLoss` in PyTorch obscures what's going on\n",
"- all you get is the loss output, not the EM alignment\n",
"- sequences are packed in a special way into batches"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"ctc_loss = nn.CTCLoss()\n",
"\n",
"def train_batch(input, target):\n",
" optimizer.zero_grad()\n",
" output = model(input)\n",
" loss = ctc_loss(output, target)\n",
" loss.backward()\n",
" optimizer.step()"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Word / Text Line Recognition "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def make_model():\n",
" return nn.Sequential(\n",
" *convolutional_layers(),\n",
" layers.Fun(lambda x: x.sum(2)),\n",
" layers.Conv1d(num_classes, 1)\n",
" )\n",
"\n",
"def train_batch(input, target):\n",
" optimizer.zero_grad()\n",
" output = model(input)\n",
" loss = ctc_loss(output, target)\n",
" loss.backward()\n",
" optimizer.step() "
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# VGG-Like Model"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"def conv2d(d, r=3, stride=1, repeat=1):\n",
" \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n",
" result = []\n",
" for i in range(repeat):\n",
" result += [\n",
" flex.Conv2d(d, r, padding=(r//2, r//2), stride=stride),\n",
" flex.BatchNorm2d(),\n",
" nn.ReLU()\n",
" ]\n",
" return result\n",
"\n",
"def conv2mp(d, r=3, mp=2, repeat=1):\n",
" \"\"\"Generate a conv layer with batchnorm and optional maxpool.\"\"\"\n",
" result = conv2d(d, r, repeat=repeat)\n",
" if mp is not None:\n",
" result += [nn.MaxPool2d(mp)]\n",
" return result\n",
"\n",
"def project_and_conv1d(d, noutput, r=5):\n",
" return [\n",
" layers.Fun(\"lambda x: x.max(2)[0]\"),\n",
" flex.Conv1d(d, r, padding=r//2),\n",
" flex.BatchNorm1d(),\n",
" nn.ReLU(),\n",
" flex.Conv1d(noutput, 1),\n",
" layers.Reorder(\"BDL\", \"BLD\")\n",
" ]\n",
"\n",
"class Additive(nn.Module):\n",
" def __init__(self, *args, post=None):\n",
" super().__init__()\n",
" self.sub = nn.ModuleList(args)\n",
" self.post = None\n",
" def forward(self, x):\n",
" y = self.sub[0](x)\n",
" for f in self.sub[1:]:\n",
" y = y + f(x)\n",
" if self.post is not None:\n",
" y = self.post(y)\n",
" return y"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 50, 53])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def make_vgg_model(noutput=53):\n",
" return nn.Sequential(\n",
" layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
" *conv2mp(100, 3, 2, repeat=2),\n",
" *conv2mp(200, 3, 2, repeat=2),\n",
" *conv2mp(300, 3, 2, repeat=2),\n",
" *conv2d(400, 3, repeat=2),\n",
" *project_and_conv1d(800, noutput)\n",
" )\n",
"make_vgg_model()(torch.rand(1, 1, 60, 400)).shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Resnet-Block\n",
"\n",
"- NB: we can easily define Resnet etc. in an object-oriented fashion"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def ResnetBlock(d, r=3):\n",
" return Additive(\n",
" nn.Identity(),\n",
" nn.Sequential(\n",
" nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d), nn.ReLU(),\n",
" nn.Conv2d(d, d, r, padding=r//2), nn.BatchNorm2d(d)\n",
" )\n",
" )\n",
"\n",
"def resnet_blocks(n, d, r=3):\n",
" return [ResnetBlock(d, r) for _ in range(n)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Resnet-like Model"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 50, 53])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def make_resnet_model(noutput=53): \n",
" return nn.Sequential(\n",
" layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
" *conv2mp(64, 3, (2, 1)),\n",
" *resnet_blocks(5, 64), *conv2mp(128, 3, 2),\n",
" *resnet_blocks(5, 128), *conv2mp(256, 3, 2),\n",
" *resnet_blocks(5, 256), *conv2mp(512, 3, 2),\n",
" *resnet_blocks(5, 512),\n",
" *project_and_conv1d(800, noutput)\n",
" )\n",
"make_resnet_model()(torch.rand(1, 1, 60, 400)).shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Footprints\n",
"\n",
"- even with projection/1D convolution, a character is first recognized in 2D by the 2D convolutional network\n",
"- character recognition with 2D convolutional networks really a kind of deformable template matching\n",
"- in order to recognize a character, each pixel at the output of the 2D convolutional network needs to have a footprint large enough to cover the character to be recognized\n",
"- footprint calculation:\n",
" - 3x3 convolution, three maxpool operations = 24x24 footprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conv-Only Training\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Problems with VGG/Resnet+Conv1d\n",
"\n",
"Problem:\n",
"- reduces output to H/8, W/8\n",
"- CTC alignment needs two pixels for each character\n",
"- result: models trouble with narrow characters\n",
"\n",
"Solutions:\n",
"- use fractional max pooling\n",
"- use upscaling\n",
"- use transposed convolutions"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Less Downscaling using `FractionalMaxPool2d`\n",
"\n",
"- permits more max pooling steps without making image too small\n",
"- can be performed anisotropically\n",
"- necessary non-uniform spacing may have additional benefits"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 210, 53])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def conv2fmp(d, r=3, fmp=(0.7, 0.85), repeat=1):\n",
" result = conv2d(d, r, repeat=repeat)\n",
" if fmp is not None:\n",
" result += [nn.FractionalMaxPool2d(3, output_ratio=fmp)]\n",
" return result\n",
"\n",
"def make_fmp_model(noutput=53):\n",
" return nn.Sequential(\n",
" layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
" *[l for d in [50, 100, 150, 200, 250, 300] for l in conv2fmp(d, 3, (0.7, 0.9))],\n",
" *project_and_conv1d(800, noutput)\n",
" )\n",
"make_fmp_model()(torch.rand(1, 1, 60, 400)).shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Upscaling using `interpolate`\n",
"\n",
"- `interpolate` scales an image, has `backward()`\n",
"- `MaxPool2d...interpolate` is a simple multiscale analysis\n",
"- can be combined with loss functions at each level"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 400, 53])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch.nn.functional as F\n",
"def make_interpolating_model(noutput=53):\n",
" return nn.Sequential(\n",
" layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
" *conv2mp(50, 3), *conv2mp(100, 3), *conv2mp(150, 3), *conv2mp(200, 3),\n",
" layers.Fun_(lambda x: F.interpolate(x, scale_factor=16)),\n",
" *project_and_conv1d(800, noutput)\n",
" )\n",
"make_interpolating_model()(torch.rand(1, 1, 60, 400)).shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Upscaling with `interpolate`\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"# Upscaling using `ConvTranspose1d`\n",
"\n",
"\n",
"- `ConvTranspose2d` fills in higher resolutions with \"templates\"\n",
"- commonly used in image segmentation and superresolution\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch.Size([1, 53, 49])\n",
"torch.Size([1, 53, 25])\n"
]
}
],
"source": [
"def make_ct_model(noutput=53, ct=1):\n",
" return nn.Sequential(\n",
" layers.Input(\"BDHW\", sizes=[None, 1, None, None]),\n",
" *conv2mp(50, 3), \n",
" *conv2mp(100, 3),\n",
" *conv2mp(150, 3),\n",
" *conv2mp(200, 3),\n",
" layers.Fun(\"lambda x: x.sum(2)\"), # BDHW -> BDW\n",
" *[flex.ConvTranspose1d(800, 1, stride=2)]*ct,\n",
" flex.Conv1d(noutput, 7, padding=3)\n",
" )\n",
"print(make_ct_model()(torch.rand(1, 1, 60, 400)).shape)\n",
"print(make_ct_model(ct=0)(torch.rand(1, 1, 60, 400)).shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# How well do these work?\n",
"\n",
"- Works for word or text line recognition.\n",
"- All these models only require that characters are arranged left to right.\n",
"- Input images can be rotated up to around 30 degrees and scaled.\n",
"- Input images can be grayscale.\n",
"- Great for scene text and degraded documents.\n",
"\n",
"But:\n",
"- You pay a price for translation/scale inv: lower performance.\n",
"- These don't use any recurrent networks, so all information flow is strictly limited to the footprint.\n"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "ipynb",
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}