{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "# MODERN DEEP LEARNING ERA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Breakthrough Paper\n",
    "\n",
    "\"ImageNet Classification with Deep Convolutional Neural Networks\", Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, 2013\n",
    "\n",
    "- substantially better performance than previous methods\n",
    "- effective use of GPUs for convolutional neural networks\n",
    "- ReLU nonlinearity\n",
    "- max pooling\n",
    "- data augmentation\n",
    "- dropout, local response normalization\n",
    "- much bigger than previous networks\n",
    "\n",
    "None of these were new ideas, but this paper brought it all together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# This took a lot of planning\n",
    "\n",
    "Deep Learning didn't come out of nothing.\n",
    "\n",
    "The major people in the field today tried to push it for 20 years compared to other approaches.\n",
    "\n",
    "Finally succeeded in the 2010's due to Alexnet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Alexnet Architecture\n",
    "\n",
    "![alexnet](figs/alexnet.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "from torch import nn\n",
    "from torchmore import flex, layers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Alexnet Architecture (approximately)\n",
    "\n",
    "model = nn.Sequential(\n",
    "    layers.Input(\"BDHW\", sizes=(None, 3, 224, 224)),\n",
    "    flex.Conv2d(2*48, (11, 11)),\n",
    "    nn.ReLU(),\n",
    "    nn.MaxPool2d((2, 2)),\n",
    "    flex.Conv2d(2*192, (3, 3)),\n",
    "    nn.ReLU(),\n",
    "    flex.Conv2d(2*192, (3, 3)),\n",
    "    nn.ReLU(),\n",
    "    flex.Conv2d(2*128, (3, 3)),\n",
    "    layers.Reshape(0, [1, 2, 3]),\n",
    "    flex.Linear(4096),\n",
    "    nn.ReLU(),\n",
    "    flex.Linear(4096),\n",
    "    nn.ReLU(),\n",
    "    flex.Linear(1000)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "# Alexnet Learned Features\n",
    "\n",
    "![alexnet features](figs/alexnet-features.png)\n",
    "\n",
    "NB: these are standard PCA/Gabor-jet style features from vision>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# ReLU\n",
    "\n",
    "<img src=\"figs/relu-nonlinearity.png\" style=\"height: 4in\" />\n",
    "\n",
    "$\\sigma(x) = (1 + e^{-x})^{-1}$, $\\rho(x) = \\max(0, x)$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# ReLU Derivatives\n",
    "\n",
    "<img src=\"figs/relu-deriv.png\" style=\"height: 4in\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Nonlinearity Properties\n",
    "\n",
    "<table style=\"font-size: 44px\">\n",
    "    <tr><th>property</th><th>sigmoid</th><th>ReLU</th></tr>\n",
    "    <tr><td>derivatives</td><td>infinite</td><td>f': discontinuous, f'': zero</td></tr>\n",
    "    <tr><td>monotonicity></td><td>monotonic</td><td>monotonic</td></tr>\n",
    "    <tr><td>range</td><td>$(0, 1)$</td><td>$(0, \\infty)$</td></tr>\n",
    "    <tr><td>zero derivative</td><td>none</td><td>$(-\\infty, 0)$</td></tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# ReLU\n",
    "\n",
    "- much faster to compute\n",
    "- converges faster\n",
    "- scale independent\n",
    "- existence of zero-derivative regions causes \"no deltas\"\n",
    "- units may \"go dead\"\n",
    "- positive output only\n",
    "- results in piecewise linear approximations to functions\n",
    "- results in classifiers based on linear arrangements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Max Pooling\n",
    "\n",
    "<img src=\"figs/maxpool.png\" style=\"height: 3in\" />\n",
    "\n",
    "- replaces average pooling, reduces resolution\n",
    "- performed per channel\n",
    "- nonlinear operation, somewhat similar to morphological operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Local Response Normalization\n",
    "\n",
    "$$ y = x \\cdot (k + \\alpha (K * |x|^\\gamma) ^ \\beta)^-1 $$\n",
    "\n",
    "- Here, $*$ is a convolution operation.\n",
    "- That is, we normalize the image with an average of the local response. \n",
    "- In Alexnet, $k=2$, K is a 5x5 pillbox, $\\gamma=2$, $\\beta=0.75$\n",
    "- A simple variance normalization would use $k=0$, $\\gamma=2$, and $\\beta=0.5$\n",
    "\n",
    "In later models, this is effectively replaced by batch normalization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Dropout\n",
    "\n",
    "- randomly turn off units during training\n",
    "- motivated by an approximation to an ensemble of networks\n",
    "- intended to lead to better generalization from limited samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Data Augmentation\n",
    "\n",
    "- generate additional training samples by modifying the original images\n",
    "- long history in OCR\n",
    "- for image training:\n",
    "  - random geometric transformation\n",
    "  - random distortions\n",
    "  - random photometric transforms\n",
    "  - addition of noise, distractors, masking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# FURTHER DEVELOPMENTS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Batch Normalization\n",
    "\n",
    "\"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\", S. Ioffe and C. Szegedy, 2015.\n",
    "\n",
    "- attributes slower learning to \"internal covariate shift\"\n",
    "- suggests that ideally, each layer should \"whiten\" the data\n",
    "- instead normalizes mean and variance for each neuron\n",
    "- normalization based on batch statistics (and running statistics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Inception\n",
    "\n",
    "Szegedy, Christian, et al. \"Rethinking the inception architecture for computer vision.\" Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.\n",
    "\n",
    "- very deep architecture built up from complex modules\n",
    "- separable convolutions for large footprints\n",
    "- \"label smoothing\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# VGG Networks\n",
    "\n",
    "Simonyan, Karen, and Andrew Zisserman. \"Very deep convolutional networks for large-scale image recognition.\" arXiv preprint arXiv:1409.1556 (2014).\n",
    "\n",
    "- very deep networks with fairly regular structure\n",
    "- multiple convolutions + max pooling\n",
    "- combined with batch normalization in later systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Residual Networks\n",
    "\n",
    "He, Kaiming, et al. \"Deep residual learning for image recognition.\" Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.\n",
    "\n",
    "- novel architecture composed of modules\n",
    "- each module consists of convolutional layers\n",
    "- the output of the convolutional layers is added to the input\n",
    "- ReLU and batch normalization is used throughout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# LOCALIZATION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Localization of Objects\n",
    "\n",
    "- objects occur at different locations in scenes/images\n",
    "- different strategies with recognizing objects:\n",
    "  - global classification\n",
    "  - moving/scanning window\n",
    "  - region proposals (RCNN etc.)\n",
    "  - learning dense markers / segmentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Global Classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def conv2d_block(d):\n",
    "    return nn.Sequential(\n",
    "        flex.Conv2d(d, 3, padding=1), flex.BatchNorm2d(), nn.ReLU(),\n",
    "        flex.Conv2d(d, 3, padding=1), flex.BatchNorm2d(), nn.ReLU(),\n",
    "        flex.MaxPool2d(),\n",
    "    )\n",
    "\n",
    "def make_vgg_like_model():\n",
    "    return nn.Sequential(\n",
    "        *conv2d_block(64), *conv2d_block(128), *conv2d_block(256), \n",
    "        *conv2d_block(512), *conv2d_block(1024), *conv2d_block(2048),\n",
    "        # we have a (None, 2048, 4, 4) batch at this point\n",
    "        # switch to global classification\n",
    "        nn.Flatten(),\n",
    "        flex.Linear(4096), flex.BatchNorm(), nn.ReLU(),\n",
    "        flex.Linear(4096), flex.BatchNorm(), nn.ReLU(),\n",
    "        flex.Linear(1000)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sliding Windows\n",
    "\n",
    "![overfeat](figs/overfeat.png)\n",
    "\n",
    "Sermanet, Pierre, et al. \"Overfeat: Integrated recognition, localization and detection using convolutional networks.\" arXiv preprint arXiv:1312.6229 (2013).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Region Proposal Network\n",
    "\n",
    "![region proposal](figs/region-proposal.png)\n",
    "\n",
    "Ren, Shaoqing, et al. \"Faster r-cnn: Towards real-time object detection with region proposal networks.\" Advances in neural information processing systems. 2015.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Summary - New Style Object Recognition Networks\n",
    "\n",
    "- tricks: ReLU, batch norm, max pool, softmax, cross-entropy\n",
    "- hacks: ad-hoc collections of kernels, R-CNN\n",
    "- working at the limits of current GPU hardware\n",
    "- little theoretical foundation, \"Wild West\"\n",
    "- odd effects: deep dreaming, adversarial samples, etc."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "cell_metadata_filter": "-all",
   "formats": "ipynb",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}