{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Analysis of Fast RCNN.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "v3JItYGHRZI3",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "NnvCuOG6SKKy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "# Preliminaties \n",
        "\n",
        "## Notation \n",
        "\n",
        "### Spaces \n",
        "\n",
        "- $ \\mathcal{I} $ : Raw Image Space \n",
        "  - it can be represented as a 3D Tensor $ I \\in \\mathbb{R}^{w \\times h \\times c} $ with Channels Dimenions $ c = 3 $ assuming an RGB Input \n",
        "- $ \\mathcal{B} $ : Bounding Box Space \n",
        "  - each element is $ b \\in \\mathcal{B} $ so that $ b = (u,v,w,h) $ which identies a group of pixels in $ I $ image \n",
        "  - $ B = \\{ b_{i} \\}_{i=1,...,N} $ Group of Bounding Boxes \n",
        "- $ I \\in \\mathcal{I} $ represents a generic image \n",
        "- $ I_{B} \\subset I $ represents a Bounding Box applied to $ I $ according to $ B $\n",
        "- $ \\mathcal{S} $ : Latent Image Space \n",
        "  - it results from the CNN Processing and it typically identifies the 2D Spatial Tensor, with a certain Channel Depth, after all the Convolutive Processing (Convolutions + NonLin e.g. ReLU + Spatial Reduction Operators e.g. MaxPooling) just before it gets transformed into the $ d^{(bottleneck)} $ Bottleneck Feature Descriptor \n",
        "- $ \\mathcal{D} $ : Bottleneck Feature Space \n",
        "  - it is typically a $ d $ Dimensional Space $ \\mathbb{R}^{d} $ \n",
        "- $ \\mathcal{L} $ : Label Space \n",
        "  - it is typically a finite set of semantic labels \n",
        "  \n",
        "### Functions \n",
        "\n",
        "- $ f^{(ROI)} : \\mathbb{R}^{w \\times h \\times c} \\times \\mathcal{B} \\rightarrow \\mathbb{R}^{w' \\times h' \\times c} \\qquad w' < w \\quad h' < h $ : Gets a Spatial ROI from a Tensor Space (the Channel Dimension is kept the same) \n",
        "  - with $ f^{(ROI)}(I; b) $ it applies to the $ I \\subset \\mathbb{R}^{w \\times h \\times c} $ Input Tensor the ROI identified by $ b \\in \\mathcal{B} $ so that $ b = ( u_{0}, v_{0}, w, h ) $ \n",
        "- $ f^{CNN} : \\mathcal{I} \\rightarrow \\mathcal{S} $ : Performs the CNN Processing to compute the Latent Representation \n",
        "- $ f^{Cl} : \\mathcal{D} \\rightarrow \\mathcal{L} $ : Performs the Classification starting from some Latent Representation, typically consisting of 1D Tensor of some fixed lenght (Bottleneck Feature)\n",
        "- $ f^{MaxPooling} : \\mathbb{R}^{w \\times h} \\rightarrow \\mathbb{R} $ : Represents the Spatial Max Pooling Operator \n",
        "  - it is defined as \n",
        "  \n",
        "$$ f^{MaxPooling}(R) = \\max_{i=1,...,w \\quad j=1,...,h} R(i,j) $$\n",
        "\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "D8_Op7Z0RgFN",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "# RCNN \n",
        "\n",
        "## Main Idea \n",
        "\n",
        "The RCNN consists of 3 main blocks running sequentially \n",
        "\n",
        "1. the Region Proposal $ f^{(RP)} : \\mathcal{I} \\rightarrow \\mathcal{B} $ which in RCNN original formulation relies on Selective Search Algorithm \n",
        "2. the Feature Computation $ f^{(CNN)} : \\mathcal{I_{B}} \\rightarrow \\mathcal{S} $ relying on some CNN Backend (e.g. VGG)\n",
        "3. the Classificator $ f^{(Cl)} : \\mathcal{D} \\rightarrow \\mathcal{L} $ in its original implementation it relis on a Shallow Classifier like SVM \n",
        "\n",
        "## Implementation \n",
        "\n",
        "1. Compute $ B = \\{ b_{i} \\} $ Region Proposal Set \n",
        "2. Compute $ S_{i} = f^{(CNN)}(B_{i}) $ Latent Representation for each selected BBox \n",
        "3. Assign Semantic Label $ L_{i} = f^{(Cl)}(I_{B_{i}}) $ \n",
        "4. The final result is $ \\{ (B, L)_{i} \\}_{i=1,...,N} $ \n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "Ep_WsZKYWrI3",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "# Fast RCNN \n",
        "\n",
        "## Overview \n",
        "\n",
        "- Focused on improving RCNN on the speed performance side \n",
        "\n",
        "- Introduces ROI Pooling Network (Ross Girschick, Apr 2015)\n",
        "\n",
        "\n",
        "\n",
        "## Main Ideas \n",
        "\n",
        "- Achieve speed up by sharing the computationally expensive CNN Processing \n",
        "\n",
        "- Introduce $ f^{(RP)} $ ROI Pooling Network which is responsible for \n",
        "  - mapping from Image Space Bounding Box $ I_{b} $ to Latent Space Bounding Box $ S_{b^{(s)}} $ \n",
        "    - it means computing $ b^{(s)} = (u,v,w,h)^{(s)} $ in Latent Space from $ b = (u,v,w,h) $ in Input Space \n",
        "      - it can be performed in a deterministic way, considering all the spatial reductions performed by Convolutive Processing (Convolutions + Spatial Pooling) \n",
        "      - it allows to compute a $ S_{b^{(s)}} $ ConvMap corresponding to a $ I_{b} $ Input Region Proposal \n",
        "      - however it is not possibly to apply $ f^{Cl} $ directly to $ S_{b^{s}} $ because the latter can have a generic size while the former requires a fixed size input (as the classification is internally performed with fully connected layers), this is managed by the following second function performed by ROI Pooling Network \n",
        "  - mapping the variable size $ S_{b^{(s)}} $ into a fixed size $ S^{(p)} $ ConvMap by means of further spatial pooling \n",
        "\n",
        "- By making the \"Feature Computation Path\" Mol start from the \"Full Image PP\" Mol instead of from a \"Fixed Size ROI PP\" Mol \n",
        "\n",
        "\n",
        "\n",
        "## Implementation \n",
        "\n",
        "1. Region Proposal \n",
        "  - The $ f^{(RP)} = f^{(SS)} $ : Region Proposal still implemented with Selective Search (non trainable driven approach)\n",
        "  \n",
        "2. CNN Processing \n",
        "  - Change $ S_{B} = f^{(CNN)}(I_{B}) $ with $ S=f^{(CNN)}(I) $ and use \n",
        "  \n",
        "\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "qBcDuN5ZB9ro",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "## Details \n",
        "\n",
        "### ROI Pooling \n",
        "\n",
        "The ROI Pooling Network computes $ S^{(p)} $ Fixed Size Region Proposal Latent Representation which can be easily transformed into the fixed size vector which can be passed to $ f^{(Cl)} $ for Classification, starting from $ S_{ b^{(s)} } $ according to the following Algo \n",
        "\n",
        "- Assumptions \n",
        "  - Sizes: $ S_{b^{(s)}} $ ConvMap has size $ w^{(s)} \\times h^{(s)} \\times c^{(s)} $ while the $ S^{(p)} $ Pooled ConvMap has size $ w^{(p)} \\times h^{(p)} \\times c^{(p)} $ with $ w^{(p)} < w^{(s)} $ and $ h^{(p)} < h^{(s)} $ \n",
        "  - Typically $ S^{(p)} $ is square \n",
        "- The $ \\{w', h'\\} $ are computed as the result of an integer division between $ \\{w,h\\}^{(s)} $ and $ \\{w,h\\}^{(p)} $ respectively so the Input ConvMap gets divided into a set of $ \\{ R_{i,j} \\}_{i=1,...,w^{(p)}, j=1,...,h^{(p)}} $ elements of mostly equally sized subregions (up to the integer division approximation) so that $ S_{b^{(s)}} = \\bigcup_{i,j}^{i=1,...,w^{(p)}, j=1,...,h^{(p)}} R_{i,j} $ and there is a one-to-one relationship between $ R_{i,j} $ and the $ i,j $ element in $ S^{(p)} $ ConvMap \n",
        "- Finally Max Pooling is performed setting the corresponding element in the Pooled ConvMap \n",
        "\n",
        "$$ S^{(p)}(i,j) = f^{(MaxPooling)} R_{i,j} \\quad \\forall i = 1,...,w^{(p)}, j=1,...,h^{(p)} $$\n",
        "\n",
        "![ROI Pooling1](https://blog.deepsense.ai/wp-content/uploads/2017/02/1.jpg)\n",
        "- The $ S $ Full Latent ConvMap\n",
        "\n",
        "![ROI Pooling3](https://blog.deepsense.ai/wp-content/uploads/2017/02/2.jpg)\n",
        "- The $ S_{b^{(s)}} $ Region Proposal ConvMap with size $ w^{(s)}=7, h^{(s)}=5 $ \n",
        "\n",
        "![ROI Pooling5](https://blog.deepsense.ai/wp-content/uploads/2017/02/3.jpg)\n",
        "- Considering $ S^{(p)} $ has $ w^{(p)}=2, h^{(p)}=2 $ then 4 $ R_{i,j} $ Regions are needed and considering $ w^{(s)} / w_{(p)} = 3 $ and $ h^{(s)} / h^{(p)} = 2 $ the association is \n",
        "  - $ R_{0,0} = f^{(ROI)}( S_{b^{(s)}}; 0,0,3,2 ) $\n",
        "  - $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 3,0,4,2 ) $\n",
        "  - $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 0,2,3,3 ) $\n",
        "  - $ R_{1,1} = f^{(ROI)}( S_{b^{(s)}}; 3,2,4,3 ) $\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "bGmU_8miKAme",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}