{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Analysis of Fast RCNN.ipynb", "version": "0.3.2", "views": {}, "default_view": {}, "provenance": [], "collapsed_sections": [] } }, "cells": [ { "metadata": { "id": "v3JItYGHRZI3", "colab_type": "code", "colab": { "autoexec": { "startup": false, "wait_interval": 0 } } }, "cell_type": "code", "source": [ "" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "NnvCuOG6SKKy", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "# Preliminaties \n", "\n", "## Notation \n", "\n", "### Spaces \n", "\n", "- $ \\mathcal{I} $ : Raw Image Space \n", " - it can be represented as a 3D Tensor $ I \\in \\mathbb{R}^{w \\times h \\times c} $ with Channels Dimenions $ c = 3 $ assuming an RGB Input \n", "- $ \\mathcal{B} $ : Bounding Box Space \n", " - each element is $ b \\in \\mathcal{B} $ so that $ b = (u,v,w,h) $ which identies a group of pixels in $ I $ image \n", " - $ B = \\{ b_{i} \\}_{i=1,...,N} $ Group of Bounding Boxes \n", "- $ I \\in \\mathcal{I} $ represents a generic image \n", "- $ I_{B} \\subset I $ represents a Bounding Box applied to $ I $ according to $ B $\n", "- $ \\mathcal{S} $ : Latent Image Space \n", " - it results from the CNN Processing and it typically identifies the 2D Spatial Tensor, with a certain Channel Depth, after all the Convolutive Processing (Convolutions + NonLin e.g. ReLU + Spatial Reduction Operators e.g. MaxPooling) just before it gets transformed into the $ d^{(bottleneck)} $ Bottleneck Feature Descriptor \n", "- $ \\mathcal{D} $ : Bottleneck Feature Space \n", " - it is typically a $ d $ Dimensional Space $ \\mathbb{R}^{d} $ \n", "- $ \\mathcal{L} $ : Label Space \n", " - it is typically a finite set of semantic labels \n", " \n", "### Functions \n", "\n", "- $ f^{(ROI)} : \\mathbb{R}^{w \\times h \\times c} \\times \\mathcal{B} \\rightarrow \\mathbb{R}^{w' \\times h' \\times c} \\qquad w' < w \\quad h' < h $ : Gets a Spatial ROI from a Tensor Space (the Channel Dimension is kept the same) \n", " - with $ f^{(ROI)}(I; b) $ it applies to the $ I \\subset \\mathbb{R}^{w \\times h \\times c} $ Input Tensor the ROI identified by $ b \\in \\mathcal{B} $ so that $ b = ( u_{0}, v_{0}, w, h ) $ \n", "- $ f^{CNN} : \\mathcal{I} \\rightarrow \\mathcal{S} $ : Performs the CNN Processing to compute the Latent Representation \n", "- $ f^{Cl} : \\mathcal{D} \\rightarrow \\mathcal{L} $ : Performs the Classification starting from some Latent Representation, typically consisting of 1D Tensor of some fixed lenght (Bottleneck Feature)\n", "- $ f^{MaxPooling} : \\mathbb{R}^{w \\times h} \\rightarrow \\mathbb{R} $ : Represents the Spatial Max Pooling Operator \n", " - it is defined as \n", " \n", "$$ f^{MaxPooling}(R) = \\max_{i=1,...,w \\quad j=1,...,h} R(i,j) $$\n", "\n", "\n" ] }, { "metadata": { "id": "D8_Op7Z0RgFN", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "# RCNN \n", "\n", "## Main Idea \n", "\n", "The RCNN consists of 3 main blocks running sequentially \n", "\n", "1. the Region Proposal $ f^{(RP)} : \\mathcal{I} \\rightarrow \\mathcal{B} $ which in RCNN original formulation relies on Selective Search Algorithm \n", "2. the Feature Computation $ f^{(CNN)} : \\mathcal{I_{B}} \\rightarrow \\mathcal{S} $ relying on some CNN Backend (e.g. VGG)\n", "3. the Classificator $ f^{(Cl)} : \\mathcal{D} \\rightarrow \\mathcal{L} $ in its original implementation it relis on a Shallow Classifier like SVM \n", "\n", "## Implementation \n", "\n", "1. Compute $ B = \\{ b_{i} \\} $ Region Proposal Set \n", "2. Compute $ S_{i} = f^{(CNN)}(B_{i}) $ Latent Representation for each selected BBox \n", "3. Assign Semantic Label $ L_{i} = f^{(Cl)}(I_{B_{i}}) $ \n", "4. The final result is $ \\{ (B, L)_{i} \\}_{i=1,...,N} $ \n", "\n" ] }, { "metadata": { "id": "Ep_WsZKYWrI3", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "# Fast RCNN \n", "\n", "## Overview \n", "\n", "- Focused on improving RCNN on the speed performance side \n", "\n", "- Introduces ROI Pooling Network (Ross Girschick, Apr 2015)\n", "\n", "\n", "\n", "## Main Ideas \n", "\n", "- Achieve speed up by sharing the computationally expensive CNN Processing \n", "\n", "- Introduce $ f^{(RP)} $ ROI Pooling Network which is responsible for \n", " - mapping from Image Space Bounding Box $ I_{b} $ to Latent Space Bounding Box $ S_{b^{(s)}} $ \n", " - it means computing $ b^{(s)} = (u,v,w,h)^{(s)} $ in Latent Space from $ b = (u,v,w,h) $ in Input Space \n", " - it can be performed in a deterministic way, considering all the spatial reductions performed by Convolutive Processing (Convolutions + Spatial Pooling) \n", " - it allows to compute a $ S_{b^{(s)}} $ ConvMap corresponding to a $ I_{b} $ Input Region Proposal \n", " - however it is not possibly to apply $ f^{Cl} $ directly to $ S_{b^{s}} $ because the latter can have a generic size while the former requires a fixed size input (as the classification is internally performed with fully connected layers), this is managed by the following second function performed by ROI Pooling Network \n", " - mapping the variable size $ S_{b^{(s)}} $ into a fixed size $ S^{(p)} $ ConvMap by means of further spatial pooling \n", "\n", "- By making the \"Feature Computation Path\" Mol start from the \"Full Image PP\" Mol instead of from a \"Fixed Size ROI PP\" Mol \n", "\n", "\n", "\n", "## Implementation \n", "\n", "1. Region Proposal \n", " - The $ f^{(RP)} = f^{(SS)} $ : Region Proposal still implemented with Selective Search (non trainable driven approach)\n", " \n", "2. CNN Processing \n", " - Change $ S_{B} = f^{(CNN)}(I_{B}) $ with $ S=f^{(CNN)}(I) $ and use \n", " \n", "\n", "\n" ] }, { "metadata": { "id": "qBcDuN5ZB9ro", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "## Details \n", "\n", "### ROI Pooling \n", "\n", "The ROI Pooling Network computes $ S^{(p)} $ Fixed Size Region Proposal Latent Representation which can be easily transformed into the fixed size vector which can be passed to $ f^{(Cl)} $ for Classification, starting from $ S_{ b^{(s)} } $ according to the following Algo \n", "\n", "- Assumptions \n", " - Sizes: $ S_{b^{(s)}} $ ConvMap has size $ w^{(s)} \\times h^{(s)} \\times c^{(s)} $ while the $ S^{(p)} $ Pooled ConvMap has size $ w^{(p)} \\times h^{(p)} \\times c^{(p)} $ with $ w^{(p)} < w^{(s)} $ and $ h^{(p)} < h^{(s)} $ \n", " - Typically $ S^{(p)} $ is square \n", "- The $ \\{w', h'\\} $ are computed as the result of an integer division between $ \\{w,h\\}^{(s)} $ and $ \\{w,h\\}^{(p)} $ respectively so the Input ConvMap gets divided into a set of $ \\{ R_{i,j} \\}_{i=1,...,w^{(p)}, j=1,...,h^{(p)}} $ elements of mostly equally sized subregions (up to the integer division approximation) so that $ S_{b^{(s)}} = \\bigcup_{i,j}^{i=1,...,w^{(p)}, j=1,...,h^{(p)}} R_{i,j} $ and there is a one-to-one relationship between $ R_{i,j} $ and the $ i,j $ element in $ S^{(p)} $ ConvMap \n", "- Finally Max Pooling is performed setting the corresponding element in the Pooled ConvMap \n", "\n", "$$ S^{(p)}(i,j) = f^{(MaxPooling)} R_{i,j} \\quad \\forall i = 1,...,w^{(p)}, j=1,...,h^{(p)} $$\n", "\n", "![ROI Pooling1](https://blog.deepsense.ai/wp-content/uploads/2017/02/1.jpg)\n", "- The $ S $ Full Latent ConvMap\n", "\n", "![ROI Pooling3](https://blog.deepsense.ai/wp-content/uploads/2017/02/2.jpg)\n", "- The $ S_{b^{(s)}} $ Region Proposal ConvMap with size $ w^{(s)}=7, h^{(s)}=5 $ \n", "\n", "![ROI Pooling5](https://blog.deepsense.ai/wp-content/uploads/2017/02/3.jpg)\n", "- Considering $ S^{(p)} $ has $ w^{(p)}=2, h^{(p)}=2 $ then 4 $ R_{i,j} $ Regions are needed and considering $ w^{(s)} / w_{(p)} = 3 $ and $ h^{(s)} / h^{(p)} = 2 $ the association is \n", " - $ R_{0,0} = f^{(ROI)}( S_{b^{(s)}}; 0,0,3,2 ) $\n", " - $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 3,0,4,2 ) $\n", " - $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 0,2,3,3 ) $\n", " - $ R_{1,1} = f^{(ROI)}( S_{b^{(s)}}; 3,2,4,3 ) $\n", "\n" ] }, { "metadata": { "id": "bGmU_8miKAme", "colab_type": "code", "colab": { "autoexec": { "startup": false, "wait_interval": 0 } } }, "cell_type": "code", "source": [ "" ], "execution_count": 0, "outputs": [] } ] }