{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython tensorflow tensorpack typing" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "fd2354a3", "metadata": {}, "source": [ "# Object detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is object detection?\n", "\n", "In this chapter we will introduce the object detection problem which can be described in this way: given an image or a video stream, an object detection model can identify which of a known set of objects might be present and provide information about their positions within the image.\n", "\n", "For example, this screenshot of the example application shows how three objects have been recognized and their positions annotated:\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/objdet/01_example.png\n", "---\n", "name: 'Example of the object detection task'\n", "width: 90%\n", "---\n", "Example of the object detection task\n", ":::\n", "\n", "Object detection has now been widely used in many real-world applications, such as autonomous driving, robot vision, video surveillance, etc. The following image shows the growing number of publications that are associated with “object detection” over the past two decades.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/objdet/02_number_of_pub.png\n", "---\n", "name: 'number of pub'\n", "width: 90%\n", "---\n", "The increasing number of publications in object detection from 1998 to 2021\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenges\n", "\n", "In addition to some common challenges in other computer vision tasks such as objects under different viewpoints, illuminations,and intra-class variations, the challenges in object detection include but are not limited to the following aspects: \n", "\n", "- object rotation and scale changes (e.g., small objects);\n", "- accurate object localization;\n", "- dense and occluded object detection;\n", "- speed up of detection, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## History & classic models\n", "\n", "Since image classification is a classic task for computer vision, there are several models that are well-performed in the past. We can list them as follows: Faster R-CNN, Mask R-CNN, YOLO, FCOS, DETR. In this part, we will introduce them in order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Faster R-CNN\n", "\n", "Faster R-CNN is a single, unified network for object detection that utilizes a region proposal network (RPN) with the CNN model2. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals2. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/objdet/03_faster_rcnn.png\n", "---\n", "name: 'The structure of Faster RCNN'\n", "width: 90%\n", "---\n", "The structure of Faster RCNN {cite}`faster_structure`\n", ":::\n", "\n", ":::{note}\n", "We can see the paper [here](https://arxiv.org/pdf/1506.01497.pdf).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code\n", "\n", "Due to the complexity, here we just introduce the key parts of the model, Region Proposal Network(RPN)." ] }, { "cell_type": "code", "execution_count": 2, "id": "01d44ff2", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import tensorflow as tf\n", "from tensorpack.models import Conv2D\n", "from tensorpack.tfutils.argscope import argscope\n", "\n", "def rpn_head(featuremap, channel, num_anchors):\n", " \"\"\"\n", " Returns:\n", " label_logits: fHxfWxNA\n", " box_logits: fHxfWxNAx4\n", " \"\"\"\n", " with argscope(Conv2D, data_format='channels_first',\n", " kernel_initializer=tf.random_normal_initializer(stddev=0.01)):\n", " hidden = Conv2D('conv0', featuremap, channel, 3, activation=tf.nn.relu)\n", "\n", " label_logits = Conv2D('class', hidden, num_anchors, 1)\n", " box_logits = Conv2D('box', hidden, 4 * num_anchors, 1)\n", " # 1, NA(*4), im/16, im/16 (NCHW)\n", "\n", " label_logits = tf.transpose(label_logits, [0, 2, 3, 1]) # 1xfHxfWxNA\n", " label_logits = tf.squeeze(label_logits, 0) # fHxfWxNA\n", "\n", " shp = tf.shape(box_logits) # 1x(NAx4)xfHxfW\n", " box_logits = tf.transpose(box_logits, [0, 2, 3, 1]) # 1xfHxfWx(NAx4)\n", " box_logits = tf.reshape(box_logits, tf.stack([shp[2], shp[3], num_anchors, 4])) # fHxfWxNAx4\n", " return label_logits, box_logits" ] }, { "cell_type": "markdown", "id": "341325e3", "metadata": {}, "source": [ ":::{seealso}\n", "The complete version of Faster R-CNN can be found [here](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mask R-CNN\n", "\n", "Mask R-CNN is a deep neural network aimed to solve both object detection and instance segmentation problems. In other words, it can separate different objects in an image. You give it an image, it gives you the object bounding boxes, classes and masks. Mask R-CNN was developed on top of Faster R-CNN, a Region-Based Convolutional Neural Network. It is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/objdet/04_mask_rcnn.png\n", "---\n", "name: 'The structure of Mask RCNN'\n", "width: 90%\n", "---\n", "The structure of Mask RCNN {cite}`mask_structure`\n", ":::\n", "\n", ":::{note}\n", "We can see the paper [here](https://arxiv.org/pdf/1703.06870v3.pdf).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code\n", "\n", "The key parts in Mask RCNN are:\n", "\n", "- RoIAlign" ] }, { "cell_type": "code", "execution_count": 3, "id": "9bb5d3b8", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import tensorflow as tf\n", "from tensorpack.tfutils.scope_utils import under_name_scope\n", "\n", "def crop_and_resize(image, boxes, box_ind, crop_size, pad_border=True):\n", " \"\"\"\n", " Aligned version of tf.image.crop_and_resize, following our definition of floating point boxes.\n", "\n", " Args:\n", " image: NCHW\n", " boxes: nx4, x1y1x2y2\n", " box_ind: (n,)\n", " crop_size (int):\n", " Returns:\n", " n,C,size,size\n", " \"\"\"\n", " assert isinstance(crop_size, int), crop_size\n", " boxes = tf.stop_gradient(boxes)\n", "\n", " # TF's crop_and_resize produces zeros on border\n", " if pad_border:\n", " # this can be quite slow\n", " image = tf.pad(image, [[0, 0], [0, 0], [1, 1], [1, 1]], mode='SYMMETRIC')\n", " boxes = boxes + 1\n", "\n", " @under_name_scope()\n", " def transform_fpcoor_for_tf(boxes, image_shape, crop_shape):\n", " \"\"\"\n", " The way tf.image.crop_and_resize works (with normalized box):\n", " Initial point (the value of output[0]): x0_box * (W_img - 1)\n", " Spacing: w_box * (W_img - 1) / (W_crop - 1)\n", " Use the above grid to bilinear sample.\n", "\n", " However, what we want is (with fpcoor box):\n", " Spacing: w_box / W_crop\n", " Initial point: x0_box + spacing/2 - 0.5\n", " (-0.5 because bilinear sample (in my definition) assumes floating point coordinate\n", " (0.0, 0.0) is the same as pixel value (0, 0))\n", "\n", " This function transform fpcoor boxes to a format to be used by tf.image.crop_and_resize\n", "\n", " Returns:\n", " y1x1y2x2\n", " \"\"\"\n", " x0, y0, x1, y1 = tf.split(boxes, 4, axis=1)\n", "\n", " spacing_w = (x1 - x0) / tf.cast(crop_shape[1], tf.float32)\n", " spacing_h = (y1 - y0) / tf.cast(crop_shape[0], tf.float32)\n", "\n", " imshape = [tf.cast(image_shape[0] - 1, tf.float32), tf.cast(image_shape[1] - 1, tf.float32)]\n", " nx0 = (x0 + spacing_w / 2 - 0.5) / imshape[1]\n", " ny0 = (y0 + spacing_h / 2 - 0.5) / imshape[0]\n", "\n", " nw = spacing_w * tf.cast(crop_shape[1] - 1, tf.float32) / imshape[1]\n", " nh = spacing_h * tf.cast(crop_shape[0] - 1, tf.float32) / imshape[0]\n", "\n", " return tf.concat([ny0, nx0, ny0 + nh, nx0 + nw], axis=1)\n", "\n", " image_shape = tf.shape(image)[2:]\n", "\n", " boxes = transform_fpcoor_for_tf(boxes, image_shape, [crop_size, crop_size])\n", " image = tf.transpose(image, [0, 2, 3, 1]) # nhwc\n", " ret = tf.image.crop_and_resize(\n", " image, boxes, tf.cast(box_ind, tf.int32),\n", " crop_size=[crop_size, crop_size])\n", " ret = tf.transpose(ret, [0, 3, 1, 2]) # ncss\n", " return ret\n", "\n", "\n", "def roi_align(featuremap, boxes, resolution):\n", " \"\"\"\n", " Args:\n", " featuremap: 1xCxHxW\n", " boxes: Nx4 floatbox\n", " resolution: output spatial resolution\n", "\n", " Returns:\n", " NxCx res x res\n", " \"\"\"\n", " # sample 4 locations per roi bin\n", " ret = crop_and_resize(\n", " featuremap, boxes,\n", " tf.zeros(tf.shape(boxes)[0], dtype=tf.int32),\n", " resolution * 2)\n", " try:\n", " avgpool = tf.nn.avg_pool2d\n", " except AttributeError:\n", " avgpool = tf.nn.avg_pool\n", " ret = avgpool(ret, [1, 1, 2, 2], [1, 1, 2, 2], padding='SAME', data_format='NCHW')\n", " return ret" ] }, { "cell_type": "markdown", "id": "dd525756", "metadata": {}, "source": [ "- Mask" ] }, { "cell_type": "code", "execution_count": 5, "id": "45612c0d", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import tensorflow as tf\n", "from tensorpack import tfv1\n", "from tensorpack.models import Conv2D, Conv2DTranspose,layer_register\n", "from tensorpack.tfutils.argscope import argscope\n", "\n", "def maskrcnn_upXconv_head(feature, num_category, num_convs, norm=None):\n", " \"\"\"\n", " Args:\n", " feature (NxCx s x s): size is 7 in C4 models and 14 in FPN models.\n", " num_category(int):\n", " num_convs (int): number of convolution layers\n", " norm (str or None): either None or 'GN'\n", "\n", " Returns:\n", " mask_logits (N x num_category x 2s x 2s):\n", " \"\"\"\n", " assert norm in [None, 'GN'], norm\n", " l = feature\n", " with argscope([Conv2D, Conv2DTranspose], data_format='channels_first',\n", " kernel_initializer=tfv1.variance_scaling_initializer(\n", " scale=2.0, mode='fan_out',\n", " distribution='untruncated_normal')):\n", " # c2's MSRAFill is fan_out\n", " for k in range(num_convs):\n", " l = Conv2D('fcn{}'.format(k), l, 256, 3, activation=tf.nn.relu)\n", " if norm is not None:\n", " l = GroupNorm('gn{}'.format(k), l)\n", " l = Conv2DTranspose('deconv', l, 256, 2, strides=2, activation=tf.nn.relu)\n", " l = Conv2D('conv', l, num_category, 1, kernel_initializer=tf.random_normal_initializer(stddev=0.001))\n", " return l\n", "\n", "@layer_register(log_shape=True)\n", "def GroupNorm(x, group=32, gamma_initializer=tf.constant_initializer(1.)):\n", " \"\"\"\n", " More code that reproduces the paper can be found at https://github.com/ppwwyyxx/GroupNorm-reproduce/.\n", " \"\"\"\n", " shape = x.get_shape().as_list()\n", " ndims = len(shape)\n", " assert ndims == 4, shape\n", " chan = shape[1]\n", " assert chan % group == 0, chan\n", " group_size = chan // group\n", "\n", " orig_shape = tf.shape(x)\n", " h, w = orig_shape[2], orig_shape[3]\n", "\n", " x = tf.reshape(x, tf.stack([-1, group, group_size, h, w]))\n", "\n", " mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True)\n", "\n", " new_shape = [1, group, group_size, 1, 1]\n", "\n", " beta = tf.get_variable('beta', [chan], initializer=tf.constant_initializer())\n", " beta = tf.reshape(beta, new_shape)\n", "\n", " gamma = tf.get_variable('gamma', [chan], initializer=gamma_initializer)\n", " gamma = tf.reshape(gamma, new_shape)\n", "\n", " out = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-5, name='output')\n", " return tf.reshape(out, orig_shape, name='output')" ] }, { "cell_type": "markdown", "id": "9eba4748", "metadata": {}, "source": [ "### FCOS\n", "\n", "FCOS (Fully Convolutional One-Stage Object Detection) is a one-stage object detection algorithm that uses a fully convolutional architecture to detect objects. It is a simple and effective method for object detection that does not require region proposal networks (RPNs) or anchor boxes.\n", "\n", "\n", "\n", ":::{note}\n", "We can see the paper [here](https://arxiv.org/pdf/1904.01355v5.pdf).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code\n", "\n", "The key point in FCOS is Center-ness, which is a novel way to calculate target: $centerness = \\sqrt{\\frac{min(l,r)}{max(l,r)} \\frac{min(t,b)}{max(t,b)}}$\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/objdet/06_centerness.png\n", "---\n", "name: 'Illustration of centerness'\n", "width: 90%\n", "---\n", "Illustration of centerness {cite}`fcos_structure`\n", ":::" ] }, { "cell_type": "code", "execution_count": 6, "id": "b44ac0a8", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import tensorflow as tf\n", "\n", "class CenternessNet(tf.keras.layers.Layer):\n", " def __init__(self, n_anchor, use_bias = None, concat = False, convolution = tf.keras.layers.Conv2D, normalize = None, activation = tf.keras.activations.sigmoid, **kwargs):\n", " super(CenternessNet, self).__init__(**kwargs)\n", " self.n_anchor = n_anchor\n", " self.use_bias = (normalize is None) if use_bias is None else use_bias\n", " self.concat = concat\n", " self.activation = activation\n", " self.convolution = convolution\n", " self.normalize = normalize\n", "\n", " def build(self, input_shape):\n", " if not isinstance(input_shape, (tuple, list)):\n", " input_shape = [input_shape]\n", " \n", " self.layers = [self.convolution(self.n_anchor, 3, padding = \"same\", use_bias = self.use_bias, name = \"head\")]\n", " if self.normalize is not None:\n", " self.layers.append(self.normalize(name = \"norm\"))\n", " self.layers.append(tf.keras.layers.Reshape([-1, 1], name = \"reshape\"))\n", " if self.concat and 1 < len(input_shape):\n", " self.post = tf.keras.layers.Concatenate(axis = -2, name = \"logits_concat\")\n", " self.act = tf.keras.layers.Activation(self.activation, dtype = tf.float32, name = \"logits\")\n", "\n", " def call(self, inputs):\n", " if not isinstance(inputs, (tuple, list)):\n", " inputs = [inputs]\n", " out = []\n", " for x in inputs:\n", " for l in self.layers:\n", " x = l(x)\n", " out.append(x)\n", " if len(out) == 1:\n", " out = out[0]\n", " elif self.concat:\n", " out = self.post(out)\n", " if isinstance(out, (tuple, list)):\n", " out = [self.act(o) for o in out]\n", " else:\n", " out = self.act(out)\n", " return out" ] }, { "cell_type": "markdown", "id": "ad0ac0ae", "metadata": {}, "source": [ "### DETR\n", "\n", "Similar to image classification, Transformer is also used in object detection task, and the classical one is DEtection TRansformer(DETR).\n", "\n", "\n", "\n", ":::{note}\n", "We can see the paper [here](https://arxiv.org/pdf/2005.12872v3.pdf).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 7, "id": "53deb9a2", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import tensorflow as tf\n", "from keras.layers import Dense\n", "from keras.layers import Conv2D\n", "from keras.layers import Embedding\n", "from typing import Dict\n", "\n", "\n", "class DETR(tf.keras.Model):\n", " \"\"\" This is the DETR module that performs object detection \"\"\"\n", " def __init__(self, \n", " backbone: tf.keras.Model, \n", " transformer: tf.keras.Model, \n", " num_classes: int, \n", " num_queries: int, \n", " aux_loss: bool = False, \n", " **kwargs):\n", " \n", " super(DETR, self).__init__(**kwargs)\n", " self.num_queries = num_queries\n", " self.transformer = transformer\n", " hidden_dim = transformer.d_model\n", " self.class_embed = Dense(num_classes+1, name='class_embed')\n", " self.bbox_embed = MLP(hidden_dim, 4, 3, name='bbox_embed')\n", " self.query_embed = Embedding(num_queries, hidden_dim, name='query_embed')\n", " self.query_embed.build((num_queries, hidden_dim))\n", " self.input_proj = Conv2D(hidden_dim, 1, name='input_proj')\n", " self.backbone = backbone\n", " self.aux_loss = aux_loss\n", "\n", " def call(self, samples: Dict):\n", " features, pos = self.backbone(samples)\n", " src, mask = features[-1][1]['img'], features[-1][1]['mask']\n", " assert mask is not None\n", " hs = self.transformer(self.input_proj(src), mask, self.query_embed.weights[0], pos[-1][1])\n", "\n", "class MLP(tf.keras.layers.Layer):\n", " def __init__(self, \n", " hidden_dim: int, \n", " output_dim: int, \n", " num_layers: int, \n", " **kwargs):\n", " \n", " self(MLP, self).__init__(**kwargs)\n", " self.num_layers = num_layers\n", " h = [hidden_dim] * (num_layers - 1)\n", " self.layers = [Dense(k) for k in h+[output_dim]]\n", "\n", " def call(self, x):\n", " for i, layer in enumerate(self.layers):\n", " x = tf.nn.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n", " return x" ] }, { "cell_type": "markdown", "id": "f27d5dfa", "metadata": {}, "source": [ ":::{seealso}\n", "The complete version of DETR can be found [here](https://github.com/PaperCodeReview/DETR-TF).\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your turn! 🚀\n", "\n", "Assignment - [Car object detection](../assignments/deep-learning/object-detection/car-object-detection.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgments\n", "\n", "Thanks to [Tensorpack](https://github.com/tensorpack) for creating the open-source project [tensorpack](https://github.com/tensorpack/tensorpack), [Hyungjin Kim](https://github.com/Burf) for creating the open-source project [TFDetection](https://github.com/Burf/TFDetection) and [PaperCodeReview](https://github.com/PaperCodeReview) for creating the open-source project [DETR-TF](https://github.com/PaperCodeReview/DETR-TF). They inspire the majority of the content in this chapter.\n", "\n", "

\n", "\n", "A demo of object detection. [source]\n", "

\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "55d876c5", "metadata": {}, "source": [ ":::{bibliography}\n", ":filter: docname in docnames\n", ":::" ] } ], "metadata": { "kernelspec": { "display_name": "py39", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }