{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# My Lectures Notes of [CNN Course](https://www.coursera.org/learn/convolutional-neural-networks/)\n", "It is short overview of what I've learned in CNN course of [Deep Learning specialization](deeplearning.net) from Coursera by Andrew Ng." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Convolution matrixes\n", "edge detection Kernal\n", "$$\n", "K =\n", " \\left[ {\\begin{array}{ccc}\n", " -1 & 0 & 1 \\\\\n", " -1 & 0 & 1 \\\\\n", " -1 & 0 & 1 \\\\\n", " \\end{array} } \\right]\n", "$$\n", "\n", "#### in python" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO: apply conv matrixes to an image and show result" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## CNN Technics\n", "### Padding \n", "_Adds zeros around image to avoid shrinking image and losing information on the edge._\n", "Size of zeros = `(p - 1) / 2`\n", "\n", "### Strided convolution\n", "**params:**\n", "\n", "- $f^{[l]}$ - filter size\n", "- $p^{[l]}$ - padding\n", "- $s^{[l]}$ - stride\n", "- $n_{c}^{[l]}$ - number of channels\n", "\n", "**input**: \n", "$n_{h}^{[l-1]} n_{w}^{[l-1]} n_{c}^{[l-1]}$. \n", "\n", "**output**: \n", "$n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.\n", "\n", "$n^{[l]} = floor(\\frac{n^{[l-1]} + 2p^{[l-1]} - f^{[l-1]}}{s^{[l-1]} + 1})$\n", "\n", "**each filter has shape:**:\n", "$f^{[l]} f^{[l]} n_{c}^{[l-1]}$\n", "\n", "**activations:**\n", "$a^{[l]} -> n_{h}^{[l]} n_{w}^{[l]} n_{c}^{[l]}$.\n", "\n", "### Pooling layer\n", "Could be max pooling, average pooling. It splites matrix in the regions and get max/avg value in each region and store it in the new matrix.\n", "\n", "**params:**\n", "- f-size of region\n", "- s-stride.\n", "\n", "**properties**\n", "- has no parameters to learn\n", "- common patter of usage: \n", "\n", "`conv -> pool -> conv -> pool -> … -> fc -> ... -> fc -> softmax.`\n", "\n", "- activation size tends to drop gradually but number of channels are increase\n", "- but sure influent on back propagation\n", "\n", "\n", "## CNN Backpropagation\n", "it is not covered in the course itself but has quick overview in a exercises.\n", "\n", "## CNN Features\n", "- very good at capturing area features\n", "- parameter sharing - a feature detector that’s useful in one part of the image is probably useful in another part of the image\n", "- sparsity of connections - in each layer, each output value depends only on a small number of inputs (much less than fully connected layer)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How to use Available (Third party) Nets\n", "## Classic Nets (LeNet-5, AlexNet, VGG)\n", "- good to start from AlexNet, because it's much easy to understand.\n", "- **LeNet-5** 1998\n", " - articles: http://yann.lecun.com/exdb/lenet/\n", " - http://deeplearning.net/tutorial/lenet.html\n", "- **AlexNet** 2012\n", " - article: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton\n", " - https://en.wikipedia.org/wiki/AlexNet \n", " - contains only 8 layers, first 5 were convolutional layers followed by fully connected layers\n", "- **VGG** 2014\n", " - https://arxiv.org/abs/1409.1556 Very Deep Convolutional Networks for Large-Scale Image Recognition Karen Simonyan, Andrew Zisserman\n", " - use only very small convolution filter 3x3\n", " - 16-19 layers\n", "\n", "## ResNet (152 - layers)'2015\n", "- article: https://arxiv.org/abs/1512.03385 _Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun_\n", "- **residual block** (\"short cut\" or \"skip connection\")\n", " - \"plain\" network (without residual blocks) increasing layer number after some max layer in theory should decrease training error but in practice will increase it - because of vanishing gradients (or, in rare cases, grow exponentially quickly and \"explode\" to take very large values). \n", " - ResBlocks allows the gradient to be directly backpropagated to earlier layers and solve problem of “plain” network\n", " - ResBlocks makes it very easy for one of the blocks to learn an identity function. Thus we can stack on additional ResNet blocks with little risk of harming training set performance\n", " - If we have different volumes of the input/output we use CONV2D layer in “shortcut” path to resize dimension \n", "- we could use batch norm for the channel after each CONV2D layer — to speed up training.\n", " \n", "## Inception Network'2014\n", "- article: https://arxiv.org/abs/1409.4842 _Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich_\n", "- **features**:\n", " - for one layer apply multiple filters (1x1, 3x3, 5x5, max-pool) => 28x28x(64+128+32+32)\n", " - bottleneck layer - use 1x1 layer to shrink layer before applying some big butch of filters (example layer: 28x28x192 apply 32 of 5x5).\n", " - side branches have with few FCL and softmax which helps follow that we are going right way\n", " - gooLeNet\n", "\n", "- **1x1 convolution** features:\n", " - weight each channel by n filter matrixes\n", " - shrink number of channels \n", " \n", "## Transfer Learning\n", " - get nn + weights from internet\n", " - replace its last layers with softmax\n", " - freeze all (or part for big data set) layers except softmax layer\n", " - precompute last frozen layer activations -- convert X through all fixed layers and save to disk (it save from passing through all layers multiple times)\n", " - train\n", "\n", "## Data Augmentation\n", " - should be implemented on-fly and in parallel\n", " - mirroring\n", " - random cropping\n", " - rotation\n", " - shearing\n", " - local warping\n", " - color shifting\n", " - PCA color argumentation http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf\n", " - Kaggle Galaxy Zoo challenge http://benanne.github.io/2014/04/05/galaxy-zoo.html\n", " - examples https://github.com/facebook/fb.resnet.torch/blob/e8fb31378fd8dc188836cf1a7c62b609eb4fd50a/datasets/transforms.lua\n", "\n", "## Ensembling \n", " - train several network independently and average their output ($\\hat{y}$)\n", " - popular for competition but not so useful for production - _because gets a lot of power for the small amount of improvement_\n", "\n", "## Multi-crop on test time \n", " - run classifier (predict) on 10-cropped images (center, top-left, top-right, …., mirrored, etc) and then average output ($\\hat{y}$)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Object Classification < Object Localization < Object Detection\n", "Landmark detection\n", "\n", "- **YOLO (you only look once) YOLO / objects detection**\n", " - Redmon et al., 2016 (https://arxiv.org/abs/1506.02640) and Redmon and Farhadi, 2016 (https://arxiv.org/abs/1612.08242).\n", " - grid nxn\n", " - in each cell find object (x,y, width, height, class)\n", "\n", "- **Validate - (IoU) Intersection Over Union**\n", " - = intersection / union. \n", " - 0.5 is good\n", "\n", "- **Non-max Suppression**\n", " - get box with max p\n", " - exclude other boxes with IoU > 0.5\n", " - repeat\n", "\n", "- **Anchor Boxes**\n", " - purpose - overlapping objects\n", " - problem - we could have in one cell more than 1 object\n", " - make prediction for different anchor boxes (and store more than few anchors in one y - “output”) \n", " - choose right anchor box by matching IoU\n", " - _TODO: what about case when we have few pedestrians in single cell? They should have similar anchor boxes. And why don’t we store in y id of anchor? In this way we could have few objects with similar anchor._\n", " \n", "- **Alternative**\n", " - **Region Proposals (R-CNN)**\n", " - article: https://arxiv.org/abs/1311.2524\n", " - segment to regions\n", " - try to classify these regions (label + bounding box)\n", " - classify one region at the time\n", " - **Fast R-CNN**\n", " - article: https://arxiv.org/abs/1504.08083\n", " - use CNN for classify all regions at the time \n", " - **Faster R-CNN**\n", " - article: https://arxiv.org/abs/1506.01497\n", " - use CNN to propose regions \n", " - but it's still slower than YOLO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Face recognition\n", "\n", "- **Problem:** _One Shot Learning_:\n", " - we have extendable data base of persons with little amount of pictures (maybe one for single person)\n", " - we can't use NN with n (number of persons) outputs because:\n", " - we can re-train your NN each time when new person comes or leaves\n", " - it is really small amount of data for training good NN\n", "\n", "- **Solution:**\n", "\n", "train function which gives similarity: \n", "$$\n", "d(face\\_image^{(i)}, face\\_image^{(j)}).\n", "$$\n", "\n", "- **Siamese Network**\n", " - $f(image^{(i)})$ - encoding picture by NN to 128th numbers (the last layer).\n", " - $d(image^{(i)}, image^{(j)}) = ||f(image^{(i)}) - f(image^{(j)})||_{2}^{2}$ - distance between $image^{(i)}$ and $image^{(j)}$, small for the same persons and large for different persons\n", " - article: https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf DeepFace: Closing the Gap to Human-Level Performance in Face\n", "- **Triplet Loss**\n", "\n", " - gets anchor (A), positive (P) and negative (N) images\n", "$$\n", "||f(A) - f(P)||^{2} \\leq ||f(A) - f(N)||^{2}\n", "$$\n", " - to exclude trivial cases when d(A,B) will be 0 we add small margin ($\\alpha$):\n", "$$\n", "||f(A) - f(P)||^{2} + \\alpha \\leq ||f(A) - f(N)||^{2}\n", "$$\n", " - _ME: why don’t just replace $\\le$ with $<$?_\n", " \n", "- **Lost function:**\n", "$$\n", "L(A,P,N) = max(||f(A) - f(P)||^{2} - ||f(A) - f(N)||^{2} + \\alpha, 0)\n", "$$\n", " - Training: \n", " - few images for single person -> make triple (A,P,N) for training, single image for person is not enough\n", " - just random triple (A,P,N) would not enough because it is very easy to satisfy L function. Choose triplet that’re hard to train - so d(A,P)~d(A,N) - quite close.\n", " - Prediction: \n", " - single image of person could be ok.\n", " - precompute f(A) for db known persons\n", " - FaceNet: A Unified Embedding for Face Recognition and Clustering Florian Schroff, Dmitry Kalenichenko, James Philbin https://arxiv.org/abs/1503.03832\n", " - Binary classification for Face verification\n", " - add one last extra layer which will be binary classifier:\n", " $\\hat{y} = \\Sigma(\\sum{k=1}{128} w_{i} |f(x^{(i)}_{k}) - f(x^{(j)}_{k})| + b)$\n", "\n", "### Other applications\n", "We can use this approach for checking similarity of other things - for example photos of artists, tweets and etc._\n", "\n", "https://keras.io/getting-started/functional-api-guide/#shared-layers \n", "loss function based on: \n", "```\n", "p(tweet_1 and tweet_2 belong to one person) = log_regression(f(tweet_1) + f(tweet_2))\n", "log_regression == Dense(1, activation=\"sigmoid\")(merged_vector)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Neural Style Transfer (NST)\n", "## To Understand Should take a look on visualizatin of CNN layers\n", "- article: https://arxiv.org/abs/1311.2901 Visualizing and Understanding Convolutional Networks Matthew D Zeiler, Rob Fergus\n", "- pick a unit in layer 1 of (CNN Deep NN). Find the nine image patches that maximize the unit’s activation\n", "_ME: How should we search there images? brute force?_\n", "- repeat for other units\n", "- repeat for other layers\n", "- this way we could finds that each channel of each layer focus on some subset of features - like circle at the middle, cat, people and etc.\n", "\n", "\n", "## NST algorithm\n", "- `Content (C) + Style (S) => Generated image (G)`\n", "- Start with random G image.\n", " For example: G shape 100x100x3\n", "\n", "- Use gradient descent to minimize J(G)\n", "\n", "$$\n", "G = G - \\alpha / dG * J(G) \n", "$$\n", "\n", "- Cost function \n", "$$\n", "J(G) = \\alpha*J_{content}(C, G) + \\beta*J_{style}(S, G)\n", "$$\n", "\n", "\n", "### Content cost\n", "\n", "use hidden layer `l` to compute content cost. l is somewhere in between (lower - \"pixel-2-pixel\", high - \"feature somewhere\")\n", "\n", "$$\n", "J_{content}(C, G) = \\frac{1}{2}||a^{[l](C)} - a^{[l](G)}||^{2}\n", "$$\n", "\n", "difference of activation matrixes\n", "\n", "### Style cost\n", "\n", "use style matrix -- correlation (actually covariation) of different channels of one layer\n", "\n", "$$\n", "G^{[l]}_{kk'} = \\sum_{i=1}^{n^{[l]}_{h}} \\sum_{j=1}^{n^{[l]}_{w}} a^{[l]}_{ijk}a^{[l]}_{ijk'}\n", "$$\n", "\n", "for one layer\n", "\n", "$$\n", "J^{[l]}_{style}(S,G) = ||G^{[l][S]} - G^{[l][G]}||^{2} = \\frac{1}{2n^{[l]}_{H}n^{[l]}_{W}n^{[l]}_{C})^{2} \\sum_{k} \\sum_{k'} G^{[l](S)}_{kk'} G^{[l](G)}_{kk'}}\n", "$$\n", "\n", "we would get better result when we will use sum of few layers.\n", "\n", "$$\n", "J_{style}(S,G) = \\sum_{l} \\lambda^{[l]} J^{[l]}_{style}(S,G)\n", "$$\n", "\n", "- diagonal elements $G_{ii}$ - shows how common feature i is.\n", "- https://arxiv.org/abs/1508.06576 A Neural Algorithm of Artistic Style Leon A. Gatys, Alexander S. Ecker, Matthias Bethge" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }