{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import HTML\n", "style = \"\"\"\n", "\n", "\"\"\"\n", "HTML(style)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# GANs\n", "\n", "## What are they, what makes them work, and what is their future.\n", "\n", "**Seth Weidman, ODSC East 2018**\n", "\n", "May 3, 2018" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Agenda\n", "\n", "### 1. What are GANs and how do they work?\n", "1. Neural Networks Review\n", "2. How GANs work\n", "3. GAN architectures and results\n", "4. GAN \"tricks\": deep convolutional architectures and batch normalization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2. What are the latest and greatest cutting edge results?\n", "1. Pose Generation\n", "2. Semi Supervised Learning\n", "3. Progressive GANs\n", "\n", "Special topic: Inception Score for scoring GANs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 3. What is their future of GANs and Deep Learning in general?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What are GANs?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "You may not know what a GAN is: when the conversation turns to GANs, you may feel like [Homer Simpson](https://www.youtube.com/watch?v=PGLzm-Gy0dQ)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\"GAN\" stands for \"Generative Adversarial Network\". " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "They are a method of training neural networks to generate images similar to those in the data the neural network is trained on. This training is done via an adversarial process." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Basic example (Goodfellow et. al., 2014)\n", "\n", "![](img/mnist_gan_8s.png)\n", "\n", "Not digits written by a human. Generated by a neural network." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Cutting edge example: (NVIDIA Research, October 2017)\n", "\n", "![](img/progressive_gan_example.png)\n", "\n", "Not real people: images generated by a neural network." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> \"[GANs], and the variations that are now being proposed, are the most interesting idea in the last 10 years in ML [machine learning], in my opinion.\" \n", "\n", "-- Yann LeCunn, Director of AI Research at Facebook, [in 2016 on Quora](https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning/answer/Yann-LeCun)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What are neural networks?\n", "\n", "### Neural network review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We've all seen diagrams like this when trying to understand neural nets:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/neural_network_diagram.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "But what are they _really_? There are many different ways of explaining what a neural net is. _Mathematically_, they are:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Nested functions** (like $f(g(x))$)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Universal function approximators** (if we nest enough of them, we can approximate any function, no matter how complex)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Differentiable** (this allows us to \"train\" them to actually accomplish things)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This means that you can think of a neural net as being a mathematical function that takes in:\n", "\n", "* An input image (or batch) (that we'll call $X$)\n", "* Several weight matrices (that we'll denote $W$)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The net itself is just some big differentiable function $N$. Like $N(X, W) = X^3 + 3W^2 - 5$, but way more complicated." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Every time we feed a set of inputs and weights through this network, we get a **\"prediction vector\" $P$**; we compare the prediction vector to the **actual vector of correct responses $Y$** to get a **loss vector $L$**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "These facts mean we can train neural network using the following procedure:\n", "\n", "1. Feed a bunch of data points through the neural network.\n", "2. Compute the loss $L$ - how much the network \"missed\" by on these points.\n", "3. Compute for every single weight $w$ in the network: $$ \\frac{\\partial L}{\\partial w} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And then we can update the weights according to the equation:\n", "\n", "$$ w = w - \\frac{\\partial L}{\\partial w} $$\n", "\n", "(Or one of the many modifications of this equation that exist, that are different variants on gradient descent)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**In addition**, differentiability means we can compute, for every _pixel_ $x$ in the input image:\n", "\n", "$$ \\frac{\\partial L}{\\partial x} $$\n", "\n", "In other words, how much the loss would change if this pixel in the _input_ image changed." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(_This_ turns out to be the key fact that allows GANs to work)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# How were GANs invented?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/ian_goodfellow_beer.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In 2013, Ian Goodfellow (inventor of GANs, then a grad student at the University of Montreal) and Yoshua Bengio (one of the leading researchers on neural networks in the world) are about to run a speech synthesis contest." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Their idea is to have a \"discriminator network\" that could listen to artificially generated speech and decide if it was real or not. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "They decide not to run the contest, concluding that people will just game the system by generating examples that will fool _this particular_ discriminator network, rather than trying to produce _generally_ good speech." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then, Ian Goodfellow was in a bar one night, and asked the question: **can this be fixed by the _discriminator network_ learning**?\n", "\n", "This led him to develop what ultimately became the GAN framework. Let's dive in and see how it works:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### How GANs work\n", "\n", "## What he came up with" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Part 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First: randomly generate a feature vector; feed the feature vector through a randomly initialized neural network to produce an output image." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$ \\begin{bmatrix}z_1 \\\\\n", " z_2 \\\\\n", " ... \\\\\n", " z_{100}\n", " \\end{bmatrix} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/gan_1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Denote the matrix of pixels in this image - generated by the first neural network - $X$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then, feed this image (matrix of pixels $X$) into a second network and get a prediction:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/gan_2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Use this loss to train this second network, called the \"**discriminator**\". " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Critically, also compute $$ \\frac{\\partial L}{\\partial X} $$ - how much each of the _pixels generated_ affects the loss." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then, update the first network, called the **generator**, with $$ -\\frac{\\partial L}{\\partial X} $$\n", "\n", "negative because we want the generator to be continually making the discriminator _more_ likely to say that the images it is generating are real." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/gan_3.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Finally, generate a _new_ random noise vector $Z$, and repeat the process, so that the generator will learn to turn _any_ random noise vector into an image that the discriminator thinks is real." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### What's missing?\n", "\n", "This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify - that is, only fake images, and no real images. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So, we'll have to give it real images as well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Part 2:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/gans_4.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Quote from the original paper on GANs:\n", "\n", "> \"The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles.\" \n", "\n", "-Goodfellow et. al., \"Generative Adversarial Networks\" (2014)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Let's code one up" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's check [one](DCGAN_Illustration_PyTorch) out!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### GAN Architectures and Results\n", "\n", "# Famous architectures" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Code of the first GAN ever:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is the [Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6) that he used to generate MNIST digits back in 2014." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## DCGAN (January 2016)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "[This paper](https://arxiv.org/abs/1511.06434) introduced a couple of key concepts that pushed GANs forward:\n", "\n", "* Deep Convolutional/Deconvolutional architecture\n", "* Batch normalization (which had been invented earlier in 2015) first applied to GANs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Crazy fact: the lead author of the DCGAN paper (Alec Radford) was still in college when it was published." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## DCGAN Results" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Which of these five bedrooms do you think are real?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(they're all fake!)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Smooth transitions in the latent (100-dimensional input to generator) space:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/dcgan_smooth_transition.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Filters learned by the last layer of the discriminator:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Arithmetic in the latent space:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/dcgan_arithmetic.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### GAN Tricks \n", "\n", "## DCGAN Trick #1: Deep Convolutional/Deconvolutional Architecture" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Discriminator\n", "\n", "![](img/dcgan_discriminator.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "```python\n", "class Discriminator(torch.nn.Module):\n", " \n", " def __init__(self):\n", " super(Discriminator, self).__init__()\n", " \n", " self.conv1 = nn.Sequential(\n", " nn.Conv2d(\n", " in_channels=1, out_channels=128, kernel_size=4, \n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.LeakyReLU(0.2, inplace=True)\n", " )\n", " self.conv2 = nn.Sequential(\n", " nn.Conv2d(\n", " in_channels=128, out_channels=256, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(256),\n", " nn.LeakyReLU(0.2, inplace=True)\n", " )\n", " self.conv3 = nn.Sequential(\n", " nn.Conv2d(\n", " in_channels=256, out_channels=512, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(512),\n", " nn.LeakyReLU(0.2, inplace=True)\n", " )\n", " self.conv4 = nn.Sequential(\n", " nn.Conv2d(\n", " in_channels=512, out_channels=1024, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(1024),\n", " nn.LeakyReLU(0.2, inplace=True)\n", " )\n", " self.out = nn.Sequential(\n", " nn.Linear(1024*4*4, 1),\n", " nn.Sigmoid(),\n", " )\n", "\n", " def forward(self, x):\n", " # Convolutional layers\n", " x = self.conv1(x)\n", " x = self.conv2(x)\n", " x = self.conv3(x)\n", " x = self.conv4(x)\n", " # Flatten and apply sigmoid\n", " x = x.view(-1, 1024*4*4)\n", " x = self.out(x)\n", " return x\n", "``` " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Generator\n", "\n", "![](img/dcgan_generator.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "```python\n", "class Generator(torch.nn.Module):\n", " \n", " def __init__(self):\n", " super(Generator, self).__init__()\n", " \n", " self.linear = torch.nn.Linear(100, 1024*4*4)\n", " \n", " self.conv1 = nn.Sequential(\n", " nn.ConvTranspose2d(\n", " in_channels=1024, out_channels=512, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(512),\n", " nn.ReLU(inplace=True)\n", " )\n", " self.conv2 = nn.Sequential(\n", " nn.ConvTranspose2d(\n", " in_channels=512, out_channels=256, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(256),\n", " nn.ReLU(inplace=True)\n", " )\n", " self.conv3 = nn.Sequential(\n", " nn.ConvTranspose2d(\n", " in_channels=256, out_channels=128, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " ),\n", " nn.BatchNorm2d(128),\n", " nn.ReLU(inplace=True)\n", " )\n", " self.conv4 = nn.Sequential(\n", " nn.ConvTranspose2d(\n", " in_channels=128, out_channels=1, kernel_size=4,\n", " stride=2, padding=1, bias=False\n", " )\n", " )\n", " self.out = torch.nn.Tanh()\n", "\n", " def forward(self, x):\n", " # Project and reshape\n", " x = self.linear(x)\n", " x = x.view(x.shape[0], 1024, 4, 4)\n", " # Convolutional layers\n", " x = self.conv1(x)\n", " x = self.conv2(x)\n", " x = self.conv3(x)\n", " x = self.conv4(x)\n", " # Apply Tanh\n", " return self.out(x)\n", "``` " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**What's going on in these convolutions anyway?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Convolutions deep dive\n", "\n", "We've all seen diagrams like this in the context of convolutional neural nets:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/AlexNet_0.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is the famous [AlexNet](https://en.wikipedia.org/wiki/AlexNet) architecture." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What's really going on here?\n", "\n", "Let's say we have an input layer of size $[224x224x3]$, as we do in the ImageNet dataset that AlexNet was trained on. This next layer seems to be $96$ deep. What does that mean?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Review of convolutions\n", "\n", "\"_Filters_\" are slid over images using the convolution operation. \n", "\n", "In theory, these filters can act as _feature detectors_, and the images that result from the convolving these filters with the image can be thought of as versions of the original image where the detected features have been \"highlighted.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "Blue = original image\n", "Gray = convolutional \"filter\"\n", "Green = output image" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In practice, the neural network _learns_ filters that are useful to solving the particular problem it has been given." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We _typically_ visualize the _results_ of applying these filters to the images. However, in certain cases we visualize the filters themselves." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's return to the concrete example of the AlexNet architecture:\n", "\n", "For each of 96 _filters_, the following happens:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For each of the 3 _input channels_ - usually **red, green, and blue** for color image - one of these _filters_, which happens to be dimension $11 x 11$ in this case, is slid over the image, \"detecting the presence of different features\" at each location. \n", "\n", "So, there are actually a total of 96 * 3 convolution operations that take place, resulting in 96 filters, each of which has a red, green, and blue component." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can combine the red, green, and blue filters together and visualize them as if they were a mini $11x11$ image:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### The 96 AlexNet filters:\n", "\n", "![](img/AlexNet_filt1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## DCGAN Trick #2: Deep Convolutional/Deconvolutional Architecture" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Batch normalization is one of the most powerful and simple tricks to come along in the history of the training of deep neural networks. It was [introduced](https://arxiv.org/abs/1502.03167) by two researchers from Google in March 2015, just nine months before the [DCGAN paper](https://arxiv.org/abs/1511.06434) came out." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Regular neural network:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We know that normalizing the input to a neural network helps with training: the network doesn't have to \"learn\" that one feature is on a scale from 0-1000 and another is on a scale from 0-1 and change its weights accordingly, for example." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The same thing applies further down in the network:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inituitively, batch normalization works for the same reasons that normalizing data before feeding it into a neural network works." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**How is it actually done?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When passing data through a neural network, we do so in batches - say, 64 or 128 images at a time.\n", "\n", "Thus, at every step of the neural network, each neuron has a value _for each observation that is being passed through_.\n", "\n", "We normalize _across these observations_, so that _for each batch_, each neuron will have a mean 0 and standard deviation 1. Specifically, we replace the value of the neuron $N$ with:\n", "\n", "$$N' = \\frac{N - \\mu}{\\sigma}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**What's wrong with this in _convolutional neural networks specifically_**?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Hint: convolutional neural networks learn by learning groups of neurons which are really filters:\n", "\n", "![](img/activations_mnist.png)\n", "\n", "This is one image, convolved with 10 different filters in a CNN." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For convolutional networks, the \"neurons\" are pixels in output images that have been convolved with a filter. These images are important - they contain spatial information about what is present in the images. If we modify pixels in these images by different amounts, this spatial information could get modified. \n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So, instead of calculating means and standard deviations for each _neuron_ in each batch, we calculate means and standard deviations for all the output images for a given batch, so that for **a given image, each pixel will be modified by the same amount**.\n", "\n", "![](img/activations_mnist.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Enough theory!**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Latest and Greatest Results" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### GAN Result #1\n", "\n", "# \"Pose Guided Person Image Generation\"\n", "\n", "[NIPS 2017 paper](https://papers.nips.cc/paper/6644-pose-guided-person-image-generation.pdf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> This paper proposes the novel Pose Guided Person Generation Network (PG2) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Based on the [DeepFashion Dataset](http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "\n", "### DeepFashion dataset description\n", "\n", "> \"“It contains over 800,000 images, which are richly annotated with massive attributes, clothing landmarks, and correspondence of images taken under different scenarios including store, street snapshot, and consumer. Such rich annotations enable the development of powerful algorithms in clothes recognition and facilitating future researches.”" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Example data\n", "\n", "![](img/deep_fashion_clothing_locations.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Generated poses" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/pose_generation_1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](img/pose_generation_2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### GAN Result #2\n", "\n", "# Semi-Supervised Learning\n", "\n", "![](img/semi-supervised_gans.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Semi-supervised learning made it into Jeff Bezos' most recent letter to Amazon's shareholders!\n", "\n", "> \"...in the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Semi-supervised learning is a third type of machine learning, in addition to supervised learning and unsupervised learning.\n", "\n", "At a high level:\n", "\n", "* The goal of supervised learning is to learn from _labeled_ data.\n", "* The goal of unsupervised learning is to learn from _unlabeled_ data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Semi-supervised learning asks the question: can you learn from a _combination_ of both labeled and unlabeled data? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "With GANs, the answer turns out to be yes! The paper that introduced this idea was [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How does it work? Basic idea is: " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's say we're trying to classify MNIST digits. The discriminator will output a probability vector of an image belonging to one of ten classes: \n", "\n", "![](img/ssl_discriminator_1.png)\n", "\n", "This is compared with the real values, turned into a loss vector, and backpropagated through the network to train it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "With semi-supervised learning, we simply add another class to this output:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/ssl_discriminator_2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Then, data points are fed through a classifier, with the following labels:\n", "\n", "* _Real_, _labeled_ examples are given labels simply of 0 for all the digits they are not, 1 for the digit they are, and 0 for $P(real)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* _Fake_ examples generated by the generator are given labels of 0 across the board, including for $P(real)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* _Real_, *un*-labeled examples are given labels of 0 for all the classes and 1 for the probability of the image being real." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This allows the classifier to learn from real, labeled examples, as well as both fake examples, and real, unlabeled examples! " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In practice, using fake examples is more effective than using real, unlabeled examples." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In this framework, how do we train the generator? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The classifier in this case is acting like a discrminator at the same time it is acting like a classifier. For each image it sees, it is outputting *both*:\n", "\n", "* The probability that the image is a 0, 1, 2, etc.\n", "* The probability that the image is real. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Nevertheless, as the researchers in the paper point out:\n", "\n", "> This approach introduces an interaction between G and our classifier that we do not fully understand yet" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This leads them to one of their key innovations that allowed this procedure to work: **feature matching**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Semi-supervised learning trick: feature matching" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Feature matching, a technique for training GANs, was proposed in the same paper that proposed using GANs for Semi-Supervised Learning: [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf), by Salimans et. al. from OpenAI." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Idea\n", "\n", "The last layer of a convolutional netural network, before the values get fed through a fully connected layer, is typically a layer with many features that have been detected " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/gan_layer.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For example, in the convolutional architecture used in the discriminator of the DCGAN architecture described above, the last layer is $2x2x128$ - the result of 128 \"features of features of features\" that the network has learned. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is then \"flattened\" to a single layer of $2 * 2 * 128 = 512$ neurons, and these 512 neurons are then fed through a fully connected layer to produce an output of length 10." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Their idea was to train the generator, not simply by using the discriminator's prediction of whether the image was real or fake, but on **how similar this 512 dimensional vector was between _real_ images fed through the discrimintor compared to _fake_ images fed through the discriminator.** \n", "\n", "The delta between these two sets of _features_ is the loss used to train the generator." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Aside: why does this work? Even the authors of the paper don't fully understand it:\n", "\n", " > \"This approach introduces an interaction between G and [the hybrid classifier-discriminator] that we do not fully understand yet, but empirically we find that optimizing G using feature matching GAN works very well for semi-supervised learning, while training G using GAN with minibatch discrimination does not work at all. Here we present our empirical results using this approach; developing a full theoretical understanding of the interaction between D and G using this approach is left for future work." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Nevertheless, feature matching was the trick that led to breakthrough performance using semi-supervised learning to build powerful classifiers: " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Salimans et. al. from OpenAI in mid-2016 used this approach to get just under a **6%** error rate on the [Street View House Numbers dataset](http://ufldl.stanford.edu/housenumbers/) _with just 1,000 labeled images_. Prior approaches achieved roughly **16%** error. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "State-of-the-art error, using the entire dataset of roughly 600,000 images, simply using supervised learning with very deep convolutional networks, is roughly **2%**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Semi-supervised learning is perhaps the most important _application_ of GANs - what is the cutting edge of building GANs themselves?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### GAN Result #3\n", "\n", "# Progressive GANs\n", "\n", "People have been trying to improve the resolution of GANs since their invention. Progressive GANs, published by a few folks at NVIDIA research in November 2017, are a huge step forward in doing so:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "[Here](http://research.nvidia.com/sites/default/files/publications/karras2017gan-paper.pdf) is the Progressive GAN paper, describing how they generated high quality 1024x1024 images mimicking those from the CelebA dataset. The findings even made the New York Times!\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What is the main idea behind Progressive GANs?\n", "\n", "1. Begin by downsampling the images to be simply _4x4_.\n", "2. Train a GAN to generate \"high quality\" 4x4 images. \n", "3. Then, using the weights already learned in the initial layers, add a layer after the generator and before the discriminator so that this GAN now generates _8x8_ images, etc.\n", "![](img/progressive_gans_technique.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**But how do we _know_ these GANs are any good?**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## An aside: how do we score GANs?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "How do we know that these samples are \"good\"? They \"look good\", but how can we quantify this?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> \"Generative Adversarial Networks are generally regarded as producing the best samples [compared to other generative methods such as variational autoencoders] but there is no good way to quantify this.\"\n", "\n", "--Ian Goodfellow, NIPS tutorial 2016" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Since then, several methods have been proposed, the most prominent of which is the **Inception Score**:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Inception Score" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In the same paper that introduced feature matching, a technique for scoring GANs called **Inception Score** was introduced." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Consider a GAN that was intended to generate images that come from one of a finite number of classes, such as MNIST digits." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Inception score (cont.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's say that the generator generated some images, and those generated images were then fed through a pre-trained neural network, and the a probability distribution over the images was:\n", "\n", "`[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]`\n", "\n", "In other words, the pre-trained model has no idea which class this image should belong to. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In this case, we conclude that all else equal, this likely isn't a very good generator." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The way we formalize this is that this resulting vector should have _low entropy_ - that is, _not_ an even distribution over class labels. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Inception score (cont.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There is another way we can use this pre-trained neural network. Let's say that for every image generated, we recorded the \"most likely class\" that the pre-trained network was predicting. And let's say that 90% of the time, the pre-trained network was classifying the images that our model was generating as zeros, so that the vector of \"most likely class\" looked like:\n", "\n", "`[0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Again, this would not be a very good GAN!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The way we formalize this is that we want the vector of the frequency of the predictions to have _high entropy_: that is, we *do* want the classes to be balanced." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\"Inception\" simply refers to the neural network architecture used to score these generated images." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Progressive GANs did indeed show a record Inception score on the CIFAR-10 dataset:\n", "\n", "![](img/progressive_gans_inception.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, we can't do this in the Celeb-A dataset: there are no classes!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Patch similarity" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The authors propose a new way of assessing their GANs to identify improvement:\n", "\n", "They randomly sample 7x7 patches from the 16x16 versions of the images, the 32x32 versions, etc., up to the 1024x1024 version. They then use a metric called the \"Wasserstein distance\" to compute the similarity between generated patches and the corresponding real patches.\n", "\n", "> \"...the distance between the patch sets extracted from the lowest-resolution 16 × 16 images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Using this metric, their method does indeed outperform other GANs that have come before." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The future" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What is the future of GANs? More generally, what is the future of Deep Learning? Can we predict it?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "I asked Ian Goodfellow in a LinkedIn message if he was surprised by how quickly Progressive GANs were able toget clase to photorealistic image quality on 1024x1024 images. He replied:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> I'm actually surprised at how slow it's been. Back in 2015 I thought that getting to photorealistic video was mostly going to be an engineering effort of scaling the model up and training on more data.\n", "\n", "-Ian Goodfellow, in a LinkedIn message to me" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Ian Goodfellow's background" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Knowledge over time" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What is the future of GANs?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](img/question_mark.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Nobody knows!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Thanks!**\n", "\n", "![](img/professional_headshot.png)\n", "\n", "[Website](https://www.sethweidman.com) | [Medium](https://medium.com/@sethweidman) | [GitHub](https://github.com/sethHWeidman/) | [Twitter](https://twitter.com/SethHWeidman) | [LinkedIn](https://www.linkedin.com/in/sethhweidman/)\n", "\n", "seth@sethweidman.com if you have any questions." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }