{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Convolutional Neural Network\n", "\n", "In this tutorial we will implement a simple Convolutional Neural Network in TensorFlow which has a classification accuracy of about 99%, or more if you make some of the suggested exercises.\n", "\n", "Convolutional Networks work by moving small filters across the input image. This means the filters are re-used for recognizing patterns throughout the entire input image. This makes the Convolutional Networks much more powerful than Fully-Connected networks with the same number of variables. This in turn makes the Convolutional Networks faster to train.\n", "\n", "A discrete convolution (or cross-correlation) is defined as:\n", "\n", "$$y = \\mathbf{x} \\star \\mathbf{w} = \\sum_{k=-\\infty}^{\\infty} x[i-k] w[k]$$\n", "\n", "where $\\mathbf{x}$ is a matrix input and $\\mathbf{w}$ is a filter matrix. The index $i$ runs through each element of the output vector $y$.\n", "\n", "In practice, most of the infinite elements are zero, so we do not have to calculate them. If $\\mathbf{x}$ is an image, then we will simply pad the sides with zeros.\n", "\n", "![](images/padding.png)\n", "\n", "If the input $\\mathbf{x}$ and filter $\\mathbf{w}$ have $n$ and $m$ elements, respectively, then the convolution formula becomes:\n", "\n", "$$y = \\mathbf{x} \\star \\mathbf{w} = \\sum_{k=0}^{m-1} x^{(p)}[i+m-k] w[k]$$\n", "\n", "which is free of infinite indices.\n", "\n", "However, $\\mathbf{x}$ and $\\mathbf{w}$ are indexed in different dimensions, so we can flip the filter $\\mathbf{w}$ and perform a dot product to get an element $y[i]$:\n", "\n", "$$\\mathbf{x}[i:i+m] \\centerdot \\mathbf{w}^{T}$$\n", "\n", "This operation is repeated as a sliding window to obtain all of the output elements.\n", "\n", "![](images/sliding_window.png)\n", "\n", "Padding allows the size of the output to be controlled:\n", "\n", "![](images/padding_types.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Flowchart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following chart shows roughly how the data flows in the Convolutional Neural Network that is implemented below.\n", "\n", "![Flowchart](images/02_network_flowchart.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The input image is processed in the first convolutional layer using the filter-weights. This results in 16 new images, one for each filter in the convolutional layer. The images are also down-sampled so the image resolution is decreased from 28x28 to 14x14.\n", "\n", "These 16 smaller images are then processed in the second convolutional layer. We need filter-weights for each of these 16 channels, and we need filter-weights for each output channel of this layer. There are 36 output channels so there are a total of 16 x 36 = 576 filters in the second convolutional layer. The resulting images are down-sampled again to 7x7 pixels.\n", "\n", "The output of the second convolutional layer is 36 images of 7x7 pixels each. These are then flattened to a single vector of length 7 x 7 x 36 = 1764, which is used as the input to a fully-connected layer with 128 neurons (or elements). This feeds into another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image, that is, which number is depicted in the image.\n", "\n", "The convolutional filters are initially chosen at random, so the classification is done randomly. The error between the predicted and true class of the input image is measured as the so-called cross-entropy. The optimizer then automatically propagates this error back through the Convolutional Network using the chain-rule of differentiation and updates the filter-weights so as to improve the classification error. This is done iteratively thousands of times until the classification error is sufficiently low.\n", "\n", "These particular filter-weights and intermediate images are the results of one optimization run and may look different if you re-run this Notebook.\n", "\n", "Note that the computation in TensorFlow is actually done on a batch of images instead of a single image, which makes the computation more efficient. This means the flowchart actually has one more data-dimension when implemented in TensorFlow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convolutional Layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following chart shows the basic idea of processing an image in the first convolutional layer. The input image depicts the number 7 and four copies of the image are shown here, so we can see more clearly how the filter is being moved to different positions of the image. For each position of the filter, the dot-product is being calculated between the filter and the image pixels under the filter, which results in a single pixel in the output image. So moving the filter across the entire input image results in a new image being generated.\n", "\n", "The red filter-weights means that the filter has a positive reaction to black pixels in the input image, while blue pixels means the filter has a negative reaction to black pixels.\n", "\n", "In this case it appears that the filter recognizes the horizontal line of the 7-digit, as can be seen from its stronger reaction to that line in the output image.\n", "\n", "![Convolution example](images/02_convolution.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The step-size for moving the filter across the input is called the stride. There is a stride for moving the filter horizontally (x-axis) and another stride for moving vertically (y-axis).\n", "\n", "In the source-code below, the stride is set to 1 in both directions, which means the filter starts in the upper left corner of the input image and is being moved 1 pixel to the right in each step. When the filter reaches the end of the image to the right, then the filter is moved back to the left side and 1 pixel down the image. This continues until the filter has reached the lower right corner of the input image and the entire output image has been generated.\n", "\n", "When the filter reaches the end of the right-side as well as the bottom of the input image, then it can be padded with zeroes (white pixels). This causes the output image to be of the exact same dimension as the input image.\n", "\n", "Furthermore, the output of the convolution may be passed through a so-called Rectified Linear Unit (ReLU), which merely ensures that the output is positive because negative values are set to zero. The output may also be down-sampled by so-called max-pooling, which considers small windows of 2x2 pixels and only keeps the largest of those pixels. This halves the resolution of the input image e.g. from 28x28 to 14x14 pixels.\n", "\n", "Note that the second convolutional layer is more complicated because it takes 16 input channels. We want a separate filter for each input channel, so we need 16 filters instead of just one. Furthermore, we want 36 output channels from the second convolutional layer, so in total we need 16 x 36 = 576 filters for the second convolutional layer. It can be a bit challenging to understand how this works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import warnings\n", "warnings.filterwarnings(\"ignore\", category=RuntimeWarning)\n", "\n", "import matplotlib.pyplot as plt\n", "import tensorflow as tf\n", "import numpy as np\n", "from sklearn.metrics import confusion_matrix\n", "import time\n", "from datetime import timedelta\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration of Neural Network\n", "\n", "The configuration of the Convolutional Neural Network is defined here for convenience, so you can easily find and change these numbers and re-run the Notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convolutional Layer 1.\n", "filter_size1 = 5 # Convolution filters are 5 x 5 pixels.\n", "num_filters1 = 16 # There are 16 of these filters.\n", "\n", "# Convolutional Layer 2.\n", "filter_size2 = 5 # Convolution filters are 5 x 5 pixels.\n", "num_filters2 = 36 # There are 36 of these filters.\n", "\n", "# Fully-connected layer.\n", "fc_size = 128 # Number of neurons in fully-connected layer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate convolutional neural networks, we will use Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct (more challenging) drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the fashion-mnist pre-shuffled train data and test data\n", "(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()\n", "\n", "print(\"x_train shape:\", x_train.shape, \"y_train shape:\", y_train.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy some of the data-dimensions for convenience." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_image = x_train[0]\n", "\n", "# The number of pixels in each dimension of an image.\n", "img_size = sample_image.shape[0]\n", "\n", "# The images are stored in one-dimensional arrays of this length.\n", "img_size_flat = sample_image.ravel().shape[0]\n", "\n", "# Tuple with height and width of images used to reshape arrays.\n", "img_shape = sample_image.shape\n", "\n", "# Number of classes, one class for each of 10 classes.\n", "num_classes = len(set(y_train))\n", "\n", "# Number of colour channels for the images: 1 channel for gray-scale.\n", "num_channels = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot an image to see if data is correct" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function used to plot 9 images in a 3x3 grid, and writing the true and predicted classes below each image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print training set shape - note there are 60,000 training data of image size of 28x28, 60,000 train labels)\n", "print(\"x_train shape:\", x_train.shape, \"y_train shape:\", y_train.shape)\n", "\n", "# Print the number of training and test datasets\n", "print(x_train.shape[0], 'train set')\n", "print(x_test.shape[0], 'test set')\n", "\n", "# Define the text labels\n", "fashion_mnist_labels = [\"T-shirt/top\", # index 0\n", " \"Trouser\", # index 1\n", " \"Pullover\", # index 2 \n", " \"Dress\", # index 3 \n", " \"Coat\", # index 4\n", " \"Sandal\", # index 5\n", " \"Shirt\", # index 6 \n", " \"Sneaker\", # index 7 \n", " \"Bag\", # index 8 \n", " \"Ankle boot\"] # index 9\n", "\n", "# Image index, you can pick any number between 0 and 59,999\n", "img_index = 5\n", "# y_train contains the lables, ranging from 0 to 9\n", "label_index = y_train[img_index]\n", "# Print the label, for example 2 Pullover\n", "print (\"y = \" + str(label_index) + \" \" +(fashion_mnist_labels[label_index]))\n", "# # Show one of the images from the training dataset\n", "plt.imshow(x_train[img_index])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CFlNHktHBtru" }, "source": [ "## Split the data into train/validation/test data sets\n", "\n", "\n", "* Training data - used for training the model\n", "* Validation data - used for tuning the hyperparameters and evaluate the models\n", "* Test data - used to test the model after the model has gone through initial vetting by the validation set.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 86 }, "colab_type": "code", "id": "1ShU787gZZg0", "outputId": "2016e59c-a88b-4ecd-c9fa-1177011877d1" }, "outputs": [], "source": [ "# Further break training data into train / validation sets (# put 5000 into validation set and keep remaining 55,000 for train)\n", "(x_train, x_valid) = x_train[5000:], x_train[:5000] \n", "(y_train, y_valid) = y_train[5000:], y_train[:5000]\n", "\n", "# Reshape input data from (28, 28) to (28, 28, 1)\n", "w, h = 28, 28\n", "x_train = x_train.reshape(x_train.shape[0], w*h)\n", "x_valid = x_valid.reshape(x_valid.shape[0], w*h)\n", "x_test = x_test.reshape(x_test.shape[0], w*h)\n", "\n", "# One-hot encode the labels\n", "y_train = tf.keras.utils.to_categorical(y_train, 10)\n", "y_valid = tf.keras.utils.to_categorical(y_valid, 10)\n", "y_test = tf.keras.utils.to_categorical(y_test, 10)\n", "\n", "# Print training set shape\n", "print(\"x_train shape:\", x_train.shape, \"y_train shape:\", y_train.shape)\n", "\n", "# Print the number of training, validation, and test datasets\n", "print(x_train.shape[0], 'train set')\n", "print(x_valid.shape[0], 'validation set')\n", "print(x_test.shape[0], 'test set')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-functions for creating new variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions for creating new TensorFlow variables in the given shape and initializing them with random values. Note that the initialization is not actually done at this point, it is merely being defined in the TensorFlow graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def new_weights(shape):\n", " return tf.Variable(tf.truncated_normal(shape, stddev=0.05))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def new_biases(length):\n", " return tf.Variable(tf.constant(0.05, shape=[length]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function for creating a new Convolutional Layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function creates a new convolutional layer in the computational graph for TensorFlow. Nothing is actually calculated here, we are just adding the mathematical formulas to the TensorFlow graph.\n", "\n", "It is assumed that the input is a 4-dim tensor with the following dimensions:\n", "\n", "1. Image number.\n", "2. Y-axis of each image.\n", "3. X-axis of each image.\n", "4. Channels of each image.\n", "\n", "Note that the input channels may either be colour-channels, or it may be filter-channels if the input is produced from a previous convolutional layer.\n", "\n", "The output is another 4-dim tensor with the following dimensions:\n", "\n", "1. Image number, same as input.\n", "2. Y-axis of each image. If 2x2 pooling is used, then the height and width of the input images is divided by 2.\n", "3. X-axis of each image. Ditto.\n", "4. Channels produced by the convolutional filters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def new_conv_layer(input, # The previous layer.\n", " num_input_channels, # Num. channels in prev. layer.\n", " filter_size, # Width and height of each filter.\n", " num_filters, # Number of filters.\n", " use_pooling=True): # Use 2x2 max-pooling.\n", "\n", " # Shape of the filter-weights for the convolution.\n", " # This format is determined by the TensorFlow API.\n", " shape = [filter_size, filter_size, num_input_channels, num_filters]\n", "\n", " # Create new weights aka. filters with the given shape.\n", " weights = new_weights(shape=shape)\n", "\n", " # Create new biases, one for each filter.\n", " biases = new_biases(length=num_filters)\n", "\n", " # Create the TensorFlow operation for convolution.\n", " # Note the strides are set to 1 in all dimensions.\n", " # The first and last stride must always be 1,\n", " # because the first is for the image-number and\n", " # the last is for the input-channel.\n", " # But e.g. strides=[1, 2, 2, 1] would mean that the filter\n", " # is moved 2 pixels across the x- and y-axis of the image.\n", " # The padding is set to 'SAME' which means the input image\n", " # is padded with zeroes so the size of the output is the same.\n", " layer = tf.nn.conv2d(input=input,\n", " filter=weights,\n", " strides=[1, 1, 1, 1],\n", " padding='SAME')\n", "\n", " # Add the biases to the results of the convolution.\n", " # A bias-value is added to each filter-channel.\n", " layer += biases\n", "\n", " # Use pooling to down-sample the image resolution?\n", " if use_pooling:\n", " # This is 2x2 max-pooling, which means that we\n", " # consider 2x2 windows and select the largest value\n", " # in each window. Then we move 2 pixels to the next window.\n", " layer = tf.nn.max_pool(value=layer,\n", " ksize=[1, 2, 2, 1],\n", " strides=[1, 2, 2, 1],\n", " padding='SAME')\n", "\n", " # Rectified Linear Unit (ReLU).\n", " # It calculates max(x, 0) for each input pixel x.\n", " # This adds some non-linearity to the formula and allows us\n", " # to learn more complicated functions.\n", " layer = tf.nn.relu(layer)\n", "\n", " # Note that ReLU is normally executed before the pooling,\n", " # but since relu(max_pool(x)) == max_pool(relu(x)) we can\n", " # save 75% of the relu-operations by max-pooling first.\n", "\n", " # We return both the resulting layer and the filter-weights\n", " # because we will plot the weights later.\n", " return layer, weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function for flattening a layer\n", "\n", "A convolutional layer produces an output tensor with 4 dimensions. We will add fully-connected layers after the convolution layers, so we need to reduce the 4-dim tensor to 2-dim which can be used as input to the fully-connected layer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def flatten_layer(layer):\n", " # Get the shape of the input layer.\n", " layer_shape = layer.get_shape()\n", "\n", " # The shape of the input layer is assumed to be:\n", " # layer_shape == [num_images, img_height, img_width, num_channels]\n", "\n", " # The number of features is: img_height * img_width * num_channels\n", " # We can use a function from TensorFlow to calculate this.\n", " num_features = layer_shape[1:4].num_elements()\n", " \n", " # Reshape the layer to [num_images, num_features].\n", " # Note that we just set the size of the second dimension\n", " # to num_features and the size of the first dimension to -1\n", " # which means the size in that dimension is calculated\n", " # so the total size of the tensor is unchanged from the reshaping.\n", " layer_flat = tf.reshape(layer, [-1, num_features])\n", "\n", " # The shape of the flattened layer is now:\n", " # [num_images, img_height * img_width * num_channels]\n", "\n", " # Return both the flattened layer and the number of features.\n", " return layer_flat, num_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function for creating a new Fully-Connected Layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function creates a new fully-connected layer in the computational graph for TensorFlow. Nothing is actually calculated here, we are just adding the mathematical formulas to the TensorFlow graph.\n", "\n", "It is assumed that the input is a 2-dim tensor of shape [num_images, num_inputs]. The output is a 2-dim tensor of shape [num_images, num_outputs]." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def new_fc_layer(input, # The previous layer.\n", " num_inputs, # Num. inputs from prev. layer.\n", " num_outputs, # Num. outputs.\n", " use_relu=True): # Use Rectified Linear Unit (ReLU)?\n", "\n", " # Create new weights and biases.\n", " weights = new_weights(shape=[num_inputs, num_outputs])\n", " biases = new_biases(length=num_outputs)\n", "\n", " # Calculate the layer as the matrix multiplication of\n", " # the input and weights, and then add the bias-values.\n", " layer = tf.matmul(input, weights) + biases\n", "\n", " # Use ReLU?\n", " if use_relu:\n", " layer = tf.nn.relu(layer)\n", "\n", " return layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Placeholder variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Placeholder variables serve as the input to the TensorFlow computational graph that we may change each time we execute the graph. We call this feeding the placeholder variables and it is demonstrated further below.\n", "\n", "First we define the placeholder variable for the input images. This allows us to change the images that are input to the TensorFlow graph. This is a so-called tensor, which just means that it is a multi-dimensional vector or matrix. The data-type is set to float32 and the shape is set to [None, img_size_flat], where None means that the tensor may hold an arbitrary number of images with each image being a vector of length img_size_flat." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = tf.placeholder(tf.float32, shape=[None, img_size_flat], name='x')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The convolutional layers expect x to be encoded as a 4-dim tensor so we have to reshape it so its shape is instead [num_images, img_height, img_width, num_channels]. Note that img_height == img_width == img_size and num_images can be inferred automatically by using -1 for the size of the first dimension. So the reshape operation is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_image = tf.reshape(x, [-1, img_size, img_size, num_channels])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we have the placeholder variable for the true labels associated with the images that were input in the placeholder variable x. The shape of this placeholder variable is [None, num_classes] which means it may hold an arbitrary number of labels and each label is a vector of length num_classes which is 10 in this case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_true = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_true')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also have a placeholder variable for the class-number, but we will instead calculate it using argmax. Note that this is a TensorFlow operator so nothing is calculated at this point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_true_cls = tf.argmax(y_true, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convolutional Layer 1\n", "\n", "Create the first convolutional layer. It takes x_image as input and creates num_filters1 different filters, each having width and height equal to filter_size1. Finally we wish to down-sample the image so it is half the size by using 2x2 max-pooling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_conv1, weights_conv1 = \\\n", " new_conv_layer(input=x_image,\n", " num_input_channels=num_channels,\n", " filter_size=filter_size1,\n", " num_filters=num_filters1,\n", " use_pooling=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the shape of the tensor that will be output by the convolutional layer. It is (?, 14, 14, 16) which means that there is an arbitrary number of images (this is the ?), each image is 14 pixels wide and 14 pixels high, and there are 16 different channels, one channel for each of the filters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_conv1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convolutional Layer 2\n", "\n", "Create the second convolutional layer, which takes as input the output from the first convolutional layer. The number of input channels corresponds to the number of filters in the first convolutional layer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_conv2, weights_conv2 = \\\n", " new_conv_layer(input=layer_conv1,\n", " num_input_channels=num_filters1,\n", " filter_size=filter_size2,\n", " num_filters=num_filters2,\n", " use_pooling=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the shape of the tensor that will be output from this convolutional layer. The shape is (?, 7, 7, 36) where the ? again means that there is an arbitrary number of images, with each image having width and height of 7 pixels, and there are 36 channels, one for each filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_conv2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Flatten Layer\n", "\n", "The convolutional layers output 4-dim tensors. We now wish to use these as input in a fully-connected network, which requires for the tensors to be reshaped or flattened to 2-dim tensors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_flat, num_features = flatten_layer(layer_conv2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the tensors now have shape (?, 1764) which means there's an arbitrary number of images which have been flattened to vectors of length 1764 each. Note that 1764 = 7 x 7 x 36." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_flat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fully-Connected Layer 1\n", "\n", "Add a fully-connected layer to the network. The input is the flattened layer from the previous convolution. The number of neurons or nodes in the fully-connected layer is fc_size. ReLU is used so we can learn non-linear relations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_fc1 = new_fc_layer(input=layer_flat,\n", " num_inputs=num_features,\n", " num_outputs=fc_size,\n", " use_relu=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the output of the fully-connected layer is a tensor with shape (?, 128) where the ? means there is an arbitrary number of images and fc_size == 128." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_fc1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fully-Connected Layer 2\n", "\n", "Add another fully-connected layer that outputs vectors of length 10 for determining which of the 10 classes the input image belongs to. Note that ReLU is not used in this layer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_fc2 = new_fc_layer(input=layer_fc1,\n", " num_inputs=fc_size,\n", " num_outputs=num_classes,\n", " use_relu=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_fc2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predicted Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second fully-connected layer estimates how likely it is that the input image belongs to each of the 10 classes. However, these estimates are a bit rough and difficult to interpret because the numbers may be very small or large, so we want to normalize them so that each element is limited between zero and one and the 10 elements sum to one. This is calculated using the so-called softmax function and the result is stored in y_pred." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = tf.nn.softmax(layer_fc2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class-number is the index of the largest element." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_cls = tf.argmax(y_pred, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cost-function to be optimized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make the model better at classifying the input images, we must somehow change the variables for all the network layers. To do this we first need to know how well the model currently performs by comparing the predicted output of the model y_pred to the desired output y_true.\n", "\n", "The cross-entropy is a performance measure used in classification. The cross-entropy is a continuous function that is always positive and if the predicted output of the model exactly matches the desired output then the cross-entropy equals zero. The goal of optimization is therefore to minimize the cross-entropy so it gets as close to zero as possible by changing the variables of the network layers.\n", "\n", "TensorFlow has a built-in function for calculating the cross-entropy. Note that the function calculates the softmax internally so we must use the output of layer_fc2 directly rather than y_pred which has already had the softmax applied." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cross_entropy = tf.losses.softmax_cross_entropy(onehot_labels=y_true, logits=layer_fc2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have now calculated the cross-entropy for each of the image classifications so we have a measure of how well the model performs on each image individually. But in order to use the cross-entropy to guide the optimization of the model's variables we need a single scalar value, so we simply take the average of the cross-entropy for all the image classifications." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cost = tf.reduce_mean(cross_entropy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimization Method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a cost measure that must be minimized, we can then create an optimizer. In this case it is the AdamOptimizer which is an advanced form of Gradient Descent.\n", "\n", "Note that optimization is not performed at this point. In fact, nothing is calculated at all, we just add the optimizer-object to the TensorFlow graph for later execution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Performance Measures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need a few more performance measures to display the progress to the user.\n", "\n", "This is a vector of booleans whether the predicted class equals the true class of each image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "correct_prediction = tf.equal(y_pred_cls, y_true_cls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This calculates the classification accuracy by first type-casting the vector of booleans to floats, so that False becomes 0 and True becomes 1, and then calculating the average of these numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TensorFlow Run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create TensorFlow session\n", "\n", "Once the TensorFlow graph has been created, we have to create a TensorFlow session which is used to execute the graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = tf.Session()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize variables\n", "\n", "The variables for weights and biases must be initialized before we start optimizing them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session.run(tf.global_variables_initializer())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function to perform optimization iterations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 55,000 images in the training-set. It takes a long time to calculate the gradient of the model using all these images. We therefore only use a small batch of images in each iteration of the optimizer.\n", "\n", "If your computer crashes or becomes very slow because you run out of RAM, then you may try and lower this number, but you may then need to perform more optimization iterations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_batch_size = 64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function for performing a number of optimization iterations so as to gradually improve the variables of the network layers. In each iteration, a new batch of data is selected from the training-set and then TensorFlow executes the optimizer using those training samples. The progress is printed every 100 iterations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "\n", "def random_batch(X, y, batch_size=64, \n", " shuffle=False, random_seed=None):\n", " \n", " idx = np.arange(y.shape[0])\n", " \n", " if shuffle:\n", " rng = np.random.RandomState(random_seed)\n", " rng.shuffle(idx)\n", " X = X[idx]\n", " y = y[idx]\n", " \n", " for i in range(0, X.shape[0], batch_size):\n", " yield (X[i:i+batch_size, :], y[i:i+batch_size])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_batch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Counter for total number of iterations performed so far.\n", "total_iterations = 0\n", "\n", "def optimize(num_iterations):\n", " # Ensure we update the global variable rather than a local copy.\n", " global total_iterations\n", "\n", " # Start-time used for printing time-usage below.\n", " start_time = time.time()\n", "\n", " for i in range(total_iterations,\n", " total_iterations + num_iterations):\n", "\n", " # Get a batch of training examples.\n", " # x_batch now holds a batch of images and\n", " # y_true_batch are the true labels for those images.\n", " x_batch, y_true_batch = random_batch(x_train, y_train, batch_size=train_batch_size)\n", "\n", " # Put the batch into a dict with the proper names\n", " # for placeholder variables in the TensorFlow graph.\n", " feed_dict_train = {x: x_batch,\n", " y_true: y_true_batch}\n", "\n", " # Run the optimizer using this batch of training data.\n", " # TensorFlow assigns the variables in feed_dict_train\n", " # to the placeholder variables and then runs the optimizer.\n", " session.run(optimizer, feed_dict=feed_dict_train)\n", "\n", " # Print status every 100 iterations.\n", " if i % 100 == 0:\n", " # Calculate the accuracy on the training-set.\n", " acc = session.run(accuracy, feed_dict=feed_dict_train)\n", "\n", " # Message for printing.\n", " msg = \"Optimization Iteration: {0:>6}, Training Accuracy: {1:>6.1%}\"\n", "\n", " # Print it.\n", " print(msg.format(i + 1, acc))\n", "\n", " # Update the total number of iterations performed.\n", " total_iterations += num_iterations\n", "\n", " # Ending time.\n", " end_time = time.time()\n", "\n", " # Difference between start and end-times.\n", " time_dif = end_time - start_time\n", "\n", " # Print the time-usage.\n", " print(\"Time usage: \" + str(timedelta(seconds=int(round(time_dif)))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function to plot example errors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function for plotting examples of images from the test-set that have been mis-classified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_example_errors(cls_pred, correct):\n", " # This function is called from print_test_accuracy() below.\n", "\n", " # cls_pred is an array of the predicted class-number for\n", " # all images in the test-set.\n", "\n", " # correct is a boolean array whether the predicted class\n", " # is equal to the true class for each image in the test-set.\n", "\n", " # Negate the boolean array.\n", " incorrect = (correct == False)\n", " \n", " # Get the images from the test-set that have been\n", " # incorrectly classified.\n", " images = x_test[incorrect]\n", " \n", " # Get the predicted classes for those images.\n", " cls_pred = cls_pred[incorrect]\n", "\n", " # Get the true classes for those images.\n", " cls_true = y_test[incorrect]\n", " \n", " # Plot the first 9 images.\n", " plot_images(images=images[0:9],\n", " cls_true=cls_true[0:9],\n", " cls_pred=cls_pred[0:9])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function to plot confusion matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(cls_pred):\n", " # This is called from print_test_accuracy() below.\n", "\n", " # cls_pred is an array of the predicted class-number for\n", " # all images in the test-set.\n", "\n", " # Get the true classifications for the test-set.\n", " cls_true = y_test\n", " \n", " # Get the confusion matrix using sklearn.\n", " cm = confusion_matrix(y_true=cls_true,\n", " y_pred=cls_pred)\n", "\n", " # Print the confusion matrix as text.\n", " print(cm)\n", "\n", " # Plot the confusion matrix as an image.\n", " plt.matshow(cm)\n", "\n", " # Make various adjustments to the plot.\n", " plt.colorbar()\n", " tick_marks = np.arange(num_classes)\n", " plt.xticks(tick_marks, range(num_classes))\n", " plt.yticks(tick_marks, range(num_classes))\n", " plt.xlabel('Predicted')\n", " plt.ylabel('True')\n", "\n", " # Ensure the plot is shown correctly with multiple plots\n", " # in a single Notebook cell.\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function for showing the performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function for printing the classification accuracy on the test-set.\n", "\n", "It takes a while to compute the classification for all the images in the test-set, that's why the results are re-used by calling the above functions directly from this function, so the classifications don't have to be recalculated by each function.\n", "\n", "Note that this function can use a lot of computer memory, which is why the test-set is split into smaller batches. If you have little RAM in your computer and it crashes, then you can try and lower the batch-size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Split the test-set into smaller batches of this size.\n", "test_batch_size = 256\n", "\n", "def print_test_accuracy(show_example_errors=False,\n", " show_confusion_matrix=False):\n", "\n", " # Number of images in the test-set.\n", " num_test = x_test.shape[0]\n", "\n", " # Allocate an array for the predicted classes which\n", " # will be calculated in batches and filled into this array.\n", " cls_pred = np.zeros(shape=num_test, dtype=np.int)\n", "\n", " # Now calculate the predicted classes for the batches.\n", " # We will just iterate through all the batches.\n", " # There might be a more clever and Pythonic way of doing this.\n", "\n", " # The starting index for the next batch is denoted i.\n", " i = 0\n", "\n", " while i < num_test:\n", " # The ending index for the next batch is denoted j.\n", " j = min(i + test_batch_size, num_test)\n", "\n", " # Get the images from the test-set between index i and j.\n", " images = x_test[i:j, :]\n", "\n", " # Get the associated labels.\n", " labels = y_test[i:j, :]\n", "\n", " # Create a feed-dict with these images and labels.\n", " feed_dict = {x: images,\n", " y_true: labels}\n", "\n", " # Calculate the predicted class using TensorFlow.\n", " cls_pred[i:j] = session.run(y_pred_cls, feed_dict=feed_dict)\n", "\n", " # Set the start-index for the next batch to the\n", " # end-index of the current batch.\n", " i = j\n", "\n", " # Convenience variable for the true class-numbers of the test-set.\n", " cls_true = np.where(y_test)[1]\n", "\n", " # Create a boolean array whether each image is correctly classified.\n", " correct = (cls_true == cls_pred)\n", "\n", " # Calculate the number of correctly classified images.\n", " # When summing a boolean array, False means 0 and True means 1.\n", " correct_sum = sum(correct)\n", "\n", " # Classification accuracy is the number of correctly classified\n", " # images divided by the total number of images in the test-set.\n", " acc = float(correct_sum) / num_test\n", "\n", " # Print the accuracy.\n", " msg = \"Accuracy on Test-Set: {0:.1%} ({1} / {2})\"\n", " print(msg.format(acc, correct_sum, num_test))\n", "\n", " # Plot some examples of mis-classifications, if desired.\n", " if show_example_errors:\n", " print(\"Example errors:\")\n", " plot_example_errors(cls_pred=cls_pred, correct=correct)\n", "\n", " # Plot the confusion matrix, if desired.\n", " if show_confusion_matrix:\n", " print(\"Confusion Matrix:\")\n", " plot_confusion_matrix(cls_pred=cls_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance before any optimization\n", "\n", "The accuracy on the test-set is very low because the model variables have only been initialized and not optimized at all, so it just classifies the images randomly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print_test_accuracy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance after 100 optimization iterations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "optimize(num_iterations=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print_test_accuracy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance after 3000 optimization iterations\n", "\n", "After 3000 optimization iterations, the model has greatly increased its accuracy on the test-set to more than 90%." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "optimize(num_iterations=2000) # We performed 100 iterations above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print_test_accuracy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization of Weights and Layers\n", "\n", "In trying to understand why the convolutional neural network can recognize images, we will now visualize the weights of the convolutional filters and the resulting output images." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper-function for plotting convolutional weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_conv_weights(weights, input_channel=0):\n", " # Assume weights are TensorFlow ops for 4-dim variables\n", " # e.g. weights_conv1 or weights_conv2.\n", " \n", " # Retrieve the values of the weight-variables from TensorFlow.\n", " # A feed-dict is not necessary because nothing is calculated.\n", " w = session.run(weights)\n", "\n", " # Get the lowest and highest values for the weights.\n", " # This is used to correct the colour intensity across\n", " # the images so they can be compared with each other.\n", " w_min = np.min(w)\n", " w_max = np.max(w)\n", "\n", " # Number of filters used in the conv. layer.\n", " num_filters = w.shape[3]\n", "\n", " # Number of grids to plot.\n", " # Rounded-up, square-root of the number of filters.\n", " num_grids = math.ceil(math.sqrt(num_filters))\n", " \n", " # Create figure with a grid of sub-plots.\n", " fig, axes = plt.subplots(num_grids, num_grids)\n", "\n", " # Plot all the filter-weights.\n", " for i, ax in enumerate(axes.flat):\n", " # Only plot the valid filter-weights.\n", " if i