{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Networks Using Blocks (VGG)\n", "\n", ":label:`sec_vgg`\n", "\n", "\n", "While AlexNet proved that deep convolutional neural networks\n", "can achieve good results, it did not offer a general template\n", "to guide subsequent researchers in designing new networks.\n", "In the following sections, we will introduce several heuristic concepts\n", "commonly used to design deep networks.\n", "\n", "Progress in this field mirrors that in chip design\n", "where engineers went from placing transistors\n", "to logical elements to logic blocks.\n", "Similarly, the design of neural network architectures\n", "had grown progressively more abstract,\n", "with researchers moving from thinking in terms of\n", "individual neurons to whole layers,\n", "and now to blocks, repeating patterns of layers.\n", "\n", "The idea of using blocks first emerged from the\n", "[Visual Geometry Group](http://www.robots.ox.ac.uk/~vgg/) (VGG)\n", "at Oxford University,\n", "in their eponymously-named VGG network.\n", "It is easy to implement these repeated structures in code\n", "with any modern deep learning framework by using loops and subroutines.\n", "\n", "\n", "## VGG Blocks\n", "\n", "The basic building block of classic convolutional networks\n", "is a sequence of the following layers:\n", "(i) a convolutional layer\n", "(with padding to maintain the resolution),\n", "(ii) a nonlinearity such as a ReLU, (iii) a pooling layer such \n", "as a max pooling layer. \n", "One VGG block consists of a sequence of convolutional layers,\n", "followed by a max pooling layer for spatial downsampling.\n", "In the original VGG paper :cite:`Simonyan.Zisserman.2014`,\n", "the authors \n", "employed convolutions with $3\\times3$ kernels\n", "and $2 \\times 2$ max pooling with stride of $2$\n", "(halving the resolution after each block).\n", "In the code below, we define a function called `vggBlock`\n", "to implement one VGG block.\n", "The function takes two arguments\n", "corresponding to the number of convolutional layers `numConvs`\n", "and the number of output channels `numChannels`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load ../utils/djl-imports\n", "%load ../utils/plot-utils\n", "%load ../utils/Training.java\n", "%load ../utils/Accumulator.java" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ai.djl.basicdataset.cv.classification.*;\n", "import org.apache.commons.lang3.ArrayUtils;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "attributes": { "classes": [], "id": "", "n": "1" } }, "outputs": [], "source": [ "public SequentialBlock vggBlock(int numConvs, int numChannels) {\n", "\n", " SequentialBlock tempBlock = new SequentialBlock();\n", " for (int i = 0; i < numConvs; i++) {\n", " // DJL has default stride of 1x1, so don't need to set it explicitly.\n", " tempBlock\n", " .add(Conv2d.builder()\n", " .setFilters(numChannels)\n", " .setKernelShape(new Shape(3, 3))\n", " .optPadding(new Shape(1, 1))\n", " .build()\n", " )\n", " .add(Activation::relu);\n", " }\n", " tempBlock.add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2)));\n", " return tempBlock;\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VGG Network\n", "\n", "Like AlexNet and LeNet,\n", "the VGG Network can be partitioned into two parts:\n", "the first consisting mostly of convolutional and pooling layers\n", "and a second consisting of fully-connected layers.\n", "The convolutional portion of the net connects several `vggBlock` modules\n", "in succession.\n", "In :numref:`fig_vgg`, the variable `convArch` consists of a list of tuples (one per block),\n", "where each contains two values: the number of convolutional layers\n", "and the number of output channels,\n", "which are precisely the arguments requires to call\n", "the `vggBlock` function.\n", "The fully-connected module is identical to that covered in AlexNet.\n", "\n", "![Designing a network from building blocks](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/vgg.svg)\n", "\n", ":width:`400px`\n", "\n", "\n", ":label:`fig_vgg`\n", "\n", "\n", "The original VGG network had 5 convolutional blocks,\n", "among which the first two have one convolutional layer each\n", "and the latter three contain two convolutional layers each.\n", "The first block has 64 output channels\n", "and each subsequent block doubles the number of output channels,\n", "until that number reaches $512$.\n", "Since this network uses $8$ convolutional layers\n", "and $3$ fully-connected layers, it is often called VGG-11." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "attributes": { "classes": [], "id": "", "n": "2" } }, "outputs": [], "source": [ "int[][] convArch = {{1, 64}, {1, 128}, {2, 256}, {2, 512}, {2, 512}};" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code implements VGG-11. This is a simple matter of executing a for loop over `convArch`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "attributes": { "classes": [], "id": "", "n": "3" } }, "outputs": [], "source": [ "public SequentialBlock VGG(int[][] convArch) {\n", "\n", " SequentialBlock block = new SequentialBlock();\n", " // The convolutional layer part\n", " for (int i = 0; i < convArch.length; i++) {\n", " block.add(vggBlock(convArch[i][0], convArch[i][1]));\n", " }\n", "\n", " // The fully connected layer part\n", " block\n", " .add(Blocks.batchFlattenBlock())\n", " .add(Linear\n", " .builder()\n", " .setUnits(4096)\n", " .build())\n", " .add(Activation::relu)\n", " .add(Dropout\n", " .builder()\n", " .optRate(0.5f)\n", " .build())\n", " .add(Linear\n", " .builder()\n", " .setUnits(4096)\n", " .build())\n", " .add(Activation::relu)\n", " .add(Dropout\n", " .builder()\n", " .optRate(0.5f)\n", " .build())\n", " .add(Linear.builder().setUnits(10).build());\n", " \n", " return block;\n", "}\n", "\n", "SequentialBlock block = VGG(convArch);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will construct a single-channel data example\n", "with a height and width of 224 to observe the output shape of each layer." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "attributes": { "classes": [], "id": "", "n": "4" } }, "outputs": [], "source": [ "float lr = 0.05f;\n", "Model model = Model.newInstance(\"vgg-display\");\n", "model.setBlock(block);\n", "\n", "Loss loss = Loss.softmaxCrossEntropyLoss();\n", "\n", "Tracker lrt = Tracker.fixed(lr);\n", "Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();\n", "\n", "DefaultTrainingConfig config = new DefaultTrainingConfig(loss).optOptimizer(sgd) // Optimizer (loss function)\n", " .optDevices(Engine.getInstance().getDevices(1)) // single GPU\n", " .addEvaluator(new Accuracy()) // Model Accuracy\n", " .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging\n", "\n", "Trainer trainer = model.newTrainer(config);\n", "\n", "Shape inputShape = new Shape(1, 1, 224, 224);\n", "\n", "try(NDManager manager = NDManager.newBaseManager()) {\n", " NDArray X = manager.randomUniform(0f, 1.0f, inputShape);\n", " trainer.initialize(inputShape);\n", "\n", " Shape currentShape = X.getShape();\n", "\n", " for (int i = 0; i < block.getChildren().size(); i++) {\n", " Shape[] newShape = block.getChildren().get(i).getValue().getOutputShapes(new Shape[]{currentShape});\n", " currentShape = newShape[0];\n", " System.out.println(block.getChildren().get(i).getKey() + \" layer output : \" + currentShape);\n", " }\n", "}\n", "// save memory on VGG params\n", "model.close();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we halve height and width at each block,\n", "finally reaching a height and width of 7\n", "before flattening the representations\n", "for processing by the fully-connected layer.\n", "\n", "## Model Training\n", "\n", "Since VGG-11 is more computationally-heavy than AlexNet\n", "we construct a network with a smaller number of channels.\n", "This is more than sufficient for training on Fashion-MNIST." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "attributes": { "classes": [], "id": "", "n": "5" } }, "outputs": [], "source": [ "int ratio = 4;\n", "\n", "for(int i=0; i < convArch.length; i++){\n", " convArch[i][1] = convArch[i][1] / ratio;\n", "}\n", "\n", "inputShape = new Shape(1, 1, 96, 96); // resize the input shape to save memory\n", "\n", "Model model = Model.newInstance(\"vgg-tiny\");\n", "SequentialBlock newBlock = VGG(convArch);\n", "model.setBlock(newBlock);\n", "Loss loss = Loss.softmaxCrossEntropyLoss();\n", "\n", "Tracker lrt = Tracker.fixed(lr);\n", "Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();\n", "\n", "DefaultTrainingConfig config = new DefaultTrainingConfig(loss).optOptimizer(sgd) // Optimizer (loss function)\n", " .optDevices(Engine.getInstance().getDevices(1)) // single GPU\n", " .addEvaluator(new Accuracy()) // Model Accuracy\n", " .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging\n", "\n", "trainer = model.newTrainer(config);\n", "trainer.initialize(inputShape);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "int batchSize = 128;\n", "int numEpochs = Integer.getInteger(\"MAX_EPOCH\", 10);\n", "\n", "double[] trainLoss;\n", "double[] testAccuracy;\n", "double[] epochCount;\n", "double[] trainAccuracy;\n", "\n", "epochCount = new double[numEpochs];\n", "\n", "for (int i = 0; i < epochCount.length; i++) {\n", " epochCount[i] = i+1;\n", "}\n", "\n", "FashionMnist trainIter = FashionMnist.builder()\n", " .addTransform(new Resize(96))\n", " .addTransform(new ToTensor())\n", " .optUsage(Dataset.Usage.TRAIN)\n", " .setSampling(batchSize, true)\n", " .optLimit(Long.getLong(\"DATASET_LIMIT\", Long.MAX_VALUE))\n", " .build();\n", "\n", "FashionMnist testIter = FashionMnist.builder()\n", " .addTransform(new Resize(96))\n", " .addTransform(new ToTensor())\n", " .optUsage(Dataset.Usage.TEST)\n", " .setSampling(batchSize, true)\n", " .optLimit(Long.getLong(\"DATASET_LIMIT\", Long.MAX_VALUE))\n", " .build();\n", "\n", "trainIter.prepare();\n", "testIter.prepare();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apart from using a slightly larger learning rate,\n", "the model training process is similar to that of AlexNet in the last section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Map evaluatorMetrics = new HashMap<>();\n", "double avgTrainTimePerEpoch = Training.trainingChapter6(trainIter, testIter, numEpochs, trainer, evaluatorMetrics);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainLoss = evaluatorMetrics.get(\"train_epoch_SoftmaxCrossEntropyLoss\");\n", "trainAccuracy = evaluatorMetrics.get(\"train_epoch_Accuracy\");\n", "testAccuracy = evaluatorMetrics.get(\"validate_epoch_Accuracy\");\n", "\n", "System.out.printf(\"loss %.3f,\", trainLoss[numEpochs - 1]);\n", "System.out.printf(\" train acc %.3f,\", trainAccuracy[numEpochs - 1]);\n", "System.out.printf(\" test acc %.3f\\n\", testAccuracy[numEpochs - 1]);\n", "System.out.printf(\"%.1f examples/sec\", trainIter.size() / (avgTrainTimePerEpoch / Math.pow(10, 9)));\n", "System.out.println();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Contour Gradient Descent.](https://d2l-java-resources.s3.amazonaws.com/img/chapter_convolution-modern-cnn-VGG.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];\n", "\n", "Arrays.fill(lossLabel, 0, trainLoss.length, \"train loss\");\n", "Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, \"train acc\");\n", "Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,\n", " trainLoss.length + testAccuracy.length + trainAccuracy.length, \"test acc\");\n", "\n", "Table data = Table.create(\"Data\").addColumns(\n", " DoubleColumn.create(\"epoch\", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),\n", " DoubleColumn.create(\"metrics\", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),\n", " StringColumn.create(\"lossLabel\", lossLabel)\n", ");\n", "\n", "render(LinePlot.create(\"\", data, \"epoch\", \"metrics\", \"lossLabel\"), \"text/html\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "* VGG-11 constructs a network using reusable convolutional blocks. Different VGG models can be defined by the differences in the number of convolutional layers and output channels in each block.\n", "* The use of blocks leads to very compact representations of the network definition. It allows for efficient design of complex networks.\n", "* In their work Simonyan and Ziserman experimented with various architectures. In particular, they found that several layers of deep and narrow convolutions (i.e., $3 \\times 3$) were more effective than fewer layers of wider convolutions.\n", "\n", "## Exercises\n", "\n", "1. When printing out the dimensions of the layers we only saw 8 results rather than 11. Where did the remaining 3 layer informations go?\n", "1. Compared with AlexNet, VGG is much slower in terms of computation, and it also needs more GPU memory. Try to analyze the reasons for this.\n", "1. Try to change the height and width of the images in Fashion-MNIST from 224 to 96. What influence does this have on the experiments?\n", "1. Refer to Table 1 in :cite:`Simonyan.Zisserman.2014` to construct other common models, such as VGG-16 or VGG-19." ] } ], "metadata": { "kernelspec": { "display_name": "Java", "language": "java", "name": "java" }, "language_info": { "codemirror_mode": "java", "file_extension": ".jshell", "mimetype": "text/x-java-source", "name": "Java", "pygments_lexer": "java", "version": "14.0.2+12" } }, "nbformat": 4, "nbformat_minor": 2 }