\n", "- Batch Normalization for Convolutional Layers\n", " - Batch normalization occurs after the convolution computation and before the application of the activation function. \n", " - If the convolution computation outputs multiple channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has an independent scale parameter and shift parameter. \n", " - Assume that there are $m$ examples in the mini-batch. \n", " - On a single channel, we assume that the height and width of the convolution computation output are $p$ and $q$, respectively. \n", " - We need to carry out batch normalization for $m \\times p \\times q$ elements in this channel simultaneously. \n", " - While carrying out the standardization computation for these elements, we use the same mean and variance. \n", " - In other words, we use the means and variances of the $m \\times p \\times q$ elements in this channel rather than one per pixel.\n", "

\n", "- Batch Normalization During Prediction\n", " - At prediction time we might be required to make one prediction at a time. \n", " - $\\mathbf{\\mu}$ and $\\mathbf{\\sigma}$ arising from a minibatch are highly undesirable once we've trained the model. \n", " - One way to mitigate this is to compute more stable estimates on a larger set for once (e.g. via a moving average) and then fix them at prediction time.\n", " - Consequently, Batch Normalization behaves differently during training and test time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.11.2 Implementation Starting from Scratch" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gluonbook as gb\n", "from mxnet import autograd, gluon, init, nd\n", "from mxnet.gluon import nn\n", "\n", "def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):\n", " # Use autograd to determine whether the current mode is training mode or prediction mode.\n", " if not autograd.is_training():\n", " # If it is the prediction mode, directly use the mean and variance obtained\n", " # from the incoming moving average.\n", " X_hat = (X - moving_mean) / nd.sqrt(moving_var + eps)\n", " else:\n", " assert len(X.shape) in (2, 4)\n", " if len(X.shape) == 2:\n", " # When using a fully connected layer, calculate the mean and variance on the feature dimension.\n", " mean = X.mean(axis=0)\n", " var = ((X - mean) ** 2).mean(axis=0)\n", " else:\n", " # When using a two-dimensional convolutional layer, calculate the mean\n", " # and variance on the channel dimension (axis=1). Here we need to maintain\n", " # the shape of X, so that the broadcast operation can be carried out later.\n", " mean = X.mean(axis=(0, 2, 3), keepdims=True)\n", " var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)\n", " \n", " # In training mode, the current mean and variance are used for the standardization.\n", " X_hat = (X - mean) / nd.sqrt(var + eps)\n", "\n", " # Update the mean and variance of the moving average.\n", " moving_mean = momentum * moving_mean + (1.0 - momentum) * mean\n", " moving_var = momentum * moving_var + (1.0 - momentum) * var\n", "\n", " Y = gamma * X_hat + beta # Scale and shift.\n", " return Y, moving_mean, moving_var" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- `BatchNorm` retains the scale parameter `gamma` and the shift parameter `beta` involved in gradient finding and iteration\n", "- It also maintains the mean and variance obtained from the moving average, so that they can be used during model prediction. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class BatchNorm(nn.Block):\n", " def __init__(self, num_features, num_dims, **kwargs):\n", " super(BatchNorm, self).__init__(**kwargs)\n", " \n", " if num_dims == 2:\n", " shape = (1, num_features)\n", " else:\n", " shape = (1, num_features, 1, 1)\n", " \n", " # The scale parameter and the shift parameter involved in gradient finding and iteration are initialized to 0 and 1 respectively.\n", " self.gamma = self.params.get('gamma', shape=shape, init=init.One())\n", " self.beta = self.params.get('beta', shape=shape, init=init.Zero())\n", "\n", " # All the variables not involved in gradient finding and iteration are initialized to 0 on the CPU.\n", " self.moving_mean = nd.zeros(shape)\n", " self.moving_var = nd.zeros(shape)\n", "\n", " def forward(self, X):\n", " # If X is not on the CPU, copy moving_mean and moving_var to the device where X is located.\n", " if self.moving_mean.context != X.context:\n", " self.moving_mean = self.moving_mean.copyto(X.context)\n", " self.moving_var = self.moving_var.copyto(X.context)\n", " \n", " # Save the updated moving_mean and moving_var.\n", " Y, self.moving_mean, self.moving_var = batch_norm(\n", " X, \n", " self.gamma.data(), \n", " self.beta.data(), \n", " self.moving_mean,\n", " self.moving_var, \n", " eps=1e-5, \n", " momentum=0.9\n", " )\n", " return Y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The `num_features` parameter required by the `BatchNorm` instance is the number of outputs for a fully connected layer and the number of output channels for a convolutional layer. \n", "- The `num_dims` parameter also required by this instance is 2 for a fully connected layer and 4 for a convolutional layer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.11.3 Use a Batch Normalization LeNet" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "net = nn.Sequential()\n", "net.add(\n", " nn.Conv2D(6, kernel_size=5),\n", " BatchNorm(6, num_dims=4),\n", " nn.Activation('sigmoid'),\n", " nn.MaxPool2D(pool_size=2, strides=2),\n", " nn.Conv2D(16, kernel_size=5),\n", " BatchNorm(16, num_dims=4),\n", " nn.Activation('sigmoid'),\n", " nn.MaxPool2D(pool_size=2, strides=2),\n", " nn.Dense(120),\n", " BatchNorm(120, num_dims=2),\n", " nn.Activation('sigmoid'),\n", " nn.Dense(84),\n", " BatchNorm(84, num_dims=2),\n", " nn.Activation('sigmoid'),\n", " nn.Dense(10)\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training on cpu(0)\n", "epoch 1, loss 0.6653, train acc 0.761, test acc 0.806, time 142.5 sec\n", "epoch 2, loss 0.3917, train acc 0.859, test acc 0.847, time 141.8 sec\n", "epoch 3, loss 0.3493, train acc 0.874, test acc 0.854, time 142.9 sec\n", "epoch 4, loss 0.3212, train acc 0.884, test acc 0.872, time 138.8 sec\n", "epoch 5, loss 0.3033, train acc 0.889, test acc 0.861, time 136.9 sec\n" ] } ], "source": [ "lr = 1.0\n", "num_epochs = 5\n", "batch_size = 256\n", "\n", "ctx = gb.try_gpu()\n", "\n", "net.initialize(ctx=ctx, init=init.Xavier(), force_reinit=True)\n", "\n", "trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})\n", "\n", "train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)\n", "\n", "gb.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(\n", " [1.3520054 1.3801662 1.8764832 1.4937813 0.93755937 1.8829043 ]\n", "