{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mini-batch Training from Foundations\n", "\n", "#### Last Time\n", "[Most recently](http://nbviewer.jupyter.org/github/jamesdellinger/fastai_deep_learning_course_part2_v3/blob/master/02_fully_connected_my_reimplementation.ipynb?flush_cache=true) we saw how to implement from scratch both the forward and backward passes of a neural network.\n", "\n", "After an extended focus on weight initialization, by which we saw how to derive the basic principals that underpin the now widely-used [Kaiming weight init](02_fully_connected_my_reimplementation.ipynb), we spent some extended time refactoring the code of our forward/backward passes. \n", "\n", "We learned that organizing this code into classes was more concise and interpretable (by human readers) than leaving the logic inside sundry, scattered methods. Finally, we wrapped things up by creating our own `Module()` class that's similar to PyTorch's `nn.Module`, and let our custom loss/linear layer/ReLU classes inherit from it.\n", "\n", "According to the rules we set for ourselves at the beginning of this course, we're free to use the PyTorch versions of all classes/functionalities we've thus far implemented from scratch.\n", "\n", "#### Minibatch Training\n", "Today we'll implement a model that supports another must-have feature of any deep learning model: the ability to train using mini-batches.\n", "\n", "Mini-batches allow us to update our model weights by leveraging the parallel processing capability of Nvidia GPUs to train on several inputs *at the same time*. \n", "\n", "This allows our model to complete a single pass through all the training samples in our dataset in *a much shorter amount of time* than if it were to have to train and update weights for each and every single input in the training set, one at a time!\n", "\n", "#### What Components We Implement Here\n", "In building up to being able to create a model that can successfully train on mini-batches, we implement from scratch several other crucial components below. These include:\n", "* Cross entropy loss\n", "* Updating and registering model parameters\n", "* Optimzer classes\n", "* Dataset and Dataloader classes\n", "* Random Sampling\n", "* Setting aside a Validation Set\n", "\n", "#### Attribution\n", "Virtually all the code that appears in this notebook is the creation of [Sylvain Gugger](https://www.fast.ai/about/#sylvain) and [Jeremy Howard](https://www.fast.ai/about/#jeremy). The original version of this notebook that they made for the course lecture can be found [here](https://github.com/fastai/course-v3/blob/master/nbs/dl2/03_minibatch_training.ipynb). I simply re-typed, line-by-line, the pieces of logic necessary to implement the functionality that their notebook demonstrated. In some cases I changed the order of code cells and or variable names so as to fit an organization and style that seemed more intuitive to me. Any and all mistakes are my own.\n", "\n", "On the other hand, all long-form text explanations in this notebook are solely my own creation. Writing extensive descriptions of the concepts and code in plain and simple English forces me to make sure that I actually understand how they work." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#export\n", "from exports.nb_02 import *\n", "import torch.nn.functional as F" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the Data\n", "We continue to use the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset as a baseline to test the functionality and performance of all the classes we create from scratch." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "mpl.rcParams['image.cmap'] = 'gray'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "x_train, y_train, x_valid, y_valid = get_data()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "n,m = x_train.shape # 50,000 images x 784 pixels per image\n", "c = int(y_train.max()) + 1 # number of classes in dataset\n", "nh = 50 # size of hidden layers" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class Model(nn.Module):\n", " def __init__(self, n_in, nh, n_out):\n", " super().__init__()\n", " self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh, n_out)]\n", " \n", " def __call__(self, x):\n", " for l in self.layers: x = l(x)\n", " return x" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "model = Model(m, nh, c)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "pred = model(x_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross Entropy Loss\n", "\n", "In the previous notebook we used a quick-and-dirty mean squared error loss function just so we could have a simple loss function to use to test whether our model was correctly calculating weight gradients. Now, however, it's time to implement a loss function which is better tailored to the MNIST task, which entails predicting one class (out of ten total) to which a handwriting sample of a single-digit number most likely belongs.\n", "\n", "#### Log softmax\n", "To build cross entropy loss, we first calculate the softmax of our activations, $$\\textrm{softmax}(x)_{i} = \\frac{e^{x_{i}}}{e^{x_{0}} + e^{x_{1}} + ... + e^{x_{n-1}}}$$ or more concisely, $$\\textrm{softmax}(x)_{i} = \\frac{e^{x_{i}}}{\\sum_{0\\leq{j}\\leq{n-1}}e^{x_{j}}}$$\n", "\n", "Note that in practice, we need to take the log of the softmax in order to calculate the loss." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def log_softmax(x): return (x.exp()/x.exp().sum(-1,keepdim=True)).log()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[ 0.0887, -0.0349, -0.0814, ..., 0.0308, -0.0196, -0.0004],\n", " [ 0.0576, -0.0033, 0.0132, ..., 0.0507, -0.0592, 0.0061],\n", " [ 0.0516, -0.0581, 0.0250, ..., 0.0023, -0.0396, -0.0962],\n", " ...,\n", " [ 0.0841, -0.1409, -0.0611, ..., 0.0254, -0.0366, 0.0731],\n", " [-0.0209, -0.0425, 0.0053, ..., -0.0104, 0.0560, 0.1751],\n", " [ 0.0554, -0.1532, 0.0325, ..., 0.0359, -0.1168, 0.0629]],\n", " grad_fn=)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([50000, 10])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([ 0.0887, -0.0349, -0.0814, -0.0429, 0.0312, 0.1822, 0.0856, 0.0308,\n", " -0.0196, -0.0004], grad_fn=)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our model generates, as predictions, a list of length 10 (the number of categories in MNIST) for each of the 50,000 input images. The problem is, if we just use these lists of ten predictions, we really have no standardized way of ascertaining and comparing the degree to which the model believes that a target image belongs to each of the ten categories.\n", "\n", "The softmax function we introduced just above, however, thankfully gives us a way to do this. Softmax will take the list of ten predictions for each image, and turn it into a list of ten probabilities that all sum to 1." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "softmax_pred = (pred.exp()/pred.exp().sum(-1,keepdim=True))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([0.1064, 0.0940, 0.0898, 0.0933, 0.1004, 0.1168, 0.1061, 0.1004, 0.0955,\n", " 0.0973], grad_fn=)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "softmax_pred[0]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(1., grad_fn=)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "softmax_pred[0].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, in practice we use the log of the softmax instead of just softmax. Why? We like using logarithms because they have a nice property that lets us use subtraction instead of division. Avoiding division is one surefire way to make our loss calculation more numerically stable. [The answer here](https://discuss.pytorch.org/t/logsoftmax-vs-softmax/21386/4) gives a nice explanation." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "logsoftmax_pred = log_softmax(pred)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([-2.2406, -2.3642, -2.4107, -2.3722, -2.2981, -2.1470, -2.2436, -2.2985,\n", " -2.3489, -2.3297], grad_fn=)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logsoftmax_pred[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To one-hot encode or not to one-hot encode\n", "The cross entropy loss between a one-hot encoded target $x$ and a prediction $p(x)$ generated by a model is $$-\\sum_{i=1}^{n}{x_{i}\\log{\\left(p_{i}(x)\\right)}}$$ where each $i$ is one of the one-hot encoded label's $n$ total categories.\n", "\n", "In other words, it's the sum of the products of the values at all indices in the label list $x$ with the prediction probabilities at corresponding indices in the prediction list $p(x)$. Recall that the values in $x$ will one or zero since the label for $x$ is one-hot encoded. \n", "\n", "Here's a very intuitive [explanation](https://youtu.be/AcA8HAYh7IE?t=1925) using Excel.\n", "\n", "Here's a nice, concrete example. Remember that the first training sample's ground truth label is the '5' digit:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(5)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 253 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.imshow(x_train[0].view(28,28))\n", "y_train[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The one-hot encoded label, $x$, for this image is:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label = tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0]).float()\n", "label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we just saw above, the log softmax predictions $p(x)$ are:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([-2.2406, -2.3642, -2.4107, -2.3722, -2.2981, -2.1470, -2.2436, -2.2985,\n", " -2.3489, -2.3297])" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prediction = logsoftmax_pred[0].detach()\n", "prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the formula for cross entropy loss, here's how we calculate the loss for the model's prediction for this image:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(2.1470)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-(label * prediction).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you've probably already noticed that our training labels *aren't* actually one-hot encoded:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([5, 0, 4, ..., 8, 4, 8])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([50000])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, each image's label is just a single integer that indicates the correct label for the digit depicted in each image. It turns out that not only do these integers represent actual digit names (which is a nice side-effect of MNIST only having 10 categories), but they also represent the *index* of the correct digit in a one-hot encoded label. i.e. our first training image is a '5' and so there is a one at index 5 in its one-hot encoded label:\n", "\n", "```\n", "[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]\n", "```\n", "\n", "When we summed our products in our first attempt at calculating cross entropy loss for the first training image above, you probably noticed that the results of nine of the ten products were *zero*. At this point it should be clear to see that we might as well not waste time computing products that are gonna be zero anyhow.\n", "\n", "Indeed, why don't we just calculate the one product that *we know* will return a non-zero value, and not bother with the nine other products and also not bother adding a bunch of zeros to the only non-zero product? \n", "\n", "We can use [numpy-style integer array indexing](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html) to accomplish this. It's as simple as indexing into the prediction list for a single image and grabbing the log softmax value that sits at the index that we know represents the ground truth category of the image.\n", "\n", "For example, once again, here's the ground truth label of the first training sample:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(5)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct_label_index = y_train[0]\n", "correct_label_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the log softmax value at the 5th index of our model's prediction for this training image, which is the index that corresponds to the correct ground truth label (the number 5, out of numbers 0 through 9, inclusive)." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(-2.1470, grad_fn=)" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logsoftmax_pred[0][correct_label]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we take this approach, we can rewrite cross-entropy loss in a much more simple way. It's just $$-\\log{\\left(p_{i}(x)\\right)}$$ where $i$ is the index belonging to the target's ground truth class.\n", "\n", "Using this formula, here's the categorical cross entropy loss for the first training image:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(2.1470, grad_fn=)" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-logsoftmax_pred[0][correct_label_index]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How wonderful is that! It's the exact same value we got when we summed the ten products of the prediction and its one-hot encoded label, but without all the needless arithmetic. When you're training for hundreds of epochs over potentially millions of images, any optimizations that speed things up on a per-image basis can substantially decrease overall training time!\n", "\n", "Note that it's also easy to do this for several images at once. Here's how we'd index into our predictions to find softmax values at the appropriate indices for the first three training images:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([5, 0, 4])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[:3]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([-2.3642, -2.2576, -2.2754], grad_fn=)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logsoftmax_pred[[0,1,2], [5,0,4]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above indexing mechanism grabs the first three predictions, or lists of length ten, from our tensor that contains these lists of training image category predictions. Then, we go into each of these three lists to grab the log softmax value that is sitting at the index that corresponds to the ground truth category of the image. The first training image is a '5' digit, so we want the log softmax value that's sitting at index 5. The second training image is a '0' digit, so we want to get the log softmax value that is sitting at index 0 of its prediction list of ten log softmax values. And so on and so forth.\n", "\n", "#### Negative log-likelihood\n", "Now that we've obtained the softmax log of our predictions we're ready to compute the actual cross-entropy loss.\n", "\n", "The negative log-likelihood function is how we do that." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def nll(input, target): return -input[range(target.shape[0]), target].mean()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(2.3182, grad_fn=)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss = nll(logsoftmax_pred, y_train)\n", "loss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we proceed further, let's pause and remember that thanks to a property of logarithms, we can rewrite the cross entropy function in a more computationally efficient structure (we get rid of the division operation): $$\\log{\\left(\\frac{a}{b}\\right)} = \\log{\\left(a\\right)} - \\log{\\left(b\\right)}$$ \n", "\n", "Let's write a new version of `log_softmax()` that takes advantage of this." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def log_softmax(x): return x - x.exp().sum(-1,keepdim=True).log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's ensure that this refactoring is computationally accurate:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "test_near(nll(log_softmax(pred), y_train), loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're almost done, but there's one more really helpful tweak that we should build in. \n", "\n", "The [LogSumExp trick](https://en.wikipedia.org/wiki/LogSumExp) lets us use the following formula to compute the sum of exponentials in a more stable manner: $$\\log{\\left(\\sum_{j=1}^{n}e^{x_{j}}\\right)} = \\log{\\left(e^{a}\\sum_{j=1}^{n}e^{x_{j}-a}\\right)} = a + \\log{\\left(\\sum_{j=1}^{n}e^{x_{j}-a}\\right)}$$ where $a$ is the maximum of all $x_{j}$.\n", "\n", "Given that to calculate cross entropy we have to take a sum of exponential terms (as evidenced by the \n", "```\n", "x.exp().sum(-1,keepdim=True)\n", "``` \n", "portion of the above `log_softmax()` function), implementing a revised version of our `log_softmax()` function that uses this trick will ensure we avoid an overflow if we have to take the exponential of a big activation." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def logsumexp(x):\n", " m = x.max(-1)[0] # take the max along the highest dimension of the tensor\n", " return m + (x - m[:,None]).exp().sum(-1).log()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch has `logsumexp` as a built-in method so lets compare it to ours:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "test_near(logsumexp(pred), # ours\n", " pred.logsumexp(-1) # PyTorch's\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's our final refactored `log_softmax()` with `logsumexp`:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def log_softmax(x): return x - x.logsumexp(-1,keepdim=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that our latest log_softmax refactoring is still correct:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "test_near(nll(log_softmax(pred), y_train), loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify the same for PyTorch's own functions:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "test_near(F.cross_entropy(pred, y_train), loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch combines `F.nll_loss` and `F.log_softmax` into one optimized function called `F.cross_entropy`. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "test_near(F.cross_entropy(pred, y_train), loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a Basic Training Loop\n", "\n", "To complete one training loop, our model must be able to perform the following:\n", "1. Get the output of the model on **a batch** of inputs.\n", "* Compare the output to the labels and compute a loss.\n", "* Calculate the gradients of the loss with respect to every parameter in the model.\n", "* Update model parameters with their gradients in order to make the parameters a little bit better.\n", "\n", "Below we implement each of these steps in successive lines of code. Further down in this notebook we will see how to refactor into specific classes that manage tasks like storing a dataset, loading the data, etc." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "loss_func = F.cross_entropy" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "#export\n", "def accuracy(out, yb): return (torch.argmax(out, dim=1)==yb).float().mean()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor([ 0.0942, 0.0491, -0.0652, -0.1084, 0.0840, -0.0479, -0.0238, 0.0033,\n", " 0.0379, 0.0908], grad_fn=), torch.Size([64, 10]))" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bs = 64 # batch size\n", " \n", "xb = x_train[0:bs] # a mini-batch from inputs x \n", "preds = model(xb) # predictions on items in the mini-batch\n", "preds[0], preds.shape" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(2.3140, grad_fn=)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "yb = y_train[0:bs]\n", "loss_func(preds, yb)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(0.1406)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy(preds, yb)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "lr = 0.5 # learning rate\n", "epochs = 1 # number of epochs to train for" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "for epoch in range(epochs):\n", " for i in range((n-1)//bs + 1):\n", " start_i = i*bs\n", " end_i = start_i+bs\n", " xb = x_train[start_i:end_i]\n", " yb = y_train[start_i:end_i]\n", " loss = loss_func(model(xb), yb)\n", " \n", " loss.backward()\n", " with torch.no_grad():\n", " for l in model.layers:\n", " if hasattr(l, 'weight'):\n", " l.weight -= l.weight.grad * lr\n", " l.bias -= l.bias.grad * lr\n", " l.weight.grad.zero_()\n", " l.bias .grad.zero_()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.2697, grad_fn=), tensor(0.9375))" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating model.parameters\n", "In the training loop that we wrote above, layer weights and biases were manually updated and then zeroed out. Instead of this, we can write our model class in such a way, using `self.l1` and `self.l2`, such that we can update all the model's trainable parameters after each forward pass by calling `model.parameters()`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "class Model(nn.Module):\n", " def __init__(self, n_in, nh, n_out):\n", " super().__init__()\n", " self.l1 = nn.Linear(n_in, nh)\n", " self.l2 = nn.Linear(nh, n_out)\n", " \n", " def __call__(self, x): return self.l2(F.relu(self.l1(x)))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "model = Model(m,nh,10)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "l1: Linear(in_features=784, out_features=50, bias=True)\n", "l2: Linear(in_features=50, out_features=10, bias=True)\n" ] } ], "source": [ "for name, l in model.named_children(): print(f'{name}: {l}')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Model(\n", " (l1): Linear(in_features=784, out_features=50, bias=True)\n", " (l2): Linear(in_features=50, out_features=10, bias=True)\n", ")" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Linear(in_features=784, out_features=50, bias=True)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.l1" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def fit():\n", " for epoch in range(epochs):\n", " for i in range((n-1)//bs + 1):\n", " start_i = i*bs\n", " end_i = start_i+bs\n", " xb = x_train[start_i:end_i]\n", " yb = y_train[start_i:end_i]\n", " loss = loss_func(model(xb), yb)\n", " \n", " loss.backward()\n", " with torch.no_grad():\n", " for p in model.parameters(): p -= p.grad*lr\n", " model.zero_grad()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1160, grad_fn=), tensor(0.9375))" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit()\n", "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does PyTorch know what the model's parameters are? It overrides the `__setattr__` function inside the `nn.Module` class in order to register as model parameters the weights and biases inside the submodules (`self.l`, `self.l2`) that were defined in the model's class.\n", "\n", "Here's a sample dummy module that mocks up what's going on in `nn.Module`:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "class DummyModule():\n", " def __init__(self, n_in, nh, n_out):\n", " self._modules = {}\n", " self.l1 = nn.Linear(n_in,nh)\n", " self.l2 = nn.Linear(nh,n_out)\n", " \n", " def __setattr__(self,k,v):\n", " if not k.startswith(\"_\"): self._modules[k] = v\n", " super().__setattr__(k,v)\n", " \n", " def __repr__(self): return f'{self._modules}'\n", " \n", " def parameters(self):\n", " for l in self._modules.values():\n", " for p in l.parameters(): yield p" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'l1': Linear(in_features=784, out_features=50, bias=True), 'l2': Linear(in_features=50, out_features=10, bias=True)}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_mdl = DummyModule(m,nh,10)\n", "dummy_mdl" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[torch.Size([50, 784]),\n", " torch.Size([50]),\n", " torch.Size([10, 50]),\n", " torch.Size([10])]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[o.shape for o in dummy_mdl.parameters()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Registering Modules\n", "\n", "For deeper models, it's obviously going to be a hassle to declare a `self.` variable for each and every layer in the model. It's probably more convenient to just pass in a list that contains all the layers. E.g. something like\n", "```\n", "layers = [nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10)]\n", "self.layers = layers\n", "```\n", "\n", "However in order to do this we have to manually register the modules because `nn.Module` won't automatically do so for us." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "layers = [nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10)]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "class Model(nn.Module):\n", " def __init__(self, layers):\n", " super().__init__()\n", " self.layers = layers\n", " for i,l in enumerate(self.layers): self.add_module(f'layer_{i}', l)\n", " \n", " def __call__(self,x):\n", " for l in self.layers: x = l(x)\n", " return x" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "model = Model(layers)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Model(\n", " (layer_0): Linear(in_features=784, out_features=50, bias=True)\n", " (layer_1): ReLU()\n", " (layer_2): Linear(in_features=50, out_features=10, bias=True)\n", ")" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### nn.ModuleList and nn.Sequential\n", "\n", "Thankfully both the `nn.ModuleList` and `nn.Sequential` classes can help us do this.\n", "\n", "`nn.Sequential` just uses an `nn.ModuleList` object to store the layers. This object automatically registers all layers.\n", "\n", "Here's a home-grown clone of `nn.Sequential` that depicts how `nn.ModuleList` is used:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "class SequentialModel(nn.Module):\n", " def __init__(self, layers):\n", " super().__init__()\n", " self.layers = nn.ModuleList(layers)\n", " \n", " def __call__(self, x):\n", " for l in self.layers: x = l(x)\n", " return x" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "model = SequentialModel(layers)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SequentialModel(\n", " (layers): ModuleList(\n", " (0): Linear(in_features=784, out_features=50, bias=True)\n", " (1): ReLU()\n", " (2): Linear(in_features=50, out_features=10, bias=True)\n", " )\n", ")" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.0609, grad_fn=), tensor(1.))" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit()\n", "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `nn.Sequential` already does all of the above on its own, we can just use it going forward:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.2535, grad_fn=), tensor(0.9375))" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit()\n", "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sequential(\n", " (0): Linear(in_features=784, out_features=50, bias=True)\n", " (1): ReLU()\n", " (2): Linear(in_features=50, out_features=10, bias=True)\n", ")" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.0282, grad_fn=), tensor(1.))" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit()\n", "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Refactoring the Optimizer\n", "\n", "In our training loops above we manually coded the optimization step\n", "```\n", "with torch.no_grad():\n", " for p in model.parameters(): p -= p.grad*lr\n", " model.zero_grad()\n", "```\n", "\n", "We can refactor this logic into our own `Optimizer` class, which can be much more concisely called from our training loop:\n", "```\n", "opt.step()\n", "opt.zero_grad()\n", "```" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "class Optimizer():\n", " def __init__(self, params, lr=0.5):\n", " self.params, self.lr = list(params), lr\n", " \n", " def step(self):\n", " with torch.no_grad():\n", " for p in self.params: p -= p.grad*lr\n", " \n", " def zero_grad(self):\n", " for p in self.params: p.grad.data.zero_()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "opt = Optimizer(model.parameters())" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "for epoch in range(epochs):\n", " for i in range((n-1)//bs + 1):\n", " start_i = i*bs\n", " end_i = start_i+bs\n", " xb = x_train[start_i:end_i]\n", " yb = y_train[start_i:end_i]\n", " pred = model(xb)\n", " loss = loss_func(pred,yb)\n", " \n", " loss.backward()\n", " opt.step()\n", " opt.zero_grad()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1486, grad_fn=), tensor(0.9375))" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss, acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "loss, acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch's own `optim.SGD` class functions identically to our home-grown `Optimizer` class, with the exception that `optim.SGD` also handles momentum." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "#export\n", "from torch import optim" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def get_model():\n", " model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))\n", " return model, optim.SGD(model.parameters(), lr=lr)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(2.3366, grad_fn=)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model, opt = get_model()\n", "loss_func(model(xb), yb)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "for epoch in range(epochs):\n", " for i in range((n-1)//bs + 1):\n", " start_i = i*bs\n", " end_i = start_i+bs\n", " xb = x_train[start_i:end_i]\n", " yb = y_train[start_i:end_i]\n", " pred = model(xb)\n", " loss = loss_func(pred, yb)\n", " \n", " loss.backward()\n", " opt.step()\n", " opt.zero_grad()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1586, grad_fn=), tensor(0.9375))" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss, acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "loss, acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't be afraid to include random tests such as this one right below. Although there may well be times when accuracy would dip below `0.7` (due to randomness), having checks like this interspersed throughout your code does much more good than harm, on the whole." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "assert acc>0.7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "\n", "In our early crack at coding up a training loop, we iterated through minibatches of `x` and `y` values separately:\n", "```\n", "xb = x_train[start_i:end_i]\n", "yb = y_train[start_i:end_i]\n", "```\n", "If, however, we create a `Dataset` class to hold our inputs and labels, we can accomplish those steps at once:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "#export\n", "class Dataset():\n", " def __init__(self, x, y): self.x, self.y = x,y\n", " def __len__(self): return len(self.x)\n", " def __getitem__(self, i): return self.x[i], self.y[i]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)\n", "assert len(train_ds)==len(x_train)\n", "assert len(valid_ds)==len(x_valid)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor([[0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.]]), tensor([5, 0, 4, 1, 9]))" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xb, yb = train_ds[0:5]\n", "assert xb.shape==(5,28*28)\n", "assert yb.shape==(5,)\n", "xb, yb" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "model, opt = get_model()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "for epoch in range(epochs):\n", " for i in range((n-1)//bs + 1):\n", " xb,yb = train_ds[i*bs: i*bs+bs]\n", " pred = model(xb)\n", " loss = loss_func(pred, yb)\n", " \n", " loss.backward()\n", " opt.step()\n", " opt.zero_grad()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.2065, grad_fn=), tensor(0.9375))" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "assert acc>0.7\n", "loss, acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataLoader\n", "\n", "Our first crack at coding a training loop explicitly iterated over batches using a for-loop that kept track of specific indices. \n", "```\n", "for i in range((n-1)//bs + 1):\n", " xb,yb = train_ds[i*bs: i*bs+bs]\n", "```\n", "\n", "Creating a `DataLoader` class will allow for a more concise implementation thanks to the inclusion of a generator that automatically yields the next batch as soon as needed." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "class DataLoader():\n", " def __init__(self,ds,bs): self.ds, self.bs = ds,bs\n", " def __iter__(self):\n", " for i in range(0, len(self.ds), self.bs): yield self.ds[i:i+self.bs]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "train_dl = DataLoader(train_ds, bs)\n", "valid_dl = DataLoader(valid_ds, bs)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "xb,yb = next(iter(valid_dl))\n", "assert xb.shape==(bs,28*28)\n", "assert yb.shape==(bs,)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(3)" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 253 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.imshow(xb[0].view(28,28))\n", "yb[0]" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "model,opt = get_model()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "def fit():\n", " for epoch in range(epochs):\n", " for xb,yb in train_dl:\n", " pred = model(xb)\n", " loss = loss_func(pred, yb)\n", " loss.backward()\n", " opt.step()\n", " opt.zero_grad()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.0822, grad_fn=), tensor(1.))" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit()\n", "loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "assert acc>0.7\n", "loss,acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling Should be Random\n", "\n", "When training:\n", "* training set should be in *random* order\n", "* that order should differ on each iteration\n", "\n", "However for validation:\n", "* validation set should *never* be randomized" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "class Sampler():\n", " def __init__(self, ds, bs, shuffle=False):\n", " self.n, self.bs, self.shuffle = len(ds), bs, shuffle\n", " \n", " def __iter__(self):\n", " self.idxs = torch.randperm(self.n) if self.shuffle else torch.arange(self.n)\n", " for i in range(0, self.n, self.bs): yield self.idxs[i: i+self.bs]" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9])]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small_ds = Dataset(*train_ds[:10])\n", "s = Sampler(small_ds, 3, False)\n", "[o for o in s]" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[tensor([1, 9, 0]), tensor([8, 2, 3]), tensor([5, 7, 6]), tensor([4])]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = Sampler(small_ds, 3, True)\n", "[o for o in s]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks pretty random. Good.\n", "\n", "Let's rewrite our `DataLoader` to take advantage of it." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "def collate(b):\n", " xs,ys = zip(*b)\n", " return torch.stack(xs), torch.stack(ys)\n", "\n", "class DataLoader():\n", " def __init__(self, ds, sampler, collate_fn=collate):\n", " self.ds, self.sampler, self.collate_fn = ds, sampler, collate_fn\n", " \n", " def __iter__(self):\n", " for s in self.sampler: yield self.collate_fn([self.ds[i] for i in s])" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "train_samp = Sampler(train_ds, bs, shuffle=True)\n", "valid_samp = Sampler(valid_ds, bs, shuffle=False)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "train_dl = DataLoader(train_ds, sampler=train_samp, collate_fn=collate)\n", "valid_dl = DataLoader(valid_ds, sampler=valid_samp, collate_fn=collate)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(3)" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 253 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "xb, yb = next(iter(valid_dl))\n", "plt.imshow(xb[0].view(28,28))\n", "yb[0]" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(5)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 253 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "xb, yb = next(iter(train_dl))\n", "plt.imshow(xb[0].view(28,28))\n", "yb[0]" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(9)" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 253 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "xb, yb = next(iter(train_dl))\n", "plt.imshow(xb[0].view(28,28))\n", "yb[0]" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1490, grad_fn=), tensor(0.9375))" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model, opt = get_model()\n", "fit()\n", "loss, acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "assert acc>0.7\n", "loss, acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PyTorch's DataLoader\n", "\n", "PyTorch has its own `DataLoader`, `RandomSampler` (for training), and `SequentialSampler` (for validation) classes and we can use them to create our train/valid dataloaders like so:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "#export\n", "from torch.utils.data import DataLoader, SequentialSampler, RandomSampler" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "train_dl = DataLoader(train_ds, bs, sampler=RandomSampler(train_ds), collate_fn=collate)\n", "valid_dl = DataLoader(valid_ds, bs, sampler=SequentialSampler(valid_ds), collate_fn=collate)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1246, grad_fn=), tensor(0.9531))" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model,opt = get_model()\n", "fit()\n", "loss_func(model(xb), yb), accuracy(model(xb), yb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyTorch's defaults work fine in most cases. Note that if we pass `num_workers` to PyTorch's `DataLoader`, PyTorch will use multiple threads to call the `Dataset`." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "train_dl = DataLoader(train_ds, bs, shuffle=True, drop_last=True)\n", "valid_dl = DataLoader(valid_ds, bs, shuffle=False)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.1928, grad_fn=), tensor(0.9531))" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model, opt = get_model()\n", "fit()\n", "\n", "loss, acc = loss_func(model(xb), yb), accuracy(model(xb), yb)\n", "assert acc>0.7\n", "loss, acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting aside a Validation Set\n", "\n", "We should **always** have a [validation set](http://www.fast.ai/2017/11/13/validation-sets/) in order to identify whether or not at some point during training our model begins to overfit.\n", "\n", "We'll write a training loop once more below, and print out the validation loss at the end of each epoch.\n", "\n", "Note that with PyTorch, you should be sure to call `model.train()` *before* training and then call `model.eval()` *before* inference. The reason is that the `nn.BatchNorm2d` and `nn.Dropout` layers' behavior is different depending on whether training or inference is being performed!" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "def fit (epochs, model, loss_func, opt, train_dl, valid_dl):\n", " for epoch in range(epochs):\n", " model.train() # Handle proper execution of bn and dropout at training.\n", " for xb, yb in train_dl:\n", " loss = loss_func(model(xb), yb)\n", " loss.backward()\n", " opt.step()\n", " opt.zero_grad()\n", " \n", " model.eval() # Handle proper execution of bn and dropout at inference.\n", " with torch.no_grad():\n", " tot_loss, tot_acc = 0., 0.\n", " for xb,yb in valid_dl:\n", " pred = model(xb)\n", " tot_loss += loss_func(pred, yb)\n", " tot_acc += accuracy(pred, yb)\n", " nv = len(valid_dl)\n", " print(epoch, tot_loss/nv, tot_acc/nv)\n", " return tot_loss/nv, tot_acc/nv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A question to think about: Will the validation metrics printed out here still be correct if batch size varies?\n", "\n", "And the answer is that owing to the way that the validation loss/accuracy are calculated the metrics will be incorrect if batch size varies. `tot_loss` and `tot_acc` are augmented each batch. After all batches, they are divided by the number of batches to get the average val loss/accuracy for the entire epoch.\n", "```\n", "for xb,yb in valid_dl:\n", " pred = model(xb)\n", " tot_loss += loss_func(pred, yb)\n", " tot_acc += accuracy(pred, yb)\n", " .\n", " .\n", " .\n", "return tot_loss/nv, tot_acc/nv\n", "```\n", "If batch size varies, and say the final batch is smaller than all the others, the loss/acc of the final batch will be *over*-weighted in the epoch's loss/acc metrics.\n", "\n", "Why? Say all batches are size 64 except for the final batch, which is 16. Each batch up to the penultimate batch have an avg loss/acc (for that batch) that's divided by 64. The final batch's avg loss/acc is only divided by 16. \n", "\n", "Now, by averaging the average loss/acc of *each* of the individual batches over the total number of batches, our epoch loss/avg is essentially assuming that each each batch's avg loss/acc is calculated using the same sized denominator (batch size). In other words, `tot_loss`/`tot_acc` assumes that each pred/label pair (or each batch's average loss/acc) contributes equally to the overall epoch average loss/acc. However, we know that for the final batch, this isn't true. The denominator (batch size) is only 16 and because this isn't compensated for, the pred/label pairs in the final batch disproportionally sway the overall average epoch loss/acc.\n", "\n", "Here's a simple example to illustrate what's going on.\n", "\n", "Say we have three batches, with the first two of size 10 and the last of size 5. And if they have the following accuracies:\n", "$$\\frac{5}{10}, \\frac{5}{10}, \\frac{2}{5}$$ We would expect that the total combined accuracy of all samples would be: $$\\frac{5+5+2}{10+10+5} = \\frac{12}{25}$$ However, if we calculate the their combined average accuracy using the approach we used to calculate `tot_acc`, we'd calculate a combined average accuracy of $$\\frac{\\frac{5}{10} + \\frac{5}{10} + \\frac{2}{5}}{3} = \\frac{\\frac{14}{10}}{3} = \\frac{14}{30}$$ Immediately we notice that $\\frac{14}{30}$ is *less than* $\\frac{12}{25}$. \n", "\n", "In other words, the 3rd batch's lower accuracy is exerting *too large* an effect on the overall average accuracy calculation -- it is lower, just slightly, than it ought to be. The batch's accuracy of $\\frac{2}{5}$ should only contribute to $\\frac{5}{25}$ of the epoch's average accuracy (since the batch has only 5 of the 25 total samples), yet our misguided calculation has it influencing $\\frac{1}{3}$ of the epoch's average accuracy.\n", "\n", "The proper way to calculate the epoch's average accuracy that takes into account the 3rd batch's smaller size relative to the first two batches would be to calculate a weighted average: $$\\frac{5}{10}*\\frac{10}{25} + \\frac{5}{10}*\\frac{10}{25} + \\frac{2}{5}*\\frac{5}{25} = \\frac{10}{25} + \\frac{2}{25} = \\frac{12}{25}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get_dls` will return dataloaders for the training and validation sets:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "#export \n", "def get_dls(train_ds, valid_ds, bs, **kwargs):\n", " return (DataLoader(train_ds, batch_size=bs, shuffle=True, **kwargs), \n", " DataLoader(valid_ds, batch_size=bs*2, **kwargs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the validation set has `10,000` items and because we're using a batch size of 128 for running inference on our validation set, if we don't explicitly set `drop_last=True` in our validation data loader, our calculations for validation loss/accuracy will be slightly skewed. \n", "\n", "The last batch will have a size of only `16`. And as explained above, the last batch's loss/acc metrics will thus sway the overall totals more than they should. For the instructional purposes of this notebook, we won't worry about accounting for this in our validation loss/acc calculations at this point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `get_dls()` method allows us to now create dataloaders and fit the model, in only *three lines of code*." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 tensor(0.2808) tensor(0.9006)\n", "1 tensor(0.1229) tensor(0.9634)\n", "2 tensor(0.1326) tensor(0.9616)\n", "3 tensor(0.1114) tensor(0.9670)\n", "4 tensor(0.1046) tensor(0.9691)\n" ] } ], "source": [ "train_dl, valid_dl = get_dls(train_ds, valid_ds, bs)\n", "model, opt = get_model()\n", "loss, acc = fit(5, model, loss_func, opt, train_dl, valid_dl)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "assert acc>0.9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Converted 03_minibatch_training_my_reimplementation.ipynb to nb_03.py\r\n" ] } ], "source": [ "!python notebook2script_my_reimplementation.py 03_minibatch_training_my_reimplementation.ipynb" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }