{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pytorch tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this tutorial is to very quickly present pytorch, the main deep learning framework nowadays, to students with already some experience in tensorflow/keras. \n", "\n", "We will train a small fully-connected network on MNIST and observe what happens when the inputs or outputs are correlated, by training successively on the 0 digits, then the 1s, etc. This will explain why correlated inputs are a problem for neural networks." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Pytorch\n", "\n", "Like tensorflow / keras, Pytorch provides a lot of ready-made layer types, activation functions, optimizers and so on. Do not hesitate to read its documentation on .\n", "\n", "The first step is to install pytorch if you are not on colab (where it is installed by default). The easiest way is to use pip:\n", " \n", "```bash\n", "pip install torch torchvision\n", "```\n", "\n", "`torchvision` is necessary if you want to deal with images, such as the MNIST dataset.\n", "\n", "`torch` is now available for importing. There is quite a lot to import, so let's just copy and paste:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Imports\n", "import torch\n", "import torch.nn.functional as F\n", "from torch.utils.data import TensorDataset, DataLoader, Subset\n", "from torchvision import datasets, transforms\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Select hardware: \n", "if torch.cuda.is_available(): # GPU\n", " device = torch.device(\"cuda\")\n", "elif torch.backends.mps.is_available(): # Metal (Macos)\n", " device = torch.device(\"mps\")\n", "else: # CPU\n", " device = torch.device(\"cpu\")\n", "print(f\"Device: {device}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random data\n", "\n", "Let's train a MLP on some dummy data. To show the (overfitting) power of deep neural networks, we will try to learn noise by heart. The following cell creates 1000 random samples of dimension 10, artificially ordered in 3 classes:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "N = 1000\n", "nb_features = 10\n", "nb_classes = 3\n", "\n", "X = np.random.uniform(-1.0, 1.0, (N, nb_features))\n", "t = np.random.randint(0, nb_classes, (N, ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by splitting this data in training / validation sets using scikit-learn." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The four numpy arrays have to be converted to torch tensors. The inputs are `float32` (32 bits) numbers, while the classes are integers (`long`)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X_train_tensor = torch.tensor(X_train, dtype=torch.float32)\n", "t_train_tensor = torch.tensor(t_train, dtype=torch.long)\n", "X_test_tensor = torch.tensor(X_test, dtype=torch.float32)\n", "t_test_tensor = torch.tensor(t_test, dtype=torch.long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using these four tensors, we can now create datasets (`TensorDataset`) and data loaders (`DataLoader`) allowing to sample minibatches very easily. Note that you have to define the batch size at the time of creation of the data loaders. It cannot be changed later unless you create new ones." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Create TensorDatasets for both train and test sets\n", "train_dataset = TensorDataset(X_train_tensor, t_train_tensor)\n", "test_dataset = TensorDataset(X_test_tensor, t_test_tensor)\n", "\n", "# Create DataLoaders for train and test sets\n", "batch_size = 32\n", "train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n", "test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now is time to create a MLP in pytorch. The standard way of creating a neural network is to create a class inheriting from `torch.nn.module`. \n", "\n", "There are only two methods that need to be defined:\n", "\n", "1. The constructor `__init__(self, ...)` that instantiates the layers. Do not forget the first line with `super()`, it is super important. The first argument `MLP` should be replaced with the name of the class if you change it. The layers are created in the constructor and saved as attributes to the class (`self.fc1`). The order does not matter.\n", "2. The forward method `forward(self, x)`that defines how the input `x` will be processed, and in which order the layers will be called.\n", "\n", "The following class defines a MLP with two hidden layers, and the ReLU activation function:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class MLP(torch.nn.Module):\n", " \"MLP with two hidden layers.\"\n", "\n", " def __init__(self, nb_features, nb_layer1, nb_layer2, nb_classes):\n", " super(MLP, self).__init__() # Obligatory, do not forget\n", "\n", " # Layers\n", " self.fc1 = torch.nn.Linear(nb_features, nb_layer1)\n", " self.fc2 = torch.nn.Linear(nb_layer1, nb_layer2)\n", " self.output = torch.nn.Linear(nb_layer2, nb_classes)\n", "\n", " def forward(self, x):\n", " x = F.relu(self.fc1(x))\n", " x = F.relu(self.fc2(x))\n", " x = self.output(x)\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`torch.nn.Linear(N, M)` creates a fully-connected layer between a layer of size `N` and a layer of size `M`.\n", "\n", "There is more than one way to create this network. For example, the ReLU non-linearity does not have to come from the functional module `F`, but could be a layer of its own:\n", "\n", "```python\n", "class MLP(torch.nn.Module):\n", " \"MLP with two hidden layers.\"\n", "\n", " def __init__(self, nb_features, nb_layer1, nb_layer2, nb_classes):\n", " super(MLP, self).__init__() # Obligatory, do not forget\n", "\n", " # Layers\n", " self.fc1 = torch.nn.Linear(nb_features, nb_layer1)\n", " self.fc2 = torch.nn.Linear(nb_layer1, nb_layer2)\n", " self.output = torch.nn.Linear(nb_layer2, nb_classes)\n", "\n", " # Activations\n", " self.relu = torch.nn.ReLU()\n", "\n", " def forward(self, x):\n", " x = self.fc1(x)\n", " x = self.relu(x)\n", " x = self.fc2(x)\n", " x = self.relu(x)\n", " x = self.output(x)\n", " return x\n", "```\n", "\n", "Note that it would be much closer to `keras` and much shorter to use the `Sequential` model, but for some reason (reusability, etc) it is not recommended:\n", "\n", "```python\n", "model = torch.nn.Sequential(\n", " torch.nn.Linear(nb_features, nb_layer1), \n", " torch.nn.ReLU(), \n", " torch.nn.Linear(nb_layer1, nb_layer2), \n", " torch.nn.ReLU(), \n", " torch.nn.Linear(nb_layer2, nb_classes)\n", ")\n", "```\n", "\n", "Contrary to tensorflow/keras, there is no need to create an input layer explicitly, as the input tensor is passed as the argument `x` in `def forward(self, x)`.\n", "\n", "Note that the output layer does not use a softmax activation function, although we are doing a classification. The cross-entropy loss function that we will define later expects logits as an input, not probabilities, so we just keep the numbers as they are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another big difference with `keras`is that there is no `model.fit()` method that trains the model for you. You have to define the whole training procedure by yourself. The equivalent of high-level keras API would be **pytorch lightning** () or **pytorch ignite** ().\n", "\n", "Here, we will define a `train()` method applying backpropagation and the optimizer on each minibatch. Skipping some details, the pseudo-algorithm would be something like this:\n", "\n", "```python\n", "# Create the neural network\n", "model = MLP(nb_features, nb_layer1, nb_layer2, nb_classes)\n", "\n", "# Select the optimizer, e.g. Adam\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n", "\n", "# Select the loss function, here cross-entropy as we do a classification\n", "loss_function = torch.nn.CrossEntropyLoss()\n", "\n", "# Iterate over the minibatches contained in the loader\n", "for batch_idx, (data, target) in enumerate(train_loader):\n", "\n", " # Reinitialize the gradients (important!)\n", " optimizer.zero_grad()\n", "\n", " # Forward pass\n", " y = model(X)\n", "\n", " # Compute the loss function on the minibatch\n", " loss = loss_function(y, t)\n", " \n", " # Backpropagate the gradients\n", " loss.backward()\n", " \n", " # Apply the optimizer on the gradients\n", " optimizer.step()\n", "```\n", "\n", "We also need to send the data to the GPU if needed, compute the metrics (loss and accuracy), etc. The following function is quite generic and can be reused in many networks:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def train(model, device, train_loader, optimizer, loss_function):\n", " \n", " # Tell pytorch to start training, i.e. to remember gradients, enable dropout, etc.\n", " model.train()\n", "\n", " # Initialize metrics\n", " training_loss = 0\n", " correct = 0 ; total = 0\n", " \n", " # Iterate over the minibatches\n", " for batch_idx, (data, target) in enumerate(train_loader):\n", " \n", " # Send the data to the device\n", " data, target = data.to(device), target.to(device)\n", " \n", " # Reinitialize the gradients\n", " optimizer.zero_grad()\n", " \n", " # Make the forward pass\n", " output = model(data)\n", " \n", " # Compute the loss function on the minibatch\n", " loss = loss_function(output, target)\n", "\n", " # Accumulate training loss. data.size(0) is the batch size.\n", " training_loss += loss.item() * data.size(0)\n", " \n", " # Backpropagate the gradients\n", " loss.backward()\n", " \n", " # Apply the optimizer on the gradients\n", " optimizer.step()\n", " \n", " # Compute metrics\n", " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", " total += target.size(0)\n", " correct += pred.eq(target.view_as(pred)).sum().item()\n", " \n", " # Info\n", " training_loss /= len(train_loader.dataset)\n", " accuracy = 100 * correct / total\n", " print(f'Training loss {training_loss:.4f}, accuracy {accuracy:.4f}')\n", "\n", " return training_loss, accuracy\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function does a similar function on the validation set, but does NOT apply backpropagation. `model.eval()` and `with torch.no_grad():` make sure that the gradients are not computed, speeding up the computations (also dropout and batchnorm are switched off)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def validate(model, device, test_loader, loss_function):\n", "\n", " # Evaluation mode, without the gradients\n", " model.eval()\n", "\n", " # Initialize metrics\n", " test_loss = 0\n", " correct = 0; total = 0\n", "\n", " # Important! No backpropagation when testing.\n", " with torch.no_grad():\n", "\n", " # Iterate over the minibatches\n", " for data, target in test_loader:\n", "\n", " # Send the data to the device\n", " data, target = data.to(device), target.to(device)\n", "\n", " # Make the forward pass\n", " output = model(data)\n", "\n", " # Compute the loss function on the minibatch\n", " test_loss += loss_function(output, target).item() * data.size(0)\n", "\n", " # Compute metrics\n", " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", " total += target.size(0)\n", " correct += pred.eq(target.view_as(pred)).sum().item()\n", " \n", " # Info\n", " test_loss /= len(test_loader.dataset)\n", " accuracy = 100 * correct / total\n", " print(f'Validation loss: {test_loss:.4f}, accuracy: {accuracy:.4f}')\n", "\n", " return test_loss, accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create the neural network (with two hidden layers of 100 neurons each), the Adam optimizer with a fixed learning rate and the cross-entropy loss. Note again that `torch.nn.CrossEntropyLoss()` expects the network to output logits, not probabilities.\n", "\n", "It is important to **send** the network to the device (GPU, TPU, etc) after creating an instance of the class." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Create the model\n", "model = MLP(nb_features, 100, 100, nb_classes).to(device)\n", "\n", "# Optimizer\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n", "\n", "# Loss function\n", "loss_function = torch.nn.CrossEntropyLoss()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Train the model for 50 epochs, by calling repeatedly the `train()` and `validate()` methods. Record and plot the training/validation losses and accuracies, and plot them. Comment on the final accuracies on the training and test sets, and what this means in terms of overfitting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a CNN on MNIST\n", "\n", "Let's now try to learn something a bit more serious, the MNIST dataset. The following cell load the MNIST data (training set 60000 of 28x28x1 monochrome images, test set of 10000 images), and normalizes it (values betwen 0 and 1 for each pixel)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# torchvision.transform allows to normalize the images. The mean=0.1307 and std=0.3081 are common practice.\n", "transform = transforms.Compose([\n", " transforms.ToTensor(), # Convert the image to a tensor (scaling to [0, 1])\n", " transforms.Normalize((0.1307,), (0.3081,)) # Normalize with mean and std\n", "])\n", "\n", "# Download the data if needed.\n", "dataset_train = datasets.MNIST('./data', train=True, download=True, transform=transform)\n", "dataset_test = datasets.MNIST('./data', train=False, download=True, transform=transform)\n", "\n", "# Create the data loaders \n", "batch_size=128\n", "train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=batch_size)\n", "test_loader = torch.utils.data.DataLoader(dataset_test, batch_size=batch_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a convolutional neural network with two convolutional layers (AlexNet-like) that can reach around 99% validation accuracy after **10 epochs**. Feel free to translate what you did in the course Neurocomputing, search the web, ask ChatGPT, etc. Some ingredients/tips you might need:\n", "\n", "* Convolutional layers, obviously: . You need to define the number of channels / features in the previous layer and in the next one. The first convolutional layer works on the image directly, so the number of input channels is 1 on MNIST (because the MNIST images are monochrome, it would be 3 for RGB images). Keep the kernel size at 3 (i.e. 3x3 filters) and define the padding as `'same'` or `'valid'`, as you prefer.\n", "\n", "```python\n", "self.conv1 = torch.nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding='same')\n", "```\n", "\n", "* Max-pooling layers can be defined using the functional module `F`:\n", "\n", "```python\n", "def forward(self, x):\n", " x = F.relu(self.conv1(x)) # Conv layer + ReLU\n", " x = F.max_pool2d(x, 2) # 2x2 max pooling\n", "```\n", "\n", "or as a reusable layer:\n", "\n", "```python\n", "def __init__(self):\n", " self.max_pooling = torch.nn.MaxPool2d(2)\n", "def forward(self, x):\n", " x = F.relu(self.conv1(x)) # Conv layer + ReLU\n", " x = self.max_pooling(x) # 2x2 max pooling\n", "```\n", "\n", "The functional approach is usually preferred, but they are equivalent, pick the approach you prefer.\n", "\n", "* After the last convolutional block, you need to **flatten** the tensor into a vector, before feeding it to the next fully-connected layer. It must be defined as a layer: \n", "\n", "```python\n", "def __init__(self):\n", " self.flatten = torch.nn.Flatten()\n", "def forward(self, x):\n", " x = self.flatten(x) # flatten\n", "```\n", "\n", "* The caveat is the size of that flattened vector, which will be the first argument of the next FC layer. For convolutional layers, the (width, height) dimensions of the input image do not matter, but FC layers require fixed numbers of inputs. The size of the vector will depend on 1) the image size, 2) the number of conv and pooling layers, 3) the padding method, etc. \n", "\n", "The trick to find the size of that layer is to create the network up until the flatten layer, pass a single image to the `forward()` method and print the shape of the returned tensor:\n", "\n", "```python\n", "class CNN(torch.nn.Module):\n", " def forward(self, x):\n", " ...\n", " x = self.flatten(x)\n", " return x\n", "\n", "model = CNN().to(device)\n", "\n", "# Random image (batch_size, channels, width, height)\n", "img = torch.randn(1, 1, 28, 28).to(device)\n", "\n", "# Forward pass\n", "res = model.forward(img)\n", "print(res.shape)\n", "```\n", "\n", "For an AlexNet-like network on MNIST with 2 convolutional layers, `padding=same` and 32 features in the last convolutional layer, you should get `torch.Size([1, 1568])` or `torch.Size([1, 32, 7, 7])` depending on whether you print before or after `flatten()`. This means that you have one tensor (the first dimension is always the batch size) of size 7x7 with 32 features, or 1568 elements when flattened. This is the input size for the next FC layer. Of course, you have to adapt this to your network.\n", "\n", "* You will likely observe overfitting if you only have conv, pooling and fc layers in your network. It never hurts to use a bit of dropout after each conv and fc layer. If you use the same level of dropout everywhere, you can define a single layer in `__init__` and use it in forward:\n", "\n", "```python\n", "def __init__(self):\n", " self.dropout = torch.nn.Dropout(p=0.5)\n", "def forward(self, x):\n", " x = self.dropout(x)\n", "```\n", "\n", "If you want different dropout levels, create as many layers as needed, or use the functional `F.dropout(x, 0.5)`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlated inputs\n", "\n", "Now that we have a basic CNN working on MNIST, let's investigate why deep NN hate sequentially correlated inputs (which is the main justification for the experience replay memory in DQN). Is that really true, or is just some mathematical assumption that does not matter in practice?\n", "\n", "The idea of this section is the following: we will train the same network as before for 10 epochs, but each epoch will train the network on all the zeros first, then all the ones, etc. Each epoch will contain the same number of training examples as before, but the order of presentation will be different (correlated instead of i.i.d).\n", "\n", "To do so, we only need to sort the datasets according to their targets, and tell the Pytorch DataLoaders not to shuffle the data when sampling minibatches. \n", "\n", "The following function sorts the datasets `dataset_train` and `dataset_test` (generated earlier when downloading MNIST), so that the data loaders can iterative deterministically over them (the flag `shuffle=False` is important)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Function to sort dataset by labels\n", "def sort_dataset_by_labels(dataset):\n", " sorted_indices = np.argsort(dataset.targets.numpy())\n", " sorted_dataset = Subset(dataset, sorted_indices)\n", " return sorted_dataset\n", "\n", "# Load and sort the MNIST dataset\n", "train_loader_sorted = DataLoader(\n", " sort_dataset_by_labels(dataset_train), \n", " batch_size=batch_size, shuffle=False\n", ")\n", "test_loader_sorted = DataLoader(\n", " sort_dataset_by_labels(dataset_test), \n", " batch_size=batch_size, shuffle=False\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Using these new data loaders, train the same CNN as before (after reinitializing it, of course) for 10 epochs. What do you observe? Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** To better understand what happened, modify the `validate()` method so that it returns a list of accuracies on each minibatch of the sorted test set, and plot these accuracies.\n", "\n", "As the test set is also sorted, the first minibatches will only have zeros in them, the following only ones, and so on. If you want, you can figure out which digits are in a minibatch using the list `np.argsort(dataset_test.targets.numpy())` and the batch size." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Optional Q:** Increase and decrease the learning rate of the optimizer. What do you observe? Is there a solution to this problem? " ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }