# Pytorch tutorial

The goal of this tutorial is to very quickly present pytorch, the main deep learning framework nowadays, to students with already some experience in tensorflow/keras. 

We will train a small fully-connected network on MNIST and observe what happens when the inputs or outputs are correlated, by training successively on the 0 digits, then the 1s, etc. This will explain why correlated inputs are a problem for neural networks.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Installing Pytorch

Like tensorflow / keras, Pytorch provides a lot of ready-made layer types, activation functions, optimizers and so on. Do not hesitate to read its documentation on .

The first step is to install pytorch if you are not on colab (where it is installed by default). The easiest way is to use pip:
 
```bash
pip install torch torchvision
```

`torchvision` is necessary if you want to deal with images, such as the MNIST dataset.

`torch` is now available for importing. There is quite a lot to import, so let's just copy and paste:


In [None]:
# Imports
import torch
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, Subset
from torchvision import datasets, transforms
from sklearn.model_selection import train_test_split

# Select hardware: 
if torch.cuda.is_available(): # GPU
 device = torch.device("cuda")
elif torch.backends.mps.is_available(): # Metal (Macos)
 device = torch.device("mps")
else: # CPU
 device = torch.device("cpu")
print(f"Device: {device}")

## Random data

Let's train a MLP on some dummy data. To show the (overfitting) power of deep neural networks, we will try to learn noise by heart. The following cell creates 1000 random samples of dimension 10, artificially ordered in 3 classes:

In [3]:
N = 1000
nb_features = 10
nb_classes = 3

X = np.random.uniform(-1.0, 1.0, (N, nb_features))
t = np.random.randint(0, nb_classes, (N, ))

Let's start by splitting this data in training / validation sets using scikit-learn.

In [4]:
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.1)

The four numpy arrays have to be converted to torch tensors. The inputs are `float32` (32 bits) numbers, while the classes are integers (`long`).

In [5]:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
t_train_tensor = torch.tensor(t_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
t_test_tensor = torch.tensor(t_test, dtype=torch.long)

Using these four tensors, we can now create datasets (`TensorDataset`) and data loaders (`DataLoader`) allowing to sample minibatches very easily. Note that you have to define the batch size at the time of creation of the data loaders. It cannot be changed later unless you create new ones.

In [6]:
# Create TensorDatasets for both train and test sets
train_dataset = TensorDataset(X_train_tensor, t_train_tensor)
test_dataset = TensorDataset(X_test_tensor, t_test_tensor)

# Create DataLoaders for train and test sets
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Now is time to create a MLP in pytorch. The standard way of creating a neural network is to create a class inheriting from `torch.nn.module`. 

There are only two methods that need to be defined:

1. The constructor `__init__(self, ...)` that instantiates the layers. Do not forget the first line with `super()`, it is super important. The first argument `MLP` should be replaced with the name of the class if you change it. The layers are created in the constructor and saved as attributes to the class (`self.fc1`). The order does not matter.
2. The forward method `forward(self, x)`that defines how the input `x` will be processed, and in which order the layers will be called.

The following class defines a MLP with two hidden layers, and the ReLU activation function:

In [7]:
class MLP(torch.nn.Module):
 "MLP with two hidden layers."

 def __init__(self, nb_features, nb_layer1, nb_layer2, nb_classes):
 super(MLP, self).__init__() # Obligatory, do not forget

 # Layers
 self.fc1 = torch.nn.Linear(nb_features, nb_layer1)
 self.fc2 = torch.nn.Linear(nb_layer1, nb_layer2)
 self.output = torch.nn.Linear(nb_layer2, nb_classes)

 def forward(self, x):
 x = F.relu(self.fc1(x))
 x = F.relu(self.fc2(x))
 x = self.output(x)
 return x

`torch.nn.Linear(N, M)` creates a fully-connected layer between a layer of size `N` and a layer of size `M`.

There is more than one way to create this network. For example, the ReLU non-linearity does not have to come from the functional module `F`, but could be a layer of its own:

```python
class MLP(torch.nn.Module):
 "MLP with two hidden layers."

 def __init__(self, nb_features, nb_layer1, nb_layer2, nb_classes):
 super(MLP, self).__init__() # Obligatory, do not forget

 # Layers
 self.fc1 = torch.nn.Linear(nb_features, nb_layer1)
 self.fc2 = torch.nn.Linear(nb_layer1, nb_layer2)
 self.output = torch.nn.Linear(nb_layer2, nb_classes)

 # Activations
 self.relu = torch.nn.ReLU()

 def forward(self, x):
 x = self.fc1(x)
 x = self.relu(x)
 x = self.fc2(x)
 x = self.relu(x)
 x = self.output(x)
 return x
```

Note that it would be much closer to `keras` and much shorter to use the `Sequential` model, but for some reason (reusability, etc) it is not recommended:

```python
model = torch.nn.Sequential(
 torch.nn.Linear(nb_features, nb_layer1), 
 torch.nn.ReLU(), 
 torch.nn.Linear(nb_layer1, nb_layer2), 
 torch.nn.ReLU(), 
 torch.nn.Linear(nb_layer2, nb_classes)
)
```

Contrary to tensorflow/keras, there is no need to create an input layer explicitly, as the input tensor is passed as the argument `x` in `def forward(self, x)`.

Note that the output layer does not use a softmax activation function, although we are doing a classification. The cross-entropy loss function that we will define later expects logits as an input, not probabilities, so we just keep the numbers as they are.

Another big difference with `keras`is that there is no `model.fit()` method that trains the model for you. You have to define the whole training procedure by yourself. The equivalent of high-level keras API would be **pytorch lightning** () or **pytorch ignite** ().

Here, we will define a `train()` method applying backpropagation and the optimizer on each minibatch. Skipping some details, the pseudo-algorithm would be something like this:

```python
# Create the neural network
model = MLP(nb_features, nb_layer1, nb_layer2, nb_classes)

# Select the optimizer, e.g. Adam
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Select the loss function, here cross-entropy as we do a classification
loss_function = torch.nn.CrossEntropyLoss()

# Iterate over the minibatches contained in the loader
for batch_idx, (data, target) in enumerate(train_loader):

 # Reinitialize the gradients (important!)
 optimizer.zero_grad()

 # Forward pass
 y = model(X)

 # Compute the loss function on the minibatch
 loss = loss_function(y, t)
 
 # Backpropagate the gradients
 loss.backward()
 
 # Apply the optimizer on the gradients
 optimizer.step()
```

We also need to send the data to the GPU if needed, compute the metrics (loss and accuracy), etc. The following function is quite generic and can be reused in many networks:

In [8]:
def train(model, device, train_loader, optimizer, loss_function):
 
 # Tell pytorch to start training, i.e. to remember gradients, enable dropout, etc.
 model.train()

 # Initialize metrics
 training_loss = 0
 correct = 0 ; total = 0
 
 # Iterate over the minibatches
 for batch_idx, (data, target) in enumerate(train_loader):
 
 # Send the data to the device
 data, target = data.to(device), target.to(device)
 
 # Reinitialize the gradients
 optimizer.zero_grad()
 
 # Make the forward pass
 output = model(data)
 
 # Compute the loss function on the minibatch
 loss = loss_function(output, target)

 # Accumulate training loss. data.size(0) is the batch size.
 training_loss += loss.item() * data.size(0)
 
 # Backpropagate the gradients
 loss.backward()
 
 # Apply the optimizer on the gradients
 optimizer.step()
 
 # Compute metrics
 pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
 total += target.size(0)
 correct += pred.eq(target.view_as(pred)).sum().item()
 
 # Info
 training_loss /= len(train_loader.dataset)
 accuracy = 100 * correct / total
 print(f'Training loss {training_loss:.4f}, accuracy {accuracy:.4f}')

 return training_loss, accuracy


The following function does a similar function on the validation set, but does NOT apply backpropagation. `model.eval()` and `with torch.no_grad():` make sure that the gradients are not computed, speeding up the computations (also dropout and batchnorm are switched off).

In [9]:
def validate(model, device, test_loader, loss_function):

 # Evaluation mode, without the gradients
 model.eval()

 # Initialize metrics
 test_loss = 0
 correct = 0; total = 0

 # Important! No backpropagation when testing.
 with torch.no_grad():

 # Iterate over the minibatches
 for data, target in test_loader:

 # Send the data to the device
 data, target = data.to(device), target.to(device)

 # Make the forward pass
 output = model(data)

 # Compute the loss function on the minibatch
 test_loss += loss_function(output, target).item() * data.size(0)

 # Compute metrics
 pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
 total += target.size(0)
 correct += pred.eq(target.view_as(pred)).sum().item()
 
 # Info
 test_loss /= len(test_loader.dataset)
 accuracy = 100 * correct / total
 print(f'Validation loss: {test_loss:.4f}, accuracy: {accuracy:.4f}')

 return test_loss, accuracy

Now we can create the neural network (with two hidden layers of 100 neurons each), the Adam optimizer with a fixed learning rate and the cross-entropy loss. Note again that `torch.nn.CrossEntropyLoss()` expects the network to output logits, not probabilities.

It is important to **send** the network to the device (GPU, TPU, etc) after creating an instance of the class.

In [10]:
# Create the model
model = MLP(nb_features, 100, 100, nb_classes).to(device)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Loss function
loss_function = torch.nn.CrossEntropyLoss()

**Q:** Train the model for 50 epochs, by calling repeatedly the `train()` and `validate()` methods. Record and plot the training/validation losses and accuracies, and plot them. Comment on the final accuracies on the training and test sets, and what this means in terms of overfitting.

## Training a CNN on MNIST

Let's now try to learn something a bit more serious, the MNIST dataset. The following cell load the MNIST data (training set 60000 of 28x28x1 monochrome images, test set of 10000 images), and normalizes it (values betwen 0 and 1 for each pixel).

In [13]:
# torchvision.transform allows to normalize the images. The mean=0.1307 and std=0.3081 are common practice.
transform = transforms.Compose([
 transforms.ToTensor(), # Convert the image to a tensor (scaling to [0, 1])
 transforms.Normalize((0.1307,), (0.3081,)) # Normalize with mean and std
])

# Download the data if needed.
dataset_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
dataset_test = datasets.MNIST('./data', train=False, download=True, transform=transform)

# Create the data loaders 
batch_size=128
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(dataset_test, batch_size=batch_size)

**Q:** Create a convolutional neural network with two convolutional layers (AlexNet-like) that can reach around 99% validation accuracy after **10 epochs**. Feel free to translate what you did in the course Neurocomputing, search the web, ask ChatGPT, etc. Some ingredients/tips you might need:

* Convolutional layers, obviously: . You need to define the number of channels / features in the previous layer and in the next one. The first convolutional layer works on the image directly, so the number of input channels is 1 on MNIST (because the MNIST images are monochrome, it would be 3 for RGB images). Keep the kernel size at 3 (i.e. 3x3 filters) and define the padding as `'same'` or `'valid'`, as you prefer.

```python
self.conv1 = torch.nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding='same')
```

* Max-pooling layers can be defined using the functional module `F`:

```python
def forward(self, x):
 x = F.relu(self.conv1(x)) # Conv layer + ReLU
 x = F.max_pool2d(x, 2) # 2x2 max pooling
```

or as a reusable layer:

```python
def __init__(self):
 self.max_pooling = torch.nn.MaxPool2d(2)
def forward(self, x):
 x = F.relu(self.conv1(x)) # Conv layer + ReLU
 x = self.max_pooling(x) # 2x2 max pooling
```

The functional approach is usually preferred, but they are equivalent, pick the approach you prefer.

* After the last convolutional block, you need to **flatten** the tensor into a vector, before feeding it to the next fully-connected layer. It must be defined as a layer: 

```python
def __init__(self):
 self.flatten = torch.nn.Flatten()
def forward(self, x):
 x = self.flatten(x) # flatten
```

* The caveat is the size of that flattened vector, which will be the first argument of the next FC layer. For convolutional layers, the (width, height) dimensions of the input image do not matter, but FC layers require fixed numbers of inputs. The size of the vector will depend on 1) the image size, 2) the number of conv and pooling layers, 3) the padding method, etc. 

The trick to find the size of that layer is to create the network up until the flatten layer, pass a single image to the `forward()` method and print the shape of the returned tensor:

```python
class CNN(torch.nn.Module):
 def forward(self, x):
 ...
 x = self.flatten(x)
 return x

model = CNN().to(device)

# Random image (batch_size, channels, width, height)
img = torch.randn(1, 1, 28, 28).to(device)

# Forward pass
res = model.forward(img)
print(res.shape)
```

For an AlexNet-like network on MNIST with 2 convolutional layers, `padding=same` and 32 features in the last convolutional layer, you should get `torch.Size([1, 1568])` or `torch.Size([1, 32, 7, 7])` depending on whether you print before or after `flatten()`. This means that you have one tensor (the first dimension is always the batch size) of size 7x7 with 32 features, or 1568 elements when flattened. This is the input size for the next FC layer. Of course, you have to adapt this to your network.

* You will likely observe overfitting if you only have conv, pooling and fc layers in your network. It never hurts to use a bit of dropout after each conv and fc layer. If you use the same level of dropout everywhere, you can define a single layer in `__init__` and use it in forward:

```python
def __init__(self):
 self.dropout = torch.nn.Dropout(p=0.5)
def forward(self, x):
 x = self.dropout(x)
```

If you want different dropout levels, create as many layers as needed, or use the functional `F.dropout(x, 0.5)`.

## Correlated inputs

Now that we have a basic CNN working on MNIST, let's investigate why deep NN hate sequentially correlated inputs (which is the main justification for the experience replay memory in DQN). Is that really true, or is just some mathematical assumption that does not matter in practice?

The idea of this section is the following: we will train the same network as before for 10 epochs, but each epoch will train the network on all the zeros first, then all the ones, etc. Each epoch will contain the same number of training examples as before, but the order of presentation will be different (correlated instead of i.i.d).

To do so, we only need to sort the datasets according to their targets, and tell the Pytorch DataLoaders not to shuffle the data when sampling minibatches. 

The following function sorts the datasets `dataset_train` and `dataset_test` (generated earlier when downloading MNIST), so that the data loaders can iterative deterministically over them (the flag `shuffle=False` is important).

In [17]:
# Function to sort dataset by labels
def sort_dataset_by_labels(dataset):
 sorted_indices = np.argsort(dataset.targets.numpy())
 sorted_dataset = Subset(dataset, sorted_indices)
 return sorted_dataset

# Load and sort the MNIST dataset
train_loader_sorted = DataLoader(
 sort_dataset_by_labels(dataset_train), 
 batch_size=batch_size, shuffle=False
)
test_loader_sorted = DataLoader(
 sort_dataset_by_labels(dataset_test), 
 batch_size=batch_size, shuffle=False
)


**Q:** Using these new data loaders, train the same CNN as before (after reinitializing it, of course) for 10 epochs. What do you observe? Why?

**Q:** To better understand what happened, modify the `validate()` method so that it returns a list of accuracies on each minibatch of the sorted test set, and plot these accuracies.

As the test set is also sorted, the first minibatches will only have zeros in them, the following only ones, and so on. If you want, you can figure out which digits are in a minibatch using the list `np.argsort(dataset_test.targets.numpy())` and the batch size.

**Optional Q:** Increase and decrease the learning rate of the optimizer. What do you observe? Is there a solution to this problem? 