{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Nlt8A57W7UiN"
   },
   "source": [
    "Credit cs231n.stanford.edu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AhdzVgvI7UiN"
   },
   "source": [
    "# Task 0. Part PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "QBkAP9zd7UiO"
   },
   "source": [
    "Do not worry, if you do not understand everything here. Your task now is to use documentation and to understand basics of PyTorch syntax.\n",
    "\n",
    " * It is OK **not to know** these: batch, convolution, fully-connected network, Dense layer (you will learn about these at the course).\n",
    " * You **should know** these: dataset, model, training, weights, accuracy, gradient. **In this case**, we reccomend you to familiarize yourself with machine learning basics ([ODS course](mlcourse.ai), [Andrew Ng course](deeplearning.ai), [fast.ai course](course.fast.ai/ml), [Yandex Coursera Specialization](/www.coursera.org/specializations/machine-learning-data-analysis)) or at any other place you might like.\n",
    "\n",
    "We use image classification example for it to be simplier in terms of preprocessing and numerical representation compared to texts (that we will learn at the course)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6ZQR0nww7UiP"
   },
   "source": [
    "## What is PyTorch?\n",
    "\n",
    "PyTorch is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. It comes with a powerful automatic differentiation engine that removes the need for manual back-propagation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to install PyTorch?\n",
    "\n",
    "Follow the detailed PyTorch installation instructions on the [official site](https://pytorch.org/get-started/locally/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1vWjnC927UiP"
   },
   "source": [
    "## How will I learn PyTorch?\n",
    "\n",
    "Justin Johnson has made an excellent [tutorial](https://github.com/jcjohnson/pytorch-examples) for PyTorch. \n",
    "\n",
    "You can also find the detailed [API doc](http://pytorch.org/docs/stable/index.html) here. If you have other questions that are not addressed by the API docs, the [PyTorch forum](https://discuss.pytorch.org/) is a much better place to ask than StackOverflow, however [PyTorch.org](https://pytorch.org/) might be blocked by some Internet provides.\n",
    "\n",
    "\n",
    "# Table of Contents\n",
    "\n",
    "This assignment has 4 parts. You will learn PyTorch on different levels of abstractions. \n",
    "\n",
    "1. Preparation: we will use CIFAR-10 dataset.\n",
    "2. Barebones PyTorch: we will work directly with the lowest-level PyTorch Tensors. \n",
    "3. PyTorch Module API: we will use `nn.Module` to define arbitrary neural network architecture. \n",
    "4. PyTorch Sequential API: we will use `nn.Sequential` to define a linear feed-forward network very conveniently. \n",
    "\n",
    "Here is a table of comparison:\n",
    "\n",
    "| API           | Flexibility | Convenience |\n",
    "|---------------|-------------|-------------|\n",
    "| Barebone      | High        | Low         |\n",
    "| `nn.Module`     | High        | Medium      |\n",
    "| `nn.Sequential` | Low         | High        |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GtsfX7V67UiQ"
   },
   "source": [
    "# Part I. Preparation\n",
    "\n",
    "First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "T6Uo8HYh7UiQ"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import DataLoader\n",
    "from torch.utils.data import sampler\n",
    "\n",
    "import torchvision.datasets as dset\n",
    "import torchvision.transforms as T\n",
    "\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 72
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 170925,
     "status": "ok",
     "timestamp": 1549435157645,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "oYtVQtB77UiT",
    "outputId": "7ffdfae3-758a-437c-deb5-0c0f42d7c177"
   },
   "outputs": [],
   "source": [
    "NUM_TRAIN = 49000\n",
    "\n",
    "# The torchvision.transforms package provides tools for preprocessing data\n",
    "# and for performing data augmentation; here we set up a transform to\n",
    "# preprocess the data by subtracting the mean RGB value and dividing by the\n",
    "# standard deviation of each RGB value; we've hardcoded the mean and std.\n",
    "transform = T.Compose([\n",
    "                T.ToTensor(),\n",
    "                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))\n",
    "            ])\n",
    "\n",
    "# We set up a Dataset object for each split (train / val / test); Datasets load\n",
    "# training examples one at a time, so we wrap each Dataset in a DataLoader which\n",
    "# iterates through the Dataset and forms minibatches. We divide the CIFAR-10\n",
    "# training set into train and val sets by passing a Sampler object to the\n",
    "# DataLoader telling how it should sample from the underlying Dataset.\n",
    "cifar10_train = dset.CIFAR10('./datasets', train=True, download=True,\n",
    "                             transform=transform)\n",
    "loader_train = DataLoader(cifar10_train, batch_size=64, \n",
    "                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))\n",
    "\n",
    "cifar10_val = dset.CIFAR10('./datasets', train=True, download=True,\n",
    "                           transform=transform)\n",
    "loader_val = DataLoader(cifar10_val, batch_size=64, \n",
    "                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))\n",
    "\n",
    "cifar10_test = dset.CIFAR10('./datasets', train=False, download=True, \n",
    "                            transform=transform)\n",
    "loader_test = DataLoader(cifar10_test, batch_size=64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ya3NPWWD7UiX"
   },
   "source": [
    "You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.\n",
    "\n",
    "The global variables `dtype` and `device` will control the data types throughout this assignment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 153089,
     "status": "ok",
     "timestamp": 1549435157649,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "iw90i0rV7UiX",
    "outputId": "e72e8453-cd44-4955-b684-c26ab89b7983"
   },
   "outputs": [],
   "source": [
    "USE_GPU = True\n",
    "\n",
    "dtype = torch.float32 # we will be using float throughout this tutorial\n",
    "\n",
    "if USE_GPU and torch.cuda.is_available():\n",
    "    device = torch.device('cuda')\n",
    "else:\n",
    "    device = torch.device('cpu')\n",
    "\n",
    "# Constant to control how frequently we print train loss\n",
    "print_every = 100\n",
    "\n",
    "print('using device:', device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "heCEi9AH7Uia"
   },
   "source": [
    "# Part II. Barebones PyTorch\n",
    "\n",
    "PyTorch ships with high-level APIs to help us define model architectures conveniently, which we will cover in Part II of this tutorial. In this section, we will start with the barebone PyTorch elements to understand the autograd engine better. After this exercise, you will come to appreciate the high-level model API more.\n",
    "\n",
    "We will start with a simple fully-connected ReLU network with two hidden layers and no biases for CIFAR classification. \n",
    "This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients. It is important that you understand every line, because you will write a harder version after the example.\n",
    "\n",
    "When we create a PyTorch Tensor with `requires_grad=True`, then operations involving that Tensor will not just compute values; they will also build up a computational graph in the background, allowing us to easily backpropagate through the graph to compute gradients of some Tensors with respect to a downstream loss. Concretely if x is a Tensor with `x.requires_grad == True` then after backpropagation `x.grad` will be another Tensor holding the gradient of x with respect to the scalar loss at the end."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8RQj-Dsw7Uia"
   },
   "source": [
    "### PyTorch Tensors: Flatten Function (10 points)\n",
    "A PyTorch Tensor is conceptionally similar to a numpy array: it is an n-dimensional grid of numbers, and like numpy PyTorch provides many functions to efficiently operate on Tensors. As a simple example, we provide a `flatten` function below which reshapes image data for use in a fully-connected neural network.\n",
    "\n",
    "Recall that image data is typically stored in a Tensor of shape N x C x H x W, where:\n",
    "\n",
    "* N is the number of datapoints\n",
    "* C is the number of channels\n",
    "* H is the height of the intermediate feature map in pixels\n",
    "* W is the height of the intermediate feature map in pixels\n",
    "\n",
    "This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a \"flatten\" operation to collapse the `C x H x W` values per representation into a single long vector. The flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a \"view\" of that data. \"View\" is analogous to numpy's \"reshape\" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 201
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 581,
     "status": "ok",
     "timestamp": 1549437227705,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "jcUC5ijd7Uib",
    "outputId": "5d7788fc-c227-489c-82b8-124d7d48a0d7"
   },
   "outputs": [],
   "source": [
    "def flatten(x):\n",
    "    ################################################################################\n",
    "    # TODO: Implement flatten function.                                            #\n",
    "    ################################################################################\n",
    "    x_flat = x\n",
    "    ################################################################################\n",
    "    #                                 END OF YOUR CODE                             #\n",
    "    ################################################################################\n",
    "    return x_flat\n",
    "\n",
    "def test_flatten():\n",
    "    x = torch.arange(12).view(2, 1, 3, 2)\n",
    "    print('Before flattening: ', x)\n",
    "    print('After flattening: ', flatten(x))\n",
    "\n",
    "test_flatten()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mX5b2Lr47Uid"
   },
   "source": [
    "### Barebones PyTorch: Two-Layer Network (20 points)\n",
    "\n",
    "Here we define a function `two_layer_fc` which performs the forward pass of a two-layer fully-connected ReLU network on a batch of image data. After defining the forward pass we check that it doesn't crash and that it produces outputs of the right shape by running zeros through the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 530,
     "status": "ok",
     "timestamp": 1549382933725,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "1gwRt77B7Uid",
    "outputId": "7dad1fe8-365d-4e4f-d495-42302db7fc16"
   },
   "outputs": [],
   "source": [
    "import torch.nn.functional as F  # useful stateless functions\n",
    "\n",
    "def two_layer_fc(x, params):\n",
    "    \"\"\"\n",
    "    A fully-connected neural networks; the architecture is:\n",
    "    NN is fully connected -> ReLU -> fully connected layer.\n",
    "    Note that this function only defines the forward pass; \n",
    "    PyTorch will take care of the backward pass for us.\n",
    "    \n",
    "    The input to the network will be a minibatch of data, of shape\n",
    "    (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,\n",
    "    and the output layer will produce scores for C classes.\n",
    "\n",
    "    Inputs:\n",
    "    - x: A PyTorch Tensor of shape (N, d1, ..., dM) giving a minibatch of\n",
    "      input data.\n",
    "    - params: A list [w1, w2] of PyTorch Tensors giving weights for the network;\n",
    "      w1 has shape (D, H) and w2 has shape (H, C).\n",
    "\n",
    "    Returns:\n",
    "    - scores: A PyTorch Tensor of shape (N, C) giving classification scores for\n",
    "      the input data x.\n",
    "    \"\"\"\n",
    "    # first we flatten the image\n",
    "    x = flatten(x)  # shape: [batch_size, C x H x W]\n",
    "\n",
    "    w1, w2 = params\n",
    "    \n",
    "    # Forward pass: compute predicted y using operations on Tensors. Since w1 and\n",
    "    # w2 have requires_grad=True, operations involving these Tensors will cause\n",
    "    # PyTorch to build a computational graph, allowing automatic computation of\n",
    "    # gradients. Since we are no longer implementing the backward pass by hand we\n",
    "    # don't need to keep references to intermediate values.\n",
    "    # you can also use `.clamp(min=0)`, equivalent to F.relu()\n",
    "    \n",
    "    ################################################################################\n",
    "    # TODO: Implement the forward pass for the two-layer fully-connected network.  #\n",
    "    # In more details: matrix multiply input x and weight w1, apply ReLu           #\n",
    "    # nonlinearity (search the documentation for it, we discuss it in more detail  #\n",
    "    # at the course) then matrix multiply result of the previous operation by w2   #\n",
    "    # and return the result.                                                       #\n",
    "    #\n",
    "    # Write each shape-chanding operation on a separate line and provide comment   #\n",
    "    # with the current shape of the tensor                                         #\n",
    "    ################################################################################\n",
    "    pass\n",
    "    ################################################################################\n",
    "    #                                 END OF YOUR CODE                             #\n",
    "    ################################################################################\n",
    "    return x\n",
    "\n",
    "\n",
    "def two_layer_fc_test():\n",
    "    hidden_layer_size = 42\n",
    "    x = torch.zeros((64, 50), dtype=dtype)  # minibatch size 64, feature dimension 50\n",
    "    w1 = torch.zeros((50, hidden_layer_size), dtype=dtype)\n",
    "    w2 = torch.zeros((hidden_layer_size, 10), dtype=dtype)\n",
    "    if str(device) == \"cuda\":\n",
    "        x, w1, w2 = x.to(\"cuda\"), w1.to(\"cuda\"), w2.to(\"cuda\")\n",
    "    scores = two_layer_fc(x, [w1, w2])\n",
    "    print(scores.size())  # you should see [64, 10]\n",
    "\n",
    "two_layer_fc_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tZnYV9Jf7Uig"
   },
   "source": [
    "### Barebones PyTorch: Initialization\n",
    "Let's write a couple utility methods to initialize the weight matrices for our models.\n",
    "\n",
    "- `random_weight(shape)` initializes a weight tensor with the Kaiming normalization method.\n",
    "- `zero_weight(shape)` initializes a weight tensor with all zeros. Useful for instantiating bias parameters.\n",
    "\n",
    "The `random_weight` function uses the Kaiming normal initialization method, described in:\n",
    "\n",
    "He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*, ICCV 2015, https://arxiv.org/abs/1502.01852"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 90
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 5936,
     "status": "ok",
     "timestamp": 1549437241359,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "WcwU7LBj7Uig",
    "outputId": "b1f0792a-2064-4a75-bde7-786a5b9e4b7a"
   },
   "outputs": [],
   "source": [
    "def random_weight(shape):\n",
    "    \"\"\"\n",
    "    Create random Tensors for weights; setting requires_grad=True means that we\n",
    "    want to compute gradients for these Tensors during the backward pass.\n",
    "    We use Kaiming normalization: sqrt(2 / fan_in)\n",
    "    \"\"\"\n",
    "    if len(shape) == 2:  # FC weight\n",
    "        fan_in = shape[0]\n",
    "    else:\n",
    "        fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]\n",
    "    # randn is standard normal distribution generator. \n",
    "    w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)\n",
    "    w.requires_grad = True\n",
    "    return w\n",
    "\n",
    "def zero_weight(shape):\n",
    "    return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)\n",
    "\n",
    "# create a weight of shape [3 x 5]\n",
    "# you should see the type `torch.cuda.FloatTensor` if you use GPU. \n",
    "# Otherwise it should be `torch.FloatTensor`\n",
    "random_weight((3, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fHyJh8vT7Uii"
   },
   "source": [
    "### Barebones PyTorch: Check Accuracy\n",
    "When training the model we will use the following function to check the accuracy of our model on the training or validation sets.\n",
    "\n",
    "When checking accuracy we don't need to compute any gradients; as a result we don't need PyTorch to build a computational graph for us when we compute scores. To prevent a graph from being built we scope our computation under a `torch.no_grad()` context manager."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4nGz2ga67Uij"
   },
   "outputs": [],
   "source": [
    "def check_accuracy_part2(loader, model_fn, params):\n",
    "    \"\"\"\n",
    "    Check the accuracy of a classification model.\n",
    "    \n",
    "    Inputs:\n",
    "    - loader: A DataLoader for the data split we want to check\n",
    "    - model_fn: A function that performs the forward pass of the model,\n",
    "      with the signature scores = model_fn(x, params)\n",
    "    - params: List of PyTorch Tensors giving parameters of the model\n",
    "    \n",
    "    Returns: Nothing, but prints the accuracy of the model\n",
    "    \"\"\"\n",
    "    split = 'val' if loader.dataset.train else 'test'\n",
    "    print('Checking accuracy on the %s set' % split)\n",
    "    num_correct, num_samples = 0, 0\n",
    "    with torch.no_grad():\n",
    "        for x, y in loader:\n",
    "            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU\n",
    "            y = y.to(device=device, dtype=torch.int64)\n",
    "            scores = model_fn(x, params)\n",
    "            _, preds = scores.max(1)\n",
    "            num_correct += (preds == y).sum()\n",
    "            num_samples += preds.size(0)\n",
    "        acc = float(num_correct) / num_samples\n",
    "        print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HKjg995v7Uil"
   },
   "source": [
    "### BareBones PyTorch: Training Loop\n",
    "We can now set up a basic training loop to train our network. We will train the model using stochastic gradient descent without momentum. We will use `torch.functional.cross_entropy` to compute the loss; you can [read about it here](http://pytorch.org/docs/stable/nn.html#cross-entropy).\n",
    "\n",
    "The training loop takes as input the neural network function, a list of initialized parameters (`[w1, w2]` in our example), and learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "UxuQNigu7Uim"
   },
   "outputs": [],
   "source": [
    "def train_part2(model_fn, params, learning_rate):\n",
    "    \"\"\"\n",
    "    Train a model on CIFAR-10.\n",
    "    \n",
    "    Inputs:\n",
    "    - model_fn: A Python function that performs the forward pass of the model.\n",
    "      It should have the signature scores = model_fn(x, params) where x is a\n",
    "      PyTorch Tensor of image data, params is a list of PyTorch Tensors giving\n",
    "      model weights, and scores is a PyTorch Tensor of shape (N, C) giving\n",
    "      scores for the elements in x.\n",
    "    - params: List of PyTorch Tensors giving weights for the model\n",
    "    - learning_rate: Python scalar giving the learning rate to use for SGD\n",
    "    \n",
    "    Returns: Nothing\n",
    "    \"\"\"\n",
    "    for t, (x, y) in enumerate(loader_train):\n",
    "        # Move the data to the proper device (GPU or CPU)\n",
    "        x = x.to(device=device, dtype=dtype)\n",
    "        y = y.to(device=device, dtype=torch.long)\n",
    "\n",
    "        # Forward pass: compute scores and loss\n",
    "        scores = model_fn(x, params)\n",
    "        loss = F.cross_entropy(scores, y)\n",
    "\n",
    "        # Backward pass: PyTorch figures out which Tensors in the computational\n",
    "        # graph has requires_grad=True and uses backpropagation to compute the\n",
    "        # gradient of the loss with respect to these Tensors, and stores the\n",
    "        # gradients in the .grad attribute of each Tensor.\n",
    "        loss.backward()\n",
    "\n",
    "        # Update parameters. We don't want to backpropagate through the\n",
    "        # parameter updates, so we scope the updates under a torch.no_grad()\n",
    "        # context manager to prevent a computational graph from being built.\n",
    "        with torch.no_grad():\n",
    "            for w in params:\n",
    "                w -= learning_rate * w.grad\n",
    "\n",
    "                # Manually zero the gradients after running the backward pass\n",
    "                w.grad.zero_()\n",
    "\n",
    "        if t % print_every == 0:\n",
    "            print('Iteration %d, loss = %.4f' % (t, loss.item()))\n",
    "            check_accuracy_part2(loader_val, model_fn, params)\n",
    "            print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "z9Z1_Jdz7Uin"
   },
   "source": [
    "### Inline question 1: (20 points)\n",
    "\n",
    "What is the line number in the cell above, where\n",
    "\n",
    "1. neural network is executed? (predictions on train data are made)\n",
    "2. gradient descent step is made?\n",
    "3. how convert scores into probability distribution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "p7K22BHs7Uio"
   },
   "outputs": [],
   "source": [
    "1. # YOUR ANSWER HERE\n",
    "2. # YOUR ANSWER HERE\n",
    "3. # YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "X5W7lzrG7Uiq"
   },
   "source": [
    "### BareBones PyTorch: Train a Two-Layer Network\n",
    "Now we are ready to run the training loop. We need to explicitly allocate tensors for the fully connected weights, `w1` and `w2`. \n",
    "\n",
    "Each minibatch of CIFAR has 64 examples, so the tensor shape is `[64, 3, 32, 32]`. \n",
    "\n",
    "After flattening, `x` shape should be `[64, 3 * 32 * 32]`. This will be the size of the first dimension of `w1`. \n",
    "The second dimension of `w1` is the hidden layer size, which will also be the first dimension of `w2`. \n",
    "\n",
    "Finally, the output of the network is a 10-dimensional vector that represents the probability distribution over 10 classes. \n",
    "\n",
    "You don't need to tune any hyperparameters but you should see accuracies above 40% after training for one epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_nHHPqGE7Uiq"
   },
   "outputs": [],
   "source": [
    "hidden_layer_size = 4000\n",
    "learning_rate = 1e-2\n",
    "\n",
    "w1 = random_weight((3 * 32 * 32, hidden_layer_size))\n",
    "w2 = random_weight((hidden_layer_size, 10))\n",
    "\n",
    "train_part2(two_layer_fc, [w1, w2], learning_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vXP7lohS7Uis"
   },
   "source": [
    "# Part III. PyTorch Module API\n",
    "\n",
    "Barebone PyTorch requires that we track all the parameter tensors by hand. This is fine for small networks with a few tensors, but it would be extremely inconvenient and error-prone to track tens or hundreds of tensors in larger networks.\n",
    "\n",
    "PyTorch provides the `nn.Module` API for you to define arbitrary network architectures, while tracking every learnable parameters for you. In Part II, we implemented SGD ourselves. PyTorch also provides the `torch.optim` package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. It even supports approximate second-order methods like L-BFGS! You can refer to the [doc](http://pytorch.org/docs/master/optim.html) for the exact specifications of each optimizer.\n",
    "\n",
    "To use the Module API, follow the steps below:\n",
    "\n",
    "1. Subclass `nn.Module`. Give your network class an intuitive name like `TwoLayerFC`. \n",
    "\n",
    "2. In the constructor `__init__()`, define all the layers you need as class attributes. Layer objects like `nn.Linear` are themselves `nn.Module` subclasses and contain learnable parameters, so that you don't have to instantiate the raw tensors yourself. `nn.Module` will track these internal parameters for you. Refer to the [doc](http://pytorch.org/docs/master/nn.html) to learn more about the dozens of builtin layers. **Warning**: don't forget to call the `super().__init__()` first!\n",
    "\n",
    "3. In the `forward()` method, define the *connectivity* of your network. You should use the attributes defined in `__init__` as function calls that take tensor as input and output the \"transformed\" tensor. Do *not* create any new layers with learnable parameters in `forward()`! All of them must be declared upfront in `__init__`. \n",
    "\n",
    "After you define your Module subclass, you can instantiate it as an object and call it just like the NN forward function in part II.\n",
    "\n",
    "**Tip:** Dense, Linear and fully connected (fc) layers are synonims and mean the same thing.\n",
    "\n",
    "**Tip 2:** Do **not** apply nonlinearity after the last layer of the network.\n",
    "\n",
    "### Module API: Two-Layer Network (20 points)\n",
    "Here is a concrete example of a 2-layer fully connected network:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 579,
     "status": "ok",
     "timestamp": 1549437589631,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "K4l-pdw37Uit",
    "outputId": "c1b5aa11-fdd9-40c1-e85b-10005fe48a3a"
   },
   "outputs": [],
   "source": [
    "class TwoLayerFC(nn.Module):\n",
    "    def __init__(self, input_size, hidden_size, num_classes):\n",
    "        super().__init__()\n",
    "        ########################################################################\n",
    "        # TODO: Assign layer objects to class attributes. Use Linear layer     #\n",
    "        ########################################################################\n",
    "        self.fc1 = None\n",
    "        self.fc2 = None\n",
    "        self.relu = None\n",
    "        ########################################################################\n",
    "        #                          END OF YOUR CODE                            #       \n",
    "        ########################################################################\n",
    "\n",
    "        # nn.init package contains convenient initialization methods\n",
    "        # http://pytorch.org/docs/master/nn.html#torch-nn-init \n",
    "        nn.init.kaiming_normal_(self.fc1.weight)\n",
    "        nn.init.kaiming_normal_(self.fc2.weight)\n",
    "        # we do not initialize ReLu, because it does not have parameters\n",
    "\n",
    "    def forward(self, x):\n",
    "        ########################################################################\n",
    "        # TODO: Make fully-connected neural network just like before, but      #\n",
    "        # using layer objects (self.fc1, self.fc2). Don't forget to flatten x! #\n",
    "        ########################################################################\n",
    "        pass\n",
    "        ########################################################################\n",
    "        #                          END OF YOUR CODE                            #       \n",
    "        ########################################################################\n",
    "        return x\n",
    "\n",
    "def test_TwoLayerFC():\n",
    "    input_size = 50\n",
    "    x = torch.zeros((64, input_size), dtype=dtype)  # minibatch size 64, feature dimension 50\n",
    "    model = TwoLayerFC(input_size, 42, 10)\n",
    "    scores = model(x)\n",
    "    print(scores.size())  # you should see [64, 10]\n",
    "test_TwoLayerFC()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1P-V3t3P7Uiw"
   },
   "source": [
    "### Module API: Check Accuracy\n",
    "Given the validation or test set, we can check the classification accuracy of a neural network. \n",
    "\n",
    "This version is slightly different from the one in part II. You don't manually pass in the parameters anymore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ClxARnQn7Uix"
   },
   "outputs": [],
   "source": [
    "def check_accuracy_part34(loader, model):\n",
    "    if loader.dataset.train:\n",
    "        print('Checking accuracy on validation set')\n",
    "    else:\n",
    "        print('Checking accuracy on test set')   \n",
    "    num_correct = 0\n",
    "    num_samples = 0\n",
    "    model.eval()  # set model to evaluation mode\n",
    "    with torch.no_grad():\n",
    "        for x, y in loader:\n",
    "            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU\n",
    "            y = y.to(device=device, dtype=torch.long)\n",
    "            scores = model(flatten(x))\n",
    "            _, preds = scores.max(1)\n",
    "            num_correct += (preds == y).sum()\n",
    "            num_samples += preds.size(0)\n",
    "        acc = float(num_correct) / num_samples\n",
    "        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m_jekAfY7Ui0"
   },
   "source": [
    "### Module API: Training Loop\n",
    "We also use a slightly different training loop. Rather than updating the values of the weights ourselves, we use an Optimizer object from the `torch.optim` package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zqrs0cVX7Ui1"
   },
   "outputs": [],
   "source": [
    "import torch.nn.functional as F\n",
    "\n",
    "def train_part34(model, optimizer, epochs=1):\n",
    "    \"\"\"\n",
    "    Train a model on CIFAR-10 using the PyTorch Module API.\n",
    "    \n",
    "    Inputs:\n",
    "    - model: A PyTorch Module giving the model to train.\n",
    "    - optimizer: An Optimizer object we will use to train the model\n",
    "    - epochs: (Optional) A Python integer giving the number of epochs to train for\n",
    "    \n",
    "    Returns: Nothing, but prints model accuracies during training.\n",
    "    \"\"\"\n",
    "    model = model.to(device=device)  # move the model parameters to CPU/GPU\n",
    "    for e in range(epochs):\n",
    "        for t, (x, y) in enumerate(loader_train):\n",
    "            model.train()  # put model to training mode\n",
    "            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU\n",
    "            y = y.to(device=device, dtype=torch.long)\n",
    "\n",
    "            scores = model(flatten(x))\n",
    "            loss = F.cross_entropy(scores, y)\n",
    "\n",
    "            # Zero out all of the gradients for the variables which the optimizer\n",
    "            # will update.\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "            # This is the backwards pass: compute the gradient of the loss with\n",
    "            # respect to each  parameter of the model.\n",
    "            loss.backward()\n",
    "\n",
    "            # Actually update the parameters of the model using the gradients\n",
    "            # computed by the backwards pass.\n",
    "            optimizer.step()\n",
    "\n",
    "            if t % print_every == 0:\n",
    "                print('Iteration %d, loss = %.4f' % (t, loss.item()))\n",
    "                check_accuracy_part34(loader_val, model)\n",
    "                print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tf2IQ1Rc7Ui6"
   },
   "source": [
    "### Inline question 2: (10 points)\n",
    "\n",
    "What is the line number in the cell above, where\n",
    "\n",
    "1. neural network is executed? (predictions on train data are made)\n",
    "1. gradient descent step is made?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eI1QPTOO7Ui7"
   },
   "outputs": [],
   "source": [
    "1. # YOUR ANSWER HERE\n",
    "2. # YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "faetHEQJ7Ui9"
   },
   "source": [
    "### Module API: Train a Two-Layer Network\n",
    "Now we are ready to run the training loop. In contrast to part II, we don't explicitly allocate parameter tensors anymore.\n",
    "\n",
    "Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `TwoLayerFC`. \n",
    "\n",
    "You also need to define an optimizer that tracks all the learnable parameters inside `TwoLayerFC`.\n",
    "\n",
    "You don't need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 605
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 14994,
     "status": "ok",
     "timestamp": 1549437691804,
     "user": {
      "displayName": "Alexey Sorokin",
      "photoUrl": "",
      "userId": "16189107033522981450"
     },
     "user_tz": -180
    },
    "id": "Gq480aVO7Ui-",
    "outputId": "17baab68-367d-40c0-a892-62fb6280431d"
   },
   "outputs": [],
   "source": [
    "hidden_layer_size = 4000\n",
    "learning_rate = 1e-2\n",
    "model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)\n",
    "optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n",
    "\n",
    "train_part34(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pNsDz0857Ui_"
   },
   "source": [
    "# Part IV. PyTorch Sequential API\n",
    "\n",
    "Part III introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity. \n",
    "\n",
    "For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass `nn.Module`, assign layers to class attributes in `__init__`, and call each layer one by one in `forward()`. Is there a more convenient way? \n",
    "\n",
    "Fortunately, PyTorch provides a container Module called `nn.Sequential`, which merges the above steps into one. It is not as flexible as `nn.Module`, because you cannot specify more complex topology than a feed-forward stack, but it's good enough for many use cases.\n",
    "\n",
    "### Sequential API: Two-Layer Network (20 points)\n",
    "Let's see how to rewrite our two-layer fully connected network example with `nn.Sequential`, and train it using the training loop defined above.\n",
    "\n",
    "Again, you don't need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "umKv-p_p7UjA"
   },
   "outputs": [],
   "source": [
    "# We need to wrap `flatten` function in a module in order to stack it\n",
    "# in nn.Sequential\n",
    "class Flatten(nn.Module):\n",
    "    def forward(self, x):\n",
    "        return flatten(x)\n",
    "\n",
    "hidden_layer_size = 4000\n",
    "learning_rate = 1e-2\n",
    "\n",
    "########################################################################\n",
    "# TODO: use nn.Sequential to make the same network as before           #\n",
    "########################################################################\n",
    "model = None\n",
    "########################################################################\n",
    "#                          END OF YOUR CODE                            #       \n",
    "########################################################################\n",
    "\n",
    "optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n",
    "\n",
    "train_part34(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize Results\n",
    "\n",
    "The following cell shows the classification results on the test set. Stop running whenever you feel comfortable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib import pyplot\n",
    "for t, (x, y) in enumerate(loader_test):\n",
    "    scores = model(flatten(x))\n",
    "    _, preds = scores.max(1)\n",
    "    \n",
    "    for i, img in enumerate(x):\n",
    "        pyplot.title('Label:'+str(y.tolist()[i])+' Pred:'+str(preds.tolist()[i]))\n",
    "        pyplot.imshow(x[i][0],  interpolation='nearest', aspect='auto', cmap=pyplot.get_cmap('gray'))\n",
    "        pyplot.show()"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "PyTorch.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}