{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "require 'nn'\n",
    "torch.manualSeed(287)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Preliminaries\n",
    "\n",
    "\n",
    "## What does 'nn' buy us?\n",
    " - Let's us declaratively specify neural network architectures that compute their forward and backward passes automatically\n",
    " - This is huge! Now we don't need to hand-code gradients; can just define our model and optimize\n",
    " \n",
    "## Networks are specified with (essentially) two kinds of abstract objects: Modules and Criteria\n",
    " - Modules (recursively) define a transformation from an input to an output; can think of them as functions\n",
    " - A Criterion calculates a loss based on an input (typically computed by a module) and a target\n",
    " \n",
    "## Be sure to check the official 'nn' documentation:\n",
    " - https://github.com/torch/nn/blob/master/doc/index.md\n",
    "\n",
    "## This tutorial (which partially inspired the below) may also be useful:\n",
    " - https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/practicals/practical4.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Modules\n",
    "$\\newcommand{\\reals}{\\mathbb{R}}$\n",
    "$\\newcommand{\\boldx}{\\mathbf{x}}$\n",
    "$\\newcommand{\\boldw}{\\mathbf{w}}$\n",
    "\n",
    "We will use nn.CMul as our first example of a module. \n",
    "```lua\n",
    "f = nn.CMul(size)\n",
    "```\n",
    "creates a module that computes the function $f: \\reals^{size} \\rightarrow \\reals^{size}$ defined by $f(\\boldx) = \\boldx \\odot \\boldw$, where $\\odot$ is elementwise multiplication. $\\boldw$ are the function's parameters, which 'nn' will automatically initialize to something reasonable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2a. The Forward Pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.3180\n",
       "-0.0484\n",
       " 0.1885\n",
       " 0.4165\n",
       " 0.3364\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n",
       "\n",
       "-0.3180\n",
       "-0.0968\n",
       " 0.5655\n",
       " 1.6661\n",
       " 1.6822\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = torch.range(1, 5) -- will input this into the module\n",
    "\n",
    "f = nn.CMul(x:size()) -- create the module\n",
    "-- let's see what f's parameters were initialized to. ('nn' always inits to something reasonable)\n",
    "print(f.weight)\n",
    "print()\n",
    "\n",
    "-- to apply f to an input x we call f:forward(x)\n",
    "print(f:forward(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.3180\n",
       "-0.0968\n",
       " 0.5655\n",
       " 1.6661\n",
       " 1.6822\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- modules are stateful; they store their parameters (if any) and also their last output\n",
    "print(f.output) -- N.B. every module has an 'output' member"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create another simple module. \n",
    "```lua\n",
    "g = nn.Sum(j) \n",
    "```\n",
    "creates a module computing the function $g: \\reals^{D_1 \\times \\ldots \\times \\, D_j \\, \\times \\ldots \\times \\, D_M} \\rightarrow \\reals^{D_1 \\times \\ldots \\times \\, D_{j-1} \\, \\times \\, D_{j+1} \\, \\times \\ldots \\times \\, D_M}$ that sums the input over dimension $j$ (thus decreasing the number of dimensions by 1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 15\n",
       "[torch.DoubleTensor of size 1]\n",
       "\n"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "g = nn.Sum(1) -- sum over dimension 1\n",
    "print(g:forward(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2b. Batching\n",
    "\n",
    "Most modules allow batching of inputs along the first dimension. That is, if your module expects inputs $x \\in \\reals^{size}$, you can give it an input $X \\in \\reals^{N \\times size}$, and it will apply itself to each $x$ along the first dimension of $X$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.3180 -0.0968  0.5655  1.6661  1.6822\n",
       "-0.3180 -0.0968  0.5655  1.6661  1.6822\n",
       "-0.3180 -0.0968  0.5655  1.6661  1.6822\n",
       "[torch.DoubleTensor of size 3x5]\n",
       "\n"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- let's batch calls to f\n",
    "X = x:view(1,5):expand(3, 5) -- here, N = 3\n",
    "print(f:forward(X))\n",
    "\n",
    "-- whenever you can, you should batch; it'll be much faster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2c. Container Modules\n",
    "\n",
    "Individual modules can be combined using 'Container' modules to compute more complicated functions. For instance, the modules g and f can be composed to get g(f()) using \n",
    "```lua\n",
    "nn.Sequential() \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 3.4990\n",
       "[torch.DoubleTensor of size 1]\n",
       "\n"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "h = nn.Sequential() -- this module computes the function defined by composing its child modules' functions in order\n",
    "h:add(f) -- add the module f as h's first child\n",
    "h:add(g) -- add the module g as h's second child\n",
    "print(h:forward(x)) -- computes g(f(x))) = sum_i [ x \\odot w ], where \\odot is elementwise multiplication"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " -0.3180\n",
       " -0.0968\n",
       "  0.5655\n",
       "  1.6661\n",
       "  1.6822\n",
       " 15.0000\n",
       "[torch.DoubleTensor of size 6]\n",
       "\n"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- though nn.Sequential is the container you'll use most (at least early in the course), there are others.\n",
    "-- nn.Concat(j) is a container that computes the function defined by applying each of its child modules to a single\n",
    "-- input, and then concatenating the respective outputs along dimension j\n",
    "cat = nn.Concat(1) -- concatenate outputs along 1st dimension\n",
    "cat:add(f)\n",
    "cat:add(g)\n",
    "print(cat:forward(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nn.Sequential {\n",
       "  [input -> (1) -> (2) -> output]\n",
       "  (1): nn.CMul\n",
       "  (2): nn.Sum\n",
       "}\n",
       "{\n",
       "  gradInput : DoubleTensor - empty\n",
       "  modules : \n",
       "    {\n",
       "      1 : \n",
       "        nn.CMul\n",
       "        {\n",
       "          output : DoubleTensor - size: 5\n",
       "          gradInput : DoubleTensor - empty\n",
       "          _output : DoubleTensor - size: 5\n",
       "          _repeat : DoubleTensor - empty\n",
       "          _expand : DoubleTensor - size: 3x5\n",
       "          gradWeight : DoubleTensor - size: 5\n",
       "          _weight : DoubleTensor - size: 5\n",
       "          size : LongStorage - size: 1\n",
       "          weight : DoubleTensor - size: 5\n",
       "        }\n",
       "      2 : \n",
       "        nn.Sum\n",
       "        {\n",
       "          gradInput : DoubleTensor - empty\n",
       "          "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "dimension : 1\n",
       "          output : DoubleTensor - size: 1\n",
       "        }\n",
       "    }\n",
       "  output : DoubleTensor - size: 1\n",
       "}\n",
       "\n"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- You can print a module to see its contents\n",
    "print(h)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nn.CMul\n",
       "{\n",
       "  output : DoubleTensor - size: 5\n",
       "  gradInput : DoubleTensor - empty\n",
       "  _output : DoubleTensor - size: 5\n",
       "  _repeat : DoubleTensor - empty\n",
       "  _expand : DoubleTensor - size: 3x5\n",
       "  gradWeight : DoubleTensor - size: 5\n",
       "  _weight : DoubleTensor - size: 5\n",
       "  size : LongStorage - size: 1\n",
       "  weight : DoubleTensor - size: 5\n",
       "}\n",
       "nn.CMul\n",
       "{\n",
       "  output : DoubleTensor - size: 5\n",
       "  gradInput : DoubleTensor - empty\n",
       "  _output : DoubleTensor - size: 5\n",
       "  _repeat : DoubleTensor - empty\n",
       "  _expand : DoubleTensor - size: 3x5\n",
       "  gradWeight : DoubleTensor - size: 5\n",
       "  _weight : DoubleTensor - size: 5\n",
       "  size : LongStorage - size: 1\n",
       "  weight : DoubleTensor - size: 5\n",
       "}\n"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- to access the children of containers you can use :get(i) or index into a list of children returned by .modules\n",
    "print(h:get(1))\n",
    "print(h.modules[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2d. The Backward Pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\newcommand{\\boldz}{\\mathbf{z}}$\n",
    "$\\newcommand{\\btheta}{\\boldsymbol{\\theta}}$\n",
    "\n",
    "Suppose we have a module computing a function $h$ that participates in the definition of a loss function $L$. For a particular input $\\boldx \\in \\reals^n$ let $\\boldz \\in \\reals^m$ be defined by $\\boldz = h(\\boldx)$, which allows us to write our loss function as $L(\\boldz)$. By the (multivariate) chain rule, the gradient of $L$ wrt $x_i$ is\n",
    "\n",
    "\\begin{align*}\n",
    "\\frac{\\partial L}{\\partial x_i} =  \\sum_j \\frac{\\partial L}{\\partial z_j} \\frac{\\partial z_j}{\\partial x_i}\n",
    "\\end{align*}\n",
    "\n",
    "Assuming $L$ returns a scalar, we can rewrite the above for the entire $\\boldx$ as\n",
    "\n",
    "\\begin{align*}\n",
    "\\frac{\\partial L}{\\partial \\boldx} =  \\left(\\frac{\\partial L}{\\partial \\boldz}\\right)^T \\frac{\\partial \\boldz}{\\partial \\boldx},\n",
    "\\end{align*}\n",
    "where $\\frac{\\partial \\boldz}{\\partial \\boldx}$ is the Jacobian, which lives in $\\reals^{m \\times n}$.\n",
    "\n",
    "Each 'nn' module knows how to (implicitly) compute $\\frac{\\partial \\boldz}{\\partial \\boldx}$ -- the gradient of its output wrt its input -- and so can compute $\\frac{\\partial L}{\\partial \\boldx}$ if it is also handed $\\frac{\\partial L}{\\partial \\boldz}$. In just the same way, if a module has parameters $\\theta$, it knows how to calculate $\\frac{\\partial \\boldz}{\\partial \\btheta}$, and can therefore calculate $\\frac{\\partial L}{\\partial \\btheta}$ if it is handed $\\frac{\\partial L}{\\partial \\boldz}$.\n",
    "\n",
    "It's very important to know the 'nn' terminology for these gradients:\n",
    " - $\\frac{\\partial L}{\\partial \\boldx}$ is called **'gradInput'** in nn; it's the gradient of the loss wrt a module's input\n",
    " \n",
    " - $\\frac{\\partial L}{\\partial \\boldz}$ is called **'gradOutput'** in nn; it's the gradient of the loss wrt a module's output\n",
    " \n",
    " - $\\frac{\\partial L}{\\partial \\btheta}$ is called either **'gradWeight'** or **'gradBias'** in nn; it's the gradient of the loss wrt a module's parameters\n",
    "\n",
    "Given $\\frac{\\partial \\boldz}{\\partial \\boldx}$, an 'nn' module computes $\\frac{\\partial L}{\\partial \\boldx}$ with the :backward() function, and stores it in its 'gradInput' member, as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1.6834\n",
       "[torch.DoubleTensor of size 1]\n",
       "\n",
       "-0.5353\n",
       "-0.0815\n",
       " 0.3173\n",
       " 0.7012\n",
       " 0.5664\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gradOut = torch.randn(1) -- let's make up a random gradOutput for dL/dz (of same dimension as output of h)\n",
    "print(gradOut)\n",
    "\n",
    "-- let's now compute dL/dx with our gradOut\n",
    "-- N.B. you MUST call :forward() before :backward() (and provide the same input); note we called :forward() above \n",
    "h:backward(x, gradOut) \n",
    "print(h.gradInput) -- N.B. each module also must have a gradInput member"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the gradients :backward() computed. Recall $h = g(f(\\boldx))$ = nn.Sequential():add(f):add(g). Because $h$ is an nn.Sequential/composition, it should first get gradient of $g$ wrt its input, which is $f(\\boldx)$. So, we get\n",
    "\\begin{align*}\n",
    "\\frac{\\partial L}{\\partial f(\\boldx)} = \\frac{\\partial L}{\\partial h(\\boldx)} \\cdot \\frac{\\partial h(\\boldx)}{\\partial f(\\boldx)} = gradOut \\cdot g'(f(\\boldx))\n",
    "\\end{align*}\n",
    "\n",
    "Since $g$ just sums, $g'(\\boldx)_i = 1$ for each $i$, and so $\\frac{\\partial L}{\\partial f(\\boldx)} = gradOut \\cdot$ torch.ones(x:size())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1.6834\n",
       " 1.6834\n",
       " 1.6834\n",
       " 1.6834\n",
       " 1.6834\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(g.gradInput)\n",
    "assert((g.gradInput - torch.ones(x:size()):mul(gradOut[1])):abs():max() < 1e-10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have $\\frac{\\partial L}{\\partial f(\\boldx)}$, we can calculate $\\frac{\\partial L}{\\partial \\boldx}$ as $(\\frac{\\partial L}{\\partial f(\\boldx)})^T \\frac{\\partial f}{\\partial \\boldx}$. Since $f(\\boldx)_i = w_i \\cdot x_i$, we have that $\\frac{\\partial f_i}{\\partial x_i} = w_i$, and is 0 everywhere else. Thus, $\\frac{\\partial f}{\\partial \\boldx}$ is g.gradInput$^T diag(\\boldw) =$ g.gradInput $\\odot \\boldw$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "assert((f.gradInput - torch.cmul(g.gradInput, f.weight)):abs():max() < 1e-10)\n",
    "\n",
    "-- (Note that f.gradInput == h.gradInput, since h = g(f(x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to computing $\\frac{\\partial L}{\\partial \\boldx}$, backward() also computes $\\frac{\\partial L}{\\partial \\btheta}$, where $\\btheta$ are the module's parameters. Specifically, modules accumulate the gradients wrt their parameters in their 'gradWeight' and 'gradBias' members. So, let's redo the above example, this time paying attention to parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1.6834\n",
       " 3.3668\n",
       " 5.0503\n",
       " 6.7337\n",
       " 8.4171\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- since backward() accumulates (i.e., adds) gradients, we need to start by zeroing out gradWeight and gradBias\n",
    "h:zeroGradParameters() -- N.B. calling zeroGradParameters() on a container recursively zeroes grads on children\n",
    "h:backward(x, gradOut)\n",
    "print(f.gradWeight)\n",
    "\n",
    "-- let's check that gradient was correct, using a calculation similar to the one used above for dL/dx\n",
    "assert((f.gradWeight - torch.cmul(g.gradInput, x)):abs():max() < 1e-10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2e. Module Internals\n",
    "Now that we know about :forward() and :backward() let's get a more precise sense of how they work. You'll need to know this if you ever want to implement your own modules!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "-- Recall that Module is an abstract class. The (abstract) functions :forward() and :backward() are defined in terms\n",
    "-- of 3 functions subclasses must implement: updateOutput(), updateGradInput(), accGradParameters()\n",
    "\n",
    "-- The below code is from https://github.com/torch/nn/blob/master/Module.lua; the comments are mine\n",
    "function Module:forward(input)\n",
    "   return self:updateOutput(input) -- subclasses must implement updateOutput, which sets self.output\n",
    "end\n",
    "\n",
    "function Module:backward(input, gradOutput, scale)\n",
    "   scale = scale or 1\n",
    "   self:updateGradInput(input, gradOutput) -- subclasses must implement updateGradInput, which sets self.gradInput\n",
    "   self:accGradParameters(input, gradOutput, scale) -- subclasses must add dL/d\\theta to self.gradWeight etc\n",
    "   return self.gradInput\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "-- here are some very simplified versions of these 3 functions for the CMul module (with new comments),\n",
    "-- adapted from https://github.com/torch/nn/blob/master/CMul.lua\n",
    "\n",
    "-- N.B. CMul inherits from module, and so has .output, and .gradInput members; \n",
    "-- because it has parameters, it also has .weight and .gradWeight members \n",
    "\n",
    "function CMul:updateOutput(input)   \n",
    "   self.output:resizeAs(input):copy(input) -- self.output = input \n",
    "   self.output:cmul(self.weight)           -- self._output = self._output .* self._weight\n",
    "   return self.output\n",
    "end\n",
    "\n",
    "function CMul:updateGradInput(input, gradOutput)  \n",
    "   self.gradInput:resizeAs(input):zero()   -- zero out our gradInput storer\n",
    "   self.gradInput:addcmul(1, self.weight, gradOutput) -- self.gradInput = self.gradOutput .* self.weight\n",
    "   return self.gradInput\n",
    "end\n",
    "\n",
    "function CMul:accGradParameters(input, gradOutput, scale)\n",
    "   scale = scale or 1\n",
    "   -- don't zero out gradWeight, because we're accumulating!\n",
    "   self.gradWeight:addcmul(scale, input, gradOutput)  -- self.gradWeight = self.gradOutput .* self.input\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2f. Some Useful Modules with Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-2.8031\n",
       " 1.9383\n",
       "-1.8525\n",
       "[torch.DoubleTensor of size 3]\n",
       "\n"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- In addition to nn.CMul, you will likely want to know about\n",
    "lin = nn.Linear(x:size(1), 3) -- computes Wx + b, where W \\in R^{5 x 3} and b \\in R^3\n",
    "print(lin:forward(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.3408  0.8594  1.2139\n",
       " 0.1566 -0.5897  0.1788\n",
       "-0.0315  0.1821 -0.7354\n",
       " 0.1246  0.2621 -0.0320\n",
       " 1.1030 -0.9798  0.3587\n",
       "[torch.DoubleTensor of size 5x3]\n",
       "\n",
       "-0.3408  0.8594  1.2139\n",
       " 0.1566 -0.5897  0.1788\n",
       " 1.1030 -0.9798  0.3587\n",
       "[torch.DoubleTensor of size 3x3]\n",
       "\n",
       "(1,.,.) = \n",
       " -0.3408  0.8594  1.2139\n",
       " -0.0315  0.1821 -0.7354\n",
       "\n",
       "(2,.,.) = \n",
       "  0.1246  0.2621 -0.0320\n",
       "  1.1030 -0.9798  0.3587\n",
       "\n",
       "(3,.,.) = \n",
       "  0.1566 -0.5897  0.1788\n",
       " -0.0315  0.1821 -0.7354\n",
       "[torch.DoubleTensor of size 3x2x3]\n",
       "\n"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- LookupTables will be extremely important for this course; they map indices to corresponding weight vectors\n",
    "LT = nn.LookupTable(5, 3) -- maps indices (1 thru 5) to corresponding weight vectors, which live in R^3\n",
    "\n",
    "-- let's look at a LookupTable's weights\n",
    "print(LT.weight)\n",
    "\n",
    "-- LookupTables take indices as input!\n",
    "idxs = torch.LongTensor({1,2,5})\n",
    "print(LT:forward(idxs)) -- extracts 1st, 2nd, and 5th rows of weights\n",
    "\n",
    "-- can also batch input to a LookupTable, as follows\n",
    "batchIdxs = torch.LongTensor({{1, 3}, {4, 5}, {2, 3}}) -- here, there are 3 examples, each associated with 2 idxs\n",
    "print(LT:forward(batchIdxs))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1.4369\n",
       " 2.2328\n",
       " 2.5575\n",
       " 4.2737\n",
       " 4.6302\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- nn.Add computes x + b, where b \\in R^5 (though can also be used to a single constant)\n",
    "add = nn.Add(x:size()) \n",
    "print(add:forward(x))\n",
    "\n",
    "-- there are many more (esp. convolutions, which we'll talk about later in the course)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2g. Some Useful Modules Without Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 0.6790\n",
       " 0.5264\n",
       " 0.6779\n",
       " 0.6240\n",
       " 0.5938\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "-1.3853\n",
       "-2.0289\n",
       "-1.3905\n",
       "-1.6280\n",
       "-1.7545\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n",
       " 0.6346\n",
       " 0.1051\n",
       " 0.6315\n",
       " 0.4672\n",
       " 0.3626\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n",
       " 0.7491\n",
       " 0.1055\n",
       " 0.7440\n",
       " 0.5065\n",
       " 0.3799\n",
       "[torch.DoubleTensor of size 5]\n",
       "\n",
       " 0.9011\n",
       " 2.5226\n",
       "[torch.DoubleTensor of size 2]\n",
       "\n",
       " 0.4434\n",
       " 0.2421\n",
       " 1.2315\n",
       "[torch.DoubleTensor of size 3]\n",
       "\n",
       " 0.9011  0.0143\n",
       " 0.8167  1.3009\n",
       " 0.0597  2.5226\n",
       "[torch.DoubleTensor of size 3x2]\n",
       "\n"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       " 0.9011 -0.0143 -0.8167\n",
       " 1.3009 -0.0597  2.5226\n",
       "[torch.DoubleTensor of size 2x3]\n",
       "\n"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- non-linearities/'transfer' functions\n",
    "x = torch.randn(5)\n",
    "nonlin1 = nn.Sigmoid()\n",
    "nonlin2 = nn.LogSoftMax()\n",
    "nonlin3 = nn.Tanh()\n",
    "nonlin4 = nn.ReLU()\n",
    "\n",
    "print(nonlin1:forward(x))\n",
    "print(nonlin2:forward(x))\n",
    "print(nonlin3:forward(x))\n",
    "print(nonlin4:forward(x))\n",
    "\n",
    "-- other mathematical operations\n",
    "X = torch.randn(3, 2)\n",
    "op1 = nn.Max(1, 2) -- maxes over dimension 1, expects 2d input\n",
    "op2 = nn.Mean(2, 2) -- means over dimension 2, expects 2d input\n",
    "op3 = nn.Abs() \n",
    "\n",
    "print(op1:forward(X))\n",
    "print(op2:forward(X))\n",
    "print(op3:forward(X))\n",
    "\n",
    "-- there are also Modules that reshape or review their arguments; one you'll use most often is nn.View,\n",
    "-- which takes in the desired dimension sizes\n",
    "print(nn.View(2,3):forward(X))\n",
    "\n",
    "-- there are many more!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2h. Advanced Containers/Table-Layers\n",
    "\n",
    "All the containers (and other modules) we've seen so far take in single Tensors as arguments. This won't be sufficient if we want functions of multiple inputs (especially if they're of different sizes or types).\n",
    "\n",
    "As a motivating example, suppose we want to make a Linear-like layer over both sparse and dense features. That is, we want to compute \n",
    "\n",
    "\\begin{align*}\n",
    "\\left[ \\mathbf{W}_o \\mathbf{W}_d \\right] \\begin{bmatrix} \\boldx_o \\\\ \\boldx_d \\end{bmatrix} + \\mathbf{b}, \n",
    "\\end{align*}\n",
    "\n",
    "where matrices $\\mathbf{W}_o$ and $\\mathbf{W}_d$  are concatenated horizontally, and a one-hot vector $\\boldx_o$ is stacked on top of a dense vector $\\boldx_d$ (and $\\mathbf{b}$ is a bias). Note that the above is equivalent to $\\mathbf{W}_o \\boldx_o + \\mathbf{W}_d \\boldx_d + \\mathbf{b}$. Moreover, since we know that $\\mathbf{W}_o \\boldx_o$ is equivalent to a lookup in a LookupTable, we can do the following: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{\n",
       "  1 : DoubleTensor - size: 1x2\n",
       "  2 : DoubleTensor - size: 1x2\n",
       "}\n"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D_o, D_d, D_h = 5, 3, 2 -- width of W_o, width of W_d, height of both W_o and W_d\n",
    "x_o = torch.LongTensor({2}) -- index equivalent of [0 1 0 0 0]\n",
    "x_d = torch.randn(1, D_d)\n",
    "\n",
    "-- our first example of a Table layer/container\n",
    "par = nn.ParallelTable() -- takes a TABLE of inputs, applies i'th child to i'th input, and returns a table\n",
    "par:add(nn.LookupTable(D_o, D_h)) -- first child\n",
    "par:add(nn.Linear(D_d, D_h)) -- second child\n",
    "\n",
    "-- this parallel table produces a table of 2 1xD_h tensors corresponding to W_o x_o and W_d x_d + b resp.\n",
    "print(par:forward({x_o, x_d}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nn.Sequential {\n",
       "  [input -> (1) -> (2) -> output]\n",
       "  (1): nn.ParallelTable {\n",
       "    input\n",
       "      |`-> (1): nn.LookupTable\n",
       "      |`-> (2): nn.Linear(3 -> 2)\n",
       "       ... -> output\n",
       "  }\n",
       "  (2): nn.CAddTable\n",
       "}\n",
       "{\n",
       "  gradInput : table: 0x40e50d38\n",
       "  modules : \n",
       "    {\n",
       "      1 : \n",
       "        nn.ParallelTable {\n",
       "          input\n",
       "            |`-> (1): nn.LookupTable\n",
       "            |`-> (2): nn.Linear(3 -> 2)\n",
       "             ... -> output\n",
       "        }\n",
       "        {\n",
       "          gradInput : table: 0x40e50d38\n",
       "          modules : \n",
       "            {\n",
       "              1 : \n",
       "                nn.LookupTable\n",
       "                {\n",
       "                  copiedInput : false\n",
       "                  weight : DoubleTensor - size: 5x2\n",
       "                  shouldScaleGradByFreq : false\n",
       "              "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "    gradWeight : DoubleTensor - size: 5x2\n",
       "                  gradInput : DoubleTensor - empty\n",
       "                  _count : IntTensor - empty\n",
       "                  _input : LongTensor - empty\n",
       "                  output : DoubleTensor - size: 1x2\n",
       "                }\n",
       "              2 : \n",
       "                nn.Linear(3 -> 2)\n",
       "                {\n",
       "                  gradBias : DoubleTensor - size: 2\n",
       "                  weight : DoubleTensor - size: 2x3\n",
       "                  bias : DoubleTensor - size: 2\n",
       "                  gradInput : DoubleTensor - empty\n",
       "                  addBuffer : DoubleTensor - size: 1\n",
       "                  gradWeight : DoubleTensor - size: 2x3\n",
       "                  output : DoubleTensor - size: 1x2\n",
       "                }\n",
       "            }\n",
       "          output : \n",
       "            {\n",
       "              1 : DoubleTensor - size: 1x2\n",
       "              2 : DoubleTensor - size: 1x2\n",
       "  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "          }\n",
       "        }\n",
       "      2 : \n",
       "        nn.CAddTable\n",
       "        {\n",
       "          gradInput : table: 0x41759cc8\n",
       "          output : DoubleTensor - empty\n",
       "        }\n",
       "    }\n",
       "  output : DoubleTensor - empty\n",
       "}\n",
       "\n"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- to get our full linear transformation, we need to add the two tables. \n",
    "-- as usual, to compose functions in order we use nn.Sequential\n",
    "spAndDenseLinear = nn.Sequential()\n",
    "spAndDenseLinear:add(par)\n",
    "spAndDenseLinear:add(nn.CAddTable()) -- CAddTable adds its incoming tables\n",
    "\n",
    "-- let's look at spAndDenseLinear\n",
    "print(spAndDenseLinear)\n",
    "print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " 1.2212 -2.0951\n",
       "[torch.DoubleTensor of size 1x2]\n",
       "\n",
       "\n"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "-- finally, let's use spAndDenseLinear to compute W_o x_o + W_d x_d + b\n",
    "print(spAndDenseLinear:forward({x_o, x_d}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that table containers/layers allow networks to take tables as input and produce them as output. Here we show some more layers that are useful when dealing with table inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.6461  2.0589 -0.7707  0.0129 -0.6880  0.9008  0.0479  1.5202\n",
       "[torch.DoubleTensor of size 1x8]\n",
       "\n",
       "\n",
       "{\n",
       "  1 : DoubleTensor - size: 1x3\n",
       "  2 : DoubleTensor - size: 1x3\n",
       "}\n"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = {torch.randn(1, 3), torch.randn(1, 3), torch.randn(1, 2)}\n",
    "\n",
    "-- JoinTable(dim, nDims) makes a tensor from a table of tensors (of nDims dimensions) by concating along dim\n",
    "print(nn.JoinTable(2, 2):forward(t)) \n",
    "print()\n",
    "-- NarrowTable(offset, len) returns len tables starting at offset\n",
    "print(nn.NarrowTable(1, 2):forward(t))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Criteria\n",
    "\n",
    "Criterion objects are used to represent loss functions. They are similar to modules in that you can call :forward() and :backward() on them, and that they have .output and .gradInput members. The major difference is that :forward() takes 2 arguments, viz., scores/predictions and the true scores/labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.8225688302461\t\n",
       "\n",
       "5.4677064907383\t\n",
       "\n",
       " 2.3650\n",
       "-1.6119\n",
       " 3.6986\n",
       "[torch.DoubleTensor of size 3]\n",
       "\n",
       "\n"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mse = nn.MSECriterion() -- mean squared error criterion, often used for (scalar) regression loss\n",
    "y = torch.randn(3) -- scalar targets\n",
    "yhat = torch.zeros(3) -- scalar predictions\n",
    "print(mse:forward(yhat, y)) -- returns MSE = 1/n sum_i (yhat_i - t_i)^2\n",
    "print()\n",
    "\n",
    "-- can remove 1/n factor as follows\n",
    "mse.sizeAverage = false\n",
    "print(mse:forward(yhat, y))\n",
    "print()\n",
    "\n",
    "-- get gradient of per-example Losses wrt predictions using :backward()\n",
    "dLdyhat = mse:backward(yhat, y)\n",
    "print(dLdyhat)\n",
    "print()\n",
    "\n",
    "-- generally will pass dLdyhat as gradOutput when calling :backward() on network that computed yhat\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "-- here's a classification criterion you're likely to use:\n",
    "nllcrit = nn.ClassNLLCriterion() -- log loss for multiclass classification; expects log-probabilities and true class\n",
    "\n",
    "Z = torch.randn(2, 3) -- 3-class classification scores for 2 examples\n",
    "Yhat = nn.LogSoftMax():forward(Z) -- make log probabilities\n",
    "Y = torch.Tensor({3,1}) -- true classes\n",
    "\n",
    "print(nllcrit:forward(Yhat, Y))\n",
    "print()\n",
    "print(nllcrit:backward(Yhat, Y)) -- N.B. ClassNLLCriterion (by default) divides by numExamples, which affects grads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Training\n",
    "\n",
    "Here we'll show how to train a 1-layer neural network on a regression-style task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "-- let's generate some data\n",
    "torch.manualSeed(287)\n",
    "N = 5 -- num examples\n",
    "F = 4 -- num features\n",
    "X = torch.randn(N, F)\n",
    "y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))\n",
    "\n",
    "-- let's create a 1-layer MLP\n",
    "H = 3 -- hidden layer size\n",
    "mlp = nn.Sequential()\n",
    "mlp:add(nn.Linear(F,H))\n",
    "mlp:add(nn.Tanh())\n",
    "mlp:add(nn.Linear(H, 1))\n",
    "-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))\n",
    "\n",
    "-- now define our criterion\n",
    "mse = nn.MSECriterion()\n",
    "\n",
    "-- we can flatten (and then retrieve) all parameters (and gradParameters) of a module in the following way:\n",
    "params, gradParams = mlp:getParameters() -- N.B. getParameters() moves around memory, and should only be called once!\n",
    "eta = 0.01\n",
    "\n",
    "-- now that we have our parameters flattened, we'll train with very simple SGD\n",
    "-- note that all operations are batched across all of X\n",
    "nEpochs = 5\n",
    "for i = 1, nEpochs do\n",
    "    -- zero out our gradients\n",
    "    gradParams:zero()\n",
    "    -- do forward pass\n",
    "    preds = mlp:forward(X)\n",
    "    -- get loss\n",
    "    loss = mse:forward(preds, y)\n",
    "    print(\"epoch \" .. i .. \", loss: \" .. loss)\n",
    "    -- backprop\n",
    "    dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds\n",
    "    mlp:backward(X, dLdpreds)\n",
    "    -- update params with sgd step\n",
    "    params:add(-eta, gradParams)\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While extracting parameters and gradParameters with :getParameters() can often be useful, especially if you want to hand them to more sophisticated optimization algorithms (e.g., in the 'optim' package), if you're just doing (S)GD you can also use the module function :updateParameters(). Here's the same example as above using :updateParameters()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "-- let's generate some data\n",
    "torch.manualSeed(287)\n",
    "N = 5 -- num examples\n",
    "F = 4 -- num features\n",
    "X = torch.randn(N, F)\n",
    "y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))\n",
    "\n",
    "-- let's create a 1-layer MLP\n",
    "H = 3 -- hidden layer size\n",
    "mlp = nn.Sequential()\n",
    "mlp:add(nn.Linear(F,H))\n",
    "mlp:add(nn.Tanh())\n",
    "mlp:add(nn.Linear(H, 1))\n",
    "-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))\n",
    "\n",
    "-- now define our criterion\n",
    "mse = nn.MSECriterion()\n",
    "\n",
    "eta = 0.01\n",
    "\n",
    "-- now that we have our parameters flattened, we'll train with very simple SGD\n",
    "-- note that all operations are batched across all of X\n",
    "nEpochs = 5\n",
    "for i = 1, nEpochs do\n",
    "    -- zero out our gradients\n",
    "    mlp:zeroGradParameters()\n",
    "    -- do forward pass\n",
    "    preds = mlp:forward(X)\n",
    "    -- get loss\n",
    "    loss = mse:forward(preds, y)\n",
    "    print(\"epoch \" .. i .. \", loss: \" .. loss)\n",
    "    -- backprop\n",
    "    dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds\n",
    "    mlp:backward(X, dLdpreds)\n",
    "    -- update params with sgd step\n",
    "    mlp:updateParameters(eta) -- computes parameters = parameters - eta * gradient\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Common Bugs\n",
    "\n",
    "If you're code doesn't seem to be working, here are some things to check:\n",
    "\n",
    "- Gradient Tensors are zeroed out before each epoch\n",
    "- :backward() receives a gradOuptut argument of same dimension as input\n",
    "- Batching happens only along first dimension\n",
    "- Used the correct sign in gradient update\n",
    "\n",
    "Also worth noting that if you're doing something vaguely complicated/non-standard with 'nn', you should always check your gradients (with finite differences)!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "iTorch",
   "language": "lua",
   "name": "itorch"
  },
  "language_info": {
   "name": "lua",
   "version": "20100"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}