{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.autograd import Variable\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.utils.data as data_utils\n",
    "import operator\n",
    "\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RNN intuition\n",
    "\n",
    "Let us assume that we have an input $x = [x_1, x_2, ..., x_N]$ and we need to learn the mapping for some output $y = [y_1, y_2, ..., y_N]$, where $N$ is variable for each instance. In this case we can't just use a simple feed forward neural network which maps $x \\rightarrow y$, as this will not work with variable length sequences. Furthermore, the number or parameters required for training such a network would be proportional to $size(x_i)*N$. This is a major memory cost. Additionally, if the sequence has some common mapping between $x_i$ and $y_i$, then we would be learning redundant weights for each pair in the sequence. This is where an RNN network is more useful. The basic idea is that each input $x_i$ is processed in a similar fashion using the same processing module and some additional context variable (which we will henseforth refer to as the **hidden state**). This hidden state should capture some information about the part of the sequence which has already been processed. Now at each step of the sequence we need to do the following:\n",
    "\n",
    "* Generate the output based on the previous hidden state and current input\n",
    "* Update the hidden state based on the previous hidden state and current input. \n",
    "\n",
    "The order of the above steps is not fixed and forms the basis of many RNN spin-offs. What is important, at each step, is to have a new output and a new hidden state. Sometimes, the hidden state and the outputs are the same, to make the network smaller. But the core idea remains same. Below we would like to formalize the general intuition of an RNN module. \n",
    "\n",
    "Initialize an initial hidden state $h_{0}$ with some initial value. \n",
    "\n",
    "At timestep n: \n",
    "$$\n",
    "\\begin{equation}\n",
    "h^{'}_{i} = f(x_{i},h_{i})\\\\\n",
    "y_{i} = g(x_{i},h^{'}_{i})\\\\\n",
    "h_{i+1} = h^{'}_{i}\\\\\n",
    "\\end{equation}\n",
    "$$\n",
    "\n",
    "Here $y_{i}$ is the output and $h^{'}_{i}$ is the intermediate hidden state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class Input2Hidden(nn.Module):\n",
    "    def __init__(self, x_dim, concat_layers=False):\n",
    "        \"\"\"Input2Hidden module\n",
    "        \n",
    "        Args:\n",
    "            x_dim: input vector dimension\n",
    "            concat_layers: weather to concat input and hidden layers or sum them\n",
    "        \"\"\"\n",
    "        super(Input2Hidden, self).__init__()\n",
    "        self.concat_layers = concat_layers\n",
    "        input_dim = x_dim\n",
    "        if self.concat_layers:\n",
    "            input_dim = 2*x_dim\n",
    "        self.linear_layer = nn.Linear(input_dim, x_dim)\n",
    "        \n",
    "    def forward(self, x, h):\n",
    "        if self.concat_layers:\n",
    "            cell_input = torch.cat([x,h], dim=1)\n",
    "        else:\n",
    "            cell_input = x + h\n",
    "        assert isinstance(cell_input, Variable)\n",
    "        logit = F.tanh(self.linear_layer(cell_input))\n",
    "        return logit\n",
    "    \n",
    "    \n",
    "class Hidden2Output(nn.Module):\n",
    "    def __init__(self, x_dim, out_dim, concat_layers=False):\n",
    "        \"\"\"Hidden2Output module\n",
    "        \n",
    "        Args:\n",
    "            x_dim: input vector dimension\n",
    "            out_dim: output vector dimension\n",
    "            concat_layers: weather to concat input and hidden layers or sum them\n",
    "        \"\"\"\n",
    "        super(Hidden2Output, self).__init__()\n",
    "        input_dim = x_dim\n",
    "        self.concat_layers = concat_layers\n",
    "        if self.concat_layers:\n",
    "            input_dim = 2*x_dim\n",
    "        self.linear_layer = nn.Linear(input_dim, out_dim)\n",
    "        \n",
    "    def forward(self, x, h):\n",
    "        if self.concat_layers:\n",
    "            cell_input = torch.cat([x,h], dim=1)\n",
    "        else:\n",
    "            cell_input = x + h\n",
    "        assert isinstance(cell_input, Variable)\n",
    "        logit = F.tanh(self.linear_layer(cell_input))\n",
    "        return logit\n",
    "    \n",
    "    \n",
    "class CustomRNNCell(nn.Module):\n",
    "    def __init__(self, i2h, h2o):\n",
    "        super(CustomRNNCell, self).__init__()\n",
    "        self.i2h = i2h\n",
    "        self.h2o = h2o\n",
    "        \n",
    "    def forward(self, x, h):\n",
    "        assert isinstance(x, Variable)\n",
    "        assert isinstance(h, Variable)\n",
    "        h_prime = self.i2h(x,h)\n",
    "        assert isinstance(h_prime, Variable)\n",
    "        output = self.h2o(x,h_prime)\n",
    "        return output, h_prime\n",
    "    \n",
    "class Model(nn.Module):\n",
    "    def __init__(self, embedding, rnn_cell):\n",
    "        super(Model, self).__init__()\n",
    "        self.embedding = embedding\n",
    "        self.rnn_cell = rnn_cell\n",
    "        self.loss_function = nn.CrossEntropyLoss()\n",
    "    \n",
    "    def forward(self, word_ids, hidden=None):\n",
    "        if hidden is None:\n",
    "            hidden = Variable(torch.zeros(\n",
    "                word_ids.data.shape[0],self.embedding.embedding_dim))\n",
    "        assert isinstance(hidden, Variable)\n",
    "        embeddings = self.embedding(word_ids)\n",
    "        max_seq_length = word_ids.data.shape[-1]\n",
    "        outputs, hidden_states = [], []\n",
    "        for i in range(max_seq_length):\n",
    "            x = embeddings[:, i, :]\n",
    "            assert isinstance(x, Variable)\n",
    "            #print(\"x={}\\nhidden={}\".format(x,hidden))\n",
    "            output, hidden = self.rnn_cell(x, hidden)\n",
    "            assert isinstance(output, Variable)\n",
    "            assert isinstance(hidden, Variable)\n",
    "            #print(\"output: {}, hidden: {}\".format(output.data.shape, hidden.data.shape))\n",
    "            outputs.append(output.unsqueeze(1))\n",
    "            hidden_states.append(hidden.unsqueeze(1))\n",
    "        outputs = torch.cat(outputs, 1)\n",
    "        hidden_states = torch.cat(hidden_states, 1)\n",
    "        assert isinstance(outputs, Variable)\n",
    "        assert isinstance(hidden_states, Variable)\n",
    "        return outputs, hidden_states\n",
    "    \n",
    "    def loss(self, word_ids, target_ids, hidden=None):\n",
    "        outputs, hidden_states = self.forward(word_ids, hidden=hidden)\n",
    "        outputs = outputs.view(-1, outputs.data.shape[-1])\n",
    "        target_ids = target_ids.view(-1)\n",
    "        assert isinstance(outputs, Variable)\n",
    "        assert isinstance(target_ids, Variable)\n",
    "        #print(\"output={}\\ttargets={}\".format(outputs.data.shape,target_ids.data.shape))\n",
    "        loss = self.loss_function(outputs, target_ids)\n",
    "        return loss    \n",
    "        \n",
    "        \n",
    "    def predict(self, word_ids, hidden=None):\n",
    "        outputs, hidden_states = self.forward(word_ids, hidden=hidden)\n",
    "        outputs = outputs.view(-1, outputs.data.shape[-1])\n",
    "        max_scores, predictions = outputs.max(1)\n",
    "        predictions = predictions.view(*word_ids.data.shape)\n",
    "        #print(word_ids.data.shape, predictions.data.shape)\n",
    "        assert word_ids.data.shape == predictions.data.shape, \"word_ids: {}, predictions: {}\".format(\n",
    "            word_ids.data.shape, predictions.data.shape\n",
    "        )\n",
    "        return predictions\n",
    "        \n",
    "        \n",
    "def tensors2variables(*args, requires_grad=False):\n",
    "    return tuple(map(lambda x: Variable(x, requires_grad=requires_grad), args))\n",
    "\n",
    "def get_batch(tensor_types, *args, requires_grad=False):\n",
    "    return tuple(map(lambda t,arg: Variable(t(arg), requires_grad=requires_grad), tensor_types, args))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning to predict bit flip\n",
    "\n",
    "Let us take a simple example of using an RNN to predict the flip in bits of an $N$ bit unsigned integer. In python for an integer n represented using $N$ bits, the unsigned bitflip can be written as `(~n) & ((1<<N)-1)`. Four our RNN each bit from the left will be $x_i$ and each flipped bit will be $y_i$. This task doesn't require any temporal dependencies but will be a good exercise to test the accuracy of RNN implementation. Theoretically, the network should learn to do this job perfectly in a few iterations. Later we will move to an example which does require the network to learn some temporal dependencies between inputs. \n",
    "\n",
    "For our network we define $f(x,h)$ as a simple affine layer with $tanh$ activation, which takes the concatanated input $[x_i, h_{i-1}]$ and returns a new hidden state $h^{'}_{i}$. Similarly, we have $g(x_i, h^{'}_{n})$ also represented as an affine layer with $tanh$ activation, of the concatanation of its inputs $[x_i, h^{'}_{n}]$, resulting in a new output $y_{i}$. More formally, we have\n",
    "\n",
    "$$\n",
    "\\begin{equation}\n",
    "h^{'}_{i} = f(x_{i},h_{i}) = \\sigma([x_i, h_{i-1}]W_{i2h})\\\\\n",
    "y_{i} = g(x_{i},h^{'}_{i}) = \\sigma([x_i, h^{'}_{n}]W_{h2o})\\\\\n",
    "h_{i+1} = h^{'}_{i}\\\\\n",
    "\\end{equation}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=3\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "f = Input2Hidden(embedding_size, concat_layers=True)\n",
    "g = Hidden2Output(embedding_size, 2, concat_layers=True)\n",
    "rnn_cell = CustomRNNCell(f,g)\n",
    "\n",
    "model = Model(embedding, rnn_cell)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Variable containing:\n",
      " 0  1  0  1  0  1\n",
      "[torch.LongTensor of size 1x6]\n",
      " Variable containing:\n",
      " 1  0  1  0  1  0\n",
      "[torch.LongTensor of size 1x6]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "word_ids = [[0, 1, 0, 1, 0, 1]]\n",
    "target_ids = [[1, 0, 1, 0, 1, 0]]\n",
    "\n",
    "tensor_types = (torch.LongTensor, torch.LongTensor)\n",
    "word_ids_tensor, target_ids_tensor = get_batch(tensor_types, word_ids, target_ids)\n",
    "print(word_ids_tensor, target_ids_tensor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       "(0 ,.,.) = \n",
       "  0.1159 -0.0933\n",
       " -0.4920  0.6325\n",
       "  0.1634 -0.0420\n",
       " -0.5012  0.6430\n",
       "  0.1758 -0.0425\n",
       " -0.5058  0.6449\n",
       "\n",
       "(1 ,.,.) = \n",
       "  0.1159 -0.0933\n",
       " -0.4920  0.6325\n",
       "  0.1634 -0.0420\n",
       " -0.5012  0.6430\n",
       "  0.1758 -0.0425\n",
       " -0.5058  0.6449\n",
       "\n",
       "(2 ,.,.) = \n",
       "  0.1159 -0.0933\n",
       " -0.4920  0.6325\n",
       "  0.1634 -0.0420\n",
       " -0.5012  0.6430\n",
       "  0.1758 -0.0425\n",
       " -0.5058  0.6449\n",
       "[torch.FloatTensor of size 3x6x2]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.forward(torch.cat([word_ids_tensor, word_ids_tensor, word_ids_tensor], 0))[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 0  1  0  1  0  1\n",
       " 0  1  0  1  0  1\n",
       " 0  1  0  1  0  1\n",
       "[torch.LongTensor of size 3x6]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(torch.cat([word_ids_tensor, word_ids_tensor, word_ids_tensor], 0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 1.1108\n",
       "[torch.FloatTensor of size 1]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loss = model.loss(word_ids_tensor, target_ids_tensor)\n",
    "loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "loss.backward()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 0  1  0  1  0  1\n",
       "[torch.LongTensor of size 1x6]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(word_ids_tensor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_dataset(max_len=5):\n",
    "    \"\"\"Create a dataset of max_len bits and their flipped values\n",
    "    \n",
    "    Args:\n",
    "        max_len: Maximum number of bits in the number\n",
    "    \"\"\"\n",
    "    max_val = (1<<max_len)\n",
    "    X, Y = [], []\n",
    "    for i in range(max_val):\n",
    "        x = \"{0:0{1}b}\".format(i,max_len)\n",
    "        y = \"{0:0{1}b}\".format((~i) & max_val-1,max_len)\n",
    "        \n",
    "        x = tuple(map(int, x))\n",
    "        y = tuple(map(int, y))\n",
    "        X.append(x)\n",
    "        Y.append(y)\n",
    "    return X, Y\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=8, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[002]: loss=0.982; accuracy=45.000%\n",
      "Epoch[004]: loss=0.902; accuracy=50.000%\n",
      "Epoch[006]: loss=0.827; accuracy=50.000%\n",
      "Epoch[008]: loss=0.754; accuracy=50.000%\n",
      "Epoch[010]: loss=0.681; accuracy=50.000%\n",
      "Epoch[012]: loss=0.609; accuracy=50.000%\n",
      "Epoch[014]: loss=0.536; accuracy=77.500%\n",
      "Epoch[016]: loss=0.459; accuracy=100.000%\n",
      "Epoch[018]: loss=0.383; accuracy=100.000%\n",
      "Epoch[020]: loss=0.320; accuracy=100.000%\n",
      "CPU times: user 32.2 s, sys: 1min 42s, total: 2min 15s\n",
      "Wall time: 22.6 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 20\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % 2 != 0:\n",
    "        continue\n",
    "        \n",
    "    loss = model.loss(*tensors2variables(X_tensors, Y_tensors))\n",
    "    Y_predict = model.predict(Variable(X_tensors)).data\n",
    "    accuracy = (Y_tensors == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "    1     1     1     1     0     0     0     0     0     0\n",
       "    1     0     0     1     1     1     1     1     0     0\n",
       "    0     0     0     0     0     0     0     1     1     1\n",
       "    1     1     0     1     1     0     0     1     0     0\n",
       "    0     1     0     0     1     1     1     1     1     1\n",
       "    0     0     1     0     1     1     0     1     0     0\n",
       "    0     0     0     0     1     0     1     1     1     1\n",
       "    1     1     1     0     0     0     1     1     1     0\n",
       "[torch.LongTensor of size 8x10]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y_predict = model.predict(X_batch)\n",
    "Y_predict[:10].data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning to predict bit shift\n",
    "\n",
    "This example requires some learning of temporal dependencies. We want to learn our network the output when the input $x$'s bits are shifted right by $K$ positions. This can be done using `x >> K`. Similarly, a left shift can be done using `(a << K) & (1<<N) -1)`, where $N$ is the max length of the bit sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_dataset(max_len=5, K=4):\n",
    "    \"\"\"Create a dataset of max_len bits and their flipped values\n",
    "    \n",
    "    Args:\n",
    "        max_len: Maximum number of bits in the number\n",
    "    \"\"\"\n",
    "    X, Y = [], []\n",
    "    max_val = 2**max_len\n",
    "    for i in range(max_val):\n",
    "        x = \"{0:0{1}b}\".format(i,max_len)\n",
    "        y = \"{0:0{1}b}\".format(i>>K,max_len)\n",
    "        \n",
    "        x = tuple(map(int, x))\n",
    "        y = tuple(map(int, y))\n",
    "        X.append(x)\n",
    "        Y.append(y)\n",
    "    return X, Y\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=3\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "f = Input2Hidden(embedding_size, concat_layers=True)\n",
    "g = Hidden2Output(embedding_size, 2, concat_layers=True)\n",
    "rnn_cell = CustomRNNCell(f,g)\n",
    "\n",
    "model = Model(embedding, rnn_cell)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=128, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[500]: loss=0.552; accuracy=76.133%\n",
      "Epoch[1000]: loss=0.523; accuracy=78.555%\n",
      "Epoch[1500]: loss=0.512; accuracy=78.896%\n",
      "Epoch[2000]: loss=0.494; accuracy=79.600%\n",
      "Epoch[2500]: loss=0.464; accuracy=79.600%\n",
      "Epoch[3000]: loss=0.409; accuracy=83.193%\n",
      "Epoch[3500]: loss=0.368; accuracy=85.518%\n",
      "Epoch[4000]: loss=0.348; accuracy=87.441%\n",
      "Epoch[4500]: loss=0.341; accuracy=87.500%\n",
      "Epoch[5000]: loss=0.337; accuracy=87.500%\n",
      "Epoch[5500]: loss=0.334; accuracy=87.500%\n",
      "Epoch[6000]: loss=0.332; accuracy=87.500%\n",
      "Epoch[6500]: loss=0.331; accuracy=87.500%\n",
      "Epoch[7000]: loss=0.330; accuracy=87.500%\n",
      "Epoch[7500]: loss=0.329; accuracy=87.500%\n",
      "Epoch[8000]: loss=0.329; accuracy=87.500%\n",
      "Epoch[8500]: loss=0.329; accuracy=87.500%\n",
      "Epoch[9000]: loss=0.328; accuracy=87.500%\n",
      "Epoch[9500]: loss=0.328; accuracy=87.500%\n",
      "Epoch[10000]: loss=0.327; accuracy=87.500%\n",
      "CPU times: user 19min 9s, sys: 57min 56s, total: 1h 17min 6s\n",
      "Wall time: 12min 52s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 10000\n",
    "check_every=500\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "        \n",
    "    loss = model.loss(*tensors2variables(X_tensors, Y_tensors))\n",
    "    Y_predict = model.predict(Variable(X_tensors)).data\n",
    "    accuracy = (Y_tensors == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "    0     0     0     0     1     1     1     1     1     1\n",
       "    0     0     0     0     0     1     0     1     1     1\n",
       "    0     0     0     0     1     0     1     1     1     1\n",
       "    0     0     0     0     0     0     0     0     0     1\n",
       "    0     0     0     0     0     0     1     0     1     0\n",
       "    0     0     0     0     1     1     1     1     1     1\n",
       "    0     0     0     0     0     0     0     1     0     1\n",
       "    0     0     0     0     0     1     0     1     0     1\n",
       "    0     0     0     0     0     1     0     1     0     1\n",
       "    0     0     0     0     1     0     1     0     1     0\n",
       "[torch.LongTensor of size 10x10]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y_predict = model.predict(X_batch)\n",
    "Y_predict[:10].data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Too slow to learn\n",
    "\n",
    "The network trained above takes more than 10,000 epochs to converge to only $90\\%$ accuracy. This reflects a major shortcoming of general RNN's. The shortcoming comes from a problem known as vanishing gradients, where gradients based on more distant steps become numerically too small to update the current layer, leading to information loss and failure to learn long range dependencies. Researcher's have worked around this using what is known as gated or memory based RNN cell's, which allows the information to be stored for a longer duration in the network and the gradients from long range dependencies to flow more easily. Two of the most popular variants are Long Short Term Memory (LSTM) cells and Gated Recurrent Unit (GRU) cells. The core idea is to allow some memory of the current state to be stored for the long time either in a seperate memory cell or in the hidden state. This is usually done by selectively reading and editing from the memory based on the current step. In the following sections we will understand the GRU cells which are a very simple extension of RNN and solve the vanishing gradient problem. The LSTM cells are a bit more involved and will be discussed later. \n",
    "\n",
    "\n",
    "## Gated Recurrent Unit (GRU)\n",
    "The idea behind GRU's is to update part of the hidden state and retain the rest. This is done using the following functions: \n",
    "\n",
    "* reset gate - Identifies what proportion of hidden state should be reset\n",
    "* update gate - Identifies what proportion of hidden state should be updated\n",
    "\n",
    "The implementation is as follow:\n",
    "\n",
    "$$\n",
    "\\begin{equation}\n",
    "reset = \\sigma(W_{r}[x_i, h_{i-1}])\\\\\n",
    "update = \\sigma(W_{u}[x_i, h_{i-1}])\\\\\n",
    "interim\\_hidden = tanh(W_{i}[x_i, reset \\circ h_{i-1}])\\\\\n",
    "h^{'}_{i} = update \\circ interim\\_hidden + (1-update) \\circ h_{i-1} \\\\\n",
    "\\end{equation}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class GRUCell(nn.Module):\n",
    "    def __init__(self, input_dim, output_dim):\n",
    "        super(GRUCell, self).__init__()\n",
    "        self.input_dim = input_dim\n",
    "        self.output_dim = output_dim\n",
    "        \n",
    "        self.reset_linear = nn.Linear(2*self.input_dim, self.input_dim)\n",
    "        self.update_linear = nn.Linear(2*self.input_dim, self.input_dim)\n",
    "        self.interim_linear = nn.Linear(2*self.input_dim, self.input_dim)\n",
    "        \n",
    "        self.output_linear = nn.Linear(2*self.input_dim, output_dim)\n",
    "        \n",
    "    def forward(self, x, h):\n",
    "        concat_tensors = torch.cat([x,h], dim=1)\n",
    "        reset = F.sigmoid(self.reset_linear(concat_tensors))\n",
    "        update = F.sigmoid(self.update_linear(concat_tensors))\n",
    "        reset_hidden = reset * h\n",
    "        concat_reset_hidden = torch.cat([x, reset_hidden], dim=1)\n",
    "        interim_hidden = F.tanh(self.interim_linear(concat_reset_hidden))\n",
    "        h_prime = update * interim_hidden + (1-update) * h\n",
    "        \n",
    "        concat_out = torch.cat([x, h_prime], dim=1)\n",
    "        output = F.tanh(self.output_linear(concat_out))\n",
    "        return output, h_prime\n",
    "        \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=3\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "rnn_cell = GRUCell(embedding_size, output_size)\n",
    "\n",
    "model = Model(embedding, rnn_cell)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=64, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[500]: loss=0.412; accuracy=83.740%\n",
      "Epoch[1000]: loss=0.350; accuracy=86.836%\n",
      "Epoch[1500]: loss=0.302; accuracy=91.025%\n",
      "Epoch[2000]: loss=0.279; accuracy=92.422%\n",
      "Epoch[2500]: loss=0.270; accuracy=93.242%\n",
      "Epoch[3000]: loss=0.264; accuracy=93.506%\n",
      "CPU times: user 21min 12s, sys: 1h 6min 27s, total: 1h 27min 39s\n",
      "Wall time: 14min 38s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 3000\n",
    "check_every=500\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "        \n",
    "    loss = model.loss(*tensors2variables(X_tensors, Y_tensors))\n",
    "    Y_predict = model.predict(Variable(X_tensors)).data\n",
    "    accuracy = (Y_tensors == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Model (\n",
       "  (embedding): Embedding(2, 3)\n",
       "  (rnn_cell): GRUCell (\n",
       "    (reset_linear): Linear (6 -> 3)\n",
       "    (update_linear): Linear (6 -> 3)\n",
       "    (interim_linear): Linear (6 -> 3)\n",
       "    (output_linear): Linear (6 -> 2)\n",
       "  )\n",
       "  (loss_function): CrossEntropyLoss (\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running using GPU\n",
    "\n",
    "This makes things go faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=3\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "rnn_cell = GRUCell(embedding_size, output_size)\n",
    "\n",
    "model = Model(embedding, rnn_cell).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "batch_size=64\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 0  0  0\n",
       "[torch.cuda.FloatTensor of size 1x3 (GPU 0)]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "overall_hidden = Variable(torch.zeros(1,model.embedding.embedding_dim))\n",
    "overall_hidden = overall_hidden.cuda()\n",
    "overall_hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[500]: loss=0.406; accuracy=84.111%\n",
      "Epoch[1000]: loss=0.320; accuracy=88.711%\n",
      "Epoch[1500]: loss=0.285; accuracy=91.377%\n",
      "Epoch[2000]: loss=0.270; accuracy=92.842%\n",
      "Epoch[2500]: loss=0.262; accuracy=93.301%\n",
      "Epoch[3000]: loss=0.255; accuracy=93.975%\n",
      "CPU times: user 18min 6s, sys: 3.28 s, total: 18min 9s\n",
      "Wall time: 18min 11s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 3000\n",
    "check_every=500\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch = X_batch.cuda(async=True)\n",
    "        Y_batch = Y_batch.cuda(async=True)\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        hidden = overall_hidden.repeat(batch_size, 1)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch, hidden=hidden)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "    \n",
    "    hidden = overall_hidden.repeat(X_tensors.shape[0], 1)\n",
    "    loss = model.loss(*tensors2variables(X_tensors.cuda(), Y_tensors.cuda()),\n",
    "                      hidden=hidden)\n",
    "    Y_predict = model.predict(Variable(X_tensors.cuda()), hidden=hidden).data\n",
    "    accuracy = (Y_tensors.cuda() == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like the GPU version is actually a bit slower in this case. This might be due to the small size of our dataset, for which the cost of moving tensors to GPU is greater than the gain by speeding up network computations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Increasing the network capacity\n",
    "\n",
    "This can be done by increasing the hidden units in the network. Or in our case by increasing the embedding size as that is used to derive the number of hidden units. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=10\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "rnn_cell = GRUCell(embedding_size, output_size)\n",
    "\n",
    "model = Model(embedding, rnn_cell).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "batch_size=64\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       "    0     0     0     0     0     0     0     0     0     0\n",
       "[torch.cuda.FloatTensor of size 1x10 (GPU 0)]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "overall_hidden = Variable(torch.zeros(1,model.embedding.embedding_dim))\n",
    "overall_hidden = overall_hidden.cuda()\n",
    "overall_hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[050]: loss=0.642; accuracy=65.000%\n",
      "Epoch[100]: loss=0.605; accuracy=70.000%\n",
      "Epoch[150]: loss=0.521; accuracy=70.010%\n",
      "Epoch[200]: loss=0.426; accuracy=82.100%\n",
      "Epoch[250]: loss=0.376; accuracy=86.182%\n",
      "Epoch[300]: loss=0.344; accuracy=88.877%\n",
      "Epoch[350]: loss=0.293; accuracy=93.311%\n",
      "Epoch[400]: loss=0.229; accuracy=96.768%\n",
      "Epoch[450]: loss=0.188; accuracy=98.701%\n",
      "Epoch[500]: loss=0.157; accuracy=99.805%\n",
      "Epoch[550]: loss=0.143; accuracy=99.971%\n",
      "Epoch[600]: loss=0.136; accuracy=100.000%\n",
      "Epoch[650]: loss=0.131; accuracy=100.000%\n",
      "Epoch[700]: loss=0.130; accuracy=100.000%\n",
      "Epoch[750]: loss=0.129; accuracy=100.000%\n",
      "Epoch[800]: loss=0.128; accuracy=100.000%\n",
      "Epoch[850]: loss=0.128; accuracy=100.000%\n",
      "Epoch[900]: loss=0.127; accuracy=100.000%\n",
      "Epoch[950]: loss=0.127; accuracy=100.000%\n",
      "Epoch[1000]: loss=0.127; accuracy=100.000%\n",
      "CPU times: user 6min 5s, sys: 1.22 s, total: 6min 6s\n",
      "Wall time: 6min 7s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 1000\n",
    "check_every=50\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch = X_batch.cuda(async=True)\n",
    "        Y_batch = Y_batch.cuda(async=True)\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        hidden = overall_hidden.repeat(batch_size, 1)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch, hidden=hidden)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "    \n",
    "    hidden = overall_hidden.repeat(X_tensors.shape[0], 1)\n",
    "    loss = model.loss(*tensors2variables(X_tensors.cuda(), Y_tensors.cuda()),\n",
    "                      hidden=hidden)\n",
    "    Y_predict = model.predict(Variable(X_tensors.cuda()), hidden=hidden).data\n",
    "    accuracy = (Y_tensors.cuda() == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the network converges **10x** quicker than the one with lower capacity and also achieves $100\\%$ accuracy in just 600 epochs. This is a very useful result, as it shows that in order to learn more complex functionalities we need networks with larger capacities as well as computationally efficient structures. Luckily for us many of the standard functionalities, are usually implemented efficiently in neural network libraries. Pytorch implements many of the standard neural network modules efficiently using it's C code, which can give us an order of magniture of improvement (especially for larger networks). These modules include GRU cells and a GRU module which can process the whole sequence. We will look at these in detail below. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Pytorch's GRUCell\n",
    "\n",
    "Let us check our implementation using the Pytorch's inbuild GRU cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class PytorchGRUCell(nn.Module):\n",
    "    def __init__(self, input_dim, output_dim):\n",
    "        super(PytorchGRUCell, self).__init__()\n",
    "        self.input_dim = input_dim\n",
    "        self.output_dim = output_dim\n",
    "        \n",
    "        self.gru = torch.nn.GRUCell(self.input_dim, self.input_dim)\n",
    "        \n",
    "        self.output_linear = nn.Linear(2*self.input_dim, self.output_dim)\n",
    "        \n",
    "    def forward(self, x, h):\n",
    "        h_prime = self.gru(x,h)\n",
    "        \n",
    "        concat_out = torch.cat([x, h_prime], dim=1)\n",
    "        output = F.tanh(self.output_linear(concat_out))\n",
    "        return output, h_prime\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=10\n",
    "output_size=2\n",
    "\n",
    "embedding = nn.Embedding(input_size, embedding_size)\n",
    "rnn_cell = PytorchGRUCell(embedding_size, output_size)\n",
    "\n",
    "model = Model(embedding, rnn_cell).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "batch_size=64\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       "    0     0     0     0     0     0     0     0     0     0\n",
       "[torch.cuda.FloatTensor of size 1x10 (GPU 0)]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "overall_hidden = Variable(torch.zeros(1,model.embedding.embedding_dim))\n",
    "overall_hidden = overall_hidden.cuda()\n",
    "overall_hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[050]: loss=0.564; accuracy=70.000%\n",
      "Epoch[100]: loss=0.450; accuracy=80.488%\n",
      "Epoch[150]: loss=0.380; accuracy=86.357%\n",
      "Epoch[200]: loss=0.334; accuracy=89.805%\n",
      "Epoch[250]: loss=0.292; accuracy=92.783%\n",
      "Epoch[300]: loss=0.259; accuracy=94.795%\n",
      "Epoch[350]: loss=0.237; accuracy=96.182%\n",
      "Epoch[400]: loss=0.221; accuracy=96.670%\n",
      "Epoch[450]: loss=0.190; accuracy=97.900%\n",
      "Epoch[500]: loss=0.148; accuracy=99.990%\n",
      "Epoch[550]: loss=0.135; accuracy=100.000%\n",
      "Epoch[600]: loss=0.131; accuracy=100.000%\n",
      "Epoch[650]: loss=0.130; accuracy=100.000%\n",
      "Epoch[700]: loss=0.129; accuracy=100.000%\n",
      "Epoch[750]: loss=0.128; accuracy=100.000%\n",
      "Epoch[800]: loss=0.128; accuracy=100.000%\n",
      "Epoch[850]: loss=0.127; accuracy=100.000%\n",
      "Epoch[900]: loss=0.127; accuracy=100.000%\n",
      "Epoch[950]: loss=0.127; accuracy=100.000%\n",
      "Epoch[1000]: loss=0.127; accuracy=100.000%\n",
      "CPU times: user 3min 10s, sys: 948 ms, total: 3min 11s\n",
      "Wall time: 3min 11s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 1000\n",
    "check_every=50\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch = X_batch.cuda(async=True)\n",
    "        Y_batch = Y_batch.cuda(async=True)\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        hidden = overall_hidden.repeat(batch_size, 1)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch, hidden=hidden)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "    \n",
    "    hidden = overall_hidden.repeat(X_tensors.shape[0], 1)\n",
    "    loss = model.loss(*tensors2variables(X_tensors.cuda(), Y_tensors.cuda()),\n",
    "                      hidden=hidden)\n",
    "    Y_predict = model.predict(Variable(X_tensors.cuda()), hidden=hidden).data\n",
    "    accuracy = (Y_tensors.cuda() == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, this implementation is almost **2x** times faster than our implementation, probably because it is written using the C backend. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Pytorch GRU module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "(0 ,.,.) = \n",
       "  0  0  0  1  1  1\n",
       "  0  0  0  1  1  1\n",
       "[torch.FloatTensor of size 1x2x6]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.cat([torch.zeros(1,2,3), torch.ones(1,2,3)], 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class PyTorchModel(nn.Module):\n",
    "    def __init__(self, input_size, embedding_size, output_size):\n",
    "        super(PyTorchModel, self).__init__()\n",
    "        \n",
    "        self.input_size = input_size\n",
    "        self.embedding_size = embedding_size\n",
    "        self.output_size = output_size\n",
    "        \n",
    "        self.embedding = nn.Embedding(input_size, embedding_size)\n",
    "        self.rnn = nn.GRU(embedding_size, embedding_size)\n",
    "        self.output_linear = nn.Linear(2*self.embedding_size, self.output_size)\n",
    "        \n",
    "        self.loss_function = nn.CrossEntropyLoss()\n",
    "    \n",
    "    def forward(self, word_ids, hidden=None):\n",
    "        if hidden is None:\n",
    "            hidden = Variable(torch.zeros(\n",
    "                word_ids.data.shape[0],self.embedding.embedding_dim))\n",
    "        assert isinstance(hidden, Variable)\n",
    "        embeddings = self.embedding(word_ids)\n",
    "        max_seq_length = word_ids.data.shape[-1]\n",
    "        ## RNN input and output shapes are (seq_len, batch_size, input_size)\n",
    "        embeddings = embeddings.permute(1,0,2)\n",
    "        hidden_states, hidden = self.rnn(embeddings)\n",
    "        \n",
    "        concat_tensors = torch.cat([embeddings, hidden_states], 2)\n",
    "        concat_tensors = concat_tensors.permute(1,0,2).contiguous()\n",
    "        concat_tensors = concat_tensors.view(-1, concat_tensors.data.shape[2])\n",
    "        outputs = self.output_linear(concat_tensors)\n",
    "        \n",
    "        hidden_states = hidden_states.permute(1,0,2)\n",
    "        outputs = outputs.view(self.input_size, -1, self.output_size)\n",
    "        assert isinstance(outputs, Variable)\n",
    "        assert isinstance(hidden_states, Variable)\n",
    "        return outputs, hidden_states\n",
    "    \n",
    "    def loss(self, word_ids, target_ids, hidden=None):\n",
    "        outputs, hidden_states = self.forward(word_ids, hidden=hidden)\n",
    "        outputs = outputs.view(-1, outputs.data.shape[-1])\n",
    "        target_ids = target_ids.view(-1)\n",
    "        assert isinstance(outputs, Variable)\n",
    "        assert isinstance(target_ids, Variable)\n",
    "        #print(\"output={}\\ttargets={}\".format(outputs.data.shape,target_ids.data.shape))\n",
    "        loss = self.loss_function(outputs, target_ids)\n",
    "        return loss    \n",
    "        \n",
    "        \n",
    "    def predict(self, word_ids, hidden=None):\n",
    "        outputs, hidden_states = self.forward(word_ids, hidden=hidden)\n",
    "        outputs = outputs.view(-1, outputs.data.shape[-1])\n",
    "        max_scores, predictions = outputs.max(1)\n",
    "        predictions = predictions.view(*word_ids.data.shape)\n",
    "        #print(word_ids.data.shape, predictions.data.shape)\n",
    "        assert word_ids.data.shape == predictions.data.shape, \"word_ids: {}, predictions: {}\".format(\n",
    "            word_ids.data.shape, predictions.data.shape\n",
    "        )\n",
    "        return predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_size=2\n",
    "embedding_size=10\n",
    "output_size=2\n",
    "\n",
    "model = PyTorchModel(input_size, embedding_size, output_size).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1024, 10]) torch.Size([1024, 10])\n"
     ]
    }
   ],
   "source": [
    "X, Y = create_dataset(max_len=10, K=4)\n",
    "X_tensors, Y_tensors = tuple(map(torch.LongTensor, [X, Y]))\n",
    "print(X_tensors.shape, Y_tensors.shape)\n",
    "assert X_tensors.shape == Y_tensors.shape, \"X and Y should be of same shape\"\n",
    "\n",
    "batch_size=64\n",
    "train = data_utils.TensorDataset(X_tensors, Y_tensors)\n",
    "train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       "    0     0     0     0     0     0     0     0     0     0\n",
       "[torch.cuda.FloatTensor of size 1x10 (GPU 0)]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "overall_hidden = Variable(torch.zeros(1,model.embedding.embedding_dim))\n",
    "overall_hidden = overall_hidden.cuda()\n",
    "overall_hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch[050]: loss=0.580; accuracy=70.000%\n",
      "Epoch[100]: loss=0.453; accuracy=77.949%\n",
      "Epoch[150]: loss=0.342; accuracy=84.902%\n",
      "Epoch[200]: loss=0.255; accuracy=88.398%\n",
      "Epoch[250]: loss=0.193; accuracy=94.902%\n",
      "Epoch[300]: loss=0.136; accuracy=96.455%\n",
      "Epoch[350]: loss=0.074; accuracy=99.297%\n",
      "Epoch[400]: loss=0.032; accuracy=99.990%\n",
      "Epoch[450]: loss=0.016; accuracy=100.000%\n",
      "Epoch[500]: loss=0.009; accuracy=100.000%\n",
      "Epoch[550]: loss=0.005; accuracy=100.000%\n",
      "Epoch[600]: loss=0.003; accuracy=100.000%\n",
      "Epoch[650]: loss=0.002; accuracy=100.000%\n",
      "Epoch[700]: loss=0.001; accuracy=100.000%\n",
      "Epoch[750]: loss=0.001; accuracy=100.000%\n",
      "Epoch[800]: loss=0.001; accuracy=100.000%\n",
      "Epoch[850]: loss=0.000; accuracy=100.000%\n",
      "Epoch[900]: loss=0.000; accuracy=100.000%\n",
      "Epoch[950]: loss=0.000; accuracy=100.000%\n",
      "Epoch[1000]: loss=0.000; accuracy=100.000%\n",
      "CPU times: user 57.6 s, sys: 952 ms, total: 58.5 s\n",
      "Wall time: 58.7 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "learning_rate = 1e-4\n",
    "max_epochs = 1000\n",
    "check_every=50\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "\n",
    "for epoch in range(1, max_epochs+1):\n",
    "    for X_batch, Y_batch in train_loader:\n",
    "        X_batch = X_batch.cuda(async=True)\n",
    "        Y_batch = Y_batch.cuda(async=True)\n",
    "        X_batch, Y_batch = tensors2variables(X_batch, Y_batch)\n",
    "        hidden = overall_hidden.repeat(batch_size, 1)\n",
    "        # Forward pass: compute predicted y by passing x to the model.\n",
    "        loss = model.loss(X_batch, Y_batch, hidden=hidden)\n",
    "\n",
    "        # Before the backward pass, use the optimizer object to zero all of the\n",
    "        # gradients for the variables it will update (which are the learnable weights\n",
    "        # of the model)\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Backward pass: compute gradient of the loss with respect to model\n",
    "        # parameters\n",
    "        loss.backward()\n",
    "\n",
    "        # Calling the step function on an Optimizer makes an update to its\n",
    "        # parameters\n",
    "        optimizer.step()\n",
    "    if epoch % check_every != 0:\n",
    "        continue\n",
    "    \n",
    "    hidden = overall_hidden.repeat(X_tensors.shape[0], 1)\n",
    "    loss = model.loss(*tensors2variables(X_tensors.cuda(), Y_tensors.cuda()),\n",
    "                      hidden=hidden)\n",
    "    Y_predict = model.predict(Variable(X_tensors.cuda()), hidden=hidden).data\n",
    "    accuracy = (Y_tensors.cuda() == Y_predict).sum() /operator.mul(*Y_tensors.shape) * 100.\n",
    "    print(\"Epoch[{:03d}]: loss={:5.3f}; accuracy={:.3f}%\".format(epoch, loss.data[0], accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is super fast, **3x** faster than using the GRUCell and **6x** faster than our implementation. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes our introduction to sequence tagging using Pytorch. The example covered here were very small so as to demonstrate the code required to implement a neural network as well as to give an intuition about the kind of tasks the networks can handle. More complex models can be built on top of this demo, which can handle variable length sequences, complex inference process (e.g. Linear Chain Conditional Random Fields for predicting the best sequence of outputs), and complex handling of input like words, phrases, etc. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pytorch Version: 0.2.0_4\n"
     ]
    }
   ],
   "source": [
    "print(\"Pytorch Version: {}\".format(torch.__version__))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}