{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.\n",
    "- Author: Sebastian Raschka\n",
    "- GitHub Repository: https://github.com/rasbt/deeplearning-models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vY4SK0xKAJgm"
   },
   "source": [
    "# Model Zoo -- Simple RNN with Variable Length Inputs via Packing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sc6xejhY-NzZ"
   },
   "source": [
    "Demo of a simple RNN for sentiment classification (here: a binary classification problem with two labels, positive and negative). Note that a simple RNN usually doesn't work very well due to vanishing and exploding gradient problems. Also, this implementation uses padding for dealing with variable size inputs. Hence, the shorter the sentence, the more `<pad>` placeholders will be added to match the length of the longest sentence in a batch. However, in this example notebook, `nn.utils.rnn.pack_padded_sequence` will be used such that their won't be an actual computation carried out when the sentence ends (i.e., padding tokens will be ignored).\n",
    "\n",
    "Note that this RNN trains about 4 times faster than the equivalent without packed sequences, [./rnn_simple_imdb.ipynb](./rnn_simple_imdb.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "moNmVfuvnImW"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sebastian Raschka \n",
      "\n",
      "CPython 3.7.1\n",
      "IPython 7.4.0\n",
      "\n",
      "torch 1.0.1.post2\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Sebastian Raschka' -v -p torch\n",
    "\n",
    "import torch\n",
    "import torch.nn.functional as F\n",
    "from torchtext import data\n",
    "from torchtext import datasets\n",
    "import time\n",
    "import random\n",
    "\n",
    "torch.backends.cudnn.deterministic = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GSRL42Qgy8I8"
   },
   "source": [
    "## General Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "OvW1RgfepCBq"
   },
   "outputs": [],
   "source": [
    "RANDOM_SEED = 123\n",
    "torch.manual_seed(RANDOM_SEED)\n",
    "\n",
    "VOCABULARY_SIZE = 20000\n",
    "LEARNING_RATE = 1e-4\n",
    "BATCH_SIZE = 128\n",
    "NUM_EPOCHS = 15\n",
    "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "\n",
    "EMBEDDING_DIM = 128\n",
    "HIDDEN_DIM = 256\n",
    "OUTPUT_DIM = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mQMmKUEisW4W"
   },
   "source": [
    "## Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4GnH64XvsV8n"
   },
   "source": [
    "Load the IMDB Movie Review dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "WZ_4jiHVnMxN",
    "outputId": "8439fdc6-0073-4050-a3d2-45f6619604f1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num Train: 20000\n",
      "Num Valid: 5000\n",
      "Num Test: 25000\n"
     ]
    }
   ],
   "source": [
    "TEXT = data.Field(tokenize='spacy',\n",
    "                  include_lengths=True) # necessary for packed_padded_sequence\n",
    "LABEL = data.LabelField(dtype=torch.float)\n",
    "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n",
    "train_data, valid_data = train_data.split(random_state=random.seed(RANDOM_SEED),\n",
    "                                          split_ratio=0.8)\n",
    "\n",
    "print(f'Num Train: {len(train_data)}')\n",
    "print(f'Num Valid: {len(valid_data)}')\n",
    "print(f'Num Test: {len(test_data)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "L-TBwKWPslPa"
   },
   "source": [
    "Build the vocabulary based on the top \"VOCABULARY_SIZE\" words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "e8uNrjdtn4A8",
    "outputId": "5613581e-8582-477c-e0fb-b54554b5ebd9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocabulary size: 20002\n",
      "Number of classes: 2\n"
     ]
    }
   ],
   "source": [
    "TEXT.build_vocab(train_data, max_size=VOCABULARY_SIZE)\n",
    "LABEL.build_vocab(train_data)\n",
    "\n",
    "print(f'Vocabulary size: {len(TEXT.vocab)}')\n",
    "print(f'Number of classes: {len(LABEL.vocab)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JpEMNInXtZsb"
   },
   "source": [
    "The TEXT.vocab dictionary will contain the word counts and indices. The reason why the number of words is VOCABULARY_SIZE + 2 is that it contains to special tokens for padding and unknown words: `<unk>` and `<pad>`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "eIQ_zfKLwjKm"
   },
   "source": [
    "Make dataset iterators:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "i7JiHR1stHNF"
   },
   "outputs": [],
   "source": [
    "train_loader, valid_loader, test_loader = data.BucketIterator.splits(\n",
    "    (train_data, valid_data, test_data), \n",
    "    batch_size=BATCH_SIZE,\n",
    "    sort_within_batch=True, # necessary for packed_padded_sequence\n",
    "    device=DEVICE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "R0pT_dMRvicQ"
   },
   "source": [
    "Testing the iterators (note that the number of rows depends on the longest document in the respective batch):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "y8SP_FccutT0",
    "outputId": "1d3b02b3-1d7f-4e59-bd29-6f52f4074b3b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train\n",
      "Text matrix size: torch.Size([132, 128])\n",
      "Target vector size: torch.Size([128])\n",
      "\n",
      "Valid:\n",
      "Text matrix size: torch.Size([61, 128])\n",
      "Target vector size: torch.Size([128])\n",
      "\n",
      "Test:\n",
      "Text matrix size: torch.Size([42, 128])\n",
      "Target vector size: torch.Size([128])\n"
     ]
    }
   ],
   "source": [
    "print('Train')\n",
    "for batch in train_loader:\n",
    "    print(f'Text matrix size: {batch.text[0].size()}')\n",
    "    print(f'Target vector size: {batch.label.size()}')\n",
    "    break\n",
    "    \n",
    "print('\\nValid:')\n",
    "for batch in valid_loader:\n",
    "    print(f'Text matrix size: {batch.text[0].size()}')\n",
    "    print(f'Target vector size: {batch.label.size()}')\n",
    "    break\n",
    "    \n",
    "print('\\nTest:')\n",
    "for batch in test_loader:\n",
    "    print(f'Text matrix size: {batch.text[0].size()}')\n",
    "    print(f'Target vector size: {batch.label.size()}')\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "G_grdW3pxCzz"
   },
   "source": [
    "## Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "nQIUm5EjxFNa"
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "\n",
    "class RNN(nn.Module):\n",
    "    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):\n",
    "        \n",
    "        super().__init__()\n",
    "        \n",
    "        self.embedding = nn.Embedding(input_dim, embedding_dim)\n",
    "        self.rnn = nn.RNN(embedding_dim, hidden_dim)\n",
    "        self.fc = nn.Linear(hidden_dim, output_dim)\n",
    "        \n",
    "    def forward(self, text, text_length):\n",
    "\n",
    "        #[sentence len, batch size] => [sentence len, batch size, embedding size]\n",
    "        embedded = self.embedding(text)\n",
    "        \n",
    "        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, text_length)\n",
    "        \n",
    "        #[sentence len, batch size, embedding size] => \n",
    "        #  output: [sentence len, batch size, hidden size]\n",
    "        #  hidden: [1, batch size, hidden size]\n",
    "        output, hidden = self.rnn(packed)\n",
    "        \n",
    "        return self.fc(hidden.squeeze(0)).view(-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ik3NF3faxFmZ"
   },
   "outputs": [],
   "source": [
    "INPUT_DIM = len(TEXT.vocab)\n",
    "\n",
    "torch.manual_seed(RANDOM_SEED)\n",
    "model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)\n",
    "model = model.to(DEVICE)\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Lv9Ny9di6VcI"
   },
   "source": [
    "## Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "T5t1Afn4xO11"
   },
   "outputs": [],
   "source": [
    "def compute_binary_accuracy(model, data_loader, device):\n",
    "    model.eval()\n",
    "    correct_pred, num_examples = 0, 0\n",
    "    with torch.no_grad():\n",
    "        for batch_idx, batch_data in enumerate(data_loader):\n",
    "            text, text_lengths = batch_data.text\n",
    "            logits = model(text, text_lengths)\n",
    "            predicted_labels = (torch.sigmoid(logits) > 0.5).long()\n",
    "            num_examples += batch_data.label.size(0)\n",
    "            correct_pred += (predicted_labels == batch_data.label.long()).sum()\n",
    "        return correct_pred.float()/num_examples * 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1836
    },
    "colab_type": "code",
    "id": "EABZM8Vo0ilB",
    "outputId": "424c313e-0aa3-4b0c-9726-276c49ae7aab"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 001/015 | Batch 000/157 | Cost: 0.6821\n",
      "Epoch: 001/015 | Batch 050/157 | Cost: 0.6892\n",
      "Epoch: 001/015 | Batch 100/157 | Cost: 0.6327\n",
      "Epoch: 001/015 | Batch 150/157 | Cost: 0.6762\n",
      "training accuracy: 57.08%\n",
      "valid accuracy: 55.60%\n",
      "Time elapsed: 0.13 min\n",
      "Epoch: 002/015 | Batch 000/157 | Cost: 0.6738\n",
      "Epoch: 002/015 | Batch 050/157 | Cost: 0.6578\n",
      "Epoch: 002/015 | Batch 100/157 | Cost: 0.6949\n",
      "Epoch: 002/015 | Batch 150/157 | Cost: 0.6315\n",
      "training accuracy: 65.47%\n",
      "valid accuracy: 64.64%\n",
      "Time elapsed: 0.28 min\n",
      "Epoch: 003/015 | Batch 000/157 | Cost: 0.6614\n",
      "Epoch: 003/015 | Batch 050/157 | Cost: 0.5436\n",
      "Epoch: 003/015 | Batch 100/157 | Cost: 0.6362\n",
      "Epoch: 003/015 | Batch 150/157 | Cost: 0.5960\n",
      "training accuracy: 68.87%\n",
      "valid accuracy: 68.08%\n",
      "Time elapsed: 0.41 min\n",
      "Epoch: 004/015 | Batch 000/157 | Cost: 0.6148\n",
      "Epoch: 004/015 | Batch 050/157 | Cost: 0.5484\n",
      "Epoch: 004/015 | Batch 100/157 | Cost: 0.5179\n",
      "Epoch: 004/015 | Batch 150/157 | Cost: 0.6458\n",
      "training accuracy: 69.85%\n",
      "valid accuracy: 66.82%\n",
      "Time elapsed: 0.54 min\n",
      "Epoch: 005/015 | Batch 000/157 | Cost: 0.5394\n",
      "Epoch: 005/015 | Batch 050/157 | Cost: 0.6463\n",
      "Epoch: 005/015 | Batch 100/157 | Cost: 0.5456\n",
      "Epoch: 005/015 | Batch 150/157 | Cost: 0.5760\n",
      "training accuracy: 70.96%\n",
      "valid accuracy: 67.28%\n",
      "Time elapsed: 0.67 min\n",
      "Epoch: 006/015 | Batch 000/157 | Cost: 0.5609\n",
      "Epoch: 006/015 | Batch 050/157 | Cost: 0.5449\n",
      "Epoch: 006/015 | Batch 100/157 | Cost: 0.5924\n",
      "Epoch: 006/015 | Batch 150/157 | Cost: 0.5842\n",
      "training accuracy: 73.84%\n",
      "valid accuracy: 70.90%\n",
      "Time elapsed: 0.81 min\n",
      "Epoch: 007/015 | Batch 000/157 | Cost: 0.5566\n",
      "Epoch: 007/015 | Batch 050/157 | Cost: 0.5019\n",
      "Epoch: 007/015 | Batch 100/157 | Cost: 0.4826\n",
      "Epoch: 007/015 | Batch 150/157 | Cost: 0.5885\n",
      "training accuracy: 68.89%\n",
      "valid accuracy: 64.76%\n",
      "Time elapsed: 0.94 min\n",
      "Epoch: 008/015 | Batch 000/157 | Cost: 0.5797\n",
      "Epoch: 008/015 | Batch 050/157 | Cost: 0.5433\n",
      "Epoch: 008/015 | Batch 100/157 | Cost: 0.4908\n",
      "Epoch: 008/015 | Batch 150/157 | Cost: 0.5703\n",
      "training accuracy: 75.42%\n",
      "valid accuracy: 71.44%\n",
      "Time elapsed: 1.07 min\n",
      "Epoch: 009/015 | Batch 000/157 | Cost: 0.5631\n",
      "Epoch: 009/015 | Batch 050/157 | Cost: 0.4570\n",
      "Epoch: 009/015 | Batch 100/157 | Cost: 0.6094\n",
      "Epoch: 009/015 | Batch 150/157 | Cost: 0.6365\n",
      "training accuracy: 72.83%\n",
      "valid accuracy: 68.32%\n",
      "Time elapsed: 1.20 min\n",
      "Epoch: 010/015 | Batch 000/157 | Cost: 0.5310\n",
      "Epoch: 010/015 | Batch 050/157 | Cost: 0.4470\n",
      "Epoch: 010/015 | Batch 100/157 | Cost: 0.5479\n",
      "Epoch: 010/015 | Batch 150/157 | Cost: 0.5513\n",
      "training accuracy: 75.52%\n",
      "valid accuracy: 70.84%\n",
      "Time elapsed: 1.33 min\n",
      "Epoch: 011/015 | Batch 000/157 | Cost: 0.4262\n",
      "Epoch: 011/015 | Batch 050/157 | Cost: 0.6005\n",
      "Epoch: 011/015 | Batch 100/157 | Cost: 0.5208\n",
      "Epoch: 011/015 | Batch 150/157 | Cost: 0.5247\n",
      "training accuracy: 75.98%\n",
      "valid accuracy: 70.90%\n",
      "Time elapsed: 1.46 min\n",
      "Epoch: 012/015 | Batch 000/157 | Cost: 0.5223\n",
      "Epoch: 012/015 | Batch 050/157 | Cost: 0.5503\n",
      "Epoch: 012/015 | Batch 100/157 | Cost: 0.5315\n",
      "Epoch: 012/015 | Batch 150/157 | Cost: 0.4270\n",
      "training accuracy: 77.91%\n",
      "valid accuracy: 72.88%\n",
      "Time elapsed: 1.61 min\n",
      "Epoch: 013/015 | Batch 000/157 | Cost: 0.5056\n",
      "Epoch: 013/015 | Batch 050/157 | Cost: 0.5154\n",
      "Epoch: 013/015 | Batch 100/157 | Cost: 0.4632\n",
      "Epoch: 013/015 | Batch 150/157 | Cost: 0.4700\n",
      "training accuracy: 78.33%\n",
      "valid accuracy: 73.00%\n",
      "Time elapsed: 1.74 min\n",
      "Epoch: 014/015 | Batch 000/157 | Cost: 0.4585\n",
      "Epoch: 014/015 | Batch 050/157 | Cost: 0.5244\n",
      "Epoch: 014/015 | Batch 100/157 | Cost: 0.4338\n",
      "Epoch: 014/015 | Batch 150/157 | Cost: 0.4698\n",
      "training accuracy: 77.28%\n",
      "valid accuracy: 72.38%\n",
      "Time elapsed: 1.88 min\n",
      "Epoch: 015/015 | Batch 000/157 | Cost: 0.5293\n",
      "Epoch: 015/015 | Batch 050/157 | Cost: 0.4619\n",
      "Epoch: 015/015 | Batch 100/157 | Cost: 0.4165\n",
      "Epoch: 015/015 | Batch 150/157 | Cost: 0.4715\n",
      "training accuracy: 79.31%\n",
      "valid accuracy: 73.72%\n",
      "Time elapsed: 2.01 min\n",
      "Total Training Time: 2.01 min\n",
      "Test accuracy: 73.94%\n"
     ]
    }
   ],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "for epoch in range(NUM_EPOCHS):\n",
    "    model.train()\n",
    "    for batch_idx, batch_data in enumerate(train_loader):\n",
    "        \n",
    "        text, text_lengths = batch_data.text\n",
    "        \n",
    "        ### FORWARD AND BACK PROP\n",
    "        logits = model(text, text_lengths)\n",
    "        cost = F.binary_cross_entropy_with_logits(logits, batch_data.label)\n",
    "        optimizer.zero_grad()\n",
    "        \n",
    "        cost.backward()\n",
    "        \n",
    "        ### UPDATE MODEL PARAMETERS\n",
    "        optimizer.step()\n",
    "        \n",
    "        ### LOGGING\n",
    "        if not batch_idx % 50:\n",
    "            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '\n",
    "                   f'Batch {batch_idx:03d}/{len(train_loader):03d} | '\n",
    "                   f'Cost: {cost:.4f}')\n",
    "\n",
    "    with torch.set_grad_enabled(False):\n",
    "        print(f'training accuracy: '\n",
    "              f'{compute_binary_accuracy(model, train_loader, DEVICE):.2f}%'\n",
    "              f'\\nvalid accuracy: '\n",
    "              f'{compute_binary_accuracy(model, valid_loader, DEVICE):.2f}%')\n",
    "        \n",
    "    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')\n",
    "    \n",
    "print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')\n",
    "print(f'Test accuracy: {compute_binary_accuracy(model, test_loader, DEVICE):.2f}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jt55pscgFdKZ"
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en')\n",
    "\n",
    "def predict_sentiment(model, sentence):\n",
    "    # based on:\n",
    "    # https://github.com/bentrevett/pytorch-sentiment-analysis/blob/\n",
    "    # master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb\n",
    "    model.eval()\n",
    "    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]\n",
    "    indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n",
    "    length = [len(indexed)]\n",
    "    tensor = torch.LongTensor(indexed).to(DEVICE)\n",
    "    tensor = tensor.unsqueeze(1)\n",
    "    length_tensor = torch.LongTensor(length)\n",
    "    prediction = torch.sigmoid(model(tensor, length_tensor))\n",
    "    return prediction.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "O4__q0coFJyw",
    "outputId": "359b120d-411b-46fe-888b-125619def8a4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability positive:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.7535440325737"
      ]
     },
     "execution_count": 12,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Probability positive:')\n",
    "predict_sentiment(model, \"I really love this movie. This movie is so great!\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "rnn-simple-packed-imdb.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}