{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bOChJSNXtC9g" }, "source": [ "# Recurrent Neural Networks" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OLIxEDq6VhvZ" }, "source": [ "\n", "\n", "When working with sequential data (time-series, sentences, etc.) the order of the inputs is crucial for the task at hand. Recurrent neural networks (RNNs) process sequential data by accounting for the current input and also what has been learned from previous inputs. In this notebook, we'll learn how to create and train RNNs on sequential data.\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "VoMq0eFRvugb" }, "source": [ "# Overview" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qWro5T5qTJJL" }, "source": [ "* **Objective:** Process sequential data by accounting for the currend input and also what has been learned from previous inputs.\n", "* **Advantages:** \n", " * Account for order and previous inputs in a meaningful way.\n", " * Conditioned generation for generating sequences.\n", "* **Disadvantages:** \n", " * Each time step's prediction depends on the previous prediction so it's difficult to parallelize RNN operations. \n", " * Processing long sequences can yield memory and computation issues.\n", " * Interpretability is difficult but there are few [techniques](https://arxiv.org/abs/1506.02078) that use the activations from RNNs to see what parts of the inputs are processed. \n", "* **Miscellaneous:** \n", " * Architectural tweaks to make RNNs faster and interpretable is an ongoing area of research." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rsHeBbehrKzl" }, "source": [ "\n", "\n", "RNN forward pass for a single time step $X_t$:\n", "\n", "$h_t = tanh(W_{hh}h_{t-1} + W_{xh}X_t+b_h)$\n", "\n", "$y_t = W_{hy}h_t + b_y $\n", "\n", "$ P(y) = softmax(y_t) = \\frac{e^y}{\\sum e^y} $\n", "\n", "*where*:\n", "* $X_t$ = input at time step t | $\\in \\mathbb{R}^{NXE}$ ($N$ is the batch size, $E$ is the embedding dim)\n", "* $W_{hh}$ = hidden units weights| $\\in \\mathbb{R}^{HXH}$ ($H$ is the hidden dim)\n", "* $h_{t-1}$ = previous timestep's hidden state $\\in \\mathbb{R}^{NXH}$\n", "* $W_{xh}$ = input weights| $\\in \\mathbb{R}^{EXH}$\n", "* $b_h$ = hidden units bias $\\in \\mathbb{R}^{HX1}$\n", "* $W_{hy}$ = output weights| $\\in \\mathbb{R}^{HXC}$ ($C$ is the number of classes)\n", "* $b_y$ = output bias $\\in \\mathbb{R}^{CX1}$\n", "\n", "You repeat this for every time step's input ($X_{t+1}, X_{t+2}, ..., X_{N})$ to the get the predicted outputs at each time step.\n", "\n", "**Note**: At the first time step, the previous hidden state $h_{t-1}$ can either be a zero vector (unconditioned) or initialize (conditioned). If we are conditioning the RNN, the first hidden state $h_0$ can belong to a specific condition or we can concat the specific condition to the randomly initialized hidden vectors at each time step. More on this in the subsequent notebooks on RNNs." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dIXlGMExJD6w" }, "source": [ "Let's see what the forward pass looks like with an RNN for a synthetic task such as processing reviews (a sequence of words) to predict the sentiment at the end of processing the review." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "RcWE5cw0_cKA", "outputId": "a44156b9-b43f-409c-f0ce-4a4bd871d6a0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: torch in /usr/local/lib/python3.6/dist-packages (1.0.0)\n" ] } ], "source": [ "# Load PyTorch library\n", "!pip3 install torch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "o6eEK1wM_dXG" }, "outputs": [], "source": [ "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Qi9hIEV6COLF" }, "outputs": [], "source": [ "batch_size = 5\n", "seq_size = 10 # max length per input (masking will be used for sequences that aren't this max length)\n", "x_lengths = [8, 5, 4, 10, 5] # lengths of each input sequence\n", "embedding_dim = 100\n", "rnn_hidden_dim = 256\n", "output_dim = 4" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "bLEzfxjhB94C", "outputId": "f2feefbf-8635-4b23-ef53-b5713cf2cdb2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 10, 100])\n" ] } ], "source": [ "# Initialize synthetic inputs\n", "x_in = torch.randn(batch_size, seq_size, embedding_dim)\n", "x_lengths = torch.tensor(x_lengths)\n", "print (x_in.size())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "dr6oLqtXB98N", "outputId": "9817e88d-6e73-414a-dfa6-2386f40db0d9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 256])\n" ] } ], "source": [ "# Initialize hidden state\n", "hidden_t = torch.zeros((batch_size, rnn_hidden_dim))\n", "print (hidden_t.size())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "ryZMOLLgB9-v", "outputId": "14ec0a2a-bf37-4e03-b69b-099180f8f149" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RNNCell(100, 256)\n" ] } ], "source": [ "# Initialize RNN cell\n", "rnn_cell = nn.RNNCell(embedding_dim, rnn_hidden_dim)\n", "print (rnn_cell)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "rlbZ7ujxExXb", "outputId": "6c83ba2b-94c5-4f76-c8fb-ef0c1ccdeb37" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 10, 256])\n" ] } ], "source": [ "# Forward pass through RNN\n", "x_in = x_in.permute(1, 0, 2) # RNN needs batch_size to be at dim 1\n", "\n", "# Loop through the inputs time steps\n", "hiddens = []\n", "for t in range(seq_size):\n", " hidden_t = rnn_cell(x_in[t], hidden_t)\n", " hiddens.append(hidden_t)\n", "hiddens = torch.stack(hiddens)\n", "hiddens = hiddens.permute(1, 0, 2) # bring batch_size back to dim 0\n", "print (hiddens.size())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "3TTL-jmg-MHa", "outputId": "3fae323f-c37d-4dac-c8a8-7fea7a45c95c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "out: torch.Size([5, 10, 256])\n", "h_n: torch.Size([1, 5, 256])\n" ] } ], "source": [ "# We also could've used a more abstracted layer\n", "x_in = torch.randn(batch_size, seq_size, embedding_dim)\n", "rnn = nn.RNN(embedding_dim, rnn_hidden_dim, batch_first=True)\n", "out, h_n = rnn(x_in) #h_n is the last hidden state\n", "print (\"out: \", out.size())\n", "print (\"h_n: \", h_n.size())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "iAsyRNnbHwcT" }, "outputs": [], "source": [ "def gather_last_relevant_hidden(hiddens, x_lengths):\n", " x_lengths = x_lengths.long().detach().cpu().numpy() - 1\n", " out = []\n", " for batch_index, column_index in enumerate(x_lengths):\n", " out.append(hiddens[batch_index, column_index])\n", " return torch.stack(out)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "PVhp1KLqHqpA", "outputId": "d04be3ef-c2d6-48b9-f0f5-a93f619ec594" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 256])\n" ] } ], "source": [ "# Gather the last relevant hidden state\n", "z = gather_last_relevant_hidden(hiddens, x_lengths)\n", "print (z.size())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "id": "yGk_iZ5cITZl", "outputId": "84749ff2-1e45-4599-a38d-8c83cee116a9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 4])\n", "tensor([[0.3030, 0.2351, 0.2168, 0.2452],\n", " [0.2614, 0.1912, 0.2617, 0.2858],\n", " [0.2428, 0.2600, 0.2254, 0.2717],\n", " [0.2379, 0.2226, 0.1901, 0.3494],\n", " [0.2629, 0.2854, 0.2146, 0.2371]], grad_fn=)\n" ] } ], "source": [ "# Forward pass through FC layer\n", "fc1 = nn.Linear(rnn_hidden_dim, output_dim)\n", "y_pred = fc1(z)\n", "y_pred = F.softmax(y_pred, dim=1)\n", "print (y_pred.size())\n", "print (y_pred)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hPBQpki_n6yY" }, "source": [ "# Sequential data" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kP1awuluoCSr" }, "source": [ "There are a variety of different sequential tasks that RNNs can help with.\n", "\n", "1. **One to one**: there is one input and produces one output. \n", " * Ex. Given a word predict it's class (verb, noun, etc.).\n", "2. **One to many**: one input generates many outputs.\n", " * Ex. Given a sentiment (positive, negative, etc.) generate a review.\n", "3. **Many to one**: Many inputs are sequentially processed to generate one output.\n", " * Ex. Process the words in a review to predict the sentiment.\n", "4. **Many to many**: Many inputs are sequentially processed to generate many outputs.\n", " * Ex. Given a sentence in French, processes the entire sentence and then generate the English translation.\n", " * Ex. Given a sequence of time-series data, predict the probability of an event (risk of disease) at each time step.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "tnxUIEMdukYY" }, "source": [ "# Issues with vanilla RNNs" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "uMx2s93VLUTt" }, "source": [ "There are several issues with the vanilla RNN that we've seen so far. \n", "\n", "1. When we have an input sequence that has many time steps, it becomes difficult for the model to retain information seen earlier as we process more and more of the downstream timesteps. The goals of the model is to retain the useful components in the previously seen time steps but this becomes cumbersome when we have so many time steps to process. \n", "\n", "2. During backpropagation, the gradient from the loss has to travel all the way back towards the first time step. If our gradient is larger than 1 (${1.01}^{1000} = 20959$) or less than 1 (${0.99}^{1000} = 4.31e-5$) and we have lot's of time steps, this can quickly spiral out of control.\n", "\n", "To address both these issues, the concept of gating was introduced to RNNs. Gating allows RNNs to control the information flow between each time step to optimize on the task. Selectively allowing information to pass through allows the model to process inputs with many time steps. The most common RNN gated varients are the long short term memory ([LSTM](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM)) units and gated recurrent units ([GRUs](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU)). You can read more about how these units work [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "tirko0kwp-9J" }, "outputs": [], "source": [ "# GRU in PyTorch\n", "gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, \n", " batch_first=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "UZjUhh4VBWxM", "outputId": "9fe275fe-c8d9-42f0-e5d0-0295268ed83d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([5, 10, 100])\n" ] } ], "source": [ "# Initialize synthetic input\n", "x_in = torch.randn(batch_size, seq_size, embedding_dim)\n", "print (x_in.size())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "xJ_SE7AvBfa4", "outputId": "b9411aaa-fab1-4104-aee7-8f9a423332ab" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "out: torch.Size([5, 10, 256])\n", "h_n: torch.Size([1, 5, 256])\n" ] } ], "source": [ "# Forward pass\n", "out, h_n = gru(x_in)\n", "print (\"out:\", out.size())\n", "print (\"h_n:\", h_n.size())" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ij_GA2Rr9BbA" }, "source": [ "**Note**: Choosing whether to use GRU or LSTM really depends on the data and empirical performance. GRUs offer comparable performance with reduce number of parameters while LSTMs are more efficient and may make the difference in performance for your particular task." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9agJw4gwK1LC" }, "source": [ "# Bidirectional RNNs" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Xck0n-KpmXkV" }, "source": [ "There have been many advancements with RNNs ([attention](https://www.oreilly.com/ideas/interpretability-via-attentional-and-memory-based-interfaces-using-tensorflow), Quasi RNNs, etc.) that we will cover in later lessons but one of the basic and widely used ones are bidirectional RNNs (Bi-RNNs). The motivation behind bidirectional RNNs is to process an input sequence by both directions. Accounting for context from both sides can aid in performance when the entire input sequence is known at time of inference. A common application of Bi-RNNs is in translation where it's advantageous to look at an entire sentence from both sides when translating to another language (ie. Japanese → English).\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "gSk_5XrvApCd" }, "outputs": [], "source": [ "# BiGRU in PyTorch\n", "bi_gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, \n", " batch_first=True, bidirectional=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "Fx7-GTptBCtZ", "outputId": "f0242cc5-534a-460b-ebe0-4e8c504fab22" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "out: torch.Size([5, 10, 512])\n", "h_n: torch.Size([2, 5, 256])\n" ] } ], "source": [ "# Forward pass\n", "out, h_n = bi_gru(x_in)\n", "print (\"out:\", out.size()) # collection of all hidden states from the RNN for each time step\n", "print (\"h_n:\", h_n.size()) # last hidden state from the RNN" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "k5lvJirLBjI6" }, "source": [ "Notice that the output for each sample at each timestamp has size 512 (double the hidden dim). This is because this includes both the forward and backward directions from the BiRNN. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mJSknbofK2S9" }, "source": [ "# Document classification with RNNs" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JgYdEZmHlmft" }, "source": [ "Let's apply RNNs to the document classification task from the [emebddings notebook](https://colab.research.google.com/drive/1yDa5ZTqKVoLl-qRgH-N9xs3pdrDJ0Fb4) where we want to predict an article's category given its title." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "eIvXqvPQEiDC" }, "source": [ "## Set up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "muTcvMynlmAu" }, "outputs": [], "source": [ "import os\n", "from argparse import Namespace\n", "import collections\n", "import copy\n", "import json\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import re\n", "import torch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "00ESjecep-_y" }, "outputs": [], "source": [ "# Set Numpy and PyTorch seeds\n", "def set_seeds(seed, cuda):\n", " np.random.seed(seed)\n", " torch.manual_seed(seed)\n", " if cuda:\n", " torch.cuda.manual_seed_all(seed)\n", " \n", "# Creating directories\n", "def create_dirs(dirpath):\n", " if not os.path.exists(dirpath):\n", " os.makedirs(dirpath)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "m67THDvxEl1e", "outputId": "7118c77b-cbf9-4d7e-ff7a-b9dc1fb63cbb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using CUDA: True\n" ] } ], "source": [ "# Arguments\n", "args = Namespace(\n", " seed=1234,\n", " cuda=True,\n", " shuffle=True,\n", " data_file=\"news.csv\",\n", " split_data_file=\"split_news.csv\",\n", " vectorizer_file=\"vectorizer.json\",\n", " model_state_file=\"model.pth\",\n", " save_dir=\"news\",\n", " train_size=0.7,\n", " val_size=0.15,\n", " test_size=0.15,\n", " pretrained_embeddings=None,\n", " cutoff=25, # token must appear at least times to be in SequenceVocabulary\n", " num_epochs=5,\n", " early_stopping_criteria=5,\n", " learning_rate=1e-3,\n", " batch_size=64,\n", " embedding_dim=100,\n", " rnn_hidden_dim=128,\n", " hidden_dim=100,\n", " num_layers=1,\n", " bidirectional=False,\n", " dropout_p=0.1,\n", ")\n", "\n", "# Set seeds\n", "set_seeds(seed=args.seed, cuda=args.cuda)\n", "\n", "# Create save dir\n", "create_dirs(args.save_dir)\n", "\n", "# Expand filepaths\n", "args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)\n", "args.model_state_file = os.path.join(args.save_dir, args.model_state_file)\n", "\n", "# Check CUDA\n", "if not torch.cuda.is_available():\n", " args.cuda = False\n", "args.device = torch.device(\"cuda\" if args.cuda else \"cpu\")\n", "print(\"Using CUDA: {}\".format(args.cuda))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "s7T-_kGvExVW" }, "source": [ "## Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "XVyK25xOEwjN" }, "outputs": [], "source": [ "import re\n", "import urllib" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "M_gclwECEwll" }, "outputs": [], "source": [ "# Upload data from GitHub to notebook's local drive\n", "url = \"https://raw.githubusercontent.com/LisonEvf/practicalAI-cn/master/data/news.csv\"\n", "response = urllib.request.urlopen(url)\n", "html = response.read()\n", "with open(args.data_file, 'wb') as fp:\n", " fp.write(html)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "colab_type": "code", "id": "V244zOIPEwoP", "outputId": "ab8b5cab-4e25-436e-9cb3-0db6f524eb9a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorytitle
0BusinessWall St. Bears Claw Back Into the Black (Reuters)
1BusinessCarlyle Looks Toward Commercial Aerospace (Reu...
2BusinessOil and Economy Cloud Stocks' Outlook (Reuters)
3BusinessIraq Halts Oil Exports from Main Southern Pipe...
4BusinessOil prices soar to all-time record, posing new...
\n", "
" ], "text/plain": [ " category title\n", "0 Business Wall St. Bears Claw Back Into the Black (Reuters)\n", "1 Business Carlyle Looks Toward Commercial Aerospace (Reu...\n", "2 Business Oil and Economy Cloud Stocks' Outlook (Reuters)\n", "3 Business Iraq Halts Oil Exports from Main Southern Pipe...\n", "4 Business Oil prices soar to all-time record, posing new..." ] }, "execution_count": 25, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Raw data\n", "df = pd.read_csv(args.data_file, header=0)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "colab_type": "code", "id": "ICl2MNK4EwrL", "outputId": "d2073597-71e5-40b1-a845-90bf4913ea7a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Business: 30000\n", "Sci/Tech: 30000\n", "Sports: 30000\n", "World: 30000\n" ] } ], "source": [ "# Split by category\n", "by_category = collections.defaultdict(list)\n", "for _, row in df.iterrows():\n", " by_category[row.category].append(row.to_dict())\n", "for category in by_category:\n", " print (\"{0}: {1}\".format(category, len(by_category[category])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "76PwKQHLEww5" }, "outputs": [], "source": [ "# Create split data\n", "final_list = []\n", "for _, item_list in sorted(by_category.items()):\n", " if args.shuffle:\n", " np.random.shuffle(item_list)\n", " n = len(item_list)\n", " n_train = int(args.train_size*n)\n", " n_val = int(args.val_size*n)\n", " n_test = int(args.test_size*n)\n", "\n", " # Give data point a split attribute\n", " for item in item_list[:n_train]:\n", " item['split'] = 'train'\n", " for item in item_list[n_train:n_train+n_val]:\n", " item['split'] = 'val'\n", " for item in item_list[n_train+n_val:]:\n", " item['split'] = 'test' \n", "\n", " # Add to final list\n", " final_list.extend(item_list)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "colab_type": "code", "id": "CQeS0KHOEwzm", "outputId": "93c9aadb-25c4-4029-f002-8a43f3956045" }, "outputs": [ { "data": { "text/plain": [ "train 84000\n", "val 18000\n", "test 18000\n", "Name: split, dtype: int64" ] }, "execution_count": 28, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# df with split datasets\n", "split_df = pd.DataFrame(final_list)\n", "split_df[\"split\"].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "pPJDyVusEw3-" }, "outputs": [], "source": [ "# Preprocessing\n", "def preprocess_text(text):\n", " text = ' '.join(word.lower() for word in text.split(\" \"))\n", " text = re.sub(r\"([.,!?])\", r\" \\1 \", text)\n", " text = re.sub(r\"[^a-zA-Z.,!?]+\", r\" \", text)\n", " text = text.strip()\n", " return text\n", " \n", "split_df.title = split_df.title.apply(preprocess_text)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "colab_type": "code", "id": "IAetKendEw6b", "outputId": "d5946f7e-840e-4a0b-e492-d3da68cefd44" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorysplittitle
0Businesstraingeneral electric posts higher rd quarter profit
1Businesstrainlilly to eliminate up to us jobs
2Businesstrains amp p lowers america west outlook to negative
3Businesstraindoes rand walk the talk on labor policy ?
4Businesstrainhousekeeper advocates for changes
\n", "
" ], "text/plain": [ " category split title\n", "0 Business train general electric posts higher rd quarter profit\n", "1 Business train lilly to eliminate up to us jobs\n", "2 Business train s amp p lowers america west outlook to negative\n", "3 Business train does rand walk the talk on labor policy ?\n", "4 Business train housekeeper advocates for changes" ] }, "execution_count": 30, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Save to CSV\n", "split_df.to_csv(args.split_data_file, index=False)\n", "split_df.head()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NHzGXAI3E7lF" }, "source": [ "## Vocabulary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ZIRUjX0MEw88" }, "outputs": [], "source": [ "class Vocabulary(object):\n", " def __init__(self, token_to_idx=None):\n", "\n", " # Token to index\n", " if token_to_idx is None:\n", " token_to_idx = {}\n", " self.token_to_idx = token_to_idx\n", "\n", " # Index to token\n", " self.idx_to_token = {idx: token \\\n", " for token, idx in self.token_to_idx.items()}\n", "\n", " def to_serializable(self):\n", " return {'token_to_idx': self.token_to_idx}\n", "\n", " @classmethod\n", " def from_serializable(cls, contents):\n", " return cls(**contents)\n", "\n", " def add_token(self, token):\n", " if token in self.token_to_idx:\n", " index = self.token_to_idx[token]\n", " else:\n", " index = len(self.token_to_idx)\n", " self.token_to_idx[token] = index\n", " self.idx_to_token[index] = token\n", " return index\n", "\n", " def add_tokens(self, tokens):\n", " return [self.add_token[token] for token in tokens]\n", "\n", " def lookup_token(self, token):\n", " return self.token_to_idx[token]\n", "\n", " def lookup_index(self, index):\n", " if index not in self.idx_to_token:\n", " raise KeyError(\"the index (%d) is not in the Vocabulary\" % index)\n", " return self.idx_to_token[index]\n", "\n", " def __str__(self):\n", " return \"\" % len(self)\n", "\n", " def __len__(self):\n", " return len(self.token_to_idx)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "colab_type": "code", "id": "1LtYf3lpExBb", "outputId": "617297a7-3fdb-4789-bbca-dea82d06c8ce" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "4\n", "0\n", "Business\n" ] } ], "source": [ "# Vocabulary instance\n", "category_vocab = Vocabulary()\n", "for index, row in df.iterrows():\n", " category_vocab.add_token(row.category)\n", "print (category_vocab) # __str__\n", "print (len(category_vocab)) # __len__\n", "index = category_vocab.lookup_token(\"Business\")\n", "print (index)\n", "print (category_vocab.lookup_index(index))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Z0zkF6CsE_yH" }, "source": [ "## Sequence vocabulary" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QtntaISyE_1c" }, "source": [ "Next, we're going to create our Vocabulary classes for the article's title, which is a sequence of tokens." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ovI8QRefEw_p" }, "outputs": [], "source": [ "from collections import Counter\n", "import string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "4W3ZouuTEw1_" }, "outputs": [], "source": [ "class SequenceVocabulary(Vocabulary):\n", " def __init__(self, token_to_idx=None, unk_token=\"\",\n", " mask_token=\"\", begin_seq_token=\"\",\n", " end_seq_token=\"\"):\n", "\n", " super(SequenceVocabulary, self).__init__(token_to_idx)\n", "\n", " self.mask_token = mask_token\n", " self.unk_token = unk_token\n", " self.begin_seq_token = begin_seq_token\n", " self.end_seq_token = end_seq_token\n", "\n", " self.mask_index = self.add_token(self.mask_token)\n", " self.unk_index = self.add_token(self.unk_token)\n", " self.begin_seq_index = self.add_token(self.begin_seq_token)\n", " self.end_seq_index = self.add_token(self.end_seq_token)\n", " \n", " # Index to token\n", " self.idx_to_token = {idx: token \\\n", " for token, idx in self.token_to_idx.items()}\n", "\n", " def to_serializable(self):\n", " contents = super(SequenceVocabulary, self).to_serializable()\n", " contents.update({'unk_token': self.unk_token,\n", " 'mask_token': self.mask_token,\n", " 'begin_seq_token': self.begin_seq_token,\n", " 'end_seq_token': self.end_seq_token})\n", " return contents\n", "\n", " def lookup_token(self, token):\n", " return self.token_to_idx.get(token, self.unk_index)\n", " \n", " def lookup_index(self, index):\n", " if index not in self.idx_to_token:\n", " raise KeyError(\"the index (%d) is not in the SequenceVocabulary\" % index)\n", " return self.idx_to_token[index]\n", " \n", " def __str__(self):\n", " return \"\" % len(self.token_to_idx)\n", "\n", " def __len__(self):\n", " return len(self.token_to_idx)\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "colab_type": "code", "id": "g5UHjpi3El37", "outputId": "cb20aa34-2bd5-4178-b219-d845fdc4968e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "4400\n", "4\n", "general\n" ] } ], "source": [ "# Get word counts\n", "word_counts = Counter()\n", "for title in split_df.title:\n", " for token in title.split(\" \"):\n", " if token not in string.punctuation:\n", " word_counts[token] += 1\n", "\n", "# Create SequenceVocabulary instance\n", "title_vocab = SequenceVocabulary()\n", "for word, word_count in word_counts.items():\n", " if word_count >= args.cutoff:\n", " title_vocab.add_token(word)\n", "print (title_vocab) # __str__\n", "print (len(title_vocab)) # __len__\n", "index = title_vocab.lookup_token(\"general\")\n", "print (index)\n", "print (title_vocab.lookup_index(index))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4Dag6H0SFHAG" }, "source": [ "## Vectorizer" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "VQIfxcUuKwzz" }, "source": [ "Something new that we introduce in this Vectorizer is calculating the length of our input sequence. We will use this later on to extract the last relevant hidden state for each input sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "tsNtEnhBEl6s" }, "outputs": [], "source": [ "class NewsVectorizer(object):\n", " def __init__(self, title_vocab, category_vocab):\n", " self.title_vocab = title_vocab\n", " self.category_vocab = category_vocab\n", "\n", " def vectorize(self, title):\n", " indices = [self.title_vocab.lookup_token(token) for token in title.split(\" \")]\n", " indices = [self.title_vocab.begin_seq_index] + indices + \\\n", " [self.title_vocab.end_seq_index]\n", " \n", " # Create vector\n", " title_length = len(indices)\n", " vector = np.zeros(title_length, dtype=np.int64)\n", " vector[:len(indices)] = indices\n", "\n", " return vector, title_length\n", " \n", " def unvectorize(self, vector):\n", " tokens = [self.title_vocab.lookup_index(index) for index in vector]\n", " title = \" \".join(token for token in tokens)\n", " return title\n", "\n", " @classmethod\n", " def from_dataframe(cls, df, cutoff):\n", " \n", " # Create class vocab\n", " category_vocab = Vocabulary() \n", " for category in sorted(set(df.category)):\n", " category_vocab.add_token(category)\n", "\n", " # Get word counts\n", " word_counts = Counter()\n", " for title in df.title:\n", " for token in title.split(\" \"):\n", " word_counts[token] += 1\n", " \n", " # Create title vocab\n", " title_vocab = SequenceVocabulary()\n", " for word, word_count in word_counts.items():\n", " if word_count >= cutoff:\n", " title_vocab.add_token(word)\n", " \n", " return cls(title_vocab, category_vocab)\n", "\n", " @classmethod\n", " def from_serializable(cls, contents):\n", " title_vocab = SequenceVocabulary.from_serializable(contents['title_vocab'])\n", " category_vocab = Vocabulary.from_serializable(contents['category_vocab'])\n", " return cls(title_vocab=title_vocab, category_vocab=category_vocab)\n", " \n", " def to_serializable(self):\n", " return {'title_vocab': self.title_vocab.to_serializable(),\n", " 'category_vocab': self.category_vocab.to_serializable()}" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "id": "JtRRXU53El9Y", "outputId": "ba63f1e4-d50e-458c-cb38-da4cc69e5dfa" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "(10,)\n", "title_length: 10\n", "[ 2 1 4151 1231 25 1 2392 4076 38 3]\n", " federer wins the tennis tournament . \n" ] } ], "source": [ "# Vectorizer instance\n", "vectorizer = NewsVectorizer.from_dataframe(split_df, cutoff=args.cutoff)\n", "print (vectorizer.title_vocab)\n", "print (vectorizer.category_vocab)\n", "vectorized_title, title_length = vectorizer.vectorize(preprocess_text(\n", " \"Roger Federer wins the Wimbledon tennis tournament.\"))\n", "print (np.shape(vectorized_title))\n", "print (\"title_length:\", title_length)\n", "print (vectorized_title)\n", "print (vectorizer.unvectorize(vectorized_title))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "uk_QvpVfFM0S" }, "source": [ "## Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "oU7oDdelFMR9" }, "outputs": [], "source": [ "from torch.utils.data import Dataset, DataLoader" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "pB7FHmiSFMXA" }, "outputs": [], "source": [ "class NewsDataset(Dataset):\n", " def __init__(self, df, vectorizer):\n", " self.df = df\n", " self.vectorizer = vectorizer\n", "\n", " # Data splits\n", " self.train_df = self.df[self.df.split=='train']\n", " self.train_size = len(self.train_df)\n", " self.val_df = self.df[self.df.split=='val']\n", " self.val_size = len(self.val_df)\n", " self.test_df = self.df[self.df.split=='test']\n", " self.test_size = len(self.test_df)\n", " self.lookup_dict = {'train': (self.train_df, self.train_size), \n", " 'val': (self.val_df, self.val_size),\n", " 'test': (self.test_df, self.test_size)}\n", " self.set_split('train')\n", "\n", " # Class weights (for imbalances)\n", " class_counts = df.category.value_counts().to_dict()\n", " def sort_key(item):\n", " return self.vectorizer.category_vocab.lookup_token(item[0])\n", " sorted_counts = sorted(class_counts.items(), key=sort_key)\n", " frequencies = [count for _, count in sorted_counts]\n", " self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)\n", "\n", " @classmethod\n", " def load_dataset_and_make_vectorizer(cls, split_data_file, cutoff):\n", " df = pd.read_csv(split_data_file, header=0)\n", " train_df = df[df.split=='train']\n", " return cls(df, NewsVectorizer.from_dataframe(train_df, cutoff))\n", "\n", " @classmethod\n", " def load_dataset_and_load_vectorizer(cls, split_data_file, vectorizer_filepath):\n", " df = pd.read_csv(split_data_file, header=0)\n", " vectorizer = cls.load_vectorizer_only(vectorizer_filepath)\n", " return cls(df, vectorizer)\n", "\n", " def load_vectorizer_only(vectorizer_filepath):\n", " with open(vectorizer_filepath) as fp:\n", " return NewsVectorizer.from_serializable(json.load(fp))\n", "\n", " def save_vectorizer(self, vectorizer_filepath):\n", " with open(vectorizer_filepath, \"w\") as fp:\n", " json.dump(self.vectorizer.to_serializable(), fp)\n", "\n", " def set_split(self, split=\"train\"):\n", " self.target_split = split\n", " self.target_df, self.target_size = self.lookup_dict[split]\n", "\n", " def __str__(self):\n", " return \" software firm to cut jobs \n", "tensor([3.3333e-05, 3.3333e-05, 3.3333e-05, 3.3333e-05])\n" ] } ], "source": [ "# Dataset instance\n", "dataset = NewsDataset.load_dataset_and_make_vectorizer(args.split_data_file,\n", " args.cutoff)\n", "print (dataset) # __str__\n", "input_ = dataset[5] # __getitem__\n", "print (input_['title'], input_['title_length'], input_['category'])\n", "print (dataset.vectorizer.unvectorize(input_['title']))\n", "print (dataset.class_weights)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_IUIqtbvFUAG" }, "source": [ "## Model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xJV5WlDiFVVz" }, "source": [ "input → embedding → RNN → FC " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "rZCzdZZ9FMhm" }, "outputs": [], "source": [ "import torch.nn as nn\n", "import torch.nn.functional as F" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "wbWO4lZcIdqZ" }, "outputs": [], "source": [ "def gather_last_relevant_hidden(hiddens, x_lengths):\n", " x_lengths = x_lengths.long().detach().cpu().numpy() - 1\n", " out = []\n", " for batch_index, column_index in enumerate(x_lengths):\n", " out.append(hiddens[batch_index, column_index])\n", " return torch.stack(out)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "9TT66Y-UFMcZ" }, "outputs": [], "source": [ "class NewsModel(nn.Module):\n", " def __init__(self, embedding_dim, num_embeddings, rnn_hidden_dim, \n", " hidden_dim, output_dim, num_layers, bidirectional, dropout_p, \n", " pretrained_embeddings=None, freeze_embeddings=False, \n", " padding_idx=0):\n", " super(NewsModel, self).__init__()\n", " \n", " if pretrained_embeddings is None:\n", " self.embeddings = nn.Embedding(embedding_dim=embedding_dim,\n", " num_embeddings=num_embeddings,\n", " padding_idx=padding_idx)\n", " else:\n", " pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()\n", " self.embeddings = nn.Embedding(embedding_dim=embedding_dim,\n", " num_embeddings=num_embeddings,\n", " padding_idx=padding_idx,\n", " _weight=pretrained_embeddings)\n", " \n", " # Conv weights\n", " self.gru = nn.GRU(input_size=embedding_dim, hidden_size=rnn_hidden_dim, \n", " num_layers=num_layers, batch_first=True, \n", " bidirectional=bidirectional)\n", " \n", " # FC weights\n", " self.dropout = nn.Dropout(dropout_p)\n", " self.fc1 = nn.Linear(rnn_hidden_dim, hidden_dim)\n", " self.fc2 = nn.Linear(hidden_dim, output_dim)\n", " \n", " if freeze_embeddings:\n", " self.embeddings.weight.requires_grad = False\n", "\n", " def forward(self, x_in, x_lengths, apply_softmax=False):\n", " \n", " # Embed\n", " x_in = self.embeddings(x_in)\n", " \n", " # Feed into RNN\n", " out, h_n = self.gru(x_in)\n", " \n", " # Gather the last relevant hidden state\n", " out = gather_last_relevant_hidden(out, x_lengths)\n", "\n", " # FC layers\n", " z = self.dropout(out)\n", " z = self.fc1(z)\n", " z = self.dropout(z)\n", " y_pred = self.fc2(z)\n", "\n", " if apply_softmax:\n", " y_pred = F.softmax(y_pred, dim=1)\n", " return y_pred" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "jHPYCPd7Fl3M" }, "source": [ "## Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "D3seBMA7FlcC" }, "outputs": [], "source": [ "import torch.optim as optim" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "HnRKWLekFlnM" }, "outputs": [], "source": [ "class Trainer(object):\n", " def __init__(self, dataset, model, model_state_file, save_dir, device, shuffle, \n", " num_epochs, batch_size, learning_rate, early_stopping_criteria):\n", " self.dataset = dataset\n", " self.class_weights = dataset.class_weights.to(device)\n", " self.model = model.to(device)\n", " self.save_dir = save_dir\n", " self.device = device\n", " self.shuffle = shuffle\n", " self.num_epochs = num_epochs\n", " self.batch_size = batch_size\n", " self.loss_func = nn.CrossEntropyLoss(self.class_weights)\n", " self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)\n", " self.scheduler = optim.lr_scheduler.ReduceLROnPlateau(\n", " optimizer=self.optimizer, mode='min', factor=0.5, patience=1)\n", " self.train_state = {\n", " 'stop_early': False, \n", " 'early_stopping_step': 0,\n", " 'early_stopping_best_val': 1e8,\n", " 'early_stopping_criteria': early_stopping_criteria,\n", " 'learning_rate': learning_rate,\n", " 'epoch_index': 0,\n", " 'train_loss': [],\n", " 'train_acc': [],\n", " 'val_loss': [],\n", " 'val_acc': [],\n", " 'test_loss': -1,\n", " 'test_acc': -1,\n", " 'model_filename': model_state_file}\n", " \n", " def update_train_state(self):\n", "\n", " # Verbose\n", " print (\"[EPOCH]: {0:02d} | [LR]: {1} | [TRAIN LOSS]: {2:.2f} | [TRAIN ACC]: {3:.1f}% | [VAL LOSS]: {4:.2f} | [VAL ACC]: {5:.1f}%\".format(\n", " self.train_state['epoch_index'], self.train_state['learning_rate'], \n", " self.train_state['train_loss'][-1], self.train_state['train_acc'][-1], \n", " self.train_state['val_loss'][-1], self.train_state['val_acc'][-1]))\n", "\n", " # Save one model at least\n", " if self.train_state['epoch_index'] == 0:\n", " torch.save(self.model.state_dict(), self.train_state['model_filename'])\n", " self.train_state['stop_early'] = False\n", "\n", " # Save model if performance improved\n", " elif self.train_state['epoch_index'] >= 1:\n", " loss_tm1, loss_t = self.train_state['val_loss'][-2:]\n", "\n", " # If loss worsened\n", " if loss_t >= self.train_state['early_stopping_best_val']:\n", " # Update step\n", " self.train_state['early_stopping_step'] += 1\n", "\n", " # Loss decreased\n", " else:\n", " # Save the best model\n", " if loss_t < self.train_state['early_stopping_best_val']:\n", " torch.save(self.model.state_dict(), self.train_state['model_filename'])\n", "\n", " # Reset early stopping step\n", " self.train_state['early_stopping_step'] = 0\n", "\n", " # Stop early ?\n", " self.train_state['stop_early'] = self.train_state['early_stopping_step'] \\\n", " >= self.train_state['early_stopping_criteria']\n", " return self.train_state\n", " \n", " def compute_accuracy(self, y_pred, y_target):\n", " _, y_pred_indices = y_pred.max(dim=1)\n", " n_correct = torch.eq(y_pred_indices, y_target).sum().item()\n", " return n_correct / len(y_pred_indices) * 100\n", " \n", " def pad_seq(self, seq, length):\n", " vector = np.zeros(length, dtype=np.int64)\n", " vector[:len(seq)] = seq\n", " vector[len(seq):] = self.dataset.vectorizer.title_vocab.mask_index\n", " return vector\n", " \n", " def collate_fn(self, batch):\n", " \n", " # Make a deep copy\n", " batch_copy = copy.deepcopy(batch)\n", " processed_batch = {\"title\": [], \"title_length\": [], \"category\": []}\n", " \n", " # Get max sequence length\n", " get_length = lambda sample: len(sample[\"title\"])\n", " max_seq_length = max(map(get_length, batch))\n", " \n", " # Pad\n", " for i, sample in enumerate(batch_copy):\n", " padded_seq = self.pad_seq(sample[\"title\"], max_seq_length)\n", " processed_batch[\"title\"].append(padded_seq)\n", " processed_batch[\"title_length\"].append(sample[\"title_length\"])\n", " processed_batch[\"category\"].append(sample[\"category\"])\n", " \n", " # Convert to appropriate tensor types\n", " processed_batch[\"title\"] = torch.LongTensor(\n", " processed_batch[\"title\"])\n", " processed_batch[\"title_length\"] = torch.LongTensor(\n", " processed_batch[\"title_length\"])\n", " processed_batch[\"category\"] = torch.LongTensor(\n", " processed_batch[\"category\"])\n", " \n", " return processed_batch \n", " \n", " def run_train_loop(self):\n", " for epoch_index in range(self.num_epochs):\n", " self.train_state['epoch_index'] = epoch_index\n", " \n", " # Iterate over train dataset\n", "\n", " # initialize batch generator, set loss and acc to 0, set train mode on\n", " self.dataset.set_split('train')\n", " batch_generator = self.dataset.generate_batches(\n", " batch_size=self.batch_size, collate_fn=self.collate_fn, \n", " shuffle=self.shuffle, device=self.device)\n", " running_loss = 0.0\n", " running_acc = 0.0\n", " self.model.train()\n", "\n", " for batch_index, batch_dict in enumerate(batch_generator):\n", " # zero the gradients\n", " self.optimizer.zero_grad()\n", "\n", " # compute the output\n", " y_pred = self.model(batch_dict['title'], batch_dict['title_length'])\n", "\n", " # compute the loss\n", " loss = self.loss_func(y_pred, batch_dict['category'])\n", " loss_t = loss.item()\n", " running_loss += (loss_t - running_loss) / (batch_index + 1)\n", "\n", " # compute gradients using loss\n", " loss.backward()\n", "\n", " # use optimizer to take a gradient step\n", " self.optimizer.step()\n", " \n", " # compute the accuracy\n", " acc_t = self.compute_accuracy(y_pred, batch_dict['category'])\n", " running_acc += (acc_t - running_acc) / (batch_index + 1)\n", "\n", " self.train_state['train_loss'].append(running_loss)\n", " self.train_state['train_acc'].append(running_acc)\n", "\n", " # Iterate over val dataset\n", "\n", " # # initialize batch generator, set loss and acc to 0; set eval mode on\n", " self.dataset.set_split('val')\n", " batch_generator = self.dataset.generate_batches(\n", " batch_size=self.batch_size, collate_fn=self.collate_fn, \n", " shuffle=self.shuffle, device=self.device)\n", " running_loss = 0.\n", " running_acc = 0.\n", " self.model.eval()\n", "\n", " for batch_index, batch_dict in enumerate(batch_generator):\n", "\n", " # compute the output\n", " y_pred = self.model(batch_dict['title'], batch_dict['title_length'])\n", "\n", " # compute the loss\n", " loss = self.loss_func(y_pred, batch_dict['category'])\n", " loss_t = loss.to(\"cpu\").item()\n", " running_loss += (loss_t - running_loss) / (batch_index + 1)\n", "\n", " # compute the accuracy\n", " acc_t = self.compute_accuracy(y_pred, batch_dict['category'])\n", " running_acc += (acc_t - running_acc) / (batch_index + 1)\n", "\n", " self.train_state['val_loss'].append(running_loss)\n", " self.train_state['val_acc'].append(running_acc)\n", "\n", " self.train_state = self.update_train_state()\n", " self.scheduler.step(self.train_state['val_loss'][-1])\n", " if self.train_state['stop_early']:\n", " break\n", " \n", " def run_test_loop(self):\n", " # initialize batch generator, set loss and acc to 0; set eval mode on\n", " self.dataset.set_split('test')\n", " batch_generator = self.dataset.generate_batches(\n", " batch_size=self.batch_size, collate_fn=self.collate_fn, \n", " shuffle=self.shuffle, device=self.device)\n", " running_loss = 0.0\n", " running_acc = 0.0\n", " self.model.eval()\n", "\n", " for batch_index, batch_dict in enumerate(batch_generator):\n", " # compute the output\n", " y_pred = self.model(batch_dict['title'], batch_dict['title_length'])\n", "\n", " # compute the loss\n", " loss = self.loss_func(y_pred, batch_dict['category'])\n", " loss_t = loss.item()\n", " running_loss += (loss_t - running_loss) / (batch_index + 1)\n", "\n", " # compute the accuracy\n", " acc_t = self.compute_accuracy(y_pred, batch_dict['category'])\n", " running_acc += (acc_t - running_acc) / (batch_index + 1)\n", "\n", " self.train_state['test_loss'] = running_loss\n", " self.train_state['test_acc'] = running_acc\n", " \n", " def plot_performance(self):\n", " # Figure size\n", " plt.figure(figsize=(15,5))\n", "\n", " # Plot Loss\n", " plt.subplot(1, 2, 1)\n", " plt.title(\"Loss\")\n", " plt.plot(trainer.train_state[\"train_loss\"], label=\"train\")\n", " plt.plot(trainer.train_state[\"val_loss\"], label=\"val\")\n", " plt.legend(loc='upper right')\n", "\n", " # Plot Accuracy\n", " plt.subplot(1, 2, 2)\n", " plt.title(\"Accuracy\")\n", " plt.plot(trainer.train_state[\"train_acc\"], label=\"train\")\n", " plt.plot(trainer.train_state[\"val_acc\"], label=\"val\")\n", " plt.legend(loc='lower right')\n", "\n", " # Save figure\n", " plt.savefig(os.path.join(self.save_dir, \"performance.png\"))\n", "\n", " # Show plots\n", " plt.show()\n", " \n", " def save_train_state(self):\n", " with open(os.path.join(self.save_dir, \"train_state.json\"), \"w\") as fp:\n", " json.dump(self.train_state, fp)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "id": "ICkiOaGtFlk-", "outputId": "57f7f143-7899-407a-acbd-17f767eb56c3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Initialization\n", "dataset = NewsDataset.load_dataset_and_make_vectorizer(args.split_data_file,\n", " args.cutoff)\n", "dataset.save_vectorizer(args.vectorizer_file)\n", "vectorizer = dataset.vectorizer\n", "model = NewsModel(embedding_dim=args.embedding_dim, \n", " num_embeddings=len(vectorizer.title_vocab), \n", " rnn_hidden_dim=args.rnn_hidden_dim,\n", " hidden_dim=args.hidden_dim,\n", " output_dim=len(vectorizer.category_vocab),\n", " num_layers=args.num_layers,\n", " bidirectional=args.bidirectional,\n", " dropout_p=args.dropout_p, \n", " pretrained_embeddings=None, \n", " padding_idx=vectorizer.title_vocab.mask_index)\n", "print (model.named_modules)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 102 }, "colab_type": "code", "id": "tuaRZ4DiFlh1", "outputId": "fba7ac04-7e1d-4372-b358-7340a013960d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[EPOCH]: 00 | [LR]: 0.001 | [TRAIN LOSS]: 0.75 | [TRAIN ACC]: 70.7% | [VAL LOSS]: 0.54 | [VAL ACC]: 80.5%\n", "[EPOCH]: 01 | [LR]: 0.001 | [TRAIN LOSS]: 0.48 | [TRAIN ACC]: 82.7% | [VAL LOSS]: 0.49 | [VAL ACC]: 82.3%\n", "[EPOCH]: 02 | [LR]: 0.001 | [TRAIN LOSS]: 0.41 | [TRAIN ACC]: 85.0% | [VAL LOSS]: 0.47 | [VAL ACC]: 83.1%\n", "[EPOCH]: 03 | [LR]: 0.001 | [TRAIN LOSS]: 0.37 | [TRAIN ACC]: 86.6% | [VAL LOSS]: 0.47 | [VAL ACC]: 83.3%\n", "[EPOCH]: 04 | [LR]: 0.001 | [TRAIN LOSS]: 0.33 | [TRAIN ACC]: 88.2% | [VAL LOSS]: 0.49 | [VAL ACC]: 83.0%\n" ] } ], "source": [ "# Train\n", "trainer = Trainer(dataset=dataset, model=model, \n", " model_state_file=args.model_state_file, \n", " save_dir=args.save_dir, device=args.device,\n", " shuffle=args.shuffle, num_epochs=args.num_epochs, \n", " batch_size=args.batch_size, learning_rate=args.learning_rate, \n", " early_stopping_criteria=args.early_stopping_criteria)\n", "trainer.run_train_loop()" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 335 }, "colab_type": "code", "id": "mzRJIz88Flfe", "outputId": "a7ac8786-01ea-4421-e70c-d79c22c7ed4a" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": { "tags": [] }, "output_type": "display_data" } ], "source": [ "# Plot performance\n", "trainer.plot_performance()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "4EmFhiX-FMaV", "outputId": "c689c3e6-972b-4499-81b6-8812a25076d1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test loss: 0.49\n", "Test Accuracy: 82.9%\n" ] } ], "source": [ "# Test performance\n", "trainer.run_test_loop()\n", "print(\"Test loss: {0:.2f}\".format(trainer.train_state['test_loss']))\n", "print(\"Test Accuracy: {0:.1f}%\".format(trainer.train_state['test_acc']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "zVU1zakYFMVF" }, "outputs": [], "source": [ "# Save all results\n", "trainer.save_train_state()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qLoKfjSpFw7t" }, "source": [ "## Inference" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ANrPcS7Hp_CP" }, "outputs": [], "source": [ "class Inference(object):\n", " def __init__(self, model, vectorizer):\n", " self.model = model\n", " self.vectorizer = vectorizer\n", " \n", " def predict_category(self, title):\n", " # Vectorize\n", " vectorized_title, title_length = self.vectorizer.vectorize(title)\n", " vectorized_title = torch.tensor(vectorized_title).unsqueeze(0)\n", " title_length = torch.tensor([title_length]).long()\n", " \n", " # Forward pass\n", " self.model.eval()\n", " y_pred = self.model(x_in=vectorized_title, x_lengths=title_length, \n", " apply_softmax=True)\n", "\n", " # Top category\n", " y_prob, indices = y_pred.max(dim=1)\n", " index = indices.item()\n", "\n", " # Predicted category\n", " category = vectorizer.category_vocab.lookup_index(index)\n", " probability = y_prob.item()\n", " return {'category': category, 'probability': probability}\n", " \n", " def predict_top_k(self, title, k):\n", " # Vectorize\n", " vectorized_title, title_length = self.vectorizer.vectorize(title)\n", " vectorized_title = torch.tensor(vectorized_title).unsqueeze(0)\n", " title_length = torch.tensor([title_length]).long()\n", " \n", " # Forward pass\n", " self.model.eval()\n", " y_pred = self.model(x_in=vectorized_title, x_lengths=title_length, \n", " apply_softmax=True)\n", " \n", " # Top k categories\n", " y_prob, indices = torch.topk(y_pred, k=k)\n", " probabilities = y_prob.detach().numpy()[0]\n", " indices = indices.detach().numpy()[0]\n", "\n", " # Results\n", " results = []\n", " for probability, index in zip(probabilities, indices):\n", " category = self.vectorizer.category_vocab.lookup_index(index)\n", " results.append({'category': category, 'probability': probability})\n", "\n", " return results" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "id": "W6wr68o2p_Eh", "outputId": "3e94c736-3ad3-4c70-b24c-591edbe069ad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Load the model\n", "dataset = NewsDataset.load_dataset_and_load_vectorizer(\n", " args.split_data_file, args.vectorizer_file)\n", "vectorizer = dataset.vectorizer\n", "model = NewsModel(embedding_dim=args.embedding_dim, \n", " num_embeddings=len(vectorizer.title_vocab), \n", " rnn_hidden_dim=args.rnn_hidden_dim,\n", " hidden_dim=args.hidden_dim,\n", " output_dim=len(vectorizer.category_vocab),\n", " num_layers=args.num_layers,\n", " bidirectional=args.bidirectional,\n", " dropout_p=args.dropout_p, \n", " pretrained_embeddings=None, \n", " padding_idx=vectorizer.title_vocab.mask_index)\n", "model.load_state_dict(torch.load(args.model_state_file))\n", "model = model.to(\"cpu\")\n", "print (model.named_modules)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "JPKgHxsfN954", "outputId": "c9f21a76-8307-4737-c785-01f1004891b6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enter a title to classify: President Obama signed the petition during the White House dinner.\n", "President Obama signed the petition during the White House dinner. → World (p=0.62)\n" ] } ], "source": [ "# Inference\n", "inference = Inference(model=model, vectorizer=vectorizer)\n", "title = input(\"Enter a title to classify: \")\n", "prediction = inference.predict_category(preprocess_text(title))\n", "print(\"{} → {} (p={:0.2f})\".format(title, prediction['category'], \n", " prediction['probability']))" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 102 }, "colab_type": "code", "id": "JRdz4wzuQR4N", "outputId": "9a349bf0-16ba-402d-9133-11ce27e1ec59" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "President Obama signed the petition during the White House dinner.\n", "World (p=0.62)\n", "Sci/Tech (p=0.17)\n", "Business (p=0.15)\n", "Sports (p=0.06)\n" ] } ], "source": [ "# Top-k inference\n", "top_k = inference.predict_top_k(preprocess_text(title), k=len(vectorizer.category_vocab))\n", "print (\"{}\".format(title))\n", "for result in top_k:\n", " print (\"{} (p={:0.2f})\".format(result['category'], \n", " result['probability']))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "noPtpaHZ6NAW" }, "source": [ "# Layer normalization" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "t3bu5cEP6PSb" }, "source": [ "Recall from our [CNN notebook](https://colab.research.google.com/github/LisonEvf/practicalAI-cn/blob/master/notebooks/11_Convolutional_Neural_Networks.ipynb) that we used batch normalization to deal with internal covariant shift. Our activations will experience the same issues with RNNs but we will use a technique known as [layer normalization](https://arxiv.org/abs/1607.06450) (layernorm) to maintain zero mean unit variance on the activations. \n", "\n", "With layernorm it's a bit different from batchnorm. We compute the mean and var for every single sample (instead of each hidden dim) for each layer independently and then do the operations on the activations before they go through the nonlinearities. PyTorch's [LayerNorm](https://pytorch.org/docs/stable/nn.html#torch.nn.LayerNorm) class abstracts all of this for us when we feed in inputs to the layer." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-G-mVmUL61Fk" }, "source": [ "$ LN = \\frac{a - \\mu_{L}}{\\sqrt{\\sigma^2_{L} + \\epsilon}} * \\gamma + \\beta $\n", "\n", "where:\n", "* $a$ = activation | $\\in \\mathbb{R}^{NXH}$ ($N$ is the number of samples, $H$ is the hidden dim)\n", "* $ \\mu_{L}$ = mean of input| $\\in \\mathbb{R}^{NX1}$\n", "* $\\sigma^2_{L}$ = variance of input | $\\in \\mathbb{R}^{NX1}$\n", "* $epsilon$ = noise\n", "* $\\gamma$ = scale parameter (learned parameter)\n", "* $\\beta$ = shift parameter (learned parameter)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "P0e9TnQ581-1" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oAhAHcgZBMFe" }, "source": [ "The most useful location to apply layernorm will be inside the RNN on the activations before the non-linearities. However, this is a bit involved and though PyTorch has a [LayerNorm](https://pytorch.org/docs/stable/nn.html#torch.nn.LayerNorm) class, they do not have an RNN that has built in layernorm yet. You could implement the RNN yourself and manually add layernorm by following a similar setup like below.\n", "\n", "```python\n", "# Layernorm\n", "for t in range(seq_size):\n", " # Normalize over hidden dim\n", " layernorm = nn.LayerNorm(args.hidden_dim)\n", " # Activating the module\n", " a = layernorm(x)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1YHneO3SStOp" }, "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "gGHaKTe1SuEk" }, "source": [ "- interpretability with task to see which words were most influential" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "13_Recurrent_Neural_Networks", "provenance": [], "toc_visible": true, "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }