{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised sentiment: dense feature representations and neural networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2022\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Distributed representations as features](#Distributed-representations-as-features)\n",
    "    1. [GloVe inputs](#GloVe-inputs)\n",
    "    1. [Yelp representations](#Yelp-representations)\n",
    "    1. [Remarks on this approach](#Remarks-on-this-approach)\n",
    "1. [RNN classifiers](#RNN-classifiers)\n",
    "    1. [RNN dataset preparation](#RNN-dataset-preparation)\n",
    "    1. [Vocabulary for the embedding](#Vocabulary-for-the-embedding)\n",
    "    1. [PyTorch RNN classifier](#PyTorch-RNN-classifier)\n",
    "    1. [Pretrained embeddings](#Pretrained-embeddings)\n",
    "    1. [RNN hyperparameter tuning experiment](#RNN-hyperparameter-tuning-experiment)\n",
    "1. [The VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013)\n",
    "    1. [Defining the model](#Defining-the-model)\n",
    "    1. [VecAvg hyperparameter tuning experiment](#VecAvg-hyperparameter-tuning-experiment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This notebook defines and explores __vector averaging__ and __recurrent neural network (RNN) classifiers__ for the Stanford Sentiment Treebank. \n",
    "\n",
    "These approaches make their predictions based on comprehensive representations of the examples: \n",
    "\n",
    "* For the vector averaging models, each word is modeled, but we assume that words combine via a simple function that is insensitive to their order or constituent structure.\n",
    "* For the RNN, each word is again modeled, and we also model the sequential relationships between words.\n",
    "\n",
    "These models contrast with the ones explored in [the previous notebook](sst_02_hand_built_features.ipynb), which make predictions based on more partial, potentially idiosyncratic information extracted from the examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up\n",
    "\n",
    "See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import classification_report\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "from torch_rnn_classifier import TorchRNNClassifier\n",
    "import sst\n",
    "import vsm\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_HOME = 'data'\n",
    "\n",
    "GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')\n",
    "\n",
    "VSMDATA_HOME = os.path.join(DATA_HOME, 'vsmdata')\n",
    "\n",
    "SST_HOME = os.path.join(DATA_HOME, 'sentiment')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distributed representations as features\n",
    "\n",
    "As a first step in the direction of neural networks for sentiment, we can connect with our previous unit on distributed representations. Arguably, more than any specific model architecture, this is the major innovation of deep learning: __rather than designing feature functions by hand, we use dense, distributed representations, often derived from unsupervised models__.\n",
    "\n",
    "<img src=\"fig/distreps-as-features.png\" width=500 alt=\"distreps-as-features.png\" />\n",
    "\n",
    "Our model will just be `LogisticRegression`, and we'll continue with the experiment framework from the previous notebook. Here is `fit_softmax_classifier` again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_classifier(X, y):\n",
    "    mod = LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        solver='liblinear',\n",
    "        multi_class='auto')\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GloVe inputs\n",
    "\n",
    "To illustrate this process, we'll use the general purpose GloVe representations released by the GloVe team, at 300d:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "glove_lookup = utils.glove2dict(\n",
    "    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def vsm_phi(text, lookup, np_func=np.mean):\n",
    "    \"\"\"Represent `text` as a combination of the vector of its words.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    text : str\n",
    "\n",
    "    lookup : dict\n",
    "        From words to vectors.\n",
    "\n",
    "    np_func : function (default: np.sum)\n",
    "        A numpy matrix operation that can be applied columnwise,\n",
    "        like `np.mean`, `np.sum`, or `np.prod`. The requirement is that\n",
    "        the function take `axis=0` as one of its arguments (to ensure\n",
    "        columnwise combination) and that it return a vector of a\n",
    "        fixed length, no matter what the size of the text is.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    np.array, dimension `X.shape[1]`\n",
    "\n",
    "    \"\"\"\n",
    "    allvecs = np.array([lookup[w] for w in text.split() if w in lookup])\n",
    "    if len(allvecs) == 0:\n",
    "        dim = len(next(iter(lookup.values())))\n",
    "        feats = np.zeros(dim)\n",
    "    else:\n",
    "        feats = np_func(allvecs, axis=0)\n",
    "    return feats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def glove_phi(text, np_func=np.mean):\n",
    "    return vsm_phi(text, glove_lookup, np_func=np_func)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.613     0.724     0.664       428\n",
      "     neutral      0.400     0.044     0.079       229\n",
      "    positive      0.619     0.795     0.696       444\n",
      "\n",
      "    accuracy                          0.611      1101\n",
      "   macro avg      0.544     0.521     0.480      1101\n",
      "weighted avg      0.571     0.611     0.555      1101\n",
      "\n",
      "CPU times: user 2.48 s, sys: 75.5 ms, total: 2.56 s\n",
      "Wall time: 2.5 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    glove_phi,\n",
    "    fit_softmax_classifier,\n",
    "    assess_dataframes=sst.dev_reader(SST_HOME),\n",
    "    vectorize=False)  # Tell `experiment` that we already have our feature vectors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Yelp representations\n",
    "\n",
    "Our Yelp VSMs seems pretty well-attuned to the SST, so we might think that they can do even better than the general-purpose GloVe inputs. Here are two quick assessments of that idea that seeks to build on ideas we developed in the unit on VSMs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "yelp20 = pd.read_csv(\n",
    "    os.path.join(VSMDATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "yelp20_ppmi = vsm.pmi(yelp20, positive=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "yelp20_ppmi_svd = vsm.lsa(yelp20_ppmi, k=300)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "yelp_lookup = dict(zip(yelp20_ppmi_svd.index, yelp20_ppmi_svd.values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def yelp_phi(text, np_func=np.mean):\n",
    "    return vsm_phi(text, yelp_lookup, np_func=np_func)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.593     0.673     0.630       428\n",
      "     neutral      0.423     0.048     0.086       229\n",
      "    positive      0.560     0.743     0.639       444\n",
      "\n",
      "    accuracy                          0.571      1101\n",
      "   macro avg      0.525     0.488     0.452      1101\n",
      "weighted avg      0.544     0.571     0.521      1101\n",
      "\n",
      "CPU times: user 4.62 s, sys: 340 ms, total: 4.96 s\n",
      "Wall time: 4.4 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    yelp_phi,\n",
    "    fit_softmax_classifier,\n",
    "    assess_dataframes=sst.dev_reader(SST_HOME),\n",
    "    vectorize=False)  # Tell `experiment` that we already have our feature vectors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remarks on this approach\n",
    "\n",
    "* Recall that our `unigrams_phi` created feature representations with over 16K dimensions and got about 0.52 with no hyperparameter tuning.\n",
    "\n",
    "* The above models' feature representations have only 300 dimensions. While they are struggling with the neutral category, we can probably overcome this with some additional attention to the representations and to our strategies for optimization.\n",
    "\n",
    "* The promise of the Mittens model of [Dingwall and Potts 2018](https://arxiv.org/abs/1803.09901) is that we can use GloVe itself to update the general purpose information in the 'glove.6B' vectors with specialized information from one of these IMDB count matrices. That might be worth trying; the `mittens` package (`pip install mittens`) already implements this!\n",
    "\n",
    "* That said, just averaging all the word representations is pretty unappealing linguistically. There's no doubt that we're losing a lot of valuable information in doing this. The models we turn to now can be seen as addressing this shortcoming while retaining the insight that our distributed representations are valuable for this task.\n",
    "\n",
    "* We'll return to these ideas below, when we consider [the VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013). That model also posits a simple, fixed combination function (averaging). However, it begins with randomly initialized representations and updates them as part of training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RNN classifiers\n",
    "\n",
    "A recurrent neural network (RNN) is any deep learning model that process its inputs sequentially. There are many variations on this theme. The one that we use here is an __RNN classifier__.\n",
    "\n",
    "<img src=\"fig/rnn_classifier.png\" width=800 />\n",
    "\n",
    "The version of the model that is implemented in `np_rnn_classifier.py` corresponds exactly to the above diagram. We can express it mathematically as follows:\n",
    "\n",
    "$$\\begin{align*}\n",
    "h_{t} &= \\tanh(x_{t}W_{xh} + h_{t-1}W_{hh}) \\\\\n",
    "y     &= \\textbf{softmax}(h_{n}W_{hy} + b_y)\n",
    "\\end{align*}$$\n",
    "\n",
    "where $1 \\leqslant t \\leqslant n$. The first line defines the recurrence: each hidden state $h_{t}$ is defined by the input $x_{t}$ and the previous hidden state $h_{t-1}$, together with weight matrices $W_{xh}$ and $W_{hh}$, which are used at all timesteps. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$. In our implementations, this is always an all $0$ vector, but it can be initialized in more sophisticated ways (some of which we will explore in our units on natural language inference and grounded natural language generation).\n",
    "\n",
    "The model in `torch_rnn_classifier.py` expands on the above and allows for more flexibility:\n",
    "\n",
    "$$\\begin{align*}\n",
    "h_{t} &= \\text{RNN}(x_{t}, h_{t-1}) \\\\\n",
    "h     &= f(h_{n}W_{hh} + b_{h}) \\\\\n",
    "y     &= \\textbf{softmax}(hW_{hy} + b_y)\n",
    "\\end{align*}$$\n",
    "\n",
    "Here, $\\text{RNN}$ stands for all the parameters of the recurrent part of the model. This will depend on the choice one makes for `rnn_cell_class`; options include `nn.RNN`, `nn.LSTM`, and `nn.GRU`. In addition, the classifier part includes a hidden layer (middle row), and the user can decide on the activation funtion $f$ to use there (parameter: `classifier_activation`).\n",
    "\n",
    "This is a potential gain over our average-vectors baseline, in that it processes each word independently, and in the context of those that came before it. Thus, not only is this sensitive to word order, but the hidden representation create the potential to encode how the preceding context for a word affects its interpretation.\n",
    "\n",
    "The downside of this, of course, is that this model is much more difficult to set up and optimize. Let's dive into those details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RNN dataset preparation\n",
    "\n",
    "SST contains trees, but the RNN processes just the sequence of leaf nodes. The function `sst.build_rnn_dataset` creates datasets in this format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_rnn_train, y_rnn_train = sst.build_rnn_dataset(sst.train_reader(SST_HOME))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each member of `X_rnn_train` is a list of lists of words. Here's a look at the start of the first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['The', 'Rock', 'is', 'destined', 'to', 'be']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_rnn_train[0][: 6]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because this is a classifier, `y_rnn_train` is just a list of labels, one per example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'positive'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_rnn_train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For experiments, let's build a `dev` dataset as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_rnn_dev, y_rnn_dev = sst.build_rnn_dataset(sst.dev_reader(SST_HOME))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vocabulary for the embedding\n",
    "\n",
    "The first delicate issue we need to address is the vocabulary for our model:\n",
    "\n",
    "* As indicated in the figure above, the first thing we do when processing an example is look up the words in an embedding (a VSM), which has to have a fixed dimensionality. \n",
    "\n",
    "* We can use our training data to specify the vocabulary for this embedding; at prediction time, though, we will inevitably encounter words we haven't seen before. \n",
    "\n",
    "* The convention we adopt here is to map them to an `$UNK` token that is in our pre-specified vocabulary.\n",
    "\n",
    "* At the same time, we might want to collapse infrequent tokens into `$UNK` to make optimization easier and to try to create reasonable representations for words that we have to map to `$UNK` at test time.\n",
    "\n",
    "In `utils`, the function `get_vocab` will help you specify a vocabulary. It will let you choose a vocabulary by optionally specifying `mincount` or `n_words`, and it will ensure that `$UNK` is included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "sst_full_train_vocab = utils.get_vocab(X_rnn_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sst_full_train_vocab has 18,279 items\n"
     ]
    }
   ],
   "source": [
    "print(\"sst_full_train_vocab has {:,} items\".format(len(sst_full_train_vocab)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This frankly seems too big relative to our dataset size. Let's restrict to just words that occur at least twice:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "sst_train_vocab = utils.get_vocab(X_rnn_train, mincount=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sst_train_vocab has 8,736 items\n"
     ]
    }
   ],
   "source": [
    "print(\"sst_train_vocab has {:,} items\".format(len(sst_train_vocab)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PyTorch RNN classifier\n",
    "\n",
    "Here and throughout, we'll rely on `early_stopping=True` to try to find the optimal time to stop optimization. This behavior can be further refined by setting different values of `validation_fraction`, `n_iter_no_change`, and `tol`. For additional discussion, see [the section on model convergence in the evaluation methods notebook](#Assessing-models-without-convergence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn = TorchRNNClassifier(\n",
    "    sst_train_vocab,\n",
    "    early_stopping=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 58. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.2886183727532625"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 38.9 s, sys: 24.6 s, total: 1min 3s\n",
      "Wall time: 19.2 s\n"
     ]
    }
   ],
   "source": [
    "%time _ = rnn.fit(X_rnn_train, y_rnn_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn_dev_preds = rnn.predict(X_rnn_dev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.575     0.614     0.594       428\n",
      "     neutral      0.230     0.223     0.226       229\n",
      "    positive      0.637     0.606     0.621       444\n",
      "\n",
      "    accuracy                          0.530      1101\n",
      "   macro avg      0.481     0.481     0.481      1101\n",
      "weighted avg      0.529     0.530     0.529      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_rnn_dev, rnn_dev_preds, digits=3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above numbers are just a starting point. Let's try to improve on them by using pretrained embeddings and then by exploring a range of hyperparameter options."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pretrained embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With `embedding=None`, `TorchRNNClassifier` (and its counterpart in `np_rnn_classifier.py`) create random embeddings. You can also pass in an embedding, as long as you make sure it has the right vocabulary. The utility `utils.create_pretrained_embedding` will help with that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "glove_embedding, sst_glove_vocab = utils.create_pretrained_embedding(\n",
    "    glove_lookup, sst_train_vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an illustration using `TorchRNNClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn_glove = TorchRNNClassifier(\n",
    "    sst_glove_vocab,\n",
    "    embedding=glove_embedding,\n",
    "    early_stopping=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 27. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.3226494677364826"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 13.1 s, sys: 9.15 s, total: 22.2 s\n",
      "Wall time: 5.63 s\n"
     ]
    }
   ],
   "source": [
    "%time _ = rnn_glove.fit(X_rnn_train, y_rnn_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn_glove_dev_preds = rnn_glove.predict(X_rnn_dev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.676     0.664     0.670       428\n",
      "     neutral      0.307     0.323     0.315       229\n",
      "    positive      0.700     0.694     0.697       444\n",
      "\n",
      "    accuracy                          0.605      1101\n",
      "   macro avg      0.561     0.560     0.561      1101\n",
      "weighted avg      0.609     0.605     0.607      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_rnn_dev, rnn_glove_dev_preds, digits=3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like pretrained representations give us a notable boost, but we're still below most of the simpler models explored in [the previous notebook](sst_02_hand_built_features.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RNN hyperparameter tuning experiment\n",
    "\n",
    "As we saw in [the previous notebook](sst_02_hand_built_features.ipynb), we're not really done until we've done some hyperparameter search. So let's round out this section by cross-validating the RNN that uses GloVe embeddings, to see if we can improve on the default-parameters model we evaluated just above. For this, we'll use `sst.experiment`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simple_leaves_phi(text):\n",
    "    return text.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_rnn_with_hyperparameter_search(X, y):\n",
    "    basemod = TorchRNNClassifier(\n",
    "        sst_train_vocab,\n",
    "        embedding=glove_embedding,\n",
    "        batch_size=25,  # Inspired by comments in the paper.\n",
    "        bidirectional=True,\n",
    "        early_stopping=True)\n",
    "\n",
    "    # There are lots of other parameters and values we could\n",
    "    # explore, but this is at least a solid start:\n",
    "    param_grid = {\n",
    "        'embed_dim': [50, 75, 100],\n",
    "        'hidden_dim': [50, 75, 100],\n",
    "        'eta': [0.001, 0.01]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, basemod, cv=3, param_grid=param_grid)\n",
    "\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 14. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.7038347354674672"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'embed_dim': 75, 'eta': 0.001, 'hidden_dim': 100}\n",
      "Best score: 0.546\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.699     0.666     0.682       428\n",
      "     neutral      0.299     0.240     0.266       229\n",
      "    positive      0.662     0.759     0.707       444\n",
      "\n",
      "    accuracy                          0.615      1101\n",
      "   macro avg      0.553     0.555     0.552      1101\n",
      "weighted avg      0.601     0.615     0.606      1101\n",
      "\n",
      "CPU times: user 39min 55s, sys: 39.9 s, total: 40min 35s\n",
      "Wall time: 39min 53s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "rnn_experiment_xval = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    simple_leaves_phi,\n",
    "    fit_rnn_with_hyperparameter_search,\n",
    "    assess_dataframes=sst.dev_reader(SST_HOME),\n",
    "    vectorize=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model looks quite competitive with the simpler models we explored previously, and perhaps an even wider hyperparameter search would lead to additional improvements. In [finetuning.ipynb](finetuning.ipynb), we look at variants of the above that involve fine-tuning with BERT, and those models achieve even better results, which further highlights the value of rich pretraining."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The VecAvg baseline from Socher et al. 2013\n",
    "\n",
    "One of the baseline models from [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf) is __VecAvg__. This is like the model we explored above under the heading of [Distributed representations as features](#Distributed-representations-as-features), but it uses a random initial embedding that is updated as part of optimization. Another perspective on it is that it is like the RNN we just evaluated, but with the RNN parameters replaced by averaging. \n",
    "\n",
    "In Socher et al. 2013, this model does reasonably well, scoring 80.1 on the root-only binary problem. In this section, we reimplement it, relying on `TorchRNNClassifier` to handle most of the heavy-lifting, and we evaluate it with a reasonably wide hyperparameter search."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining the model\n",
    "\n",
    "The core model is `TorchVecAvgModel`, which just looks up embeddings, averages them, and feeds the result to a classifier layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchVecAvgModel(nn.Module):\n",
    "    def __init__(self, vocab_size, output_dim, device, embed_dim=50):\n",
    "        super().__init__()\n",
    "        self.vocab_size = vocab_size\n",
    "        self.embed_dim = embed_dim\n",
    "        self.output_dim = output_dim\n",
    "        self.device = device\n",
    "        self.embedding = nn.Embedding(self.vocab_size, self.embed_dim)\n",
    "        self.classifier_layer = nn.Linear(self.embed_dim, self.output_dim)\n",
    "\n",
    "    def forward(self, X, seq_lengths):\n",
    "        embs = self.embedding(X)\n",
    "        # Mask based on the **true** lengths:\n",
    "        mask = [torch.ones(l, self.embed_dim) for l in seq_lengths]\n",
    "        mask = torch.nn.utils.rnn.pad_sequence(mask, batch_first=True)\n",
    "        mask = mask.to(self.device)\n",
    "        # True average:\n",
    "        mu = (embs * mask).sum(axis=1) / seq_lengths.unsqueeze(1)\n",
    "        # Classifier:\n",
    "        logits = self.classifier_layer(mu)\n",
    "        return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the main interface, we can just subclass `TorchRNNClassifier` and change the `build_graph` method to use `TorchVecAvgModel`. (For more details on the code and logic here, see the notebook [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb).)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchVecAvgClassifier(TorchRNNClassifier):\n",
    "\n",
    "    def build_graph(self):\n",
    "        return TorchVecAvgModel(\n",
    "            vocab_size=len(self.vocab),\n",
    "            output_dim=self.n_classes_,\n",
    "            device=self.device,\n",
    "            embed_dim=self.embed_dim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### VecAvg hyperparameter tuning experiment\n",
    "\n",
    "Now that we have the model implemented, let's see if we can reproduce Socher et al.'s 80.1 on the binary, root-only version of SST."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = sst.train_reader(SST_HOME)\n",
    "\n",
    "train_bin_df = train_df[train_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_df = sst.dev_reader(SST_HOME)\n",
    "\n",
    "dev_bin_df = dev_df[dev_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = sst.sentiment_reader(os.path.join(SST_HOME, \"sst3-test-labeled.csv\"))\n",
    "\n",
    "test_bin_df = test_df[test_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_vecavg_with_hyperparameter_search(X, y):\n",
    "    basemod = TorchVecAvgClassifier(\n",
    "        sst_train_vocab,\n",
    "        early_stopping=True)\n",
    "\n",
    "    param_grid = {\n",
    "        'embed_dim': [50, 100, 200, 300],\n",
    "        'eta': [0.001, 0.01, 0.05]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, basemod, cv=3, param_grid=param_grid)\n",
    "\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 13. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.021834758925251663"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'embed_dim': 300, 'eta': 0.05}\n",
      "Best score: 0.784\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.827     0.779     0.802       912\n",
      "    positive      0.790     0.836     0.812       909\n",
      "\n",
      "    accuracy                          0.807      1821\n",
      "   macro avg      0.808     0.807     0.807      1821\n",
      "weighted avg      0.808     0.807     0.807      1821\n",
      "\n",
      "CPU times: user 42min 50s, sys: 28min 13s, total: 1h 11min 3s\n",
      "Wall time: 17min 48s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "vecavg_experiment_xval = sst.experiment(\n",
    "    [train_bin_df, dev_bin_df],\n",
    "    simple_leaves_phi,\n",
    "    fit_vecavg_with_hyperparameter_search,\n",
    "    assess_dataframes=test_bin_df,\n",
    "    vectorize=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Excellent – it looks like we basically reproduced the number from the paper (80.1)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}