{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised sentiment: dense feature representations and neural networks" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Fall 2020\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Overview](#Overview)\n", "1. [Set-up](#Set-up)\n", "1. [Distributed representations as features](#Distributed-representations-as-features)\n", " 1. [GloVe inputs](#GloVe-inputs)\n", " 1. [IMDB representations](#IMDB-representations)\n", " 1. [Remarks on this approach](#Remarks-on-this-approach)\n", "1. [RNN classifiers](#RNN-classifiers)\n", " 1. [RNN dataset preparation](#RNN-dataset-preparation)\n", " 1. [Vocabulary for the embedding](#Vocabulary-for-the-embedding)\n", " 1. [PyTorch RNN classifier](#PyTorch-RNN-classifier)\n", " 1. [Pretrained embeddings](#Pretrained-embeddings)\n", " 1. [RNN hyperparameter tuning experiment](#RNN-hyperparameter-tuning-experiment)\n", "1. [The VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013)\n", " 1. [Defining the model](#Defining-the-model)\n", " 1. [VecAvg hyperparameter tuning experiment](#VecAvg-hyperparameter-tuning-experiment)\n", "1. [Tree-structured neural networks](#Tree-structured-neural-networks)\n", " 1. [TreeNN dataset preparation](#TreeNN-dataset-preparation)\n", " 1. [PyTorch TreeNN](#PyTorch-TreeNN)\n", " 1. [Subtree supervision](#Subtree-supervision)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "This notebook defines and explores __vector averaging__, __recurrent neural network (RNN) classifiers__ and __tree-structured neural network (TreeNN) classifiers__ for the Stanford Sentiment Treebank. \n", "\n", "These approaches make their predictions based on comprehensive representations of the examples: \n", "\n", "* For the vector averaging models, each word is modeled, but we assume that words combine via a simple function that is insensitive to their order or constituent structure.\n", "* For the RNN, each word is again modeled, and we also model the sequential relationships between words.\n", "* For the TreeNN, the entire parsed structure of the sentence is modeled.\n", "\n", "All these models contrast with the ones explored in [the previous notebook](sst_02_hand_built_features.ipynb), which make predictions based on more partial, potentially idiosyncratic information extracted from the examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up\n", "\n", "See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "from np_rnn_classifier import RNNClassifier\n", "from np_tree_nn import TreeNN\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report\n", "import torch\n", "import torch.nn as nn\n", "from torch_rnn_classifier import TorchRNNClassifier\n", "from torch_tree_nn import TorchTreeNN\n", "import sst\n", "import vsm\n", "import utils" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "utils.fix_random_seeds()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "DATE_HOME = 'data'\n", "\n", "GLOVE_HOME = os.path.join(DATE_HOME, 'glove.6B')\n", "\n", "VSMDATA_HOME = os.path.join(DATE_HOME, 'vsmdata')\n", "\n", "SST_HOME = os.path.join(DATE_HOME, 'trees')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed representations as features\n", "\n", "As a first step in the direction of neural networks for sentiment, we can connect with our previous unit on distributed representations. Arguably, more than any specific model architecture, this is the major innovation of deep learning: __rather than designing feature functions by hand, we use dense, distributed representations, often derived from unsupervised models__.\n", "\n", "\"distreps-as-features.png\"\n", "\n", "Our model will just be `LogisticRegression`, and we'll continue with the experiment framework from the previous notebook. Here is `fit_maxent_classifier` again:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def fit_maxent_classifier(X, y):\n", " mod = LogisticRegression(\n", " fit_intercept=True,\n", " solver='liblinear',\n", " multi_class='auto')\n", " mod.fit(X, y)\n", " return mod" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GloVe inputs\n", "\n", "To illustrate this process, we'll use the general purpose GloVe representations released by the GloVe team, at 300d:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "glove_lookup = utils.glove2dict(\n", " os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def vsm_leaves_phi(tree, lookup, np_func=np.mean):\n", " \"\"\"Represent `tree` as a combination of the vector of its words.\n", "\n", " Parameters\n", " ----------\n", " tree : nltk.Tree\n", "\n", " lookup : dict\n", " From words to vectors.\n", "\n", " np_func : function (default: np.sum)\n", " A numpy matrix operation that can be applied columnwise,\n", " like `np.mean`, `np.sum`, or `np.prod`. The requirement is that\n", " the function take `axis=0` as one of its arguments (to ensure\n", " columnwise combination) and that it return a vector of a\n", " fixed length, no matter what the size of the tree is.\n", "\n", " Returns\n", " -------\n", " np.array, dimension `X.shape[1]`\n", "\n", " \"\"\"\n", " allvecs = np.array([lookup[w] for w in tree.leaves() if w in lookup])\n", " if len(allvecs) == 0:\n", " dim = len(next(iter(lookup.values())))\n", " feats = np.zeros(dim)\n", " else:\n", " feats = np_func(allvecs, axis=0)\n", " return feats" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def glove_leaves_phi(tree, np_func=np.sum):\n", " return vsm_leaves_phi(tree, glove_lookup, np_func=np_func)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.746 0.789 0.767 941\n", " positive 0.816 0.778 0.797 1135\n", "\n", " accuracy 0.783 2076\n", " macro avg 0.781 0.783 0.782 2076\n", "weighted avg 0.785 0.783 0.783 2076\n", "\n", "CPU times: user 2.21 s, sys: 276 ms, total: 2.49 s\n", "Wall time: 2.01 s\n" ] } ], "source": [ "%%time\n", "_ = sst.experiment(\n", " SST_HOME,\n", " glove_leaves_phi,\n", " fit_maxent_classifier,\n", " vectorize=False) # Tell `experiment` that we already have our feature vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB representations\n", "\n", "Our IMDB VSMs seems pretty well-attuned to the Stanford Sentiment Treebank, so we might think that they can do even better than the general-purpose GloVe inputs. Here are two quick assessments of that idea that seeks to build on ideas we developed in the unit on VSMs." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "imdb20 = pd.read_csv(\n", " os.path.join(VSMDATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "imdb20_ppmi = vsm.pmi(imdb20, positive=False)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "imdb20_ppmi_svd = vsm.lsa(imdb20_ppmi, k=300)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "imdb_lookup = dict(zip(imdb20_ppmi_svd.index, imdb20_ppmi_svd.values))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def imdb_phi(tree, np_func=np.sum):\n", " return vsm_leaves_phi(tree, imdb_lookup, np_func=np_func)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.746 0.733 0.739 977\n", " positive 0.766 0.778 0.772 1099\n", "\n", " accuracy 0.757 2076\n", " macro avg 0.756 0.755 0.756 2076\n", "weighted avg 0.757 0.757 0.757 2076\n", "\n", "CPU times: user 2.88 s, sys: 1.06 s, total: 3.94 s\n", "Wall time: 2.04 s\n" ] } ], "source": [ "%%time\n", "_ = sst.experiment(\n", " SST_HOME,\n", " imdb_phi,\n", " fit_maxent_classifier,\n", " vectorize=False) # Tell `experiment` that we already have our feature vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Remarks on this approach\n", "\n", "* Recall that our `unigrams_phi` created feature representations with over 16K dimensions and got about 0.77 with no hyperparameter tuning.\n", "\n", "* The above models' feature representations have only 300 dimensions, and they are about the same. In many ways, it's striking that we can get a model that is competitive with so few dimensions.\n", "\n", "* The promise of the Mittens model of [Dingwall and Potts 2018](https://arxiv.org/abs/1803.09901) is that we can use GloVe itself to update the general purpose information in the 'glove.6B' vectors with specialized information from one of these IMDB count matrices. That might be worth trying; the `mittens` package (`pip install mittens`) already implements this!\n", "\n", "* That said, just summing up all the word representations is pretty unappealing linguistically. There's no doubt that we're losing a lot of valuable information in doing this. The models we turn to now can be seen as addressing this shortcoming while retaining the insight that our distributed representations are valuable for this task.\n", "\n", "* We'll return to these ideas below, when we consider [the VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013). That model also posits a simple, fixed combination function (averaging). However, it begins with randomly initialized representations and updates them as part of training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RNN classifiers\n", "\n", "A recurrent neural network (RNN) is any deep learning model that process its inputs sequentially. There are many variations on this theme. The one that we use here is an __RNN classifier__.\n", "\n", "\n", "\n", "The version of the model that is implemented in `np_rnn_classifier.py` corresponds exactly to the above diagram. We can express it mathematically as follows:\n", "\n", "$$\\begin{align*}\n", "h_{t} &= \\tanh(x_{t}W_{xh} + h_{t-1}W_{hh}) \\\\\n", "y &= \\textbf{softmax}(h_{n}W_{hy} + b_y)\n", "\\end{align*}$$\n", "\n", "where $1 \\leqslant t \\leqslant n$. The first line defines the recurrence: each hidden state $h_{t}$ is defined by the input $x_{t}$ and the previous hidden state $h_{t-1}$, together with weight matrices $W_{xh}$ and $W_{hh}$, which are used at all timesteps. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$. In our implementations, this is always an all $0$ vector, but it can be initialized in more sophisticated ways (some of which we will explore in our units on natural language inference and grounded natural language generation).\n", "\n", "The model in `torch_rnn_classifier.py` expands on the above and allows for more flexibility:\n", "\n", "$$\\begin{align*}\n", "h_{t} &= \\text{RNN}(x_{t}, h_{t-1}) \\\\\n", "h &= f(h_{n}W_{hh} + b_{h}) \\\\\n", "y &= \\textbf{softmax}(hW_{hy} + b_y)\n", "\\end{align*}$$\n", "\n", "Here, $\\text{RNN}$ stands for all the parameters of the recurrent part of the model. This will depend on the choice one makes for `rnn_cell_class`; options include `nn.RNN`, `nn.LSTM`, and `nn.GRU`. In addition, the classifier part includes a hidden layer (middle row), and the user can decide on the activation funtion $f$ to use there (parameter: `classifier_activation`).\n", "\n", "This is a potential gain over our average-vectors baseline, in that it processes each word independently, and in the context of those that came before it. Thus, not only is this sensitive to word order, but the hidden representation create the potential to encode how the preceding context for a word affects its interpretation.\n", "\n", "The downside of this, of course, is that this model is much more difficult to set up and optimize. Let's dive into those details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### RNN dataset preparation\n", "\n", "SST contains trees, but the RNN processes just the sequence of leaf nodes. The function `sst.build_rnn_dataset` creates datasets in this format:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "X_rnn_train, y_rnn_train = sst.build_rnn_dataset(\n", " SST_HOME, sst.train_reader, class_func=sst.binary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each member of `X_rnn_train` is a list of lists of words. Here's a look at the start of the first:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['The', 'Rock', 'is', 'destined', 'to', 'be']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_rnn_train[0][: 6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because this is a classifier, `y_rnn_train` is just a list of labels, one per example:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'positive'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_rnn_train[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For experiments, let's build a `dev` dataset as well:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "X_rnn_dev, y_rnn_dev = sst.build_rnn_dataset(\n", " SST_HOME, sst.dev_reader, class_func=sst.binary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocabulary for the embedding\n", "\n", "The first delicate issue we need to address is the vocabulary for our model:\n", "\n", "* As indicated in the figure above, the first thing we do when processing an example is look up the words in an embedding (a VSM), which has to have a fixed dimensionality. \n", "\n", "* We can use our training data to specify the vocabulary for this embedding; at prediction time, though, we will inevitably encounter words we haven't seen before. \n", "\n", "* The convention we adopt here is to map them to an `$UNK` token that is in our pre-specified vocabulary.\n", "\n", "* At the same time, we might want to collapse infrequent tokens into `$UNK` to make optimization easier and to try to create reasonable representations for words that we have to map to `$UNK` at test time.\n", "\n", "In `utils`, the function `get_vocab` will help you specify a vocabulary. It will let you choose a vocabulary by optionally specifying `mincount` or `n_words`, and it will ensure that `$UNK` is included." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "sst_full_train_vocab = utils.get_vocab(X_rnn_train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sst_full_train_vocab has 16,283 items\n" ] } ], "source": [ "print(\"sst_full_train_vocab has {:,} items\".format(len(sst_full_train_vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This frankly seems too big relative to our dataset size. Let's restrict to just words that occur at least twice:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "sst_train_vocab = utils.get_vocab(X_rnn_train, mincount=2)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sst_train_vocab has 7,564 items\n" ] } ], "source": [ "print(\"sst_train_vocab has {:,} items\".format(len(sst_train_vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PyTorch RNN classifier\n", "\n", "Here and throughout, we'll rely on `early_stopping=True` to try to find the optimal time to stop optimization. This behavior can be further refined by setting different values of `validation_fraction`, `n_iter_no_change`, and `tol`. For additional discussion, see [the section on model convergence in the evaluation methods notebook](#Assessing-models-without-convergence)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "rnn = TorchRNNClassifier(\n", " sst_train_vocab,\n", " early_stopping=True)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 52. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.14730898616835475" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 36.1 s, sys: 435 ms, total: 36.6 s\n", "Wall time: 8 s\n" ] } ], "source": [ "%time _ = rnn.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "rnn_dev_preds = rnn.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.702 0.743 0.722 428\n", " positive 0.737 0.696 0.716 444\n", "\n", " accuracy 0.719 872\n", " macro avg 0.720 0.719 0.719 872\n", "weighted avg 0.720 0.719 0.719 872\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, rnn_dev_preds, digits=3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above numbers are just a starting point. Let's try to improve on them by using pretrained embeddings and then by exploring a range of hyperparameter options." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pretrained embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `embedding=None`, `TorchRNNClassifier` (and its counterpart in `np_rnn_classifier.py`) create random embeddings. You can also pass in an embedding, as long as you make sure it has the right vocabulary. The utility `utils.create_pretrained_embedding` will help with that:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "glove_embedding, sst_glove_vocab = utils.create_pretrained_embedding(\n", " glove_lookup, sst_train_vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an illustration using `TorchRNNClassifier`:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "rnn_glove = TorchRNNClassifier(\n", " sst_glove_vocab,\n", " embedding=glove_embedding,\n", " early_stopping=True)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 23. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.12253877334296703" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15 s, sys: 14.7 ms, total: 15 s\n", "Wall time: 2.74 s\n" ] } ], "source": [ "%time _ = rnn_glove.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "rnn_glove_dev_preds = rnn_glove.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.824 0.724 0.771 428\n", " positive 0.762 0.851 0.804 444\n", "\n", " accuracy 0.789 872\n", " macro avg 0.793 0.788 0.788 872\n", "weighted avg 0.793 0.789 0.788 872\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, rnn_glove_dev_preds, digits=3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like pretrained representations give us a notable boost, but we're still below most of the simpler models explored in [the previous notebook](sst_02_hand_built_features.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### RNN hyperparameter tuning experiment\n", "\n", "As we saw in [the previous notebook](sst_02_hand_built_features.ipynb), we're not really done until we've done some hyperparameter search. So let's round out this section by cross-validating the RNN that uses GloVe embeddings, to see if we can improve on the default-parameters model we evaluated just above. For this, we'll use `sst.experiment`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def simple_leaves_phi(tree):\n", " return tree.leaves()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "def fit_rnn_with_hyperparameter_search(X, y):\n", " basemod = TorchRNNClassifier(\n", " sst_train_vocab,\n", " embedding=glove_embedding,\n", " batch_size=25, # Inspired by comments in the paper.\n", " bidirectional=True,\n", " early_stopping=True)\n", "\n", " # There are lots of other parameters and values we could\n", " # explore, but this is at least a solid start:\n", " param_grid = {\n", " 'embed_dim': [25, 50, 75, 100],\n", " 'hidden_dim': [25, 50, 75, 100],\n", " 'eta': [0.001, 0.01, 0.05]}\n", "\n", " bestmod = utils.fit_classifier_with_hyperparameter_search(\n", " X, y, basemod, cv=3, param_grid=param_grid)\n", "\n", " return bestmod" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 13. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 3.154094541274389555" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'embed_dim': 75, 'eta': 0.001, 'hidden_dim': 75}\n", "Best score: 0.791\n", " precision recall f1-score support\n", "\n", " negative 0.769 0.834 0.800 1002\n", " positive 0.832 0.766 0.798 1074\n", "\n", " accuracy 0.799 2076\n", " macro avg 0.801 0.800 0.799 2076\n", "weighted avg 0.802 0.799 0.799 2076\n", "\n", "CPU times: user 34min 15s, sys: 15.6 s, total: 34min 30s\n", "Wall time: 34min 16s\n" ] } ], "source": [ "%%time\n", "rnn_experiment_xval = sst.experiment(\n", " SST_HOME,\n", " simple_leaves_phi,\n", " fit_rnn_with_hyperparameter_search,\n", " vectorize=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we carry forward the optimal model from our hyperparameter search, to run a final assessment on the test set:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def fit_optimized_rnn(X, y):\n", " mod = rnn_experiment_xval['model']\n", " mod.fit(X, y)\n", " return mod" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.3395959859716413" ] }, { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.815 0.826 0.820 912\n", " positive 0.823 0.812 0.817 909\n", "\n", " accuracy 0.819 1821\n", " macro avg 0.819 0.819 0.819 1821\n", "weighted avg 0.819 0.819 0.819 1821\n", "\n", "CPU times: user 21.4 s, sys: 102 ms, total: 21.5 s\n", "Wall time: 21.4 s\n" ] } ], "source": [ "%%time\n", "_ = sst.experiment(\n", " SST_HOME,\n", " simple_leaves_phi,\n", " fit_optimized_rnn,\n", " class_func=sst.binary_class_func,\n", " train_reader=(sst.train_reader, sst.dev_reader),\n", " assess_reader=sst.test_reader,\n", " vectorize=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model looks quite competitive with the simpler models we explored previously, and perhaps an even wider hyperparameter search would lead to additional improvements. In [contextualreps.ipynb](contextualreps.ipynb), we look at variants of the above that involve fine-tuning with ELMo and BERT, and those models achieve results around 0.90 on the test set, which further highlights the value of rich pretraining." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The VecAvg baseline from Socher et al. 2013\n", "\n", "One of the baseline models from [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf) is __VecAvg__. This is like the model we explored above under the heading of [Distributed representations as features](#Distributed-representations-as-features), but it uses a random initial embedding that is updated as part of optimization. Another perspective on it is that it is like the RNN we just evaluated, but with the RNN parameters replaced by averaging. \n", "\n", "In Socher et al. 2013, this model does reasonably well, scoring 80.1 on the root-only binary problem. In this section, we reimplement it, relying on `TorchRNNClassifier` to handle most of the heavy-lifting, and we evaluate it with a reasonably wide hyperparameter search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining the model\n", "\n", "The core model is `TorchVecAvgModel`, which just looks up embeddings, averages them, and feeds the result to a classifier layer:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "class TorchVecAvgModel(nn.Module):\n", " def __init__(self, vocab_size, output_dim, device, embed_dim=50):\n", " super().__init__()\n", " self.vocab_size = vocab_size\n", " self.embed_dim = embed_dim\n", " self.output_dim = output_dim\n", " self.device = device\n", " self.embedding = nn.Embedding(self.vocab_size, self.embed_dim)\n", " self.classifier_layer = nn.Linear(self.embed_dim, self.output_dim)\n", "\n", " def forward(self, X, seq_lengths):\n", " embs = self.embedding(X)\n", " # Mask based on the **true** lengths:\n", " mask = [torch.ones(l, self.embed_dim) for l in seq_lengths]\n", " mask = torch.nn.utils.rnn.pad_sequence(mask, batch_first=True)\n", " mask = mask.to(self.device)\n", " # True average:\n", " mu = (embs * mask).sum(axis=1) / seq_lengths.unsqueeze(1)\n", " # Classifier:\n", " logits = self.classifier_layer(mu)\n", " return logits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the main interface, we can just subclass `TorchRNNClassifier` and change the `build_graph` method to use `TorchVecAvgModel`. (For more details on the code and logic here, see the notebook [tutorial_torch_models.ipynb](tutorial_torch_models.ipynb).)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "class TorchVecAvgClassifier(TorchRNNClassifier):\n", "\n", " def build_graph(self):\n", " return TorchVecAvgModel(\n", " vocab_size=len(self.vocab),\n", " output_dim=self.n_classes_,\n", " device=self.device,\n", " embed_dim=self.embed_dim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VecAvg hyperparameter tuning experiment\n", "\n", "Now that we have the model implemented, let's see if we can reproduce Socher et al.'s 80.1 on the binary, root-only version of SST.\n", "\n", "First, we do the hyperparameter search:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "def fit_vecavg_with_hyperparameter_search(X, y):\n", " basemod = TorchVecAvgClassifier(\n", " sst_train_vocab,\n", " early_stopping=True)\n", "\n", " param_grid = {\n", " 'embed_dim': [50, 100, 200, 300],\n", " 'eta': [0.001, 0.01, 0.05]}\n", "\n", " bestmod = utils.fit_classifier_with_hyperparameter_search(\n", " X, y, basemod, cv=3, param_grid=param_grid)\n", "\n", " return bestmod" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 18. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.080092592164874088" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'embed_dim': 300, 'eta': 0.01}\n", "Best score: 0.766\n", " precision recall f1-score support\n", "\n", " negative 0.786 0.715 0.749 1011\n", " positive 0.751 0.815 0.782 1065\n", "\n", " accuracy 0.766 2076\n", " macro avg 0.768 0.765 0.765 2076\n", "weighted avg 0.768 0.766 0.766 2076\n", "\n", "CPU times: user 16min 49s, sys: 23.2 s, total: 17min 13s\n", "Wall time: 4min 47s\n" ] } ], "source": [ "%%time\n", "vecavg_experiment_xval = sst.experiment(\n", " SST_HOME,\n", " simple_leaves_phi,\n", " fit_vecavg_with_hyperparameter_search,\n", " vectorize=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then we use the best parameters found above to train a new model on the union of the train and dev sets:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "def fit_optimized_vecavg(X, y):\n", " mod = vecavg_experiment_xval['model']\n", " mod.fit(X, y)\n", " return mod" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 17. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.14763524942100048" ] }, { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.798 0.823 0.811 912\n", " positive 0.817 0.791 0.804 909\n", "\n", " accuracy 0.807 1821\n", " macro avg 0.808 0.807 0.807 1821\n", "weighted avg 0.808 0.807 0.807 1821\n", "\n", "CPU times: user 29.5 s, sys: 2.14 s, total: 31.6 s\n", "Wall time: 9.93 s\n" ] } ], "source": [ "%%time\n", "_= sst.experiment(\n", " SST_HOME,\n", " simple_leaves_phi,\n", " fit_optimized_vecavg,\n", " class_func=sst.binary_class_func,\n", " train_reader=(sst.train_reader, sst.dev_reader),\n", " assess_reader=sst.test_reader,\n", " vectorize=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Excellent – it looks like we basically reproduced the number from the paper (80.1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree-structured neural networks\n", "\n", "Tree-structured neural networks (TreeNNs) are close relatives of RNN classifiers. (If you tilt your head, you can see the above sequence model as a kind of tree.) The TreeNNs we explore here are the simplest possible and actually have many fewer parameters than RNNs. Here's a summary:\n", "\n", "\n", "\n", "The crucial property of these networks is the way they employ recursion: the representation of a parent node $p$ has the same dimensionality as the word representations, allowing seamless repeated application of the central combination function:\n", "\n", "$$p = \\tanh([x_{L};x_{R}]W_{wh} + b)$$\n", "\n", "Here, $[x_{L};x_{R}]$ is the concatenation of the left and right child representations, and $p$ is the resulting parent node, which can then be a child node in a higher subtree.\n", "\n", "When we reach the root node $h_{r}$ of the tree, we apply a softmax classifier using that top node's representation:\n", "\n", "$$y = \\textbf{softmax}(h_{r}W_{hy} + b)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TreeNN dataset preparation\n", "\n", "This is the only model under consideration here that makes use of the tree structures in the SST:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "X_tree_train, y_tree_train = sst.build_tree_dataset(\n", " SST_HOME, sst.train_reader, class_func=sst.binary_class_func)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('positive', [Tree(None, [Tree(None, ['The']), Tree(None, ['Rock'])]), Tree('positive', [Tree('positive', [Tree(None, ['is']), Tree('positive', [Tree(None, ['destined']), Tree(None, [Tree(None, [Tree(None, [Tree(None, [Tree(None, ['to']), Tree(None, [Tree(None, ['be']), Tree(None, [Tree(None, ['the']), Tree(None, [Tree(None, ['21st']), Tree(None, [Tree(None, [Tree(None, ['Century']), Tree(None, [\"'s\"])]), Tree(None, [Tree('positive', ['new']), Tree(None, [Tree(None, ['``']), Tree(None, ['Conan'])])])])])])])]), Tree(None, [\"''\"])]), Tree(None, ['and'])]), Tree('positive', [Tree(None, ['that']), Tree('positive', [Tree(None, ['he']), Tree('positive', [Tree(None, [\"'s\"]), Tree('positive', [Tree(None, ['going']), Tree('positive', [Tree(None, ['to']), Tree('positive', [Tree('positive', [Tree(None, ['make']), Tree('positive', [Tree('positive', [Tree(None, ['a']), Tree('positive', ['splash'])]), Tree(None, [Tree(None, ['even']), Tree('positive', ['greater'])])])]), Tree(None, [Tree(None, ['than']), Tree(None, [Tree(None, [Tree(None, [Tree(None, [Tree('negative', [Tree(None, ['Arnold']), Tree(None, ['Schwarzenegger'])]), Tree(None, [','])]), Tree(None, [Tree(None, ['Jean-Claud']), Tree(None, [Tree(None, ['Van']), Tree(None, ['Damme'])])])]), Tree(None, ['or'])]), Tree(None, [Tree(None, ['Steven']), Tree(None, ['Segal'])])])])])])])])])])])])]), Tree(None, ['.'])])])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_tree_train[0]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "X_tree_dev, y_tree_dev = sst.build_tree_dataset(\n", " SST_HOME, sst.dev_reader, class_func=sst.binary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PyTorch TreeNN" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "torch_tree_nn_glove = TorchTreeNN(\n", " sst_glove_vocab,\n", " embedding=glove_embedding,\n", " max_grad_norm=10.0,\n", " early_stopping=True)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping after epoch 33. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 4.148158252239227" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 26min 57s, sys: 175 ms, total: 26min 57s\n", "Wall time: 26min 41s\n" ] } ], "source": [ "%time _ = torch_tree_nn_glove.fit(X_tree_train, y_tree_train)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "tree_dev_preds = torch_tree_nn_glove.predict(X_tree_dev)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.574 0.605 0.589 428\n", " positive 0.599 0.568 0.583 444\n", "\n", " accuracy 0.586 872\n", " macro avg 0.586 0.586 0.586 872\n", "weighted avg 0.587 0.586 0.586 872\n", "\n" ] } ], "source": [ "print(classification_report(y_tree_dev, tree_dev_preds, digits=3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subtree supervision\n", "\n", "We've so far ignored one of the most exciting aspects of the SST: it has sentiment labels on every constituent from the root down to the lexical nodes. \n", "\n", "It is fairly easy to extend `TorchTreeNN` to learn from these additional labels. The key change is that the recursive interpretation function has to gather all of the node representations and their true labels and pass these to the loss function:\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 1 }