{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised sentiment: Dense feature representations and neural networks" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2019\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Overview](#Overview)\n", "1. [Set-up](#Set-up)\n", "1. [Distributed representations as features](#Distributed-representations-as-features)\n", " 1. [GloVe inputs](#GloVe-inputs)\n", " 1. [IMDB representations](#IMDB-representations)\n", " 1. [Remarks on this approach](#Remarks-on-this-approach)\n", "1. [RNN classifiers](#RNN-classifiers)\n", " 1. [RNN dataset preparation](#RNN-dataset-preparation)\n", " 1. [Vocabulary for the embedding](#Vocabulary-for-the-embedding)\n", " 1. [Pure NumPy RNN implementation](#Pure-NumPy-RNN-implementation)\n", " 1. [PyTorch implementation](#PyTorch-implementation)\n", " 1. [TensorFlow implementation](#TensorFlow-implementation)\n", " 1. [Pretrained embeddings](#Pretrained-embeddings)\n", "1. [Tree-structured neural networks](#Tree-structured-neural-networks)\n", " 1. [TreeNN dataset preparation](#TreeNN-dataset-preparation)\n", " 1. [Pure NumPy TreeNN implementation](#Pure-NumPy-TreeNN-implementation)\n", " 1. [Torch TreeNN implementation](#Torch-TreeNN-implementation)\n", " 1. [Subtree supervision](#Subtree-supervision)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "This notebook defines and explores __recurrent neural network (RNN) classifiers__ and __tree-structured neural network (TreeNN) classifiers__ for the Stanford Sentiment Treebank. \n", "\n", "These approaches make their predictions based on comprehensive representations of the examples: \n", "\n", "* For the RNN, each word is modeled, as are its sequential relationships to the other words.\n", "* For the TreeNN, the entire parsed structure of the sentence is modeled.\n", "\n", "Both models contrast with the ones explored in [the previous notebook](sst_02_hand_built_features.ipynb), which make predictions based on more partial, potentially idiosyncratic information extracted from the examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up\n", "\n", "See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import random\n", "from np_rnn_classifier import RNNClassifier\n", "from np_tree_nn import TreeNN\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report\n", "import tensorflow as tf\n", "from tf_rnn_classifier import TfRNNClassifier\n", "import torch\n", "import torch.nn as nn\n", "from torch_rnn_classifier import TorchRNNClassifier\n", "from torch_tree_nn import TorchTreeNN\n", "from torch_subtree_nn import TorchSubtreeNN\n", "import sst\n", "import vsm\n", "import utils" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# This will limit the TensorFlow log messages to just those\n", "# that track traing progress.\n", "\n", "utils.tf_train_progress_logging()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "DATE_HOME = 'data'\n", "\n", "GLOVE_HOME = os.path.join(DATE_HOME, 'glove.6B')\n", "\n", "VSMDATA_HOME = os.path.join(DATE_HOME, 'vsmdata')\n", "\n", "SST_HOME = os.path.join(DATE_HOME, 'trees')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed representations as features\n", "\n", "As a first step in the direction of neural networks for sentiment, we can connect with our previous unit on distributed representations. Arguably, more than any specific model architecture, this is the major innovation of deep learning: __rather than designing feature functions by hand, we use dense, distributed representations, often derived from unsupervised models__.\n", "\n", "\"distreps-as-features.png\"\n", "\n", "Our model will just be `LogisticRegression`, and we'll continue with the experiment framework from the previous notebook. Here is `fit_maxent_classifier` again:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def fit_maxent_classifier(X, y): \n", " mod = LogisticRegression(\n", " fit_intercept=True, \n", " solver='liblinear', \n", " multi_class='auto')\n", " mod.fit(X, y)\n", " return mod" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GloVe inputs\n", "\n", "To illustrate this process, we'll use the general purpose GloVe representations released by the GloVe team, at 50d:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "glove_lookup = utils.glove2dict(\n", " os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def vsm_leaves_phi(tree, lookup, np_func=np.sum):\n", " \"\"\"Represent `tree` as a combination of the vector of its words.\n", " \n", " Parameters\n", " ----------\n", " tree : nltk.Tree \n", " lookup : dict\n", " From words to vectors.\n", " np_func : function (default: np.sum)\n", " A numpy matrix operation that can be applied columnwise, \n", " like `np.mean`, `np.sum`, or `np.prod`. The requirement is that \n", " the function take `axis=0` as one of its arguments (to ensure\n", " columnwise combination) and that it return a vector of a \n", " fixed length, no matter what the size of the tree is.\n", " \n", " Returns\n", " -------\n", " np.array, dimension `X.shape[1]`\n", " \n", " \"\"\" \n", " allvecs = np.array([lookup[w] for w in tree.leaves() if w in lookup]) \n", " if len(allvecs) == 0:\n", " dim = len(next(iter(lookup.values())))\n", " feats = np.zeros(dim)\n", " else: \n", " feats = np_func(allvecs, axis=0) \n", " return feats" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def glove_leaves_phi(tree, np_func=np.sum):\n", " return vsm_leaves_phi(tree, glove_lookup, np_func=np_func)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.612 0.715 0.660 991\n", " neutral 0.322 0.076 0.122 490\n", " positive 0.659 0.785 0.716 1083\n", "\n", " micro avg 0.622 0.622 0.622 2564\n", " macro avg 0.531 0.525 0.499 2564\n", "weighted avg 0.576 0.622 0.581 2564\n", "\n" ] } ], "source": [ "_ = sst.experiment(\n", " SST_HOME,\n", " glove_leaves_phi,\n", " fit_maxent_classifier,\n", " class_func=sst.ternary_class_func,\n", " vectorize=False) # Tell `experiment` that we already have our feature vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IMDB representations\n", "\n", "Our IMDB VSMs seems pretty well-attuned to the Stanford Sentiment Treebank, so we might think that they can do even better than the general-purpose GloVe inputs. Here are two quick assessments of that idea:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "imdb20 = pd.read_csv(\n", " os.path.join(VSMDATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "imdb20_ppmi = vsm.pmi(imdb20, positive=False) " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "imdb20_ppmi_svd = vsm.lsa(imdb20_ppmi, k=50) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "imdb_lookup = dict(zip(imdb20_ppmi_svd.index, imdb20_ppmi_svd.values))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def imdb_phi(tree, np_func=np.sum):\n", " return vsm_leaves_phi(tree, imdb_lookup, np_func=np_func)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.591 0.716 0.648 1018\n", " neutral 0.214 0.006 0.012 495\n", " positive 0.617 0.774 0.687 1051\n", "\n", " micro avg 0.603 0.603 0.603 2564\n", " macro avg 0.474 0.499 0.449 2564\n", "weighted avg 0.529 0.603 0.541 2564\n", "\n" ] } ], "source": [ "_ = sst.experiment(\n", " SST_HOME,\n", " imdb_phi,\n", " fit_maxent_classifier,\n", " class_func=sst.ternary_class_func,\n", " vectorize=False) # Tell `experiment` that we already have our feature vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Remarks on this approach\n", "\n", "* Recall that our `unigrams_phi` created feature representations with over 16K dimensions and got about 0.77.\n", "\n", "* The above models have only 50 dimensions and come close in terms of performance. In many ways, it's striking that we can get a model that is competitive with so few dimensions.\n", "\n", "* The promise of the Mittens model of [Dingwall and Potts 2018](https://arxiv.org/abs/1803.09901) is that we can use GloVe itself to update the general purpose information in the 'glove.6B' vectors with specialized information from one of these IMDB count matrices. That might be worth trying; the `mittens` package already implements this!\n", "\n", "* That said, just summing up all the word representations is pretty unappealing linguistically. There's no doubt that we're losing a lot of valuable information in doing this. The models we turn to now can be seen as addressing this shortcoming while retaining the insight that our distributed representations are valuable for this task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RNN classifiers\n", "\n", "A recurrent neural network (RNN) is any deep learning model that process its inputs sequentially. There are many variations on this theme. The one that we use here is an __RNN classifier__.\n", "\n", "\n", "\n", "For a sequence of length $n$:\n", "\n", "$$\\begin{align*}\n", "h_{t} &= \\tanh(x_{t}W_{xh} + h_{t-1}W_{hh}) \\\\\n", "y &= \\textbf{softmax}(h_{n}W_{hy} + b)\n", "\\end{align*}$$\n", "\n", "where $1 \\leqslant t \\leqslant n$. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$ In our implementations, this is always an all $0$ vector, but it can be initialized in more sophisticated ways (some of which we will explore in our unit on natural language inference).\n", "\n", "This is a potential gain over our sum-the-word-vectors baseline, in that it processes each word independently, and in the context of those that came before it. Thus, not only is this sensitive to word order, but the hidden representation give us the potential to encode how the preceding context for a word affects its interpretation.\n", "\n", "The downside of this, of course, is that this model is much more difficult to set up and optimize. Let's dive into those details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### RNN dataset preparation\n", "\n", "SST contains trees, but the RNN processes just the sequence of leaf nodes. The function `sst.build_binary_rnn_dataset` creates datasets in this format:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "X_rnn_train, y_rnn_train = sst.build_rnn_dataset(\n", " SST_HOME, sst.train_reader, class_func=sst.ternary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each member of `X_rnn_train` is a list of lists of words. Here's a look at the start of the first:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['The', 'Rock', 'is', 'destined', 'to', 'be']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_rnn_train[0][: 6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because this is a classifier, `y_rnn_train` is just a list of labels, one per example:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'positive'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_rnn_train[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For experiments, let's build a `dev` dataset as well:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "X_rnn_dev, y_rnn_dev = sst.build_rnn_dataset(\n", " SST_HOME, sst.dev_reader, class_func=sst.ternary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocabulary for the embedding\n", "\n", "The first delicate issue we need to address is the vocabulary for our model:\n", "\n", "* As indicated in the figure above, the first thing we do when processing an example is look up the words in an embedding (a VSM), which has to have a fixed dimensionality. \n", "\n", "* We can use our training data to specify the vocabulary for this embedding; at prediction time, though, we will inevitably encounter words we haven't seen before. \n", "\n", "* The convention we adopt here is to map them to an `$UNK` token that is in our pre-specified vocabulary.\n", "\n", "* At the same time, we might want to collapse infrequent tokens into `$UNK` to make optimization easier.\n", "\n", "In `utils`, the function `get_vocab` implements these strategies. Now we can extract the training vocab and use it for the model embedding, secure in the knowledge that we will be able to process tokens outside of this set (by mapping them to `$UNK`)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "sst_full_train_vocab = utils.get_vocab(X_rnn_train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sst_full_train_vocab has 18,279 items\n" ] } ], "source": [ "print(\"sst_full_train_vocab has {:,} items\".format(len(sst_full_train_vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This frankly seems too big relative to our dataset size. Let's restrict to just 10000 words:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "sst_train_vocab = utils.get_vocab(X_rnn_train, n_words=10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pure NumPy RNN implementation\n", "\n", "The first implementation we'll look at is a pure NumPy implementation of exactly the model depicted above. This implementation is a bit slow and might not be all that effective, but it is useful to have available in case one really wants to inspect the details of how these models process examples." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "rnn = RNNClassifier(\n", " sst_train_vocab,\n", " embedding=None, # Will be randomly initialized.\n", " embed_dim=50,\n", " hidden_dim=50,\n", " max_iter=50, \n", " eta=0.05) " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finished epoch 50 of 50; error is 15.908830618217432" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4min 7s, sys: 332 ms, total: 4min 7s\n", "Wall time: 4min 7s\n" ] } ], "source": [ "%time _ = rnn.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "rnn_dev_predictions = rnn.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.34 0.29 0.31 428\n", " neutral 0.25 0.34 0.29 229\n", " positive 0.41 0.40 0.40 444\n", "\n", " micro avg 0.34 0.34 0.34 1101\n", " macro avg 0.33 0.34 0.33 1101\n", "weighted avg 0.35 0.34 0.34 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, rnn_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PyTorch implementation\n", "\n", "The included PyTorch implementation is much faster and more configurable." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "torch_rnn = TorchRNNClassifier(\n", " sst_train_vocab,\n", " embed_dim=50,\n", " hidden_dim=50,\n", " max_iter=50,\n", " eta=0.05) " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finished epoch 50 of 50; error is 0.21702845860272646" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11min 30s, sys: 2min 33s, total: 14min 4s\n", "Wall time: 2min 26s\n" ] } ], "source": [ "%time _ = torch_rnn.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "torch_rnn_dev_predictions = torch_rnn.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.58 0.61 0.59 428\n", " neutral 0.21 0.17 0.19 229\n", " positive 0.59 0.61 0.60 444\n", "\n", " micro avg 0.52 0.52 0.52 1101\n", " macro avg 0.46 0.46 0.46 1101\n", "weighted avg 0.51 0.52 0.51 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, torch_rnn_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TensorFlow implementation\n", "\n", "This has a very similar interface to the above implementations. It's generally faster than both of them, but you might find TensorFlow to be more challenging when it comes to debugging new architectures." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "tf_rnn = TfRNNClassifier(\n", " sst_train_vocab,\n", " embedding=None,\n", " embed_dim=50,\n", " hidden_dim=50,\n", " hidden_activation=tf.nn.tanh,\n", " cell_class=tf.nn.rnn_cell.LSTMCell,\n", " train_embedding=True,\n", " max_iter=50,\n", " eta=0.05)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Applications/anaconda3/envs/nlu/lib/python3.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.cast instead.\n", "INFO:tensorflow:loss = 1.0981584, step = 1\n", "INFO:tensorflow:loss = 0.02402336, step = 101 (27.944 sec)\n", "INFO:tensorflow:loss = 0.013917241, step = 201 (27.987 sec)\n", "INFO:tensorflow:loss = 0.0017507431, step = 301 (27.898 sec)\n", "INFO:tensorflow:loss = 0.00930609, step = 401 (27.060 sec)\n", "INFO:tensorflow:Loss for final step: 0.038237844.\n", "CPU times: user 4min 26s, sys: 1min 12s, total: 5min 39s\n", "Wall time: 1min 56s\n" ] } ], "source": [ "%time _ = tf_rnn.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "tf_rnn_dev_predictions = tf_rnn.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.61 0.60 0.60 428\n", " neutral 0.25 0.23 0.24 229\n", " positive 0.62 0.65 0.63 444\n", "\n", " micro avg 0.54 0.54 0.54 1101\n", " macro avg 0.49 0.49 0.49 1101\n", "weighted avg 0.54 0.54 0.54 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, tf_rnn_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pretrained embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `embedding=None`, `RNNClassifier`, `TorchRNNClassifier` and `TfRNNClassifier` create random embeddings in which the values are drawn from a uniform distribution with bounds `[-1, 1)`. You can also pass in an embedding, as long as you make sure it has the right vocabulary. The utility `utils.create_pretrained_embedding` will help with that:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "glove_embedding, sst_glove_vocab = utils.create_pretrained_embedding(\n", " glove_lookup, sst_train_vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an illustration using `TorchRNNClassifier`:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "torch_rnn_glove = TorchRNNClassifier(\n", " sst_glove_vocab,\n", " embedding=glove_embedding,\n", " hidden_dim=50,\n", " max_iter=50,\n", " eta=0.05) " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finished epoch 50 of 50; error is 3.2841555774211884" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8min 52s, sys: 2min 27s, total: 11min 19s\n", "Wall time: 1min 58s\n" ] } ], "source": [ "%time _ = torch_rnn_glove.fit(X_rnn_train, y_rnn_train)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "torch_rnn_imdb_dev_predictions = torch_rnn_glove.predict(X_rnn_dev)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.64 0.59 0.62 428\n", " neutral 0.29 0.22 0.25 229\n", " positive 0.62 0.74 0.67 444\n", "\n", " micro avg 0.57 0.57 0.57 1101\n", " macro avg 0.52 0.52 0.51 1101\n", "weighted avg 0.56 0.57 0.56 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_rnn_dev, torch_rnn_imdb_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree-structured neural networks\n", "\n", "Tree-structured neural networks (TreeNNs) are close relatives of RNN classifiers. (If you tilt your head, you can see the above sequence model as a kind of tree.) The TreeNNs we explore here are the simplest possible and actually have many fewer parameters than RNNs. Here's a summary:\n", "\n", "\n", "\n", "The crucial property of these networks is the way they employ recursion: the representation of a parent node $p$ has the same dimensionality as the word representations, allowing seamless repeated application of the central combination function:\n", "\n", "$$p = \\tanh([x_{L};x_{R}]W_{wh} + b)$$\n", "\n", "Here, $[x_{L};x_{R}]$ is the concatenation of the left and right child representations, and $p$ is the resulting parent node, which can then be a child node in a higher subtree.\n", "\n", "When we reach the root node $h_{r}$ of the tree, we apply a softmax classifier using that top node's representation:\n", "\n", "$$y = \\textbf{softmax}(h_{r}W_{hy} + b)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TreeNN dataset preparation\n", "\n", "This is the only model under consideration here that makes use of the tree structures in the SST:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "X_tree_train, _ = sst.build_tree_dataset(SST_HOME, sst.train_reader)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('positive', [Tree('neutral', [Tree('neutral', ['The']), Tree('neutral', ['Rock'])]), Tree('positive', [Tree('positive', [Tree('neutral', ['is']), Tree('positive', [Tree('neutral', ['destined']), Tree('neutral', [Tree('neutral', [Tree('neutral', [Tree('neutral', [Tree('neutral', ['to']), Tree('neutral', [Tree('neutral', ['be']), Tree('neutral', [Tree('neutral', ['the']), Tree('neutral', [Tree('neutral', ['21st']), Tree('neutral', [Tree('neutral', [Tree('neutral', ['Century']), Tree('neutral', [\"'s\"])]), Tree('neutral', [Tree('positive', ['new']), Tree('neutral', [Tree('neutral', ['``']), Tree('neutral', ['Conan'])])])])])])])]), Tree('neutral', [\"''\"])]), Tree('neutral', ['and'])]), Tree('positive', [Tree('neutral', ['that']), Tree('positive', [Tree('neutral', ['he']), Tree('positive', [Tree('neutral', [\"'s\"]), Tree('positive', [Tree('neutral', ['going']), Tree('positive', [Tree('neutral', ['to']), Tree('positive', [Tree('positive', [Tree('neutral', ['make']), Tree('positive', [Tree('positive', [Tree('neutral', ['a']), Tree('positive', ['splash'])]), Tree('neutral', [Tree('neutral', ['even']), Tree('positive', ['greater'])])])]), Tree('neutral', [Tree('neutral', ['than']), Tree('neutral', [Tree('neutral', [Tree('neutral', [Tree('neutral', [Tree('negative', [Tree('neutral', ['Arnold']), Tree('neutral', ['Schwarzenegger'])]), Tree('neutral', [','])]), Tree('neutral', [Tree('neutral', ['Jean-Claud']), Tree('neutral', [Tree('neutral', ['Van']), Tree('neutral', ['Damme'])])])]), Tree('neutral', ['or'])]), Tree('neutral', [Tree('neutral', ['Steven']), Tree('neutral', ['Segal'])])])])])])])])])])])])]), Tree('neutral', ['.'])])])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_tree_train[0]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "X_tree_dev, y_tree_dev = sst.build_tree_dataset(\n", " SST_HOME, sst.dev_reader, class_func=sst.ternary_class_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pure NumPy TreeNN implementation\n", "\n", "`TreeNN` is a pure NumPy implementation of this model. It should be regarded as a baseline for models of this form. The original SST paper includes evaluations of a wide range of models in this family." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "tree_nn_glove = TreeNN(\n", " sst_glove_vocab,\n", " embedding=glove_embedding,\n", " embed_dim=None, # Ignored when embedding is not `None`\n", " max_iter=10,\n", " eta=0.05) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fit` method to this model is unusual in that it takes only a list of trees as its argument. It is assumed that the label on the root node of each tree (`tree.label()`) is its class label." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finished epoch 10 of 10; error is 9.031078368143927" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27min 40s, sys: 4.74 s, total: 27min 45s\n", "Wall time: 7min 9s\n" ] } ], "source": [ "%time _ = tree_nn_glove.fit(X_tree_train)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "tree_glove_dev_predictions = tree_nn_glove.predict(X_tree_dev)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.39 0.20 0.27 428\n", " neutral 0.20 0.39 0.27 229\n", " positive 0.40 0.39 0.39 444\n", "\n", " micro avg 0.32 0.32 0.32 1101\n", " macro avg 0.33 0.33 0.31 1101\n", "weighted avg 0.35 0.32 0.32 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_tree_dev, tree_glove_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Torch TreeNN implementation" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "torch_tree_nn_glove = TorchTreeNN(\n", " sst_glove_vocab,\n", " embedding=glove_embedding,\n", " embed_dim=50,\n", " max_iter=10,\n", " eta=0.05)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finished epoch 10 of 10; error is 93838.61479759216" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1h 30s, sys: 11min 14s, total: 1h 11min 45s\n", "Wall time: 10min 32s\n" ] } ], "source": [ "%time _ = torch_tree_nn_glove.fit(X_tree_train)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "torch_tree_glove_dev_predictions = torch_tree_nn_glove.predict(X_tree_dev)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.36 0.25 0.30 428\n", " neutral 0.50 0.01 0.03 229\n", " positive 0.40 0.73 0.52 444\n", "\n", " micro avg 0.39 0.39 0.39 1101\n", " macro avg 0.42 0.33 0.28 1101\n", "weighted avg 0.41 0.39 0.33 1101\n", "\n" ] } ], "source": [ "print(classification_report(y_tree_dev, torch_tree_glove_dev_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subtree supervision\n", "\n", "We've so far ignored one of the most exciting aspects of the SST: it has sentiment labels on every constituent from the root down to the lexical nodes. \n", "\n", "It is fairly easy to extend `TorchTreeNN` to learn from these additional labels. The key change is that the recursive interpretation function has to gather all of the node representations and their true labels and pass these to the loss function:\n", "\n", "\n", "\n", "This model is implemented in `torch_subtree_nn.py`, which uses `TorchTreeNN` and `TorchTreeNNModel` to create this variant. This version should also help pave the way to other subclasses of `TorchTreeNN` that you might want to build." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 1 }