# Supervised sentiment: dense feature representations and neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2022"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Distributed representations as features](#Distributed-representations-as-features)
 1. [GloVe inputs](#GloVe-inputs)
 1. [Yelp representations](#Yelp-representations)
 1. [Remarks on this approach](#Remarks-on-this-approach)
1. [RNN classifiers](#RNN-classifiers)
 1. [RNN dataset preparation](#RNN-dataset-preparation)
 1. [Vocabulary for the embedding](#Vocabulary-for-the-embedding)
 1. [PyTorch RNN classifier](#PyTorch-RNN-classifier)
 1. [Pretrained embeddings](#Pretrained-embeddings)
 1. [RNN hyperparameter tuning experiment](#RNN-hyperparameter-tuning-experiment)
1. [The VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013)
 1. [Defining the model](#Defining-the-model)
 1. [VecAvg hyperparameter tuning experiment](#VecAvg-hyperparameter-tuning-experiment)

## Overview

This notebook defines and explores __vector averaging__ and __recurrent neural network (RNN) classifiers__ for the Stanford Sentiment Treebank. 

These approaches make their predictions based on comprehensive representations of the examples: 

* For the vector averaging models, each word is modeled, but we assume that words combine via a simple function that is insensitive to their order or constituent structure.
* For the RNN, each word is again modeled, and we also model the sequential relationships between words.

These models contrast with the ones explored in [the previous notebook](sst_02_hand_built_features.ipynb), which make predictions based on more partial, potentially idiosyncratic information extracted from the examples.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [2]:
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import torch
import torch.nn as nn

from torch_rnn_classifier import TorchRNNClassifier
import sst
import vsm
import utils

In [3]:
utils.fix_random_seeds()

In [4]:
DATA_HOME = 'data'

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

VSMDATA_HOME = os.path.join(DATA_HOME, 'vsmdata')

SST_HOME = os.path.join(DATA_HOME, 'sentiment')

## Distributed representations as features

As a first step in the direction of neural networks for sentiment, we can connect with our previous unit on distributed representations. Arguably, more than any specific model architecture, this is the major innovation of deep learning: __rather than designing feature functions by hand, we use dense, distributed representations, often derived from unsupervised models__.

"distreps-as-features.png"

Our model will just be `LogisticRegression`, and we'll continue with the experiment framework from the previous notebook. Here is `fit_softmax_classifier` again:

In [5]:
def fit_softmax_classifier(X, y):
 mod = LogisticRegression(
 fit_intercept=True,
 solver='liblinear',
 multi_class='auto')
 mod.fit(X, y)
 return mod

### GloVe inputs

To illustrate this process, we'll use the general purpose GloVe representations released by the GloVe team, at 300d:

In [6]:
glove_lookup = utils.glove2dict(
 os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

In [7]:
def vsm_phi(text, lookup, np_func=np.mean):
 """Represent `text` as a combination of the vector of its words.

 Parameters
 ----------
 text : str

 lookup : dict
 From words to vectors.

 np_func : function (default: np.sum)
 A numpy matrix operation that can be applied columnwise,
 like `np.mean`, `np.sum`, or `np.prod`. The requirement is that
 the function take `axis=0` as one of its arguments (to ensure
 columnwise combination) and that it return a vector of a
 fixed length, no matter what the size of the text is.

 Returns
 -------
 np.array, dimension `X.shape[1]`

 """
 allvecs = np.array([lookup[w] for w in text.split() if w in lookup])
 if len(allvecs) == 0:
 dim = len(next(iter(lookup.values())))
 feats = np.zeros(dim)
 else:
 feats = np_func(allvecs, axis=0)
 return feats

In [8]:
def glove_phi(text, np_func=np.mean):
 return vsm_phi(text, glove_lookup, np_func=np_func)

In [9]:
%%time
_ = sst.experiment(
 sst.train_reader(SST_HOME),
 glove_phi,
 fit_softmax_classifier,
 assess_dataframes=sst.dev_reader(SST_HOME),
 vectorize=False) # Tell `experiment` that we already have our feature vectors.

 precision recall f1-score support

 negative 0.613 0.724 0.664 428
 neutral 0.400 0.044 0.079 229
 positive 0.619 0.795 0.696 444

 accuracy 0.611 1101
 macro avg 0.544 0.521 0.480 1101
weighted avg 0.571 0.611 0.555 1101

CPU times: user 2.48 s, sys: 75.5 ms, total: 2.56 s
Wall time: 2.5 s


### Yelp representations

Our Yelp VSMs seems pretty well-attuned to the SST, so we might think that they can do even better than the general-purpose GloVe inputs. Here are two quick assessments of that idea that seeks to build on ideas we developed in the unit on VSMs.

In [10]:
yelp20 = pd.read_csv(
 os.path.join(VSMDATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)

In [11]:
yelp20_ppmi = vsm.pmi(yelp20, positive=False)

In [12]:
yelp20_ppmi_svd = vsm.lsa(yelp20_ppmi, k=300)

In [13]:
yelp_lookup = dict(zip(yelp20_ppmi_svd.index, yelp20_ppmi_svd.values))

In [14]:
def yelp_phi(text, np_func=np.mean):
 return vsm_phi(text, yelp_lookup, np_func=np_func)

In [15]:
%%time
_ = sst.experiment(
 sst.train_reader(SST_HOME),
 yelp_phi,
 fit_softmax_classifier,
 assess_dataframes=sst.dev_reader(SST_HOME),
 vectorize=False) # Tell `experiment` that we already have our feature vectors.

 precision recall f1-score support

 negative 0.593 0.673 0.630 428
 neutral 0.423 0.048 0.086 229
 positive 0.560 0.743 0.639 444

 accuracy 0.571 1101
 macro avg 0.525 0.488 0.452 1101
weighted avg 0.544 0.571 0.521 1101

CPU times: user 4.62 s, sys: 340 ms, total: 4.96 s
Wall time: 4.4 s


### Remarks on this approach

* Recall that our `unigrams_phi` created feature representations with over 16K dimensions and got about 0.52 with no hyperparameter tuning.

* The above models' feature representations have only 300 dimensions. While they are struggling with the neutral category, we can probably overcome this with some additional attention to the representations and to our strategies for optimization.

* The promise of the Mittens model of [Dingwall and Potts 2018](https://arxiv.org/abs/1803.09901) is that we can use GloVe itself to update the general purpose information in the 'glove.6B' vectors with specialized information from one of these IMDB count matrices. That might be worth trying; the `mittens` package (`pip install mittens`) already implements this!

* That said, just averaging all the word representations is pretty unappealing linguistically. There's no doubt that we're losing a lot of valuable information in doing this. The models we turn to now can be seen as addressing this shortcoming while retaining the insight that our distributed representations are valuable for this task.

* We'll return to these ideas below, when we consider [the VecAvg baseline from Socher et al. 2013](#The-VecAvg-baseline-from-Socher-et-al.-2013). That model also posits a simple, fixed combination function (averaging). However, it begins with randomly initialized representations and updates them as part of training.

## RNN classifiers

A recurrent neural network (RNN) is any deep learning model that process its inputs sequentially. There are many variations on this theme. The one that we use here is an __RNN classifier__.



The version of the model that is implemented in `np_rnn_classifier.py` corresponds exactly to the above diagram. We can express it mathematically as follows:

$$\begin{align*}
h_{t} &= \tanh(x_{t}W_{xh} + h_{t-1}W_{hh}) \\
y &= \textbf{softmax}(h_{n}W_{hy} + b_y)
\end{align*}$$

where $1 \leqslant t \leqslant n$. The first line defines the recurrence: each hidden state $h_{t}$ is defined by the input $x_{t}$ and the previous hidden state $h_{t-1}$, together with weight matrices $W_{xh}$ and $W_{hh}$, which are used at all timesteps. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$. In our implementations, this is always an all $0$ vector, but it can be initialized in more sophisticated ways (some of which we will explore in our units on natural language inference and grounded natural language generation).

The model in `torch_rnn_classifier.py` expands on the above and allows for more flexibility:

$$\begin{align*}
h_{t} &= \text{RNN}(x_{t}, h_{t-1}) \\
h &= f(h_{n}W_{hh} + b_{h}) \\
y &= \textbf{softmax}(hW_{hy} + b_y)
\end{align*}$$

Here, $\text{RNN}$ stands for all the parameters of the recurrent part of the model. This will depend on the choice one makes for `rnn_cell_class`; options include `nn.RNN`, `nn.LSTM`, and `nn.GRU`. In addition, the classifier part includes a hidden layer (middle row), and the user can decide on the activation funtion $f$ to use there (parameter: `classifier_activation`).

This is a potential gain over our average-vectors baseline, in that it processes each word independently, and in the context of those that came before it. Thus, not only is this sensitive to word order, but the hidden representation create the potential to encode how the preceding context for a word affects its interpretation.

The downside of this, of course, is that this model is much more difficult to set up and optimize. Let's dive into those details.

### RNN dataset preparation

SST contains trees, but the RNN processes just the sequence of leaf nodes. The function `sst.build_rnn_dataset` creates datasets in this format:

In [16]:
X_rnn_train, y_rnn_train = sst.build_rnn_dataset(sst.train_reader(SST_HOME))

Each member of `X_rnn_train` is a list of lists of words. Here's a look at the start of the first:

In [17]:
X_rnn_train[0][: 6]

['The', 'Rock', 'is', 'destined', 'to', 'be']

Because this is a classifier, `y_rnn_train` is just a list of labels, one per example:

In [18]:
y_rnn_train[0]

'positive'

For experiments, let's build a `dev` dataset as well:

In [19]:
X_rnn_dev, y_rnn_dev = sst.build_rnn_dataset(sst.dev_reader(SST_HOME))

### Vocabulary for the embedding

The first delicate issue we need to address is the vocabulary for our model:

* As indicated in the figure above, the first thing we do when processing an example is look up the words in an embedding (a VSM), which has to have a fixed dimensionality. 

* We can use our training data to specify the vocabulary for this embedding; at prediction time, though, we will inevitably encounter words we haven't seen before. 

* The convention we adopt here is to map them to an `$UNK` token that is in our pre-specified vocabulary.

* At the same time, we might want to collapse infrequent tokens into `$UNK` to make optimization easier and to try to create reasonable representations for words that we have to map to `$UNK` at test time.

In `utils`, the function `get_vocab` will help you specify a vocabulary. It will let you choose a vocabulary by optionally specifying `mincount` or `n_words`, and it will ensure that `$UNK` is included.

In [20]:
sst_full_train_vocab = utils.get_vocab(X_rnn_train)

In [21]:
print("sst_full_train_vocab has {:,} items".format(len(sst_full_train_vocab)))

sst_full_train_vocab has 18,279 items


This frankly seems too big relative to our dataset size. Let's restrict to just words that occur at least twice:

In [22]:
sst_train_vocab = utils.get_vocab(X_rnn_train, mincount=2)

In [23]:
print("sst_train_vocab has {:,} items".format(len(sst_train_vocab)))

sst_train_vocab has 8,736 items


### PyTorch RNN classifier

Here and throughout, we'll rely on `early_stopping=True` to try to find the optimal time to stop optimization. This behavior can be further refined by setting different values of `validation_fraction`, `n_iter_no_change`, and `tol`. For additional discussion, see [the section on model convergence in the evaluation methods notebook](#Assessing-models-without-convergence).

In [24]:
rnn = TorchRNNClassifier(
 sst_train_vocab,
 early_stopping=True)

In [25]:
%time _ = rnn.fit(X_rnn_train, y_rnn_train)

Stopping after epoch 58. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.2886183727532625

CPU times: user 38.9 s, sys: 24.6 s, total: 1min 3s
Wall time: 19.2 s


In [26]:
rnn_dev_preds = rnn.predict(X_rnn_dev)

In [27]:
print(classification_report(y_rnn_dev, rnn_dev_preds, digits=3))

 precision recall f1-score support

 negative 0.575 0.614 0.594 428
 neutral 0.230 0.223 0.226 229
 positive 0.637 0.606 0.621 444

 accuracy 0.530 1101
 macro avg 0.481 0.481 0.481 1101
weighted avg 0.529 0.530 0.529 1101



The above numbers are just a starting point. Let's try to improve on them by using pretrained embeddings and then by exploring a range of hyperparameter options.

### Pretrained embeddings

With `embedding=None`, `TorchRNNClassifier` (and its counterpart in `np_rnn_classifier.py`) create random embeddings. You can also pass in an embedding, as long as you make sure it has the right vocabulary. The utility `utils.create_pretrained_embedding` will help with that:

In [28]:
glove_embedding, sst_glove_vocab = utils.create_pretrained_embedding(
 glove_lookup, sst_train_vocab)

Here's an illustration using `TorchRNNClassifier`:

In [29]:
rnn_glove = TorchRNNClassifier(
 sst_glove_vocab,
 embedding=glove_embedding,
 early_stopping=True)

In [30]:
%time _ = rnn_glove.fit(X_rnn_train, y_rnn_train)

Stopping after epoch 27. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.3226494677364826

CPU times: user 13.1 s, sys: 9.15 s, total: 22.2 s
Wall time: 5.63 s


In [31]:
rnn_glove_dev_preds = rnn_glove.predict(X_rnn_dev)

In [32]:
print(classification_report(y_rnn_dev, rnn_glove_dev_preds, digits=3))

 precision recall f1-score support

 negative 0.676 0.664 0.670 428
 neutral 0.307 0.323 0.315 229
 positive 0.700 0.694 0.697 444

 accuracy 0.605 1101
 macro avg 0.561 0.560 0.561 1101
weighted avg 0.609 0.605 0.607 1101



It looks like pretrained representations give us a notable boost, but we're still below most of the simpler models explored in [the previous notebook](sst_02_hand_built_features.ipynb).

### RNN hyperparameter tuning experiment

As we saw in [the previous notebook](sst_02_hand_built_features.ipynb), we're not really done until we've done some hyperparameter search. So let's round out this section by cross-validating the RNN that uses GloVe embeddings, to see if we can improve on the default-parameters model we evaluated just above. For this, we'll use `sst.experiment`:

In [33]:
def simple_leaves_phi(text):
 return text.split()

In [34]:
def fit_rnn_with_hyperparameter_search(X, y):
 basemod = TorchRNNClassifier(
 sst_train_vocab,
 embedding=glove_embedding,
 batch_size=25, # Inspired by comments in the paper.
 bidirectional=True,
 early_stopping=True)

 # There are lots of other parameters and values we could
 # explore, but this is at least a solid start:
 param_grid = {
 'embed_dim': [50, 75, 100],
 'hidden_dim': [50, 75, 100],
 'eta': [0.001, 0.01]}

 bestmod = utils.fit_classifier_with_hyperparameter_search(
 X, y, basemod, cv=3, param_grid=param_grid)

 return bestmod

In [35]:
%%time
rnn_experiment_xval = sst.experiment(
 sst.train_reader(SST_HOME),
 simple_leaves_phi,
 fit_rnn_with_hyperparameter_search,
 assess_dataframes=sst.dev_reader(SST_HOME),
 vectorize=False)

Stopping after epoch 14. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.7038347354674672

Best params: {'embed_dim': 75, 'eta': 0.001, 'hidden_dim': 100}
Best score: 0.546
 precision recall f1-score support

 negative 0.699 0.666 0.682 428
 neutral 0.299 0.240 0.266 229
 positive 0.662 0.759 0.707 444

 accuracy 0.615 1101
 macro avg 0.553 0.555 0.552 1101
weighted avg 0.601 0.615 0.606 1101

CPU times: user 39min 55s, sys: 39.9 s, total: 40min 35s
Wall time: 39min 53s


This model looks quite competitive with the simpler models we explored previously, and perhaps an even wider hyperparameter search would lead to additional improvements. In [finetuning.ipynb](finetuning.ipynb), we look at variants of the above that involve fine-tuning with BERT, and those models achieve even better results, which further highlights the value of rich pretraining.

## The VecAvg baseline from Socher et al. 2013

One of the baseline models from [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf) is __VecAvg__. This is like the model we explored above under the heading of [Distributed representations as features](#Distributed-representations-as-features), but it uses a random initial embedding that is updated as part of optimization. Another perspective on it is that it is like the RNN we just evaluated, but with the RNN parameters replaced by averaging. 

In Socher et al. 2013, this model does reasonably well, scoring 80.1 on the root-only binary problem. In this section, we reimplement it, relying on `TorchRNNClassifier` to handle most of the heavy-lifting, and we evaluate it with a reasonably wide hyperparameter search.

### Defining the model

The core model is `TorchVecAvgModel`, which just looks up embeddings, averages them, and feeds the result to a classifier layer:

In [36]:
class TorchVecAvgModel(nn.Module):
 def __init__(self, vocab_size, output_dim, device, embed_dim=50):
 super().__init__()
 self.vocab_size = vocab_size
 self.embed_dim = embed_dim
 self.output_dim = output_dim
 self.device = device
 self.embedding = nn.Embedding(self.vocab_size, self.embed_dim)
 self.classifier_layer = nn.Linear(self.embed_dim, self.output_dim)

 def forward(self, X, seq_lengths):
 embs = self.embedding(X)
 # Mask based on the **true** lengths:
 mask = [torch.ones(l, self.embed_dim) for l in seq_lengths]
 mask = torch.nn.utils.rnn.pad_sequence(mask, batch_first=True)
 mask = mask.to(self.device)
 # True average:
 mu = (embs * mask).sum(axis=1) / seq_lengths.unsqueeze(1)
 # Classifier:
 logits = self.classifier_layer(mu)
 return logits

For the main interface, we can just subclass `TorchRNNClassifier` and change the `build_graph` method to use `TorchVecAvgModel`. (For more details on the code and logic here, see the notebook [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb).)

In [37]:
class TorchVecAvgClassifier(TorchRNNClassifier):

 def build_graph(self):
 return TorchVecAvgModel(
 vocab_size=len(self.vocab),
 output_dim=self.n_classes_,
 device=self.device,
 embed_dim=self.embed_dim)

### VecAvg hyperparameter tuning experiment

Now that we have the model implemented, let's see if we can reproduce Socher et al.'s 80.1 on the binary, root-only version of SST.

In [38]:
train_df = sst.train_reader(SST_HOME)

train_bin_df = train_df[train_df.label != 'neutral']

In [39]:
dev_df = sst.dev_reader(SST_HOME)

dev_bin_df = dev_df[dev_df.label != 'neutral']

In [40]:
test_df = sst.sentiment_reader(os.path.join(SST_HOME, "sst3-test-labeled.csv"))

test_bin_df = test_df[test_df.label != 'neutral']

In [41]:
def fit_vecavg_with_hyperparameter_search(X, y):
 basemod = TorchVecAvgClassifier(
 sst_train_vocab,
 early_stopping=True)

 param_grid = {
 'embed_dim': [50, 100, 200, 300],
 'eta': [0.001, 0.01, 0.05]}

 bestmod = utils.fit_classifier_with_hyperparameter_search(
 X, y, basemod, cv=3, param_grid=param_grid)

 return bestmod

In [42]:
%%time
vecavg_experiment_xval = sst.experiment(
 [train_bin_df, dev_bin_df],
 simple_leaves_phi,
 fit_vecavg_with_hyperparameter_search,
 assess_dataframes=test_bin_df,
 vectorize=False)

Stopping after epoch 13. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.021834758925251663

Best params: {'embed_dim': 300, 'eta': 0.05}
Best score: 0.784
 precision recall f1-score support

 negative 0.827 0.779 0.802 912
 positive 0.790 0.836 0.812 909

 accuracy 0.807 1821
 macro avg 0.808 0.807 0.807 1821
weighted avg 0.808 0.807 0.807 1821

CPU times: user 42min 50s, sys: 28min 13s, total: 1h 11min 3s
Wall time: 17min 48s


Excellent – it looks like we basically reproduced the number from the paper (80.1).