# Tutorial: Using and extending the course PyTorch models

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [General optimization choices](#General-optimization-choices)
1. [Classifiers](#Classifiers)
 1. [Softmax classifier](#Softmax-classifier)
 1. [A deeper neural classifier](#A-deeper-neural-classifier)
1. [Regression](#Regression)
 1. [Linear regression](#Linear-regression)
 1. [Deeper Linear Regression](#Deeper-Linear-Regression)
1. [RNN sequence labeling](#RNN-sequence-labeling)

## Overview

This repository contains a number of PyTorch modules designed to support our core content and provide tools for homeworks and bake-offs:

In [2]:
%ls torch*

torch_autoencoder.py torch_rnn_classifier.py
torch_color_describer.py torch_shallow_neural_classifier.py
torch_glove.py torch_tree_nn.py
torch_model_base.py


The goal of the current notebook is to provide some guidance on how you can extend these modules to create original custom systems. Once you get used to how the code is structured, this is sure to be much faster than coding from scratch, and it still allows you a lot of freedom to design new models.

The base class for all the modules is `torch_model_base.TorchModelBase`. The central role of this class is to provide a very full-featured `fit` method. See [General optimization choices](#General-optimization-choices) for an overview of the knobs and levers it provides. The interface is generic enough to accommodate a wide range of tasks.

In what follows, we consider three kinds of extension, aiming to highlight general techniques and code patterns:

* __Classifiers__: subclasses using `torch_shallow_neural_classifier.py`
* __Regressors__: subclasses using `torch_model_base.py`
* __RNN-based models__: subclasses using `torch_rnn_classifier.py`

If you are experienced with PyTorch already, you can probably dive right into this notebook. If not, then I recommend [our PyTorch tutorial notebook](tutorial_pytorch.ipynb) to start.

## Set-up

In [3]:
import nltk
from sklearn.datasets import load_iris, load_boston
from sklearn.metrics import classification_report, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
import torch
import torch.nn as nn
from torch_model_base import TorchModelBase
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNDataset, TorchRNNClassifier, TorchRNNModel
import utils

## General optimization choices

The `TorchModelBase` has a number of keyword parameters that relate to how models are optimized.

In [4]:
TorchModelBase().params

['batch_size',
 'max_iter',
 'eta',
 'optimizer_class',
 'l2_strength',
 'gradient_accumulation_steps',
 'max_grad_norm',
 'validation_fraction',
 'early_stopping',
 'n_iter_no_change',
 'warm_start',
 'tol']

For descriptions of what these parameters do, please refer to the docstring for the class.

All of these parameters can be included in hyperparameter optimization runs using tools in `sklearn.model_selection`, as we'll see below.

## Classifiers

To create new classifiers, one typically just needs to subclass `TorchShallowNeuralClassifier` and write a new `build_graph` method to define your computation graph. Here we illustrate with some representative examples, using the [Iris plants dataset](https://scikit-learn.org/stable/datasets/index.html#iris-dataset) for evaluations:

In [5]:
def iris_split():
 dataset = load_iris()
 X = dataset.data
 y = dataset.target
 X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.33, random_state=42)
 return X_train, X_test, y_train, y_test

In [6]:
X_cls_train, X_cls_test, y_cls_train, y_cls_test = iris_split()

### Softmax classifier

For a softmax classifier, we just need to write a simple `build_graph` method:

In [7]:
class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):

 def build_graph(self):
 return nn.Sequential(
 nn.Linear(self.input_dim, self.n_classes_))

Since the data format and optimization process are the same as for `TorchShallowNeuralClassifier`, we needn't do anything beyond this.

Quick illustration:

In [8]:
sm_mod = TorchSoftmaxClassifier()

sm_mod

TorchSoftmaxClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_dim=50,
	hidden_activation=Tanh())

Note: as you can see here, this model will still accept keyword arguments `hidden_dim` and `hidden_activation`, which will be ignored since the graph doesn't use them. I'll leave this minor inconsistency aside.

In [9]:
_ = sm_mod.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.4739058315753937

In [10]:
sm_preds = sm_mod.predict(X_cls_test)

In [11]:
print(classification_report(y_cls_test, sm_preds))

 precision recall f1-score support

 0 1.00 1.00 1.00 19
 1 0.92 0.73 0.81 15
 2 0.79 0.94 0.86 16

 accuracy 0.90 50
 macro avg 0.90 0.89 0.89 50
weighted avg 0.91 0.90 0.90 50



`TorchModelBase` is able to ["duck type"](https://en.wikipedia.org/wiki/Duck_typing) standard `sklearn` estimators, so we can use the functionality from `sklearn.model_selection`. For example, here we use `sklearn.model_selection.cross_validate`:

In [12]:
cross_validate(sm_mod, X_cls_train, y_cls_train, cv=5)

Finished epoch 1000 of 1000; error is 0.58722406625747686

{'fit_time': array([1.90538383, 1.82407284, 1.84190989, 1.83592701, 1.84237123]),
 'score_time': array([0.00169611, 0.0011301 , 0.00174618, 0.00141382, 0.0018909 ]),
 'test_score': array([0.68660969, 0.84242424, 0.84615385, 0.51515152, 0.76911977])}

### A deeper neural classifier

`TorchShallowNeuralClassifier` is "shallow" in that it has just one hidden layer of representation. Adding a second is very straightforward. Again, all we really have to do is write a new `build_graph`, but the implementation below also includes a new `__init__` method to allow the user to separately control the sizes of the two hidden layers:

In [13]:
class TorchDeeperNeuralClassifier(TorchShallowNeuralClassifier):
 def __init__(self, hidden_dim1=50, hidden_dim2=50, **base_kwargs):
 super().__init__(**base_kwargs)
 self.hidden_dim1 = hidden_dim1
 self.hidden_dim2 = hidden_dim2
 # Good to remove this to avoid confusion:
 self.params.remove("hidden_dim")
 # Add the new parameters to support model_selection using them:
 self.params += ["hidden_dim1", "hidden_dim2"]

 def build_graph(self):
 return nn.Sequential(
 nn.Linear(self.input_dim, self.hidden_dim1),
 self.hidden_activation,
 nn.Linear(self.hidden_dim1, self.hidden_dim2),
 self.hidden_activation,
 nn.Linear(self.hidden_dim2, self.n_classes_))

In [14]:
deep_mod = TorchDeeperNeuralClassifier()

deep_mod

TorchDeeperNeuralClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_activation=Tanh(),
	hidden_dim1=50,
	hidden_dim2=50)

In [15]:
_ = deep_mod.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.023747699335217476

In [16]:
deep_preds = deep_mod.predict(X_cls_test)

In [17]:
print(classification_report(y_cls_test, deep_preds))

 precision recall f1-score support

 0 1.00 1.00 1.00 19
 1 0.94 1.00 0.97 15
 2 1.00 0.94 0.97 16

 accuracy 0.98 50
 macro avg 0.98 0.98 0.98 50
weighted avg 0.98 0.98 0.98 50



To try to find optimal values for the hidden layer dimensionalities, we could do some hyperparameter tuning:

In [18]:
xval = GridSearchCV(
 TorchDeeperNeuralClassifier(),
 param_grid={
 'hidden_dim1': [5, 10],
 'hidden_dim2': [5, 10]})

best_mod = xval.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.060364335775375366

In [19]:
xval.best_score_

0.9672889488678962

In [20]:
best_mod

GridSearchCV(estimator=TorchDeeperNeuralClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_activation=Tanh(),
	hidden_dim1=50,
	hidden_dim2=50),
 param_grid={'hidden_dim1': [5, 10], 'hidden_dim2': [5, 10]})

## Regression

It is also easy to write regression models. For these, we will `TorchModelBase`, since some fundamental things are different from the classifiers above.

For illustrations, we'll use a random split of the [Boston house prices](https://scikit-learn.org/stable/datasets/index.html#boston-dataset) dataset:

In [21]:
def boston_split():
 dataset = load_boston()
 X = dataset.data
 y = dataset.target
 X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.33, random_state=42)
 return X_train, X_test, y_train, y_test

In [22]:
X_reg_train, X_reg_test, y_reg_train, y_reg_test = boston_split()

### Linear regression

For linear regression, we create an `nn.Module` subclass:

In [23]:
class TorchLinearRegressionModel(nn.Module):
 def __init__(self, input_dim):
 super().__init__()
 self.input_dim = input_dim
 self.w = nn.Parameter(torch.zeros(self.input_dim))
 self.b = nn.Parameter(torch.zeros(1))

 def forward(self, X):
 return X.matmul(self.w) + self.b

The estimator itself, a subclass of `TorchModelBase`, needs the following methods:

* `build_graph`: to use `TorchLinearRegressionModel` from above.
* `build_dataset`: for processing the data.
* `predict`: for making predictions.
* `score`: technically optional, but required for `sklearn.model_selection` usage.

In [24]:
class TorchLinearRegresson(TorchModelBase):
 def __init__(self, **base_kwargs):
 super().__init__(**base_kwargs)
 self.loss = nn.MSELoss(reduction="mean")

 def build_graph(self):
 return TorchLinearRegressionModel(self.input_dim)

 def build_dataset(self, X, y=None):
 """
 This function will be used in training (when there is a `y`)
 and in prediction (no `y`). For both cases, we rely on a
 `TensorDataset`.
 """
 X = torch.FloatTensor(X)
 self.input_dim = X.shape[1]
 if y is None:
 dataset = torch.utils.data.TensorDataset(X)
 else:
 y = torch.FloatTensor(y)
 dataset = torch.utils.data.TensorDataset(X, y)
 return dataset

 def predict(self, X, device=None):
 """
 The `_predict` function of the base class handles all the
 details around data formatting. In this case, the
 raw output of `self.model`, as given by
 `TorchLinearRegressionModel.forward` is all we need.
 """
 return self._predict(X, device=device).cpu().numpy()

 def score(self, X, y):
 """
 Follow sklearn in using `r2_score` as the default scorer.
 """
 preds = self.predict(X)
 return r2_score(y, preds)

In [25]:
lr = TorchLinearRegresson()

lr

TorchLinearRegresson(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05)

In [26]:
_ = lr.fit(X_reg_train, y_reg_train)

Finished epoch 1000 of 1000; error is 52.95167922973633

In [27]:
lr_preds = lr.predict(X_reg_test)

In [28]:
r2_score(y_reg_test, lr_preds)

0.3236728529459678

### Deeper Linear Regression

We can extend the subclass we just created to easily create deeper regression models. Here's an example showing that all we need is the deeper `nn.Module` and a new `build_graph` method in the main estimator:

In [29]:
class TorchLinearRegressionModel(nn.Module):
 def __init__(self, input_dim, hidden_dim, hidden_activation):
 super().__init__()
 self.input_dim = input_dim
 self.hidden_dim = hidden_dim
 self.hidden_activation = hidden_activation
 self.input_layer = nn.Linear(self.input_dim, self.hidden_dim)
 self.w = nn.Parameter(torch.zeros(self.hidden_dim))
 self.b = nn.Parameter(torch.zeros(1))

 def forward(self, X):
 h = self.hidden_activation(self.input_layer(X))
 return h.matmul(self.w) + self.b


class TorchDeeperLinearRegression(TorchLinearRegresson):
 def __init__(self, hidden_dim=20, hidden_activation=nn.Tanh(), **kwargs):
 super().__init__(**kwargs)
 self.hidden_dim = hidden_dim
 self.hidden_activation = hidden_activation
 self.params += ["hidden_dim", "hidden_activation"]

 def build_graph(self):
 return TorchLinearRegressionModel(
 input_dim=self.input_dim,
 hidden_dim=self.hidden_dim,
 hidden_activation=self.hidden_activation)

In [30]:
deep_lr = TorchDeeperLinearRegression()

deep_lr

TorchDeeperLinearRegression(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_dim=20,
	hidden_activation=Tanh())

In [31]:
_ = deep_lr.fit(X_reg_train, y_reg_train)

Finished epoch 1000 of 1000; error is 132.6202392578125

In [32]:
deep_lr_preds = deep_lr.predict(X_reg_test)

In [33]:
r2_score(y_reg_test, deep_lr_preds)

-0.3762662051157306

## RNN sequence labeling

As a final illustrative example, let's make use of our existing RNN classifier components to create a model that can do full sequence labeling. PyTorch's abstractions concerning how layers interact and how loss functions work make this surprisingly easy.

For examples, we'll use the CoNLL 2002 shared task on named entity labeling in Spanish. NLTK provides an easy interface:

In [34]:
def sequence_dataset():
 train_seq = nltk.corpus.conll2002.iob_sents('esp.train')
 X = [[x[0] for x in seq] for seq in train_seq]
 y = [[x[2] for x in seq] for seq in train_seq]
 X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.33, random_state=42)
 vocab = sorted({w for seq in X_train for w in seq}) + ["$UNK"]
 return X_train, X_test, y_train, y_test, vocab

In [35]:
 X_seq_train, X_seq_test, y_seq_train, y_seq_test, seq_vocab = sequence_dataset()

Here's are the first few tokens in the first training example:

In [36]:
X_seq_train[0][: 8]

['La', 'compañía', 'estatal', 'de', 'electricidad', 'de', 'Suecia', ',']

And the corresponding labels:

In [37]:
y_seq_train[0][: 8]

['O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']

We'll start with the `nn.Module` subclass we need. In `torch_rnn_classifier.py`, we already have a pretty generic RNN module: `TorchRNNModel`. For the classifier use, `TorchRNNClassifierModel` uses the output of `TorchRNNModel` to define a classifier based on the final output state. For sequence labeling, we drop `TorchRNNClassifierModel` and replace it with model that has a classifier on every output state:

In [38]:
class TorchSequenceLabeler(nn.Module):
 def __init__(self, rnn, output_dim):
 super().__init__()
 self.rnn = rnn
 self.output_dim = output_dim
 if self.rnn.bidirectional:
 self.classifier_dim = self.rnn.hidden_dim * 2
 else:
 self.classifier_dim = self.rnn.hidden_dim
 self.classifier_layer = nn.Linear(
 self.classifier_dim, self.output_dim)

 def forward(self, X, seq_lengths):
 outputs, state = self.rnn(X, seq_lengths)
 outputs, seq_length = torch.nn.utils.rnn.pad_packed_sequence(
 outputs, batch_first=True)
 logits = self.classifier_layer(outputs)
 # During training, we need to swap the dimensions of logits
 # to accommodate `nn.CrossEntropyLoss`:
 if self.training:
 return logits.transpose(1, 2)
 else:
 return logits

We won't normally interact with this module directly, but it's perhaps instructive to see how it works on its own:

In [39]:
vocab_size = 4

seq_rnn = TorchRNNModel(vocab_size, embed_dim=4, hidden_dim=5)

In [40]:
seq_module = TorchSequenceLabeler(seq_rnn, vocab_size)

_ = seq_module.eval()

In [41]:
toy_seqs = torch.LongTensor([[0,1,2], [0,2,1]])

seq_lengths = torch.LongTensor([3,3])

This should return two sequences of 4-dimensional vectors – the per-token logits:

In [42]:
seq_module(toy_seqs, seq_lengths)

tensor([[[ 0.3255, 0.2848, 0.3470, -0.1150],
 [ 0.2264, 0.3246, 0.3123, -0.1394],
 [ 0.1972, 0.3036, 0.3240, -0.0696]],

 [[ 0.3255, 0.2848, 0.3470, -0.1150],
 [ 0.2272, 0.2959, 0.3383, -0.0673],
 [ 0.1895, 0.3257, 0.3078, -0.1153]]], grad_fn=)

The remaining tasks concern the new estimator. We need to define the following methods:

* `build_graph`: to use `TorchSequenceLabeler`
* `build_dataset`: just like what we need for a classifier, but it has to deal with examples as full sequences.
* `predict_proba`: like a classifier `predict_proba`, but it needs to remove any sequence padding and deal with full sequences
* `predict`: just like a classifier `predict` method, but defined for sequences.
* `score`: also very much like a classifier `score` function but designed to deal with sequences

In [43]:
class TorchRNNSequenceLabeler(TorchRNNClassifier):

 def build_graph(self):
 rnn = TorchRNNModel(
 vocab_size=len(self.vocab),
 embedding=self.embedding,
 use_embedding=self.use_embedding,
 embed_dim=self.embed_dim,
 rnn_cell_class=self.rnn_cell_class,
 hidden_dim=self.hidden_dim,
 bidirectional=self.bidirectional,
 freeze_embedding=self.freeze_embedding)
 model = TorchSequenceLabeler(
 rnn=rnn,
 output_dim=self.n_classes_)
 self.embed_dim = rnn.embed_dim
 return model

 def build_dataset(self, X, y=None):
 X, seq_lengths = self._prepare_sequences(X)
 if y is None:
 return TorchRNNDataset(X, seq_lengths)
 else:
 # These are the changes from a regular classifier. All
 # concern the fact that our labels are sequences of labels.
 self.classes_ = sorted({x for seq in y for x in seq})
 self.n_classes_ = len(self.classes_)
 class2index = dict(zip(self.classes_, range(self.n_classes_)))
 # `y` is a list of tensors of different length. Our Dataset
 # class will turn it into a padding tensor for processing.
 y = [torch.tensor([class2index[label] for label in seq])
 for seq in y]
 return TorchRNNDataset(X, seq_lengths, y)

 def predict_proba(self, X):
 seq_lengths = [len(ex) for ex in X]
 # The base class does the heavy lifting:
 preds = self._predict(X)
 # Trim to the actual sequence lengths:
 preds = [p[: l] for p, l in zip(preds, seq_lengths)]
 # Use `softmax`; the model doesn't do this because the loss
 # function does it internally.
 probs = [torch.softmax(seq, dim=1) for seq in preds]
 return probs

 def predict(self, X):
 probs = self.predict_proba(X)
 return [[self.classes_[i] for i in seq.argmax(axis=1)] for seq in probs]

 def score(self, X, y):
 preds = self.predict(X)
 flat_preds = [x for seq in preds for x in seq]
 flat_y = [x for seq in y for x in seq]
 return utils.safe_macro_f1(flat_y, flat_preds)

In [44]:
seq_mod = TorchRNNSequenceLabeler(
 seq_vocab,
 early_stopping=True,
 eta=0.001)

In [45]:
%time _ = seq_mod.fit(X_seq_train, y_seq_train)

Stopping after epoch 17. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 8.602030873298645

CPU times: user 24min 41s, sys: 3min 21s, total: 28min 3s
Wall time: 10min 22s


In [46]:
seq_mod.score(X_seq_test, y_seq_test)

0.11311924082554141