{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "a569ad9a41c0d43b58bb9425c5bad9df", "grade": false, "grade_id": "cell-2dfc0bc1e6fbbbd3", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "# Part 3: Sequence Classification" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "4e155a44e17248e3d102e1b80e24bf6c", "grade": false, "grade_id": "cell-16d5c7a45d3f9b23", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "__Before starting, we recommend you enable GPU acceleration if you're running on Colab.__" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "b1b77d0af7b67787cd29f90504f29014", "grade": false, "grade_id": "cell-9fa514521b79541d", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Execute this code block to install dependencies when running on colab\n", "!pip uninstall -y torch\n", "!pip install torch==2.3.0\n", "!pip install torchdata==0.8.0\n", "!pip install portalocker==2.8.2\n", "\n", "try:\n", " import torchtext\n", "except:\n", " !pip install torchtext\n", "\n", "\n", "try:\n", " import torchbearer\n", "except:\n", " !pip install torchbearer\n", "\n", "try:\n", " import spacy\n", "except:\n", " !pip install spacy\n", "\n", "try:\n", " spacy.load('en-core-web-sm')\n", "except:\n", " !python -m spacy download en" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "e0fd6d0300fd2e1ce9a1b34ffd2fefe0", "grade": false, "grade_id": "cell-cabb9cac57ae217e", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Sequence Classification\n", "The problem that we will use to demonstrate sequence classification in this lab is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.\n", "\n", "The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test. An accuracy of 88.89% was achieved.\n", "\n", "We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\\{x_1, ..., x_T\\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$. \n", "\n", "$$h_t = \\text{RNN}(x_t, h_{t-1})$$\n", "\n", "Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\\hat{y} = f(h_T)$.\n", "\n", "Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros. \n", "\n", "![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment1.png)\n", "\n", "**Note:** some layers and steps have been omitted from the diagram, but these will be explained later.\n", "\n", "\n", "The TorchText library provides easy access to the IMDB dataset. The `IMDB` class allows you to load the dataset in a format that is ready for use in neural network and deep learning models, and TorchText's utility methods allow us to easily create batches of data that are `padded` to the same length (we need to pad shorter sentences in the batch to the length of the longest sentence)." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "494c01d089301731550196dabe047067", "grade": false, "grade_id": "cell-23e92e167a2ccd52", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "With `torchtext` we can utilise the built in tools to perform tokenisation,\n", "build vocabularies and turn the text into tensors." ] }, { "cell_type": "code", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "ecbcba375e6e1012f085383a6ccee602", "grade": false, "grade_id": "cell-e0561eba5550d048", "locked": true, "schema_version": 1, "solution": false }, "ExecuteTime": { "end_time": "2025-04-24T13:44:17.084059Z", "start_time": "2025-04-24T13:44:17.069873Z" } }, "source": [ "import torch\n", "from torchtext.data.utils import get_tokenizer\n", "\n", "tokenizer = get_tokenizer('basic_english')" ], "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'torchtext'", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mModuleNotFoundError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[4]\u001B[39m\u001B[32m, line 2\u001B[39m\n\u001B[32m 1\u001B[39m \u001B[38;5;28;01mimport\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorch\u001B[39;00m\n\u001B[32m----> \u001B[39m\u001B[32m2\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[34;01m.\u001B[39;00m\u001B[34;01mdata\u001B[39;00m\u001B[34;01m.\u001B[39;00m\u001B[34;01mutils\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m get_tokenizer\n\u001B[32m 4\u001B[39m tokenizer = get_tokenizer(\u001B[33m'\u001B[39m\u001B[33mbasic_english\u001B[39m\u001B[33m'\u001B[39m)\n", "\u001B[31mModuleNotFoundError\u001B[39m: No module named 'torchtext'" ] } ], "execution_count": 4 }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "c45641b2f9364a3c13d1c00d9c92682e", "grade": false, "grade_id": "cell-0689e30f35617f29", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "The following code automatically downloads the IMDb dataset and splits it\n", "into the canonical train/test splits:" ] }, { "cell_type": "code", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "73091f32f831e502e86041779578b7d0", "grade": false, "grade_id": "cell-bfc816072fd54de0", "locked": true, "schema_version": 1, "solution": false }, "ExecuteTime": { "end_time": "2025-04-24T13:42:14.434953Z", "start_time": "2025-04-24T13:42:14.374926Z" } }, "source": [ "from torchtext.datasets import IMDB\n", "from collections import Counter\n", "\n", "train_iter, test_iter = IMDB(split=('train', 'test'))" ], "outputs": [ { "ename": "OSError", "evalue": "dlopen(/Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so, 0x0006): Symbol not found: __ZN3c105ErrorC1ENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES7_PKv\n Referenced from: <7E3C8144-0701-3505-8587-6E953627B6AF> /Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so\n Expected in: <5445D2E4-6D7A-39F2-9003-F3A3F854555A> /Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torch/lib/libc10.dylib", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mOSError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[3]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[34;01m.\u001B[39;00m\u001B[34;01mdatasets\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m IMDB\n\u001B[32m 2\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mcollections\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m Counter\n\u001B[32m 3\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[34;01m.\u001B[39;00m\u001B[34;01mvocab\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m Vocab\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/__init__.py:18\u001B[39m\n\u001B[32m 15\u001B[39m _WARN = \u001B[38;5;28;01mFalse\u001B[39;00m\n\u001B[32m 17\u001B[39m \u001B[38;5;66;03m# the following import has to happen first in order to load the torchtext C++ library\u001B[39;00m\n\u001B[32m---> \u001B[39m\u001B[32m18\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m _extension \u001B[38;5;66;03m# noqa: F401\u001B[39;00m\n\u001B[32m 20\u001B[39m _TEXT_BUCKET = \u001B[33m\"\u001B[39m\u001B[33mhttps://download.pytorch.org/models/text/\u001B[39m\u001B[33m\"\u001B[39m\n\u001B[32m 22\u001B[39m _CACHE_DIR = os.path.expanduser(os.path.join(_get_torch_home(), \u001B[33m\"\u001B[39m\u001B[33mtext\u001B[39m\u001B[33m\"\u001B[39m))\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/_extension.py:64\u001B[39m\n\u001B[32m 59\u001B[39m \u001B[38;5;66;03m# This import is for initializing the methods registered via PyBind11\u001B[39;00m\n\u001B[32m 60\u001B[39m \u001B[38;5;66;03m# This has to happen after the base library is loaded\u001B[39;00m\n\u001B[32m 61\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m _torchtext \u001B[38;5;66;03m# noqa\u001B[39;00m\n\u001B[32m---> \u001B[39m\u001B[32m64\u001B[39m \u001B[43m_init_extension\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/_extension.py:58\u001B[39m, in \u001B[36m_init_extension\u001B[39m\u001B[34m()\u001B[39m\n\u001B[32m 55\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m _mod_utils.is_module_available(\u001B[33m\"\u001B[39m\u001B[33mtorchtext._torchtext\u001B[39m\u001B[33m\"\u001B[39m):\n\u001B[32m 56\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mImportError\u001B[39;00m(\u001B[33m\"\u001B[39m\u001B[33mtorchtext C++ Extension is not found.\u001B[39m\u001B[33m\"\u001B[39m)\n\u001B[32m---> \u001B[39m\u001B[32m58\u001B[39m \u001B[43m_load_lib\u001B[49m\u001B[43m(\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mlibtorchtext\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[32m 59\u001B[39m \u001B[38;5;66;03m# This import is for initializing the methods registered via PyBind11\u001B[39;00m\n\u001B[32m 60\u001B[39m \u001B[38;5;66;03m# This has to happen after the base library is loaded\u001B[39;00m\n\u001B[32m 61\u001B[39m \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorchtext\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[38;5;28;01mimport\u001B[39;00m _torchtext\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/_extension.py:50\u001B[39m, in \u001B[36m_load_lib\u001B[39m\u001B[34m(lib)\u001B[39m\n\u001B[32m 48\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m path.exists():\n\u001B[32m 49\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;01mFalse\u001B[39;00m\n\u001B[32m---> \u001B[39m\u001B[32m50\u001B[39m \u001B[43mtorch\u001B[49m\u001B[43m.\u001B[49m\u001B[43mops\u001B[49m\u001B[43m.\u001B[49m\u001B[43mload_library\u001B[49m\u001B[43m(\u001B[49m\u001B[43mpath\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 51\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torch/_ops.py:1357\u001B[39m, in \u001B[36m_Ops.load_library\u001B[39m\u001B[34m(self, path)\u001B[39m\n\u001B[32m 1352\u001B[39m path = _utils_internal.resolve_library_path(path)\n\u001B[32m 1353\u001B[39m \u001B[38;5;28;01mwith\u001B[39;00m dl_open_guard():\n\u001B[32m 1354\u001B[39m \u001B[38;5;66;03m# Import the shared library into the process, thus running its\u001B[39;00m\n\u001B[32m 1355\u001B[39m \u001B[38;5;66;03m# static (global) initialization code in order to register custom\u001B[39;00m\n\u001B[32m 1356\u001B[39m \u001B[38;5;66;03m# operators with the JIT.\u001B[39;00m\n\u001B[32m-> \u001B[39m\u001B[32m1357\u001B[39m \u001B[43mctypes\u001B[49m\u001B[43m.\u001B[49m\u001B[43mCDLL\u001B[49m\u001B[43m(\u001B[49m\u001B[43mpath\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 1358\u001B[39m \u001B[38;5;28mself\u001B[39m.loaded_libraries.add(path)\n", "\u001B[36mFile \u001B[39m\u001B[32m~/anaconda3/envs/torchwatcher/lib/python3.12/ctypes/__init__.py:379\u001B[39m, in \u001B[36mCDLL.__init__\u001B[39m\u001B[34m(self, name, mode, handle, use_errno, use_last_error, winmode)\u001B[39m\n\u001B[32m 376\u001B[39m \u001B[38;5;28mself\u001B[39m._FuncPtr = _FuncPtr\n\u001B[32m 378\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m handle \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[32m--> \u001B[39m\u001B[32m379\u001B[39m \u001B[38;5;28mself\u001B[39m._handle = \u001B[43m_dlopen\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_name\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmode\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 380\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m 381\u001B[39m \u001B[38;5;28mself\u001B[39m._handle = handle\n", "\u001B[31mOSError\u001B[39m: dlopen(/Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so, 0x0006): Symbol not found: __ZN3c105ErrorC1ENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES7_PKv\n Referenced from: <7E3C8144-0701-3505-8587-6E953627B6AF> /Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so\n Expected in: <5445D2E4-6D7A-39F2-9003-F3A3F854555A> /Users/jsh2/anaconda3/envs/torchwatcher/lib/python3.12/site-packages/torch/lib/libc10.dylib" ] } ], "execution_count": 3 }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "f868c66c17e0c39a48728520e07cb468", "grade": false, "grade_id": "cell-83b4651e016c211e", "locked": true, "schema_version": 1, "solution": false } }, "source": "We can also check an example from the train set:" }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "76a3507d5c228460a24aea95f68c5507", "grade": false, "grade_id": "cell-a3aaf4270ecf8c11", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": "next(iter(train_iter))" }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "5f7210df2512451c6d38c52c76ddf6c1", "grade": false, "grade_id": "cell-1c8ef9d389e1ea7c", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.random_split()` method.\n", "\n", "We choose to make a 70/30 split, but this can be controlled." ] }, { "cell_type": "code", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "6282e5f4b4da6ea29f0fa5d349055147", "grade": false, "grade_id": "cell-9a9d0a261cd1d62c", "locked": true, "schema_version": 1, "solution": false }, "ExecuteTime": { "end_time": "2025-04-24T17:22:03.303268Z", "start_time": "2025-04-24T17:22:03.267678Z" } }, "source": "train_iter, valid_iter = train_iter.random_split(total_length=len(list(train_iter)), weights={\"train\": 0.7, \"valid\": 0.3}, seed=0)", "outputs": [ { "ename": "NameError", "evalue": "name 'train_data' is not defined", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mNameError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[5]\u001B[39m\u001B[32m, line 3\u001B[39m\n\u001B[32m 1\u001B[39m \u001B[38;5;28;01mimport\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mrandom\u001B[39;00m\n\u001B[32m----> \u001B[39m\u001B[32m3\u001B[39m train_data, valid_data = \u001B[43mtrain_data\u001B[49m.random_split(weights={\u001B[33m\"\u001B[39m\u001B[33mtrain\u001B[39m\u001B[33m\"\u001B[39m: \u001B[32m0.7\u001B[39m, \u001B[33m\"\u001B[39m\u001B[33mvalid\u001B[39m\u001B[33m\"\u001B[39m: \u001B[32m0.3\u001B[39m}, seed=\u001B[32m0\u001B[39m)\n", "\u001B[31mNameError\u001B[39m: name 'train_data' is not defined" ] } ], "execution_count": 5 }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "8b77b32ded3821f4d03d18d256df19ae", "grade": false, "grade_id": "cell-ebac196da95db0fb", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Again, we'll view how many examples are in each split." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "8e00a4cf9bddfe86221ed1b820fcea6c", "grade": false, "grade_id": "cell-11de3fcbde1d6f7f", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "print(f'Number of training examples: {len(list(train_iter))}')\n", "print(f'Number of validation examples: {len(list(valid_iter))}')\n", "print(f'Number of testing examples: {len(list(test_iter))}')" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "eae6f73416a822fca491ebf20f878aa5", "grade": false, "grade_id": "cell-921d2a5f1737e53b", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Next, we have to build a _vocabulary_. This is effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).\n", "\n", "We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.\n", "\n", "![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment5.png)\n", "\n", "The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). \n", "\n", "There are two ways to effectively cut-down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.\n", "\n", "What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `` token. For example, if the sentence was \"This film is great and I love it\" but the word \"love\" was not in the vocabulary, it would become \"This film is great and I `` it\".\n", "\n", "The following builds the vocabulary, only keeping the most common tokens\n", "(ones that appear more than 5 times)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "6746775e3c0746e78afa6b2382d0f249", "grade": false, "grade_id": "cell-1cf0d6f0d09b9333", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "from torchtext.vocab import vocab as Vocab\n", "\n", "counter = Counter()\n", "for (label, line) in train_iter:\n", " counter.update(tokenizer(line))\n", "vocab = Vocab(counter, min_freq=5, specials=('', '', '', ''))\n", "vocab.set_default_index(0) # set the default token to " ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "c406d281eaa4f81948f1e423fd4876ed", "grade": false, "grade_id": "cell-1ca43190cc40ef00", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "4e073ef87fd96107616b5016c51b53b7", "grade": false, "grade_id": "cell-79a871ebea509deb", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": "print(f\"Unique tokens in vocabulary: {len(vocab)}\")" }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "6f0044693302832d5b864ea9f20bcaa2", "grade": false, "grade_id": "cell-68985316db3edb24", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "We can also see the vocabulary directly using either the `get_stoi` (**s**tring\n", "**to** **i**nt) or `get_itos` (**i**nt **to** **s**tring) methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "862acc66ef827e8a3e33ca822b682728", "grade": false, "grade_id": "cell-3f6931771dfb8b05", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": "print(vocab.get_itos()[:10])" }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "217fe362a504a59655710774d6bd4f3e", "grade": false, "grade_id": "cell-35a9ad2d4bf17b42", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "The final step of preparing the data is creating the iterators. We iterate\n", "over these in the training/evaluation loop, and they return a batch of\n", "examples (indexed and converted into tensors) at each iteration. Note that we\n", " define transformations which convert the text and labels into tensors.\n", "\n", "When we feed sentences into our model, we feed a _batch_ of them at a time,\n", "i.e. more than one at a time, and all sentences in the batch need to be the\n", "same size. Thus, to ensure each sentence in the batch is the same size, any\n", "sentences which are shorter than the longest within the batch are padded.\n", "This is done by the `collate_batch` function. `collate_batch` also returns\n", "the sequence lengths as part of the data.\n", "\n", "![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment6.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "526210054aa53f54d8dad8acf67d1dcf", "grade": false, "grade_id": "cell-722b81c8ccb13d25", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "from torch.utils.data import DataLoader\n", "from torch.nn.utils.rnn import pad_sequence\n", "\n", "text_transform = lambda x: [vocab['']] + [vocab[token] for token in tokenizer(x)] + [vocab['']]\n", "label_transform = lambda x: x - 1\n", "\n", "\n", "def collate_batch(batch):\n", " label_list, text_list, len_list = [], [], []\n", " for (_label, _text) in batch:\n", " label_list.append(label_transform(_label))\n", " processed_text = torch.tensor(text_transform(_text))\n", " text_list.append(processed_text)\n", " len_list.append(len(processed_text))\n", " return (pad_sequence(text_list, padding_value=3.0), len_list), torch.tensor(label_list).unsqueeze(1).float()\n", "\n", "train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True,\n", " collate_fn=collate_batch)\n", "valid_dataloader = DataLoader(list(valid_iter), batch_size=8, shuffle=False,\n", " collate_fn=collate_batch)\n", "test_dataloader = DataLoader(list(test_iter), batch_size=8, shuffle=False,\n", " collate_fn=collate_batch)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "def10308b65f2ed87e6fe3d3f2d89727", "grade": false, "grade_id": "cell-5d7acf8d6db191d3", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## Build the Model\n", "\n", "The next stage is building the model that we'll eventually train and evaluate. \n", "\n", "There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.\n", "\n", "Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.\n", "\n", "The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).\n", "\n", "The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.\n", "\n", "![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment7.png)\n", "\n", "Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.\n", "\n", "The `forward` method is called when we feed examples into our model.\n", "\n", "Each batch, `text_len`, is a tuple containing a tensor of size _**[max_sentence length, batch size]**_ and a tensor of **batch_size** containing the true lengths of each sentence (remember, they won't necessarily be the same; some reviews are much longer than others). \n", "\n", "The first tensor in the tuple contains the ordered word indexes for each review in the batch. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.\n", "\n", "The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_. \n", "\n", "`embedded` is then fed into a function called `pack_padded_sequence` before being fed into the RNN. `pack_padded_sequence` is used to create a datastructure that allows the RNN to 'mask' off the padding during the BPTT process (we don't want to learn the padding, as this could drastically influence the results!). In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.\n", "\n", "The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. \n", "\n", "Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction. Note the `squeeze` method, which is used to remove a dimension of size 1. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "c785b0842586d20d70510c6f8a2c6b39", "grade": false, "grade_id": "cell-fbb5f94b744dd6db", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "import torch.nn as nn\n", "\n", "class RNN(nn.Module):\n", " def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):\n", " super().__init__()\n", " \n", " self.embedding = nn.Embedding(input_dim, embedding_dim)\n", " self.rnn = nn.RNN(embedding_dim, hidden_dim)\n", " self.fc = nn.Linear(hidden_dim, output_dim)\n", " \n", " def forward(self, text, lengths):\n", " embedded = self.embedding(text)\n", " embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths, enforce_sorted=False)\n", " packed_output, hidden = self.rnn(embedded)\n", "\n", " return self.fc(hidden.squeeze(0))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "3c834acdf19cf4124c69e3b92c9fbd91", "grade": false, "grade_id": "cell-8b4748e05072b330", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "We now create an instance of our RNN class. \n", "\n", "The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. \n", "\n", "The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.\n", "\n", "The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.\n", "\n", "The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "a7f66791ac9794a7da9ee4e6fc743b24", "grade": false, "grade_id": "cell-751c2df54b71d158", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "INPUT_DIM = len(vocab)\n", "EMBEDDING_DIM = 50\n", "HIDDEN_DIM = 100\n", "OUTPUT_DIM = 1\n", "\n", "model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "591ccfef47e2bcc226185c87e568821a", "grade": false, "grade_id": "cell-daf23924e258a608", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "# Train the model\n", "\n", "Now we'll set up the training and then train the model.\n", "\n", "First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters that will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "4a0e4ee985e7d35fcf80b8b3eaabdbde", "grade": false, "grade_id": "cell-d7566606b6f480ec", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "import torch.optim as optim\n", "\n", "optimizer = optim.SGD(model.parameters(), lr=0.001)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "072bc1b948afe036e31f60a05469fc27", "grade": false, "grade_id": "cell-8981a2c109df3c53", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Next, we'll define our loss function. In PyTorch this is commonly called a criterion. \n", "\n", "The loss function here is _binary cross entropy with logits_. \n", "\n", "Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_ function. \n", "\n", "We then use this this bound scalar to calculate the loss using binary cross entropy. \n", "\n", "The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "dee16f7d05d182806906fc2f3d6c4484", "grade": false, "grade_id": "cell-99068d084d5dfb73", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "criterion = nn.BCEWithLogitsLoss()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "7a7cef0ba9cb2d6b49111244fb8f5841", "grade": false, "grade_id": "cell-71e6ceff6ffba4f7", "locked": true, "schema_version": 1, "solution": false } }, "source": "Finally, before we can a Torchbearer trial to train the model:" }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "fd0f46de194f98826875d7431eee5918", "grade": false, "grade_id": "cell-f011ec7d73d7ef46", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "from torchbearer import Trial\n", "\n", "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "torchbearer_trial = Trial(model, optimizer, criterion, metrics=['acc', 'loss']).to(device)\n", "torchbearer_trial.with_generators(train_generator=train_dataloader,\n", " val_generator=valid_dataloader,\n", " test_generator=test_dataloader)\n", "torchbearer_trial.run(epochs=5)\n", "torchbearer_trial.predict()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "9853268a5e221536d6e9650e8a77d473", "grade": false, "grade_id": "cell-03b5931999ab62d1", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "__Use the box below to comment on and give insight into the performance of the above model:__" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "checksum": "615ab310b18718e361d870399422e5ae", "grade": true, "grade_id": "cell-5bf61cbc741af01b", "locked": false, "points": 5, "schema_version": 1, "solution": true } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "fbeb865b3ebb7348e9d0e0e2b3f1e6d2", "grade": false, "grade_id": "cell-acff7e648aa99e42", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "Now try and build a better model. Rather than using a plain RNN, we'll instead use a (single layer) LSTM, and we'll use Adam with an initial learning rate of 0.01 as the optimiser. __Complete the following code to implement the improved model, and then train it:__" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "checksum": "ef8857370d3a96cdb46acea6f94f99ac", "grade": true, "grade_id": "cell-7c7913d0313ff2e8", "locked": false, "points": 5, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "class ImprovedRNN(nn.Module):\n", " def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):\n", " super().__init__()\n", " \n", " self.embedding = nn.Embedding(input_dim, embedding_dim)\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " self.fc = nn.Linear(hidden_dim, output_dim)\n", " \n", " def forward(self, text, lengths):\n", " embedded = self.embedding(text)\n", " embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths)\n", " \n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " \n", "INPUT_DIM = len(vocab)\n", "EMBEDDING_DIM = 50\n", "HIDDEN_DIM = 100\n", "OUTPUT_DIM = 1\n", "\n", "imodel = ImprovedRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)\n", "\n", "# TODO: Train and evaluate the model\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "a84da93b0d8631bd2279f0a0ca7cecf7", "grade": false, "grade_id": "cell-9da7d879835eafa4", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "__What do you observe about the performance of this model? What would you do next if you wanted to improve it further? Write your answers in the box below:__" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "checksum": "bd9fe3f4f88270f113a2731db2b1b2b7", "grade": true, "grade_id": "cell-856a0834622f664f", "locked": false, "points": 10, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "21752b32fcb29f71d7da22bf21a14a51", "grade": false, "grade_id": "cell-f5e06c722621a0d2", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## User Input\n", "\n", "We can now use our models to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.\n", "\n", "Our `predict_sentiment` function does a few things:\n", "- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens\n", "- indexes the tokens by converting them into their integer representation from our vocabulary\n", "- converts the indexes, which are a Python list into a PyTorch tensor\n", "- add a batch dimension by `unsqueeze`ing \n", "- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function\n", "- converts the tensor holding a single value into an integer with the `item()` method\n", "\n", "We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "03dd780275233d49f92dc46d21217de6", "grade": false, "grade_id": "cell-256f5d0cab3585a2", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "def predict_sentiment(model, sentence):\n", " tokenized = [tok for tok in tokenizer(sentence)]\n", " indexed = [vocab.get_stoi()[t] for t in tokenized]\n", " tensor = torch.LongTensor(indexed).to(device)\n", " tensor = tensor.unsqueeze(1)\n", " prediction = torch.sigmoid(model(tensor, torch.tensor([tensor.shape[0]])))\n", " return prediction.item()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "2f0ff29f9b4ff26535014b40ac64f159", "grade": false, "grade_id": "cell-44ca76b13f0ef977", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "An example negative review..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "d05743968f5f1f6c0665d09c34659f71", "grade": false, "grade_id": "cell-85a9deecd4e90b8b", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "predict_sentiment(imodel, \"This film is terrible\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "f890392d505e39def4b5dc80d028019f", "grade": false, "grade_id": "cell-78424acf52854f0e", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "and an example positive review..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "ff7023e20c6bba3f0a62bd3784853846", "grade": false, "grade_id": "cell-dc6a5b212298be67", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "predict_sentiment(imodel, \"This film is great\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "checksum": "3033258c6669e324eb0eba06b20b0275", "grade": false, "grade_id": "cell-433552e8e38ce037", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "__Use the box below to try classifying some of your own 'movie reviews':__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }