{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment 2.3: Text classification via RNN (30 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import torch\n", "\n", "from torchtext import datasets\n", "\n", "from torchtext.data import Field, LabelField\n", "from torchtext.data import BucketIterator\n", "\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT = Field(sequential=True, lower=True)\n", "LABEL = LabelField()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train, tst = datasets.IMDB.splits(TEXT, LABEL)\n", "trn, vld = train.split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "TEXT.build_vocab(trn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "LABEL.build_vocab(trn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT.vocab.freqs.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating the Iterator (2 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:\n", "\n", "e.g.\n", "\\[ \n", "\\[3, 15, 2, 7\\],\n", "\\[4, 1\\], \n", "\\[5, 5, 6, 8, 1\\] \n", "\\] -> \\[ \n", "\\[3, 15, 2, 7, **0**\\],\n", "\\[4, 1, **0**, **0**, **0**\\], \n", "\\[5, 5, 6, 8, 1\\] \n", "\\] \n", "\n", "If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.\n", "\n", "Complete the definition of the **BucketIterator** object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_iter, val_iter, test_iter = BucketIterator.splits(\n", " (trn, vld, tst),\n", " batch_sizes=(64, 64, 64),\n", " sort=False,\n", " sort_key=,# write your code here\n", " sort_within_batch=False,\n", " device='cuda',\n", " repeat=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "batch = next(train_iter.__iter__()); batch.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "batch.__dict__.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define the RNN-based text classification model (10 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start simple first. Implement the model according to the shema below. \n", "![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class RNNBaseline(nn.Module):\n", " def __init__(self, hidden_dim, emb_dim):\n", " super().__init__()\n", " # =============================\n", " # Write code here\n", " # =============================\n", " \n", " def forward(self, seq):\n", " # =============================\n", " # Write code here\n", " # =============================\n", " return preds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "em_sz = 200\n", "nh = 300\n", "model = RNNBaseline(nh, emb_dim=em_sz); model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're using a GPU, remember to call model.cuda() to move your model to the GPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The training loop (3 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define the optimization and the loss functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "opt = # your code goes here\n", "loss_func = # your code goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define the stopping criteria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epochs = # your code goes here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "for epoch in range(1, epochs + 1):\n", " running_loss = 0.0\n", " running_corrects = 0\n", " model.train() \n", " for batch in train_iter: \n", " \n", " x = batch.text\n", " y = batch.label\n", "\n", " opt.zero_grad()\n", " preds = model(x) \n", " loss = loss_func(preds, y)\n", " loss.backward()\n", " opt.step()\n", " running_loss += loss.item()\n", "\n", " epoch_loss = running_loss / len(trn)\n", " \n", " val_loss = 0.0\n", " model.eval()\n", " for batch in val_iter:\n", " \n", " x = batch.text\n", " y = batch.label\n", " \n", " preds = model(x) \n", " loss = loss_func(preds, y)\n", " val_loss += loss.item()\n", " \n", " val_loss /= len(vld)\n", " print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculate performance of the trained model (5 points)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for batch in test_iter:\n", " x = batch.text\n", " y = batch.label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write down the calculated performance\n", "\n", "### Accuracy:\n", "### Precision:\n", "### Recall:\n", "### F1:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Experiments (10 points)\n", "\n", "Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. ?\n", "### 2. ?\n", "### 3. ?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }