{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-Jv7Y4hXwt0j" }, "source": [ "# Assignment 2: Deep N-grams\n", "\n", "Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks `RNN`.\n", "- You will be using the fundamentals of google's [trax](https://github.com/google/trax) package to implement any kind of deeplearning model. \n", "\n", "By completing this assignment, you will learn how to implement models from scratch:\n", "- How to convert a line of text into a tensor\n", "- Create an iterator to feed data to the model\n", "- Define a GRU model using `trax`\n", "- Train the model using `trax`\n", "- Compute the accuracy of your model using the perplexity\n", "- Predict using your own model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "V8_3fIOdkGv1" }, "source": [ "## Outline\n", "\n", "- [Overview](#0)\n", "- [Part 1: Importing the Data](#1)\n", " - [1.1 Loading in the data](#1.1)\n", " - [1.2 Convert a line to tensor](#1.2)\n", " - [Exercise 01](#ex01)\n", " - [1.3 Batch generator](#1.3)\n", " - [Exercise 02](#ex02)\n", " - [1.4 Repeating Batch generator](#1.4) \n", "- [Part 2: Defining the GRU model](#2)\n", " - [Exercise 03](#ex03)\n", "- [Part 3: Training](#3)\n", " - [3.1 Training the Model](#3.1)\n", " - [Exercise 04](#ex04)\n", "- [Part 4: Evaluation](#4)\n", " - [4.1 Evaluating using the deep nets](#4.1)\n", " - [Exercise 05](#ex05)\n", "- [Part 5: Generating the language with your own model](#5) \n", "- [Summary](#6)\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JP0inrk5kGv3" }, "source": [ "\n", "### Overview\n", "\n", "Your task will be to predict the next set of characters using the previous characters. \n", "- Although this task sounds simple, it is pretty useful.\n", "- You will start by converting a line of text into a tensor\n", "- Then you will create a generator to feed data into the model\n", "- You will train a neural network in order to predict the new set of characters of defined length. \n", "- You will use embeddings for each character and feed them as inputs to your model. \n", " - Many natural language tasks rely on using embeddings for predictions. \n", "- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.\n", "\n", "\n", "\n", "The figure above gives you a summary of what you are about to implement. \n", "- You will get the embeddings;\n", "- Stack the embeddings on top of each other;\n", "- Run them through two layers with a relu activation in the middle;\n", "- Finally, you will compute the softmax. \n", "\n", "To predict the next character:\n", "- Use the softmax output and identify the word with the highest probability.\n", "- The word with the highest probability is the prediction for the next word." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "colab_type": "code", "id": "RVSwzQ5Bwt0m", "outputId": "9b51a13e-cf54-457f-e1ea-2574f9d67453" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 \n" ] } ], "source": [ "import os\n", "import trax\n", "import trax.fastmath.numpy as np\n", "import pickle\n", "import numpy\n", "import random as rnd\n", "from trax import fastmath\n", "from trax import layers as tl\n", "\n", "# set random seed\n", "trax.supervised.trainer_lib.init_random_number_generators(32)\n", "rnd.seed(32)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4sF9Hqzgwt0l" }, "source": [ "\n", "# Part 1: Importing the Data\n", "\n", "\n", "### 1.1 Loading in the data\n", "\n", "\n", "\n", "Now import the dataset and do some processing. \n", "- The dataset has one sentence per line.\n", "- You will be doing character generation, so you have to process each sentence by converting each **character** (and not word) to a number. \n", "- You will use the `ord` function to convert a unique character to a unique integer ID. \n", "- Store each line in a list.\n", "- Create a data generator that takes in the `batch_size` and the `max_length`. \n", " - The `max_length` corresponds to the maximum length of the sentence." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "5sDs36m81g6f" }, "outputs": [], "source": [ "dirname = 'data/'\n", "lines = [] # storing all the lines in a variable. \n", "for filename in os.listdir(dirname):\n", " with open(os.path.join(dirname, filename)) as files:\n", " for line in files:\n", " # remove leading and trailing whitespace\n", " pure_line = line.strip()\n", " \n", " # if pure_line is not the empty string,\n", " if pure_line:\n", " # append it to the list\n", " lines.append(pure_line)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "-zMCe7aJkGwA", "outputId": "c0eace05-246a-47d9-9542-f009a4940836" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of lines: 125097\n", "Sample line at position 0 A LOVER'S COMPLAINT\n", "Sample line at position 999 With this night's revels and expire the term\n" ] } ], "source": [ "n_lines = len(lines)\n", "print(f\"Number of lines: {n_lines}\")\n", "print(f\"Sample line at position 0 {lines[0]}\")\n", "print(f\"Sample line at position 999 {lines[999]}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "G6XsiyHvkGwD" }, "source": [ "Notice that the letters are both uppercase and lowercase. In order to reduce the complexity of the task, we will convert all characters to lowercase. This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "UBO9jI8EkGwE", "outputId": "55b55d61-a5b1-4381-ff88-d146071ac671" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of lines: 125097\n", "Sample line at position 0 a lover's complaint\n", "Sample line at position 999 with this night's revels and expire the term\n" ] } ], "source": [ "# go through each line\n", "for i, line in enumerate(lines):\n", " # convert to all lowercase\n", " lines[i] = line.lower()\n", "\n", "print(f\"Number of lines: {n_lines}\")\n", "print(f\"Sample line at position 0 {lines[0]}\")\n", "print(f\"Sample line at position 999 {lines[999]}\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of lines for training: 124097\n", "Number of lines for validation: 1000\n" ] } ], "source": [ "eval_lines = lines[-1000:] # Create a holdout validation set\n", "lines = lines[:-1000] # Leave the rest for training\n", "\n", "print(f\"Number of lines for training: {len(lines)}\")\n", "print(f\"Number of lines for validation: {len(eval_lines)}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "BDcxEmX31y3d" }, "source": [ "\n", "### 1.2 Convert a line to tensor\n", "\n", "Now that you have your list of lines, you will convert each character in that list to a number. You can use Python's `ord` function to do it. \n", "\n", "Given a string representing of one Unicode character, the `ord` function return an integer representing the Unicode code point of that character.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "id": "Cc_B8ae3kGwI", "outputId": "94eb6798-827a-4494-dec0-7bb84523ce34" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ord('a'): 97\n", "ord('b'): 98\n", "ord('c'): 99\n", "ord(' '): 32\n", "ord('x'): 120\n", "ord('y'): 121\n", "ord('z'): 122\n", "ord('1'): 49\n", "ord('2'): 50\n", "ord('3'): 51\n" ] } ], "source": [ "# View the unique unicode integer associated with each character\n", "print(f\"ord('a'): {ord('a')}\")\n", "print(f\"ord('b'): {ord('b')}\")\n", "print(f\"ord('c'): {ord('c')}\")\n", "print(f\"ord(' '): {ord(' ')}\")\n", "print(f\"ord('x'): {ord('x')}\")\n", "print(f\"ord('y'): {ord('y')}\")\n", "print(f\"ord('z'): {ord('z')}\")\n", "print(f\"ord('1'): {ord('1')}\")\n", "print(f\"ord('2'): {ord('2')}\")\n", "print(f\"ord('3'): {ord('3')}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZWB9qOLOkGwL" }, "source": [ "\n", "### Exercise 01\n", "\n", "**Instructions:** Write a function that takes in a single line and transforms each character into its unicode integer. This returns a list of integers, which we'll refer to as a tensor.\n", "- Use a special integer to represent the end of the sentence (the end of the line).\n", "- This will be the EOS_int (end of sentence integer) parameter of the function.\n", "- Include the EOS_int as the last integer of the \n", "- For this exercise, you will use the number `1` to represent the end of a sentence." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": {}, "colab_type": "code", "id": "IO4NSPkOITNK" }, "outputs": [], "source": [ "# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: line_to_tensor\n", "def line_to_tensor(line, EOS_int=1):\n", " \"\"\"Turns a line of text into a tensor\n", "\n", " Args:\n", " line (str): A single line of text.\n", " EOS_int (int, optional): End-of-sentence integer. Defaults to 1.\n", "\n", " Returns:\n", " list: a list of integers (unicode values) for the characters in the `line`.\n", " \"\"\"\n", " \n", " # Initialize the tensor as an empty list\n", " tensor = []\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " # for each character:\n", " for c in line:\n", " \n", " # convert to unicode int\n", " c_int = ord(c)\n", " \n", " # append the unicode integer to the tensor list\n", " tensor.append(c_int)\n", " \n", " # include the end-of-sentence integer\n", " tensor.append(EOS_int)\n", " ### END CODE HERE ###\n", "\n", " return tensor" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "D9Z_vtI7tTcw", "outputId": "0423ad21-af3e-4e6d-a558-472f4bf5f964" }, "outputs": [ { "data": { "text/plain": [ "[97, 98, 99, 32, 120, 121, 122, 1]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Testing your output\n", "line_to_tensor('abc xyz')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "7MwEspKCtTc4" }, "source": [ "##### Expected Output\n", "```CPP\n", "[97, 98, 99, 32, 120, 121, 122, 1]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iFOR19cX2TQs" }, "source": [ "\n", "### 1.3 Batch generator \n", "\n", "Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).\n", "- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.\n", "\n", "Once you create the generator, you can iterate on it like this:\n", "\n", "```\n", "next(data_generator)\n", "```\n", "\n", "This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.\n", "\n", "\n", "### Exercise 02\n", "**Instructions:** Implement the data generator below. Here are some things you will need. \n", "\n", "- While True loop: this will yield one batch at a time.\n", "- if index >= num_lines, set index to 0. \n", "- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.\n", "- if len(line) < max_length append line to cur_batch.\n", " - Note that a line that has length equal to max_length should not be appended to the batch. \n", " - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added. \n", " - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.\n", "- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.\n", "\n", "**Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_ekSEQlvtTc7" }, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

\n", "

" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": {}, "colab_type": "code", "id": "OMingz5xzrGD" }, "outputs": [], "source": [ "# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: data_generator\n", "def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):\n", " \"\"\"Generator function that yields batches of data\n", "\n", " Args:\n", " batch_size (int): number of examples (in this case, sentences) per batch.\n", " max_length (int): maximum length of the output tensor.\n", " NOTE: max_length includes the end-of-sentence character that will be added\n", " to the tensor. \n", " Keep in mind that the length of the tensor is always 1 + the length\n", " of the original line of characters.\n", " data_lines (list): list of the sentences to group into batches.\n", " line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.\n", " shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.\n", "\n", " Yields:\n", " tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).\n", " NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray\n", " \"\"\"\n", " # initialize the index that points to the current position in the lines index array\n", " index = 0\n", " \n", " # initialize the list that will contain the current batch\n", " cur_batch = []\n", " \n", " # count the number of lines in data_lines\n", " num_lines = len(data_lines)\n", " \n", " # create an array with the indexes of data_lines that can be shuffled\n", " lines_index = [*range(num_lines)]\n", " \n", " # shuffle line indexes if shuffle is set to True\n", " if shuffle:\n", " rnd.shuffle(lines_index)\n", " \n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " while True:\n", " \n", " # if the index is greater or equal than to the number of lines in data_lines\n", " if index >= num_lines:\n", " # then reset the index to 0\n", " index = 0\n", " # shuffle line indexes if shuffle is set to True\n", " if shuffle:\n", " rnd.shuffle(lines_index)\n", " \n", " # get a line at the `lines_index[index]` position in data_lines\n", " line = data_lines[lines_index[index]]\n", " \n", " # if the length of the line is less than max_length\n", " if len(line) < max_length:\n", " # append the line to the current batch\n", " cur_batch.append(line)\n", " \n", " # increment the index by one\n", " index += 1\n", " \n", " # if the current batch is now equal to the desired batch size\n", " if len(cur_batch) == batch_size:\n", " \n", " batch = []\n", " mask = []\n", " \n", " # go through each line (li) in cur_batch\n", " for li in cur_batch:\n", " # convert the line (li) to a tensor of integers\n", " tensor = line_to_tensor(li)\n", " \n", " # Create a list of zeros to represent the padding\n", " # so that the tensor plus padding will have length `max_length`\n", " pad = [0] * (max_length - len(tensor))\n", " \n", " # combine the tensor plus pad\n", " tensor_pad = tensor + pad\n", " \n", " # append the padded tensor to the batch\n", " batch.append(tensor_pad)\n", "\n", " # A mask for tensor_pad is 1 wherever tensor_pad is not\n", " # 0 and 0 wherever tensor_pad is 0, i.e. if tensor_pad is\n", " # [1, 2, 3, 0, 0, 0] then example_mask should be\n", " # [1, 1, 1, 0, 0, 0]\n", " # Hint: Use a list comprehension for this\n", " example_mask = [0 if t == 0 else 1 for t in tensor_pad]\n", " mask.append(example_mask)\n", " \n", " # convert the batch (data type list) to a trax's numpy array\n", " batch_np_arr = np.array(batch)\n", " mask_np_arr = np.array(mask)\n", " \n", " ### END CODE HERE ##\n", " \n", " # Yield two copies of the batch and mask.\n", " yield batch_np_arr, batch_np_arr, mask_np_arr\n", " \n", " # reset the current batch to an empty list\n", " cur_batch = []\n", " " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "id": "XJ1HBoEHwt0p", "outputId": "4fbe7f04-5ce9-49c5-f1dd-0136cc28a114", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57, 1],\n", " [50, 51, 52, 53, 54, 55, 56, 57, 48, 1]], dtype=int32),\n", " DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57, 1],\n", " [50, 51, 52, 53, 54, 55, 56, 57, 48, 1]], dtype=int32),\n", " DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try out your data generator\n", "tmp_lines = ['12345678901', #length 11\n", " '123456789', # length 9\n", " '234567890', # length 9\n", " '345678901'] # length 9\n", "\n", "# Get a batch size of 2, max length 10\n", "tmp_data_gen = data_generator(batch_size=2, \n", " max_length=10, \n", " data_lines=tmp_lines,\n", " shuffle=False)\n", "\n", "# get one batch\n", "tmp_batch = next(tmp_data_gen)\n", "\n", "# view the batch\n", "tmp_batch" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1rWMOk7ikGwZ" }, "source": [ "##### Expected output\n", "\n", "```CPP\n", "(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57, 1],\n", " [50, 51, 52, 53, 54, 55, 56, 57, 48, 1]], dtype=int32),\n", " DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57, 1],\n", " [50, 51, 52, 53, 54, 55, 56, 57, 48, 1]], dtype=int32),\n", " DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-M0U9GDwt0r" }, "source": [ "Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "gFcB-i-rDd68" }, "source": [ "\n", "### 1.4 Repeating Batch generator \n", "\n", "The way the iterator is currently defined, it will keep providing batches forever.\n", "\n", "Although it is not needed, we want to show you the `itertools.cycle` function which is really useful when the generator eventually stops\n", "\n", "Notice that it is expected to use this function within the training function further below\n", "\n", "Usually we want to cycle over the dataset multiple times during training (i.e. train for multiple *epochs*).\n", "\n", "For small datasets we can use [`itertools.cycle`](https://docs.python.org/3.8/library/itertools.html#itertools.cycle) to achieve this easily." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": {}, "colab_type": "code", "id": "v589leeZETy7" }, "outputs": [], "source": [ "import itertools\n", "\n", "infinite_data_generator = itertools.cycle(\n", " data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "RWvsSxDUFB0p" }, "source": [ "You can see that we can get more than the 5 lines in tmp_lines using this." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "0lJhBPgJFAxb", "outputId": "0849fa48-0d82-4050-b3b4-e738d96a7ca8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n" ] } ], "source": [ "ten_lines = [next(infinite_data_generator) for _ in range(10)]\n", "print(len(ten_lines))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "KmZRBoaMwt0w" }, "source": [ "\n", "\n", "# Part 2: Defining the GRU model\n", "\n", "Now that you have the input and output tensors, you will go ahead and initialize your model. You will be implementing the `GRULM`, gated recurrent unit model. To implement this model, you will be using google's `trax` package. Instead of making you implement the `GRU` from scratch, we will give you the necessary methods from a build in package. You can use the following packages when constructing the model: \n", "\n", "\n", "- `tl.Serial`: Combinator that applies layers serially (by function composition). [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/combinators.py#L26)\n", " - You can pass in the layers as arguments to `Serial`, separated by commas. \n", " - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))`\n", "\n", "___\n", "\n", "- `tl.ShiftRight`: Allows the model to go right in the feed forward. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.attention.ShiftRight) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/attention.py#L297)\n", " - `ShiftRight(n_shifts=1, mode='train')` layer to shift the tensor to the right n_shift times\n", " - Here in the exercise you only need to specify the mode and not worry about n_shifts\n", "\n", "___\n", "\n", "- `tl.Embedding`: Initializes the embedding. In this case it is the size of the vocabulary by the dimension of the model. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113) \n", " - `tl.Embedding(vocab_size, d_feature)`.\n", " - `vocab_size` is the number of unique words in the given vocabulary.\n", " - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).\n", "___\n", "\n", "- `tl.GRU`: `Trax` GRU layer. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.GRU) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/rnn.py#L143)\n", " - `GRU(n_units)` Builds a traditional GRU of n_cells with dense internal transformations.\n", " - `GRU` paper: https://arxiv.org/abs/1412.3555\n", "___\n", "\n", "- `tl.Dense`: A dense layer. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L28)\n", " - `tl.Dense(n_units)`: The parameter `n_units` is the number of units chosen for this dense layer.\n", "___\n", "\n", "- `tl.LogSoftmax`: Log of the output probabilities. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.LogSoftmax) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L242)\n", " - Here, you don't need to set any parameters for `LogSoftMax()`.\n", "___\n", "\n", "\n", "### Exercise 03\n", "**Instructions:** Implement the `GRULM` class below. You should be using all the methods explained above.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": {}, "colab_type": "code", "id": "hww76f8_wt0x" }, "outputs": [], "source": [ "# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: GRULM\n", "def GRULM(vocab_size=256, d_model=512, n_layers=2, mode='train'):\n", " \"\"\"Returns a GRU language model.\n", "\n", " Args:\n", " vocab_size (int, optional): Size of the vocabulary. Defaults to 256.\n", " d_model (int, optional): Depth of embedding (n_units in the GRU cell). Defaults to 512.\n", " n_layers (int, optional): Number of GRU layers. Defaults to 2.\n", " mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to \"train\".\n", "\n", " Returns:\n", " trax.layers.combinators.Serial: A GRU language model as a layer that maps from a tensor of tokens to activations over a vocab set.\n", " \"\"\"\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " model = tl.Serial(\n", " tl.ShiftRight(mode=mode), # Stack the ShiftRight layer\n", " tl.Embedding(vocab_size=vocab_size, d_feature=d_model), # Stack the embedding layer\n", " [tl.GRU(n_units=d_model) for _ in range(n_layers)], # Stack GRU layers of d_model units keeping n_layer parameter in mind (use list comprehension syntax)\n", " tl.Dense(n_units=vocab_size), # Dense layer\n", " tl.LogSoftmax() # Log Softmax\n", " )\n", " ### END CODE HERE ###\n", " return model\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 153 }, "colab_type": "code", "id": "kvQ_jf52-JAn", "outputId": "9d13c577-f89d-427a-8944-a00ca57a4f2c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Serial[\n", " ShiftRight(1)\n", " Embedding_256_512\n", " GRU_512\n", " GRU_512\n", " Dense_256\n", " LogSoftmax\n", "]\n" ] } ], "source": [ "# testing your model\n", "model = GRULM()\n", "print(model)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3K8UKp48kGwi" }, "source": [ "##### Expected output\n", "\n", "```CPP\n", "Serial[\n", " ShiftRight(1)\n", " Embedding_256_512\n", " GRU_512\n", " GRU_512\n", " Dense_256\n", " LogSoftmax\n", "]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "lsvjaCQ6wt02" }, "source": [ "\n", "# Part 3: Training\n", "\n", "Now you are going to train your model. As usual, you have to define the cost function, the optimizer, and decide whether you will be training it on a `gpu` or `cpu`. You also have to feed in a built model. Before, going into the training, we re-introduce the `TrainTask` and `EvalTask` abstractions from the last week's assignment.\n", "\n", "To train a model on a task, Trax defines an abstraction `trax.supervised.training.TrainTask` which packages the train data, loss and optimizer (among other things) together into an object.\n", "\n", "Similarly to evaluate a model, Trax defines an abstraction `trax.supervised.training.EvalTask` which packages the eval data and metrics (among other things) into another object.\n", "\n", "The final piece tying things together is the `trax.supervised.training.Loop` abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints.\n", "Using `training.Loop` will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": {}, "colab_type": "code", "id": "5Birerv82mLv" }, "outputs": [], "source": [ "batch_size = 32\n", "max_length = 64" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "VxRzCDPJFGxo" }, "source": [ "An `epoch` is traditionally defined as one pass through the dataset.\n", "\n", "Since the dataset was divided in `batches` you need several `steps` (gradient evaluations) in order to complete an `epoch`. So, one `epoch` corresponds to the number of examples in a `batch` times the number of `steps`. In short, in each `epoch` you go over all the dataset. \n", "\n", "The `max_length` variable defines the maximum length of lines to be used in training our data, lines longer that that length are discarded. \n", "\n", "Below is a function and results that indicate how many lines conform to our criteria of maximum length of a sentence in the entire dataset and how many `steps` are required in order to cover the entire dataset which in turn corresponds to an `epoch`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "T3NxHd-VtTcb", "outputId": "b945b175-3101-4600-ac2f-8c705c13d752" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of used lines from the dataset: 25881\n", "Batch size (a power of 2): 32\n", "Number of steps to cover one epoch: 808\n" ] } ], "source": [ "def n_used_lines(lines, max_length):\n", " '''\n", " Args: \n", " lines: all lines of text an array of lines\n", " max_length - max_length of a line in order to be considered an int\n", " output_dir - folder to save your file an int\n", " Return:\n", " number of efective examples\n", " '''\n", "\n", " n_lines = 0\n", " for l in lines:\n", " if len(l) <= max_length:\n", " n_lines += 1\n", " return n_lines\n", "\n", "num_used_lines = n_used_lines(lines, 32)\n", "print('Number of used lines from the dataset:', num_used_lines)\n", "print('Batch size (a power of 2):', int(batch_size))\n", "steps_per_epoch = int(num_used_lines/batch_size)\n", "print('Number of steps to cover one epoch:', steps_per_epoch)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rwUM5RwHHYP8" }, "source": [ "**Expected output:** \n", "\n", "Number of used lines from the dataset: 25881\n", "\n", "Batch size (a power of 2): 32\n", "\n", "Number of steps to cover one epoch: 808" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "IgFMfH5awt07" }, "source": [ "\n", "### 3.1 Training the model\n", "\n", "You will now write a function that takes in your model and trains it. To train your model you have to decide how many times you want to iterate over the entire data set. \n", "\n", "\n", "### Exercise 04\n", "\n", "**Instructions:** Implement the `train_model` program below to train the neural network above. Here is a list of things you should do:\n", "\n", "- Create a `trax.supervised.trainer.TrainTask` object, this encapsulates the aspects of the dataset and the problem at hand:\n", " - labeled_data = the labeled data that we want to *train* on.\n", " - loss_fn = [tl.CrossEntropyLoss()](https://trax-ml.readthedocs.io/en/latest/trax.layers.html?highlight=CrossEntropyLoss#trax.layers.metrics.CrossEntropyLoss)\n", " - optimizer = [trax.optimizers.Adam()](https://trax-ml.readthedocs.io/en/latest/trax.optimizers.html?highlight=Adam#trax.optimizers.adam.Adam) with learning rate = 0.0005\n", "\n", "- Create a `trax.supervised.trainer.EvalTask` object, this encapsulates aspects of evaluating the model:\n", " - labeled_data = the labeled data that we want to *evaluate* on.\n", " - metrics = [tl.CrossEntropyLoss()](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.CrossEntropyLoss) and [tl.Accuracy()](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.Accuracy)\n", " - How frequently we want to evaluate and checkpoint the model.\n", "\n", "- Create a `trax.supervised.trainer.Loop` object, this encapsulates the following:\n", " - The previously created `TrainTask` and `EvalTask` objects.\n", " - the training model = [GRULM](#ex03)\n", " - optionally the evaluation model, if different from the training model. NOTE: in presence of Dropout etc we usually want the evaluation model to behave slightly differently than the training model.\n", "\n", "You will be using a cross entropy loss, with Adam optimizer. Please read the [trax](https://trax-ml.readthedocs.io/en/latest/index.html) documentation to get a full understanding. Make sure you use the number of steps provided as a parameter to train for the desired number of steps.\n", "\n", "**NOTE:** Don't forget to wrap the data generator in `itertools.cycle` to iterate on it for multiple epochs." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": {}, "colab_type": "code", "id": "_kbtfz4T_m7x" }, "outputs": [], "source": [ "from trax.supervised import training\n", "\n", "# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: train_model\n", "def train_model(model, data_generator, batch_size=32, max_length=64, lines=lines, eval_lines=eval_lines, n_steps=1, output_dir='model/'): \n", " \"\"\"Function that trains the model\n", "\n", " Args:\n", " model (trax.layers.combinators.Serial): GRU model.\n", " data_generator (function): Data generator function.\n", " batch_size (int, optional): Number of lines per batch. Defaults to 32.\n", " max_length (int, optional): Maximum length allowed for a line to be processed. Defaults to 64.\n", " lines (list, optional): List of lines to use for training. Defaults to lines.\n", " eval_lines (list, optional): List of lines to use for evaluation. Defaults to eval_lines.\n", " n_steps (int, optional): Number of steps to train. Defaults to 1.\n", " output_dir (str, optional): Relative path of directory to save model. Defaults to \"model/\".\n", "\n", " Returns:\n", " trax.supervised.training.Loop: Training loop for the model.\n", " \"\"\"\n", " \n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " bare_train_generator = data_generator(batch_size, max_length, data_lines=lines)\n", " infinite_train_generator = itertools.cycle(bare_train_generator)\n", " \n", " bare_eval_generator = data_generator(batch_size, max_length, data_lines=eval_lines)\n", " infinite_eval_generator = itertools.cycle(bare_eval_generator)\n", " \n", " train_task = training.TrainTask(\n", " labeled_data=infinite_train_generator, # Use infinite train data generator\n", " loss_layer=tl.CrossEntropyLoss(), # Don't forget to instantiate this object\n", " optimizer=trax.optimizers.Adam(0.0005) # Don't forget to add the learning rate parameter\n", " )\n", "\n", " eval_task = training.EvalTask(\n", " labeled_data=infinite_eval_generator, # Use infinite eval data generator\n", " metrics=[tl.CrossEntropyLoss(), tl.Accuracy()], # Don't forget to instantiate these objects\n", " n_eval_batches=3 # For better evaluation accuracy in reasonable time\n", " )\n", " \n", " training_loop = training.Loop(model,\n", " train_task,\n", " eval_task=eval_task,\n", " output_dir=output_dir)\n", "\n", " training_loop.run(n_steps=n_steps)\n", " \n", " ### END CODE HERE ###\n", " \n", " # We return this because it contains a handle to the model, which has the weights etc.\n", " return training_loop\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "SwP646GpK3pD", "outputId": "4c88bcf5-a8aa-4160-cc3c-3ff16f39fd64" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Step 1: train CrossEntropyLoss | 5.54486227\n", "Step 1: eval CrossEntropyLoss | 5.48863840\n", "Step 1: eval Accuracy | 0.17598992\n" ] } ], "source": [ "# Train the model 1 step and keep the `trax.supervised.training.Loop` object.\n", "training_loop = train_model(GRULM(), data_generator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model was only trained for 1 step due to the constraints of this environment. Even on a GPU accelerated environment it will take many hours for it to achieve a good level of accuracy. For the rest of the assignment you will be using a pretrained model but now you should understand how the training can be done using Trax." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "abKPe7d4wt1C" }, "source": [ "\n", "# Part 4: Evaluation \n", "\n", "### 4.1 Evaluating using the deep nets\n", "\n", "Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as: \n", "\n", "$$P(W) = \\sqrt[N]{\\prod_{i=1}^{N} \\frac{1}{P(w_i| w_1,...,w_{n-1})}}$$\n", "\n", "As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good).\n", "\n", "\n", "$$log P(W) = {log\\big(\\sqrt[N]{\\prod_{i=1}^{N} \\frac{1}{P(w_i| w_1,...,w_{n-1})}}\\big)}$$\n", "\n", "$$ = {log\\big({\\prod_{i=1}^{N} \\frac{1}{P(w_i| w_1,...,w_{n-1})}}\\big)^{\\frac{1}{N}}}$$ \n", "\n", "$$ = {log\\big({\\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\\big)^{-\\frac{1}{N}}} $$\n", "$$ = -\\frac{1}{N}{log\\big({\\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\\big)} $$\n", "$$ = -\\frac{1}{N}{\\big({\\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\\big)} $$\n", "\n", "\n", "\n", "### Exercise 05\n", "**Instructions:** Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use [`tl.one_hot`](https://github.com/google/trax/blob/22765bb18608d376d8cd660f9865760e4ff489cd/trax/layers/metrics.py#L154) to transform the target into the same dimension. You then multiply them and sum. \n", "\n", "You also have to create a mask to only get the non-padded probabilities. Good luck! " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9PqIvySI6ZWb" }, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

\n", "

" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": {}, "colab_type": "code", "id": "3OtmlEuOwt1D" }, "outputs": [], "source": [ "# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: test_model\n", "def test_model(preds, target):\n", " \"\"\"Function to test the model.\n", "\n", " Args:\n", " preds (jax.interpreters.xla.DeviceArray): Predictions of a list of batches of tensors corresponding to lines of text.\n", " target (jax.interpreters.xla.DeviceArray): Actual list of batches of tensors corresponding to lines of text.\n", "\n", " Returns:\n", " float: log_perplexity of the model.\n", " \"\"\"\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " total_log_ppx = np.sum(preds * tl.one_hot(target, preds.shape[-1]),axis= -1) # HINT: tl.one_hot() should replace one of the Nones\n", "\n", " non_pad = 1.0 - np.equal(target, 0) # You should check if the target equals 0\n", " ppx = total_log_ppx * non_pad # Get rid of the padding\n", "\n", " log_ppx = np.sum(ppx) / np.sum(non_pad)\n", " ### END CODE HERE ###\n", " \n", " return -log_ppx" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "xl8X0FPAwt1F", "outputId": "1dbfef92-c8ca-4cae-c92c-7b1b4adb6963" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The log perplexity and perplexity of your model are respectively 1.9785146 7.2319922\n" ] } ], "source": [ "# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# Testing \n", "model = GRULM()\n", "model.init_from_file('model.pkl.gz')\n", "batch = next(data_generator(batch_size, max_length, lines, shuffle=False))\n", "preds = model(batch[0])\n", "log_ppx = test_model(preds, batch[1])\n", "print('The log perplexity and perplexity of your model are respectively', log_ppx, np.exp(log_ppx))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "8PZdy1V2wt1H" }, "source": [ "**Expected Output:** The log perplexity and perplexity of your model are respectively around 1.9 and 7.2." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4-STC44Ywt1I" }, "source": [ "\n", "# Part 5: Generating the language with your own model\n", "\n", "We will now use your own language model to generate new sentences for that we need to make draws from a Gumble distribution." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "AXdIBCGxtTdt" }, "source": [ "The Gumbel Probability Density Function (PDF) is defined as: \n", "\n", "$$ f(z) = {1\\over{\\beta}}e^{(-z+e^{(-z)})} $$\n", "\n", "where: $$ z = {(x - \\mu)\\over{\\beta}}$$\n", "\n", "The maximum value, which is what we choose as the prediction in the last step of a Recursive Neural Network `RNN` we are using for text generation, in a sample of a random variable following an exponential distribution approaches the Gumbel distribution when the sample increases asymptotically. For that reason, the Gumbel distribution is used to sample from a categorical distribution." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "xrOJHbXewt1J", "outputId": "665bc6ff-f9ee-4097-c89b-83ef0e12c1b7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "And in the shapes of heaven, he \n" ] } ], "source": [ "# Run this cell to generate some news sentence\n", "def gumbel_sample(log_probs, temperature=1.0):\n", " \"\"\"Gumbel sampling from a categorical distribution.\"\"\"\n", " u = numpy.random.uniform(low=1e-6, high=1.0 - 1e-6, size=log_probs.shape)\n", " g = -np.log(-np.log(u))\n", " return np.argmax(log_probs + g * temperature, axis=-1)\n", "\n", "def predict(num_chars, prefix):\n", " inp = [ord(c) for c in prefix]\n", " result = [c for c in prefix]\n", " max_len = len(prefix) + num_chars\n", " for _ in range(num_chars):\n", " cur_inp = np.array(inp + [0] * (max_len - len(inp)))\n", " outp = model(cur_inp[None, :]) # Add batch dim.\n", " next_char = gumbel_sample(outp[0, len(inp)])\n", " inp += [int(next_char)]\n", " \n", " if inp[-1] == 1:\n", " break # EOS\n", " result.append(chr(int(next_char)))\n", " \n", " return \"\".join(result)\n", "\n", "print(predict(32, \"\"))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "Raojyhw3z7HE", "outputId": "e8e88711-e200-462c-f984-018dfa784a63" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MARK ANTONY\tTo go, good sir.\n", "Even with a countenance, exempt \n", "I'll leave him to; so 'twere so \n" ] } ], "source": [ "print(predict(32, \"\"))\n", "print(predict(32, \"\"))\n", "print(predict(32, \"\"))\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NAfV3l5Zwt1L" }, "source": [ "In the generated text above, you can see that the model generates text that makes sense capturing dependencies between words and without any input. A simple n-gram model would have not been able to capture all of that in one sentence." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "FsE8tdTLwt1M" }, "source": [ "\n", "### On statistical methods \n", "\n", "Using a statistical method like the one you implemented in course 2 will not give you results that are as good. Your model will not be able to encode information seen previously in the data set and as a result, the perplexity will increase. Remember from course 2 that the higher the perplexity, the worse your model is. Furthermore, statistical ngram models take up too much space and memory. As a result, it will be inefficient and too slow. Conversely, with deepnets, you can get a better perplexity. Note, learning about n-gram language models is still important and allows you to better understand deepnets.\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "C3_W2_Assignment_Solution.ipynb", "provenance": [], "toc_visible": true }, "coursera": { "schema_names": [ "NLPC3-2A" ] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }