{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Models: Auto-Complete\n", "\n", "In this assignment, you will build an auto-complete system. Auto-complete system is something you may see every day\n", "- When you google something, you often have suggestions to help you complete your search. \n", "- When you are writing an email, you get suggestions telling you possible endings to your sentence. \n", "\n", "By the end of this assignment, you will develop a prototype of such a system.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "- [1 Load and Preprocess Data](#1)\n", "- [1.1: Load the data](#1.1)\n", "- [1.2 Pre-process the data](#1.2)\n", " - [Exercise 01](#ex-01)\n", " - [Exercise 02](#ex-02)\n", " - [Exercise 03](#ex-03)\n", " - [Exercise 04](#ex-04)\n", " - [Exercise 05](#ex-05)\n", " - [Exercise 06](#ex-06)\n", " - [Exercise 07](#ex-07)\n", "- [2 Develop n-gram based language models](#2)\n", " - [Exercise 08](#ex-08)\n", " - [Exercise 09](#ex-09) \n", "- [3 Perplexity](#3)\n", " - [Exercise 10](#ex-10)\n", "- [4 Build an auto-complete system](#4)\n", " - [Exercise 11](#ex-11)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A key building block for an auto-complete system is a language model.\n", "A language model assigns the probability to a sequence of words, in a way that more \"likely\" sequences receive higher scores. For example, \n", ">\"I have a pen\" \n", "is expected to have a higher probability than \n", ">\"I am a pen\"\n", "since the first one seems to be a more natural sentence in the real world.\n", "\n", "You can take advantage of this probability calculation to develop an auto-complete system. \n", "Suppose the user typed \n", ">\"I eat scrambled\"\n", "Then you can find a word `x` such that \"I eat scrambled x\" receives the highest probability. If x = \"eggs\", the sentence would be\n", ">\"I eat scrambled eggs\"\n", "\n", "While a variety of language models have been developed, this assignment uses **N-grams**, a simple but powerful method for language modeling.\n", "- N-grams are also used in machine translation and speech recognition. \n", "\n", "\n", "Here are the steps of this assignment:\n", "\n", "1. Load and preprocess data\n", " - Load and tokenize data.\n", " - Split the sentences into train and test sets.\n", " - Replace words with a low frequency by an unknown marker ``.\n", "1. Develop N-gram based language models\n", " - Compute the count of n-grams from a given data set.\n", " - Estimate the conditional probability of a next word with k-smoothing.\n", "1. Evaluate the N-gram models by computing the perplexity score.\n", "1. Use your own model to suggest an upcoming word given your sentence. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import math\n", "import random\n", "import numpy as np\n", "import pandas as pd\n", "import nltk\n", "nltk.data.path.append('.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Part 1: Load and Preprocess Data\n", "\n", "\n", "### Part 1.1: Load the data\n", "You will use twitter data.\n", "Load the data and view the first few sentences by running the next cell.\n", "\n", "Notice that data is a long string that contains many many tweets.\n", "Observe that there is a line break \"\\n\" between tweets." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data type: \n", "Number of letters: 3335477\n", "First 300 letters of the data\n", "-------\n" ] }, { "data": { "text/plain": [ "\"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\\nthey've decided its more fun if I don't.\\nSo Tired D; Played Lazer Tag & Ran A \"" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "-------\n", "Last 300 letters of the data\n", "-------\n" ] }, { "data": { "text/plain": [ "\"ust had one a few weeks back....hopefully we will be back soon! wish you the best yo\\nColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”\\n#GutsiestMovesYouCanMake Giving a cat a bath.\\nCoffee after 5 was a TERRIBLE idea.\\n\"" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "-------\n" ] } ], "source": [ "with open(\"en_US.twitter.txt\", \"r\") as f:\n", " data = f.read()\n", "print(\"Data type:\", type(data))\n", "print(\"Number of letters:\", len(data))\n", "print(\"First 300 letters of the data\")\n", "print(\"-------\")\n", "display(data[0:300])\n", "print(\"-------\")\n", "\n", "print(\"Last 300 letters of the data\")\n", "print(\"-------\")\n", "display(data[-300:])\n", "print(\"-------\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Part 1.2 Pre-process the data\n", "\n", "Preprocess this data with the following steps:\n", "\n", "1. Split data into sentences using \"\\n\" as the delimiter.\n", "1. Split each sentence into tokens. Note that in this assignment we use \"token\" and \"words\" interchangeably.\n", "1. Assign sentences into train or test sets.\n", "1. Find tokens that appear at least N times in the training data.\n", "1. Replace tokens that appear less than N times by ``\n", "\n", "\n", "Note: we omit validation data in this exercise.\n", "- In real applications, we should hold a part of data as a validation set and use it to tune our training.\n", "- We skip this process for simplicity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 01\n", "\n", "Split data into sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

\n", "

" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: split_to_sentences ###\n", "def split_to_sentences(data):\n", " \"\"\"\n", " Split data by linebreak \"\\n\"\n", " \n", " Args:\n", " data: str\n", " \n", " Returns:\n", " A list of sentences\n", " \"\"\"\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " sentences = data.split('\\n')\n", " ### END CODE HERE ###\n", " \n", " # Additional clearning (This part is already implemented)\n", " # - Remove leading and trailing spaces from each sentence\n", " # - Drop sentences if they are empty strings.\n", " sentences = [s.strip() for s in sentences]\n", " sentences = [s for s in sentences if len(s) > 0]\n", " \n", " return sentences " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "I have a pen.\n", "I have an apple. \n", "Ah\n", "Apple pen.\n", "\n", "\n" ] }, { "data": { "text/plain": [ "['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your code\n", "x = \"\"\"\n", "I have a pen.\\nI have an apple. \\nAh\\nApple pen.\\n\n", "\"\"\"\n", "print(x)\n", "\n", "split_to_sentences(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Expected answer: \n", "```CPP\n", "['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 02\n", "The next step is to tokenize sentences (split a sentence into a list of words). \n", "- Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.\n", "- Append each tokenized list of words into a list of tokenized sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • Use str.lower to convert strings to lowercase.
  • \n", "
  • Please use nltk.word_tokenize to split sentences into tokens.
  • \n", "
  • If you used str.split insteaad of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: tokenize_sentences ###\n", "def tokenize_sentences(sentences):\n", " \"\"\"\n", " Tokenize sentences into tokens (words)\n", " \n", " Args:\n", " sentences: List of strings\n", " \n", " Returns:\n", " List of lists of tokens\n", " \"\"\"\n", " \n", " # Initialize the list of lists of tokenized sentences\n", " tokenized_sentences = []\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Go through each sentence\n", " for sentence in sentences:\n", " \n", " # Convert to lowercase letters\n", " sentence = sentence.lower()\n", " \n", " # Convert into a list of words\n", " tokenized = nltk.word_tokenize(sentence)\n", " \n", " # append the list of words to the list of lists\n", " tokenized_sentences.append(tokenized)\n", " \n", " ### END CODE HERE ###\n", " \n", " return tokenized_sentences" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green', '.'],\n", " ['roses', 'are', 'red', '.']]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your code\n", "sentences = [\"Sky is blue.\", \"Leaves are green.\", \"Roses are red.\"]\n", "tokenize_sentences(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expected output\n", "\n", "```CPP\n", "[['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green', '.'],\n", " ['roses', 'are', 'red', '.']]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 03\n", "\n", "\n", "Use the two functions that you have just implemented to get the tokenized data.\n", "- split the data into sentences\n", "- tokenize those sentences" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: get_tokenized_data ###\n", "def get_tokenized_data(data):\n", " \"\"\"\n", " Make a list of tokenized sentences\n", " \n", " Args:\n", " data: String\n", " \n", " Returns:\n", " List of lists of tokens\n", " \"\"\"\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Get the sentences by splitting up the data\n", " sentences = split_to_sentences(data)\n", " \n", " # Get the list of lists of tokens by tokenizing the sentences\n", " tokenized_sentences = tokenize_sentences(sentences)\n", " \n", " ### END CODE HERE ###\n", " \n", " return tokenized_sentences" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green'],\n", " ['roses', 'are', 'red', '.']]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your function\n", "x = \"Sky is blue.\\nLeaves are green\\nRoses are red.\"\n", "get_tokenized_data(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected outcome\n", "\n", "```CPP\n", "[['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green'],\n", " ['roses', 'are', 'red', '.']]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into train and test sets\n", "\n", "Now run the cell below to split data into training and test sets." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "tokenized_data = get_tokenized_data(data)\n", "random.seed(87)\n", "random.shuffle(tokenized_data)\n", "\n", "train_size = int(len(tokenized_data) * 0.8)\n", "train_data = tokenized_data[0:train_size]\n", "test_data = tokenized_data[train_size:]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "47961 data are split into 38368 train and 9593 test set\n", "First training sample:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']\n", "First test sample\n", "['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']\n" ] } ], "source": [ "print(\"{} data are split into {} train and {} test set\".format(\n", " len(tokenized_data), len(train_data), len(test_data)))\n", "\n", "print(\"First training sample:\")\n", "print(train_data[0])\n", " \n", "print(\"First test sample\")\n", "print(test_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "47961 data are split into 38368 train and 9593 test set\n", "First training sample:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']\n", "First test sample\n", "['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 04\n", "\n", "You won't use all the tokens (words) appearing in the data for training. Instead, you will use the more frequently used words. \n", "- You will focus on the words that appear at least N times in the data.\n", "- First count how many times each word appears in the data.\n", "\n", "You will need a double for-loop, one for sentences and the other for tokens within a sentence.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • If you decide to import and use defaultdict, remember to cast the dictionary back to a regular 'dict' before returning it.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: count_words ###\n", "def count_words(tokenized_sentences):\n", " \"\"\"\n", " Count the number of word appearence in the tokenized sentences\n", " \n", " Args:\n", " tokenized_sentences: List of lists of strings\n", " \n", " Returns:\n", " dict that maps word (str) to the frequency (int)\n", " \"\"\"\n", " \n", " word_counts = {}\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Loop through each sentence\n", " for sentence in tokenized_sentences: # complete this line\n", " \n", " # Go through each token in the sentence\n", " for token in sentence: # complete this line\n", "\n", " # If the token is not in the dictionary yet, set the count to 1\n", " if token not in word_counts.keys(): # complete this line\n", " word_counts[token] = 1\n", " \n", " # If the token is already in the dictionary, increment the count by 1\n", " else:\n", " word_counts[token] += 1\n", "\n", " ### END CODE HERE ###\n", " \n", " return word_counts" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sky': 1,\n", " 'is': 1,\n", " 'blue': 1,\n", " '.': 3,\n", " 'leaves': 1,\n", " 'are': 2,\n", " 'green': 1,\n", " 'roses': 1,\n", " 'red': 1}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your code\n", "tokenized_sentences = [['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green', '.'],\n", " ['roses', 'are', 'red', '.']]\n", "count_words(tokenized_sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "Note that the order may differ.\n", "\n", "```CPP\n", "{'sky': 1,\n", " 'is': 1,\n", " 'blue': 1,\n", " '.': 3,\n", " 'leaves': 1,\n", " 'are': 2,\n", " 'green': 1,\n", " 'roses': 1,\n", " 'red': 1}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling 'Out of Vocabulary' words\n", "\n", "If your model is performing autocomplete, but encounters a word that it never saw during training, it won't have an input word to help it determine the next word to suggest. The model will not be able to predict the next word because there are no counts for the current word. \n", "- This 'new' word is called an 'unknown word', or out of vocabulary (OOV) words.\n", "- The percentage of unknown words in the test set is called the OOV rate. \n", "\n", "To handle unknown words during prediction, use a special token to represent all unknown words 'unk'. \n", "- Modify the training data so that it has some 'unknown' words to train on.\n", "- Words to convert into \"unknown\" words are those that do not occur very frequently in the training set.\n", "- Create a list of the most frequent words in the training set, called the closed vocabulary . \n", "- Convert all the other words that are not part of the closed vocabulary to the token 'unk'. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 05\n", "\n", "You will now create a function that takes in a text document and a threshold 'count_threshold'.\n", "- Any word whose count is greater than or equal to the threshold 'count_threshold' is kept in the closed vocabulary.\n", "- used that you want to keep, returns the document containing only the word closed vocabulary and the word unk. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: get_words_with_nplus_frequency ###\n", "def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):\n", " \"\"\"\n", " Find the words that appear N times or more\n", " \n", " Args:\n", " tokenized_sentences: List of lists of sentences\n", " count_threshold: minimum number of occurrences for a word to be in the closed vocabulary.\n", " \n", " Returns:\n", " List of words that appear N times or more\n", " \"\"\"\n", " # Initialize an empty list to contain the words that\n", " # appear at least 'minimum_freq' times.\n", " closed_vocab = []\n", " \n", " # Get the word couts of the tokenized sentences\n", " # Use the function that you defined earlier to count the words\n", " word_counts = count_words(tokenized_sentences)\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " # for each word and its count\n", " for word, cnt in word_counts.items(): # complete this line\n", " \n", " # check that the word's count\n", " # is at least as great as the minimum count\n", " if cnt >= count_threshold:\n", " \n", " # append the word to the list\n", " closed_vocab.append(word)\n", " ### END CODE HERE ###\n", " \n", " return closed_vocab" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Closed vocabulary:\n", "['.', 'are']\n" ] } ], "source": [ "# test your code\n", "tokenized_sentences = [['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green', '.'],\n", " ['roses', 'are', 'red', '.']]\n", "tmp_closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)\n", "print(f\"Closed vocabulary:\")\n", "print(tmp_closed_vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "Closed vocabulary:\n", "['.', 'are']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 06\n", "\n", "The words that appear 'count_threshold' times or more are in the 'closed vocabulary. \n", "- All other words are regarded as 'unknown'.\n", "- Replace words not in the closed vocabulary with the token \"\"." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: replace_oov_words_by_unk ###\n", "def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token=\"\"):\n", " \"\"\"\n", " Replace words not in the given vocabulary with '' token.\n", " \n", " Args:\n", " tokenized_sentences: List of lists of strings\n", " vocabulary: List of strings that we will use\n", " unknown_token: A string representing unknown (out-of-vocabulary) words\n", " \n", " Returns:\n", " List of lists of strings, with words not in the vocabulary replaced\n", " \"\"\"\n", " \n", " # Place vocabulary into a set for faster search\n", " vocabulary = set(vocabulary)\n", " \n", " # Initialize a list that will hold the sentences\n", " # after less frequent words are replaced by the unknown token\n", " replaced_tokenized_sentences = []\n", " \n", " # Go through each sentence\n", " for sentence in tokenized_sentences:\n", " \n", " # Initialize the list that will contain\n", " # a single sentence with \"unknown_token\" replacements\n", " replaced_sentence = []\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", "\n", " # for each token in the sentence\n", " for token in sentence: # complete this line\n", " \n", " # Check if the token is in the closed vocabulary\n", " if token in vocabulary: # complete this line\n", " # If so, append the word to the replaced_sentence\n", " replaced_sentence.append(token)\n", " else:\n", " # otherwise, append the unknown token instead\n", " replaced_sentence.append(unknown_token)\n", " ### END CODE HERE ###\n", " \n", " # Append the list of tokens to the list of lists\n", " replaced_tokenized_sentences.append(replaced_sentence)\n", " \n", " return replaced_tokenized_sentences" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original sentence:\n", "[['dogs', 'run'], ['cats', 'sleep']]\n", "tokenized_sentences with less frequent words converted to '':\n", "[['dogs', ''], ['', 'sleep']]\n" ] } ], "source": [ "tokenized_sentences = [[\"dogs\", \"run\"], [\"cats\", \"sleep\"]]\n", "vocabulary = [\"dogs\", \"sleep\"]\n", "tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)\n", "print(f\"Original sentence:\")\n", "print(tokenized_sentences)\n", "print(f\"tokenized_sentences with less frequent words converted to '':\")\n", "print(tmp_replaced_tokenized_sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expected answer\n", "\n", "```CPP\n", "Original sentence:\n", "[['dogs', 'run'], ['cats', 'sleep']]\n", "tokenized_sentences with less frequent words converted to '':\n", "[['dogs', ''], ['', 'sleep']]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 07\n", "\n", "Now we are ready to process our data by combining the functions that you just implemented.\n", "\n", "1. Find tokens that appear at least count_threshold times in the training data.\n", "1. Replace tokens that appear less than count_threshold times by \"\" both for training and test data." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED_FUNCTION: preprocess_data ###\n", "def preprocess_data(train_data, test_data, count_threshold):\n", " \"\"\"\n", " Preprocess data, i.e.,\n", " - Find tokens that appear at least N times in the training data.\n", " - Replace tokens that appear less than N times by \"\" both for training and test data. \n", " Args:\n", " train_data, test_data: List of lists of strings.\n", " count_threshold: Words whose count is less than this are \n", " treated as unknown.\n", " \n", " Returns:\n", " Tuple of\n", " - training data with low frequent words replaced by \"\"\n", " - test data with low frequent words replaced by \"\"\n", " - vocabulary of words that appear n times or more in the training data\n", " \"\"\"\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Get the closed vocabulary using the train data\n", " vocabulary = get_words_with_nplus_frequency(train_data,count_threshold)\n", " \n", " # For the train data, replace less common words with \"\"\n", " train_data_replaced = replace_oov_words_by_unk(train_data,vocabulary)\n", " \n", " # For the test data, replace less common words with \"\"\n", " test_data_replaced = replace_oov_words_by_unk(test_data,vocabulary)\n", " \n", " ### END CODE HERE ###\n", " return train_data_replaced, test_data_replaced, vocabulary" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tmp_train_repl\n", "[['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]\n", "\n", "tmp_test_repl\n", "[['', 'are', '', '.']]\n", "\n", "tmp_vocab\n", "['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']\n" ] } ], "source": [ "# test your code\n", "tmp_train = [['sky', 'is', 'blue', '.'],\n", " ['leaves', 'are', 'green']]\n", "tmp_test = [['roses', 'are', 'red', '.']]\n", "\n", "tmp_train_repl, tmp_test_repl, tmp_vocab = preprocess_data(tmp_train, \n", " tmp_test, \n", " count_threshold = 1)\n", "\n", "print(\"tmp_train_repl\")\n", "print(tmp_train_repl)\n", "print()\n", "print(\"tmp_test_repl\")\n", "print(tmp_test_repl)\n", "print()\n", "print(\"tmp_vocab\")\n", "print(tmp_vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected outcome\n", "\n", "```CPP\n", "tmp_train_repl\n", "[['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]\n", "\n", "tmp_test_repl\n", "[['', 'are', '', '.']]\n", "\n", "tmp_vocab\n", "['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocess the train and test data\n", "Run the cell below to complete the preprocessing both for training and test sets." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "minimum_freq = 2\n", "train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, \n", " test_data, \n", " minimum_freq)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First preprocessed training sample:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']\n", "\n", "First preprocessed test sample:\n", "['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']\n", "\n", "First 10 vocabulary:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the']\n", "\n", "Size of vocabulary: 14821\n" ] } ], "source": [ "print(\"First preprocessed training sample:\")\n", "print(train_data_processed[0])\n", "print()\n", "print(\"First preprocessed test sample:\")\n", "print(test_data_processed[0])\n", "print()\n", "print(\"First 10 vocabulary:\")\n", "print(vocabulary[0:10])\n", "print()\n", "print(\"Size of vocabulary:\", len(vocabulary))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "First preprocessed training sample:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']\n", "\n", "First preprocessed test sample:\n", "['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']\n", "\n", "First 10 vocabulary:\n", "['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the']\n", "\n", "Size of vocabulary: 14821\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are done with the preprocessing section of the assignment.\n", "Objects `train_data_processed`, `test_data_processed`, and `vocabulary` will be used in the rest of the exercises." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Part 2: Develop n-gram based language models\n", "\n", "In this section, you will develop the n-grams language model.\n", "- Assume the probability of the next word depends only on the previous n-gram.\n", "- The previous n-gram is the series of the previous 'n' words.\n", "\n", "The conditional probability for the word at position 't' in the sentence, given that the words preceding it are $w_{t-1}, w_{t-2} \\cdots w_{t-n}$ is:\n", "\n", "$$ P(w_t | w_{t-1}\\dots w_{t-n}) \\tag{1}$$\n", "\n", "You can estimate this probability by counting the occurrences of these series of words in the training data.\n", "- The probability can be estimated as a ratio, where\n", "- The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.\n", "- The denominator is the number of times word t-1 through t-n appears in the training data.\n", "\n", "$$ \\hat{P}(w_t | w_{t-1}\\dots w_{t-n}) = \\frac{C(w_{t-1}\\dots w_{t-n}, w_n)}{C(w_{t-1}\\dots w_{t-n})} \\tag{2} $$\n", "\n", "- The function $C(\\cdots)$ denotes the number of occurence of the given sequence. \n", "- $\\hat{P}$ means the estimation of $P$. \n", "- Notice that denominator of the equation (2) is the number of occurence of the previous $n$ words, and the numerator is the same sequence followed by the word $w_t$.\n", "\n", "Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.\n", "\n", "The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 08\n", "Next, you will implement a function that computes the counts of n-grams for an arbitrary number $n$.\n", "\n", "When computing the counts for n-grams, prepare the sentence beforehand by prepending $n-1$ starting markers \"\" to indicate the beginning of the sentence. \n", "- For example, in the bi-gram model (N=2), a sequence with two start tokens \"\" should predict the first word of a sentence.\n", "- So, if the sentence is \"I like food\", modify it to be \" I like food\".\n", "- Also prepare the sentence for counting by appending an end token \"\" so that the model can predict when to finish a sentence.\n", "\n", "Technical note: In this implementation, you will store the counts as a dictionary.\n", "- The key of each key-value pair in the dictionary is a **tuple** of n words (and not a list)\n", "- The value in the key-value pair is the number of occurrences. \n", "- The reason for using a tuple as a key instead of a list is because a list in Python is a mutable object (it can be changed after it is first created). A tuple is \"immutable\", so it cannot be altered after it is first created. This makes a tuple suitable as a data type for the key in a dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • To prepend or append, you can create lists and concatenate them using the + operator
  • \n", "
  • To create a list of a repeated value, you can follow this syntax: ['a'] * 3 to get ['a','a','a']
  • \n", "
  • To set the range for index 'i', think of this example: An n-gram where n=2 (bigram), and the sentence is length N=5 (including two start tokens and one end token). So the index positions are [0,1,2,3,4]. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram.
  • \n", "
  • Remember that the range() function excludes the value that is used for the maximum of the range. range(3) produces (0,1,2) but excludes 3.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED FUNCTION: count_n_grams ###\n", "def count_n_grams(data, n, start_token='', end_token = ''):\n", " \"\"\"\n", " Count all n-grams in the data\n", " \n", " Args:\n", " data: List of lists of words\n", " n: number of words in a sequence\n", " \n", " Returns:\n", " A dictionary that maps a tuple of n-words to its frequency\n", " \"\"\"\n", " \n", " # Initialize dictionary of n-grams and their counts\n", " n_grams = {}\n", "\n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Go through each sentence in the data\n", " for sentence in data: # complete this line\n", " \n", " # prepend start token n times, and append one time\n", " sentence = [start_token] * n+ sentence + [end_token]\n", " # convert list to tuple\n", " # So that the sequence of words can be used as\n", " # a key in the dictionary\n", " sentence = tuple(sentence)\n", " \n", " # Use 'i' to indicate the start of the n-gram\n", " # from index 0\n", " # to the last index where the end of the n-gram\n", " # is within the sentence.\n", " \n", " m = len(sentence) if n==1 else len(sentence)-1\n", " for i in range(m): # complete this line\n", " \n", " # Get the n-gram from i to i+n\n", " n_gram = sentence[i:i+n]\n", " \n", " # check if the n-gram is in the dictionary\n", " if n_gram in n_grams.keys(): # complete this line\n", " \n", " # Increment the count for this n-gram\n", " n_grams[n_gram] += 1\n", " else:\n", " # Initialize this n-gram count to 1\n", " n_grams[n_gram] = 1\n", " \n", " ### END CODE HERE ###\n", " return n_grams" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uni-gram:\n", "{('',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}\n", "Bi-gram:\n", "{('', ''): 2, ('', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', ''): 2, ('', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}\n" ] } ], "source": [ "# test your code\n", "# CODE REVIEW COMMENT: Outcome does not match expected outcome\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "print(\"Uni-gram:\")\n", "print(count_n_grams(sentences, 1))\n", "print(\"Bi-gram:\")\n", "print(count_n_grams(sentences, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Expected outcome:\n", "\n", "```CPP\n", "Uni-gram:\n", "{('',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}\n", "Bi-gram:\n", "{('', ''): 2, ('', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', ''): 2, ('', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 09\n", "\n", "Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.\n", "\n", "$$ \\hat{P}(w_t | w_{t-1}\\dots w_{t-n}) = \\frac{C(w_{t-1}\\dots w_{t-n}, w_n)}{C(w_{t-1}\\dots w_{t-n})} \\tag{2} $$\n", "\n", "This formula doesn't work when a count of an n-gram is zero..\n", "- Suppose we encounter an n-gram that did not occur in the training data. \n", "- Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).\n", "\n", "A way to handle zero counts is to add k-smoothing. \n", "- K-smoothing adds a positive constant $k$ to each numerator and $k \\times |V|$ in the denominator, where $|V|$ is the number of words in the vocabulary.\n", "\n", "$$ \\hat{P}(w_t | w_{t-1}\\dots w_{t-n}) = \\frac{C(w_{t-1}\\dots w_{t-n}, w_n) + k}{C(w_{t-1}\\dots w_{t-n}) + k|V|} \\tag{3} $$\n", "\n", "\n", "For n-grams that have a zero count, the equation (3) becomes $\\frac{1}{|V|}$.\n", "- This means that any n-gram with zero count has the same probability of $\\frac{1}{|V|}$.\n", "\n", "Define a function that computes the probability estimate (3) from n-gram counts and a constant $k$.\n", "\n", "- The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.\n", "- The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • To define a tuple containing a single value, add a comma after that value. For example: ('apple',) is a tuple containing a single string 'apple'
  • \n", "
  • To concatenate two tuples, use the '+' operator
  • \n", "
  • words
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "### GRADED FUNCTION: estimate_probabilityy ###\n", "def estimate_probability(word, previous_n_gram, \n", " n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):\n", " \"\"\"\n", " Estimate the probabilities of a next word using the n-gram counts with k-smoothing\n", " \n", " Args:\n", " word: next word\n", " previous_n_gram: A sequence of words of length n\n", " n_gram_counts: Dictionary of counts of (n+1)-grams\n", " n_plus1_gram_counts: Dictionary of counts of (n+1)-grams\n", " vocabulary_size: number of words in the vocabulary\n", " k: positive constant, smoothing parameter\n", " \n", " Returns:\n", " A probability\n", " \"\"\"\n", " # convert list to tuple to use it as a dictionary key\n", " previous_n_gram = tuple(previous_n_gram) \n", " \n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # Set the denominator\n", " # If the previous n-gram exists in the dictionary of n-gram counts,\n", " # Get its count. Otherwise set the count to zero\n", " # Use the dictionary that has counts for n-grams\n", " previous_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts else 0\n", " \n", " # Calculate the denominator using the count of the previous n gram\n", " # and apply k-smoothing\n", " denominator = previous_n_gram_count + k * vocabulary_size\n", " \n", " # Define n plus 1 gram as the previous n-gram plus the current word as a tuple\n", " n_plus1_gram = previous_n_gram + (word,)\n", "\n", " # Set the count to the count in the dictionary,\n", " # otherwise 0 if not in the dictionary\n", " # use the dictionary that has counts for the n-gram plus current word\n", " n_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts else 0\n", " \n", " # Define the numerator use the count of the n-gram plus current word,\n", " # and apply smoothing\n", " numerator = n_plus1_gram_count + k\n", " \n", " # Calculate the probability as the numerator divided by denominator\n", " probability = numerator / denominator\n", " \n", " ### END CODE HERE ###\n", " \n", " return probability" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333\n" ] } ], "source": [ "# test your code\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "\n", "unigram_counts = count_n_grams(sentences, 1)\n", "\n", "bigram_counts = count_n_grams(sentences, 2)\n", "\n", "tmp_prob = estimate_probability(\"cat\", \"a\", unigram_counts, bigram_counts, len(unique_words), k=1)\n", "\n", "print(f\"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimate probabilities for all words\n", "\n", "The function defined below loops over all words in vocabulary to calculate probabilities for all possible words.\n", "- This function is provided for you." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):\n", " \"\"\"\n", " Estimate the probabilities of next words using the n-gram counts with k-smoothing\n", " \n", " Args:\n", " previous_n_gram: A sequence of words of length n\n", " n_gram_counts: Dictionary of counts of (n+1)-grams\n", " n_plus1_gram_counts: Dictionary of counts of (n+1)-grams\n", " vocabulary: List of words\n", " k: positive constant, smoothing parameter\n", " \n", " Returns:\n", " A dictionary mapping from next words to the probability.\n", " \"\"\"\n", " \n", " # convert list to tuple to use it as a dictionary key\n", " previous_n_gram = tuple(previous_n_gram)\n", " \n", " # add to the vocabulary\n", " # is not needed since it should not appear as the next word\n", " vocabulary = vocabulary + [\"\", \"\"]\n", " vocabulary_size = len(vocabulary)\n", " \n", " probabilities = {}\n", " for word in vocabulary:\n", " probability = estimate_probability(word, previous_n_gram, \n", " n_gram_counts, n_plus1_gram_counts, \n", " vocabulary_size, k=k)\n", " probabilities[word] = probability\n", "\n", " return probabilities" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'dog': 0.09090909090909091,\n", " 'this': 0.09090909090909091,\n", " 'cat': 0.2727272727272727,\n", " 'is': 0.09090909090909091,\n", " 'like': 0.09090909090909091,\n", " 'a': 0.09090909090909091,\n", " 'i': 0.09090909090909091,\n", " '': 0.09090909090909091,\n", " '': 0.09090909090909091}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# test your code\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "unigram_counts = count_n_grams(sentences, 1)\n", "bigram_counts = count_n_grams(sentences, 2)\n", "estimate_probabilities(\"a\", unigram_counts, bigram_counts, unique_words, k=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "{'cat': 0.2727272727272727,\n", " 'i': 0.09090909090909091,\n", " 'this': 0.09090909090909091,\n", " 'a': 0.09090909090909091,\n", " 'is': 0.09090909090909091,\n", " 'like': 0.09090909090909091,\n", " 'dog': 0.09090909090909091,\n", " '': 0.09090909090909091,\n", " '': 0.09090909090909091}\n", "```" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'dog': 0.09090909090909091,\n", " 'this': 0.18181818181818182,\n", " 'cat': 0.09090909090909091,\n", " 'is': 0.09090909090909091,\n", " 'like': 0.09090909090909091,\n", " 'a': 0.09090909090909091,\n", " 'i': 0.18181818181818182,\n", " '': 0.09090909090909091,\n", " '': 0.09090909090909091}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Additional test\n", "trigram_counts = count_n_grams(sentences, 3)\n", "estimate_probabilities([\"\", \"\"], bigram_counts, trigram_counts, unique_words, k=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "{'cat': 0.09090909090909091,\n", " 'i': 0.18181818181818182,\n", " 'this': 0.18181818181818182,\n", " 'a': 0.09090909090909091,\n", " 'is': 0.09090909090909091,\n", " 'like': 0.09090909090909091,\n", " 'dog': 0.09090909090909091,\n", " '': 0.09090909090909091,\n", " '': 0.09090909090909091}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count and probability matrices\n", "\n", "As we have seen so far, the n-gram counts computed above are sufficient for computing the probabilities of the next word. \n", "- It can be more intuitive to present them as count or probability matrices.\n", "- The functions defined in the next cells return count or probability matrices.\n", "- This function is provided for you." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def make_count_matrix(n_plus1_gram_counts, vocabulary):\n", " # add to the vocabulary\n", " # is omitted since it should not appear as the next word\n", " vocabulary = vocabulary + [\"\", \"\"]\n", " \n", " # obtain unique n-grams\n", " n_grams = []\n", " for n_plus1_gram in n_plus1_gram_counts.keys():\n", " n_gram = n_plus1_gram[0:-1]\n", " n_grams.append(n_gram)\n", " n_grams = list(set(n_grams))\n", " \n", " # mapping from n-gram to row\n", " row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}\n", " # mapping from next word to column\n", " col_index = {word:j for j, word in enumerate(vocabulary)}\n", " \n", " nrow = len(n_grams)\n", " ncol = len(vocabulary)\n", " count_matrix = np.zeros((nrow, ncol))\n", " for n_plus1_gram, count in n_plus1_gram_counts.items():\n", " n_gram = n_plus1_gram[0:-1]\n", " word = n_plus1_gram[-1]\n", " if word not in vocabulary:\n", " continue\n", " i = row_index[n_gram]\n", " j = col_index[word]\n", " count_matrix[i, j] = count\n", " \n", " count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)\n", " return count_matrix" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bigram counts\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dogthiscatislikeai<e><unk>
(<s>,)0.01.00.00.00.00.01.00.00.0
(dog,)0.00.00.01.00.00.00.00.00.0
(cat,)0.00.00.00.00.00.00.02.00.0
(this,)1.00.00.00.00.00.00.00.00.0
(is,)0.00.00.00.01.00.00.00.00.0
(like,)0.00.00.00.00.02.00.00.00.0
(a,)0.00.02.00.00.00.00.00.00.0
(i,)0.00.00.00.01.00.00.00.00.0
\n", "
" ], "text/plain": [ " dog this cat is like a i \n", "(,) 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", "(dog,) 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0\n", "(cat,) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0\n", "(this,) 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(is,) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", "(like,) 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0\n", "(a,) 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(i,) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "bigram_counts = count_n_grams(sentences, 2)\n", "\n", "print('bigram counts')\n", "display(make_count_matrix(bigram_counts, unique_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "bigram counts\n", " cat i this a is like dog \n", "(,) 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(a,) 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(this,) 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", "(like,) 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0\n", "(dog,) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", "(cat,) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0\n", "(is,) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "(i,) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "```" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "trigram counts\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dogthiscatislikeai<e><unk>
(like, a)0.00.02.00.00.00.00.00.00.0
(<s>, <s>)0.01.00.00.00.00.01.00.00.0
(i, like)0.00.00.00.00.01.00.00.00.0
(a, cat)0.00.00.00.00.00.00.02.00.0
(dog, is)0.00.00.00.01.00.00.00.00.0
(cat,)0.00.00.00.00.00.00.02.00.0
(<s>, this)1.00.00.00.00.00.00.00.00.0
(this, dog)0.00.00.01.00.00.00.00.00.0
(is, like)0.00.00.00.00.01.00.00.00.0
(<s>, i)0.00.00.00.01.00.00.00.00.0
\n", "
" ], "text/plain": [ " dog this cat is like a i \n", "(like, a) 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(, ) 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", "(i, like) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "(a, cat) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0\n", "(dog, is) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", "(cat,) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0\n", "(, this) 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(this, dog) 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0\n", "(is, like) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "(, i) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Show trigram counts\n", "print('\\ntrigram counts')\n", "trigram_counts = count_n_grams(sentences, 3)\n", "display(make_count_matrix(trigram_counts, unique_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "trigram counts\n", " cat i this a is like dog \n", "(dog, is) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "(this, dog) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", "(a, cat) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0\n", "(like, a) 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(is, like) 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0\n", "(, i) 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n", "(i, like) 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0\n", "(, ) 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "(, this) 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function calculates the probabilities of each word given the previous n-gram, and stores this in matrix form.\n", "- This function is provided for you." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):\n", " count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)\n", " count_matrix += k\n", " prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)\n", " return prob_matrix" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bigram probabilities\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dogthiscatislikeai<e><unk>
(<s>,)0.0909090.1818180.0909090.0909090.0909090.0909090.1818180.0909090.090909
(dog,)0.1000000.1000000.1000000.2000000.1000000.1000000.1000000.1000000.100000
(cat,)0.0909090.0909090.0909090.0909090.0909090.0909090.0909090.2727270.090909
(this,)0.2000000.1000000.1000000.1000000.1000000.1000000.1000000.1000000.100000
(is,)0.1000000.1000000.1000000.1000000.2000000.1000000.1000000.1000000.100000
(like,)0.0909090.0909090.0909090.0909090.0909090.2727270.0909090.0909090.090909
(a,)0.0909090.0909090.2727270.0909090.0909090.0909090.0909090.0909090.090909
(i,)0.1000000.1000000.1000000.1000000.2000000.1000000.1000000.1000000.100000
\n", "
" ], "text/plain": [ " dog this cat is like a i \\\n", "(,) 0.090909 0.181818 0.090909 0.090909 0.090909 0.090909 0.181818 \n", "(dog,) 0.100000 0.100000 0.100000 0.200000 0.100000 0.100000 0.100000 \n", "(cat,) 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 \n", "(this,) 0.200000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 \n", "(is,) 0.100000 0.100000 0.100000 0.100000 0.200000 0.100000 0.100000 \n", "(like,) 0.090909 0.090909 0.090909 0.090909 0.090909 0.272727 0.090909 \n", "(a,) 0.090909 0.090909 0.272727 0.090909 0.090909 0.090909 0.090909 \n", "(i,) 0.100000 0.100000 0.100000 0.100000 0.200000 0.100000 0.100000 \n", "\n", " \n", "(,) 0.090909 0.090909 \n", "(dog,) 0.100000 0.100000 \n", "(cat,) 0.272727 0.090909 \n", "(this,) 0.100000 0.100000 \n", "(is,) 0.100000 0.100000 \n", "(like,) 0.090909 0.090909 \n", "(a,) 0.090909 0.090909 \n", "(i,) 0.100000 0.100000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "bigram_counts = count_n_grams(sentences, 2)\n", "print(\"bigram probabilities\")\n", "display(make_probability_matrix(bigram_counts, unique_words, k=1))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "trigram probabilities\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dogthiscatislikeai<e><unk>
(like, a)0.0909090.0909090.2727270.0909090.0909090.0909090.0909090.0909090.090909
(<s>, <s>)0.0909090.1818180.0909090.0909090.0909090.0909090.1818180.0909090.090909
(i, like)0.1000000.1000000.1000000.1000000.1000000.2000000.1000000.1000000.100000
(a, cat)0.0909090.0909090.0909090.0909090.0909090.0909090.0909090.2727270.090909
(dog, is)0.1000000.1000000.1000000.1000000.2000000.1000000.1000000.1000000.100000
(cat,)0.0909090.0909090.0909090.0909090.0909090.0909090.0909090.2727270.090909
(<s>, this)0.2000000.1000000.1000000.1000000.1000000.1000000.1000000.1000000.100000
(this, dog)0.1000000.1000000.1000000.2000000.1000000.1000000.1000000.1000000.100000
(is, like)0.1000000.1000000.1000000.1000000.1000000.2000000.1000000.1000000.100000
(<s>, i)0.1000000.1000000.1000000.1000000.2000000.1000000.1000000.1000000.100000
\n", "
" ], "text/plain": [ " dog this cat is like a \\\n", "(like, a) 0.090909 0.090909 0.272727 0.090909 0.090909 0.090909 \n", "(, ) 0.090909 0.181818 0.090909 0.090909 0.090909 0.090909 \n", "(i, like) 0.100000 0.100000 0.100000 0.100000 0.100000 0.200000 \n", "(a, cat) 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 \n", "(dog, is) 0.100000 0.100000 0.100000 0.100000 0.200000 0.100000 \n", "(cat,) 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 \n", "(, this) 0.200000 0.100000 0.100000 0.100000 0.100000 0.100000 \n", "(this, dog) 0.100000 0.100000 0.100000 0.200000 0.100000 0.100000 \n", "(is, like) 0.100000 0.100000 0.100000 0.100000 0.100000 0.200000 \n", "(, i) 0.100000 0.100000 0.100000 0.100000 0.200000 0.100000 \n", "\n", " i \n", "(like, a) 0.090909 0.090909 0.090909 \n", "(, ) 0.181818 0.090909 0.090909 \n", "(i, like) 0.100000 0.100000 0.100000 \n", "(a, cat) 0.090909 0.272727 0.090909 \n", "(dog, is) 0.100000 0.100000 0.100000 \n", "(cat,) 0.090909 0.272727 0.090909 \n", "(, this) 0.100000 0.100000 0.100000 \n", "(this, dog) 0.100000 0.100000 0.100000 \n", "(is, like) 0.100000 0.100000 0.100000 \n", "(, i) 0.100000 0.100000 0.100000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"trigram probabilities\")\n", "trigram_counts = count_n_grams(sentences, 3)\n", "display(make_probability_matrix(trigram_counts, unique_words, k=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm that you obtain the same results as for the `estimate_probabilities` function that you implemented." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Part 3: Perplexity\n", "\n", "In this section, you will generate the perplexity score to evaluate your model on the test set. \n", "- You will also use back-off when needed. \n", "- Perplexity is used as an evaluation metric of your language model. \n", "- To calculate the the perplexity score of the test set on an n-gram model, use: \n", "\n", "$$ PP(W) =\\sqrt[N]{ \\prod_{t=n+1}^N \\frac{1}{P(w_t | w_{t-n} \\cdots w_{t-1})} } \\tag{4}$$\n", "\n", "- where $N$ is the length of the sentence.\n", "- $n$ is the number of words in the n-gram (e.g. 2 for a bigram).\n", "- In math, the numbering starts at one and not zero.\n", "\n", "In code, array indexing starts at zero, so the code will use ranges for $t$ according to this formula:\n", "\n", "$$ PP(W) =\\sqrt[N]{ \\prod_{t=n}^{N-1} \\frac{1}{P(w_t | w_{t-n} \\cdots w_{t-1})} } \\tag{4.1}$$\n", "\n", "The higher the probabilities are, the lower the perplexity will be. \n", "- The more the n-grams tell us about the sentence, the lower the perplexity score will be. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 10\n", "Compute the perplexity score given an N-gram count matrix and a sentence. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • Remember that range(2,4) produces the integers [2, 3] (and excludes 4).
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: calculate_perplexity\n", "def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):\n", " \"\"\"\n", " Calculate perplexity for a list of sentences\n", " \n", " Args:\n", " sentence: List of strings\n", " n_gram_counts: Dictionary of counts of (n+1)-grams\n", " n_plus1_gram_counts: Dictionary of counts of (n+1)-grams\n", " vocabulary_size: number of unique words in the vocabulary\n", " k: Positive smoothing constant\n", " \n", " Returns:\n", " Perplexity score\n", " \"\"\"\n", " # length of previous words\n", " n = len(list(n_gram_counts.keys())[0]) \n", " \n", " # prepend and append \n", " sentence = [\"\"] * n + sentence + [\"\"]\n", " \n", " # Cast the sentence from a list to a tuple\n", " sentence = tuple(sentence)\n", " \n", " # length of sentence (after adding and tokens)\n", " N = len(sentence)\n", " \n", " # The variable p will hold the product\n", " # that is calculated inside the n-root\n", " # Update this in the code below\n", " product_pi = 1.0\n", " \n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " # Index t ranges from n to N - 1\n", " for t in range(n, N): # complete this line\n", "\n", " # get the n-gram preceding the word at position t\n", " n_gram = sentence[t-n:t]\n", " \n", " # get the word at position t\n", " word = sentence[t]\n", " \n", " # Estimate the probability of the word given the n-gram\n", " # using the n-gram counts, n-plus1-gram counts,\n", " # vocabulary size, and smoothing constant\n", " probability = estimate_probability(word,n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)\n", " \n", " # Update the product of the probabilities\n", " # This 'product_pi' is a cumulative product \n", " # of the (1/P) factors that are calculated in the loop\n", " product_pi *= 1 / probability\n", "\n", " # Take the Nth root of the product\n", " perplexity = product_pi**(1/float(N))\n", " \n", " ### END CODE HERE ### \n", " return perplexity" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Perplexity for first train sample: 2.8040\n", "Perplexity for test sample: 3.9654\n" ] } ], "source": [ "# test your code\n", "\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "\n", "unigram_counts = count_n_grams(sentences, 1)\n", "bigram_counts = count_n_grams(sentences, 2)\n", "\n", "\n", "perplexity_train1 = calculate_perplexity(sentences[0],\n", " unigram_counts, bigram_counts,\n", " len(unique_words), k=1.0)\n", "print(f\"Perplexity for first train sample: {perplexity_train1:.4f}\")\n", "\n", "test_sentence = ['i', 'like', 'a', 'dog']\n", "perplexity_test = calculate_perplexity(test_sentence,\n", " unigram_counts, bigram_counts,\n", " len(unique_words), k=1.0)\n", "print(f\"Perplexity for test sample: {perplexity_test:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expected Output\n", "\n", "```CPP\n", "Perplexity for first train sample: 2.8040\n", "Perplexity for test sample: 3.9654\n", "```\n", "\n", " Note: If your sentence is really long, there will be underflow when multiplying many fractions.\n", "- To handle longer sentences, modify your implementation to take the sum of the log of the probabilities." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Part 4: Build an auto-complete system\n", "\n", "In this section, you will combine the language models developed so far to implement an auto-complete system. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 11\n", "Compute probabilities for all possible next words and suggest the most likely one.\n", "- This function also take an optional argument `start_with`, which specifies the first few letters of the next words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • estimate_probabilities returns a dictionary where the key is a word and the value is the word's probability.
  • \n", "
  • Use str1.startswith(str2) to determine if a string starts with the letters of another string. For example, 'learning'.startswith('lea') returns True, whereas 'learning'.startswith('ear') returns False. There are two additional parameters in str.startswith(), but you can use the default values for those parameters in this case.
  • \n", "
\n", "

" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# UNQ_C11 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "# GRADED FUNCTION: suggest_a_word\n", "def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):\n", " \"\"\"\n", " Get suggestion for the next word\n", " \n", " Args:\n", " previous_tokens: The sentence you input where each token is a word. Must have length > n \n", " n_gram_counts: Dictionary of counts of (n+1)-grams\n", " n_plus1_gram_counts: Dictionary of counts of (n+1)-grams\n", " vocabulary: List of words\n", " k: positive constant, smoothing parameter\n", " start_with: If not None, specifies the first few letters of the next word\n", " \n", " Returns:\n", " A tuple of \n", " - string of the most likely next word\n", " - corresponding probability\n", " \"\"\"\n", " \n", " # length of previous words\n", " n = len(list(n_gram_counts.keys())[0]) \n", " \n", " # From the words that the user already typed\n", " # get the most recent 'n' words as the previous n-gram\n", " previous_n_gram = previous_tokens[-n:]\n", "\n", " # Estimate the probabilities that each word in the vocabulary\n", " # is the next word,\n", " # given the previous n-gram, the dictionary of n-gram counts,\n", " # the dictionary of n plus 1 gram counts, and the smoothing constant\n", " probabilities = estimate_probabilities(previous_n_gram,\n", " n_gram_counts, n_plus1_gram_counts,\n", " vocabulary, k=k)\n", " \n", " # Initialize suggested word to None\n", " # This will be set to the word with highest probability\n", " suggestion = None\n", " \n", " # Initialize the highest word probability to 0\n", " # this will be set to the highest probability \n", " # of all words to be suggested\n", " max_prob = 0\n", " \n", " ### START CODE HERE (Replace instances of 'None' with your code) ###\n", " \n", " # For each word and its probability in the probabilities dictionary:\n", " for word, prob in probabilities.items(): # complete this line\n", " \n", " # If the optional start_with string is set\n", " if start_with != None: # complete this line\n", "\n", " # Check if the word starts with the letters in 'start_with'\n", " if not word.startswith(start_with): # complete this line\n", "\n", " #If so, don't consider this word (move onto the next word)\n", " continue # complete this line\n", " \n", " # Check if this word's probability\n", " # is greater than the current maximum probability\n", " if prob > max_prob: # complete this line\n", " \n", " # If so, save this word as the best suggestion (so far)\n", " suggestion = word\n", " \n", " # Save the new maximum probability\n", " max_prob = prob\n", "\n", " ### END CODE HERE\n", " \n", " return suggestion, max_prob" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are 'i like',\n", "\tand the suggested word is `a` with a probability of 0.2727\n", "\n", "The previous words are 'i like', the suggestion must start with `c`\n", "\tand the suggested word is `cat` with a probability of 0.0909\n" ] } ], "source": [ "# test your code\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "\n", "unigram_counts = count_n_grams(sentences, 1)\n", "bigram_counts = count_n_grams(sentences, 2)\n", "\n", "previous_tokens = [\"i\", \"like\"]\n", "tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)\n", "print(f\"The previous words are 'i like',\\n\\tand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}\")\n", "\n", "print()\n", "# test your code when setting the starts_with\n", "tmp_starts_with = 'c'\n", "tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)\n", "print(f\"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`\\n\\tand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expected output\n", "\n", "```CPP\n", "The previous words are 'i like',\n", "\tand the suggested word is `a` with a probability of 0.2727\n", "\n", "The previous words are 'i like', the suggestion must start with `c`\n", "\tand the suggested word is `cat` with a probability of 0.0909\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get multiple suggestions\n", "\n", "The function defined below loop over varioud n-gram models to get multiple suggestions." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):\n", " model_counts = len(n_gram_counts_list)\n", " suggestions = []\n", " for i in range(model_counts-1):\n", " n_gram_counts = n_gram_counts_list[i]\n", " n_plus1_gram_counts = n_gram_counts_list[i+1]\n", " \n", " suggestion = suggest_a_word(previous_tokens, n_gram_counts,\n", " n_plus1_gram_counts, vocabulary,\n", " k=k, start_with=start_with)\n", " suggestions.append(suggestion)\n", " return suggestions" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are 'i like', the suggestions are:\n" ] }, { "data": { "text/plain": [ "[('a', 0.2727272727272727),\n", " ('a', 0.2),\n", " ('dog', 0.1111111111111111),\n", " ('dog', 0.1111111111111111)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# test your code\n", "sentences = [['i', 'like', 'a', 'cat'],\n", " ['this', 'dog', 'is', 'like', 'a', 'cat']]\n", "unique_words = list(set(sentences[0] + sentences[1]))\n", "\n", "unigram_counts = count_n_grams(sentences, 1)\n", "bigram_counts = count_n_grams(sentences, 2)\n", "trigram_counts = count_n_grams(sentences, 3)\n", "quadgram_counts = count_n_grams(sentences, 4)\n", "qintgram_counts = count_n_grams(sentences, 5)\n", "\n", "n_gram_counts_list = [unigram_counts, bigram_counts, trigram_counts, quadgram_counts, qintgram_counts]\n", "previous_tokens = [\"i\", \"like\"]\n", "tmp_suggest3 = get_suggestions(previous_tokens, n_gram_counts_list, unique_words, k=1.0)\n", "\n", "print(f\"The previous words are 'i like', the suggestions are:\")\n", "display(tmp_suggest3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Suggest multiple words using n-grams of varying length\n", "\n", "Congratulations! You have developed all building blocks for implementing your own auto-complete systems.\n", "\n", "Let's see this with n-grams of varying lengths (unigrams, bigrams, trigrams, 4-grams...6-grams)." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Computing n-gram counts with n = 1 ...\n", "Computing n-gram counts with n = 2 ...\n", "Computing n-gram counts with n = 3 ...\n", "Computing n-gram counts with n = 4 ...\n", "Computing n-gram counts with n = 5 ...\n" ] } ], "source": [ "n_gram_counts_list = []\n", "for n in range(1, 6):\n", " print(\"Computing n-gram counts with n =\", n, \"...\")\n", " n_model_counts = count_n_grams(train_data_processed, n)\n", " n_gram_counts_list.append(n_model_counts)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are ['i', 'am', 'to'], the suggestions are:\n" ] }, { "data": { "text/plain": [ "[('be', 0.027665685098338604),\n", " ('have', 0.00013487086115044844),\n", " ('have', 0.00013490725126475548),\n", " ('i', 6.746272684341901e-05)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "previous_tokens = [\"i\", \"am\", \"to\"]\n", "tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)\n", "\n", "print(f\"The previous words are {previous_tokens}, the suggestions are:\")\n", "display(tmp_suggest4)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are ['i', 'want', 'to', 'go'], the suggestions are:\n" ] }, { "data": { "text/plain": [ "[('to', 0.014051961029228078),\n", " ('to', 0.004697942168993581),\n", " ('to', 0.0009424436216762033),\n", " ('to', 0.0004044489383215369)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "previous_tokens = [\"i\", \"want\", \"to\", \"go\"]\n", "tmp_suggest5 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)\n", "\n", "print(f\"The previous words are {previous_tokens}, the suggestions are:\")\n", "display(tmp_suggest5)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are ['hey', 'how', 'are'], the suggestions are:\n" ] }, { "data": { "text/plain": [ "[('you', 0.023426812585499317),\n", " ('you', 0.003559435862995299),\n", " ('you', 0.00013491635186184566),\n", " ('i', 6.746272684341901e-05)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "previous_tokens = [\"hey\", \"how\", \"are\"]\n", "tmp_suggest6 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)\n", "\n", "print(f\"The previous words are {previous_tokens}, the suggestions are:\")\n", "display(tmp_suggest6)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:\n" ] }, { "data": { "text/plain": [ "[(\"'re\", 0.023973994311255586),\n", " ('?', 0.002888465830762161),\n", " ('?', 0.0016134453781512605),\n", " ('', 0.00013491635186184566)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "previous_tokens = [\"hey\", \"how\", \"are\", \"you\"]\n", "tmp_suggest7 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)\n", "\n", "print(f\"The previous words are {previous_tokens}, the suggestions are:\")\n", "display(tmp_suggest7)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:\n" ] }, { "data": { "text/plain": [ "[('do', 0.009020723283218204),\n", " ('doing', 0.0016411737674785006),\n", " ('doing', 0.00047058823529411766),\n", " ('dvd', 6.745817593092283e-05)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "previous_tokens = [\"hey\", \"how\", \"are\", \"you\"]\n", "tmp_suggest8 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=\"d\")\n", "\n", "print(f\"The previous words are {previous_tokens}, the suggestions are:\")\n", "display(tmp_suggest8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Congratulations!\n", "\n", "You've completed this assignment by building an autocomplete model using an n-gram language model! \n", "\n", "Please continue onto the fourth and final week of this course!" ] } ], "metadata": { "coursera": { "schema_names": [ "NLPC2-3" ] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }