{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Building the language model\n", "\n", "\n", "### Count matrix\n", "\n", "To calculate the n-gram probability, you will need to count frequencies of n-grams and n-gram prefixes in the training dataset. In some of the code assignment exercises, you will store the n-gram frequencies in a dictionary. \n", "\n", "In other parts of the assignment, you will build a count matrix that keeps counts of (n-1)-gram prefix followed by all possible last words in the vocabulary.\n", "\n", "The following code shows how to check, retrieve and update counts of n-grams in the word count dictionary." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count of n-gram ('i', 'am', 'happy'): 2\n", "n-gram ('i', 'am', 'learning') missing\n", "n-gram ('i', 'am', 'learning') found\n" ] } ], "source": [ "# manipulate n_gram count dictionary\n", "\n", "n_gram_counts = {\n", " ('i', 'am', 'happy'): 2,\n", " ('am', 'happy', 'because'): 1}\n", "\n", "# get count for an n-gram tuple\n", "print(f\"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}\")\n", "\n", "# check if n-gram is present in the dictionary\n", "if ('i', 'am', 'learning') in n_gram_counts:\n", " print(f\"n-gram {('i', 'am', 'learning')} found\")\n", "else:\n", " print(f\"n-gram {('i', 'am', 'learning')} missing\")\n", "\n", "# update the count in the word count dictionary\n", "n_gram_counts[('i', 'am', 'learning')] = 1\n", "if ('i', 'am', 'learning') in n_gram_counts:\n", " print(f\"n-gram {('i', 'am', 'learning')} found\")\n", "else:\n", " print(f\"n-gram {('i', 'am', 'learning')} missing\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next code snippet shows how to merge two tuples in Python. That will be handy when creating the n-gram from the prefix and the last word." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('i', 'am', 'happy', 'because')\n" ] } ], "source": [ "# concatenate tuple for prefix and tuple with the last word to create the n_gram\n", "prefix = ('i', 'am', 'happy')\n", "word = 'because'\n", "\n", "# note here the syntax for creating a tuple for a single word\n", "n_gram = prefix + (word,)\n", "print(n_gram)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lecture, you've seen that the count matrix could be made in a single pass through the corpus. Here is one approach to do that." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " happy because i am learning .\n", "(i, am) 1.0 0.0 0.0 0.0 1.0 0.0\n", "(am, happy) 0.0 1.0 0.0 0.0 0.0 0.0\n", "(happy, because) 0.0 0.0 1.0 0.0 0.0 0.0\n", "(because, i) 0.0 0.0 0.0 1.0 0.0 0.0\n", "(am, learning) 0.0 0.0 0.0 0.0 0.0 1.0\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from collections import defaultdict\n", "def single_pass_trigram_count_matrix(corpus):\n", " \"\"\"\n", " Creates the trigram count matrix from the input corpus in a single pass through the corpus.\n", " \n", " Args:\n", " corpus: Pre-processed and tokenized corpus. \n", " \n", " Returns:\n", " bigrams: list of all bigram prefixes, row index\n", " vocabulary: list of all found words, the column index\n", " count_matrix: pandas dataframe with bigram prefixes as rows, \n", " vocabulary words as columns \n", " and the counts of the bigram/word combinations (i.e. trigrams) as values\n", " \"\"\"\n", " bigrams = []\n", " vocabulary = []\n", " count_matrix_dict = defaultdict(dict)\n", " \n", " # go through the corpus once with a sliding window\n", " for i in range(len(corpus) - 3 + 1):\n", " # the sliding window starts at position i and contains 3 words\n", " trigram = tuple(corpus[i : i + 3])\n", " \n", " bigram = trigram[0 : -1]\n", " if not bigram in bigrams:\n", " bigrams.append(bigram) \n", " \n", " last_word = trigram[-1]\n", " if not last_word in vocabulary:\n", " vocabulary.append(last_word)\n", " \n", " if (bigram,last_word) not in count_matrix_dict:\n", " count_matrix_dict[bigram,last_word] = 0\n", " \n", " count_matrix_dict[bigram,last_word] += 1\n", " \n", " # convert the count_matrix to np.array to fill in the blanks\n", " count_matrix = np.zeros((len(bigrams), len(vocabulary)))\n", " for trigram_key, trigam_count in count_matrix_dict.items():\n", " count_matrix[bigrams.index(trigram_key[0]), \\\n", " vocabulary.index(trigram_key[1])]\\\n", " = trigam_count\n", " \n", " # np.array to pandas dataframe conversion\n", " count_matrix = pd.DataFrame(count_matrix, index=bigrams, columns=vocabulary)\n", " return bigrams, vocabulary, count_matrix\n", "\n", "corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']\n", "\n", "bigrams, vocabulary, count_matrix = single_pass_trigram_count_matrix(corpus)\n", "\n", "print(count_matrix)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Probability matrix\n", "The next step is to build a probability matrix from the count matrix. \n", "\n", "You can use an object dataframe from library pandas and its methods [sum](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html?highlight=sum#pandas.DataFrame.sum) and [div](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.div.html) to normalize the cell counts with the sum of the respective rows. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " happy because i am learning .\n", "(i, am) 0.5 0.0 0.0 0.0 0.5 0.0\n", "(am, happy) 0.0 1.0 0.0 0.0 0.0 0.0\n", "(happy, because) 0.0 0.0 1.0 0.0 0.0 0.0\n", "(because, i) 0.0 0.0 0.0 1.0 0.0 0.0\n", "(am, learning) 0.0 0.0 0.0 0.0 0.0 1.0\n" ] } ], "source": [ "# create the probability matrix from the count matrix\n", "row_sums = count_matrix.sum(axis=1)\n", "# delete each row by its sum\n", "prob_matrix = count_matrix.div(row_sums, axis=0)\n", "\n", "print(prob_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probability matrix now helps you to find a probability of an input trigram. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bigram: ('i', 'am')\n", "word: happy\n", "trigram_probability: 0.5\n" ] } ], "source": [ "# find the probability of a trigram in the probability matrix\n", "trigram = ('i', 'am', 'happy')\n", "\n", "# find the prefix bigram \n", "bigram = trigram[:-1]\n", "print(f'bigram: {bigram}')\n", "\n", "# find the last word of the trigram\n", "word = trigram[-1]\n", "print(f'word: {word}')\n", "\n", "# we are using the pandas dataframes here, column with vocabulary word comes first, row with the prefix bigram second\n", "trigram_probability = prob_matrix[word][bigram]\n", "print(f'trigram_probability: {trigram_probability}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code assignment, you will be searching for the most probable words starting with a prefix. You can use the method [str.startswith](https://docs.python.org/3/library/stdtypes.html#str.startswith) to test if a word starts with a prefix. \n", "\n", "Here is a code snippet showing how to use this method." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "words in vocabulary starting with prefix: ha\n", "\n", "happy\n", "have\n" ] } ], "source": [ "# lists all words in vocabulary starting with a given prefix\n", "vocabulary = ['i', 'am', 'happy', 'because', 'learning', '.', 'have', 'you', 'seen','it', '?']\n", "starts_with = 'ha'\n", "\n", "print(f'words in vocabulary starting with prefix: {starts_with}\\n')\n", "for word in vocabulary:\n", " if word.startswith(starts_with):\n", " print(word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Language model evaluation\n", "\n", "### Train/validation/test split\n", "In the videos, you saw that to evaluate language models, you need to keep some of the corpus data for validation and testing.\n", "\n", "The choice of the test and validation data should correspond as much as possible to the distribution of the data coming from the actual application. If nothing but the input corpus is known, then random sampling from the corpus is used to define the test and validation subset. \n", "\n", "Here is a code similar to what you'll see in the code assignment. The following function allows you to randomly sample the input data and return train/validation/test subsets in a split given by the method parameters." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "split 80/10/10:\n", " train data:[28, 76, 5, 0, 62, 29, 54, 95, 88, 58, 4, 22, 92, 14, 50, 77, 47, 33, 75, 68, 56, 74, 43, 80, 83, 84, 73, 93, 66, 87, 9, 91, 64, 79, 20, 51, 17, 27, 12, 31, 67, 81, 7, 34, 45, 72, 38, 30, 16, 60, 40, 86, 48, 21, 70, 59, 6, 19, 2, 99, 37, 36, 52, 61, 97, 44, 26, 57, 89, 55, 53, 85, 3, 39, 10, 71, 23, 32, 25, 8]\n", " validation data:[78, 65, 63, 11, 49, 98, 1, 46, 15, 41]\n", " test data:[90, 96, 82, 42, 35, 13, 69, 24, 94, 18]\n", "\n", "split 98/1/1:\n", " train data:[66, 23, 29, 28, 52, 87, 70, 13, 15, 2, 62, 43, 82, 50, 40, 32, 30, 79, 71, 89, 6, 10, 34, 78, 11, 49, 39, 42, 26, 46, 58, 96, 97, 8, 56, 86, 33, 93, 92, 91, 57, 65, 95, 20, 72, 3, 12, 9, 47, 37, 67, 1, 16, 74, 53, 99, 54, 68, 5, 18, 27, 17, 48, 36, 24, 45, 73, 19, 41, 59, 21, 98, 0, 31, 4, 85, 80, 64, 84, 88, 25, 44, 61, 22, 60, 94, 76, 38, 77, 81, 90, 69, 63, 7, 51, 14, 55, 83]\n", " validation data:[35]\n", " test data:[75]\n", "\n" ] } ], "source": [ "# we only need train and validation %, test is the remainder\n", "import random\n", "def train_validation_test_split(data, train_percent, validation_percent):\n", " \"\"\"\n", " Splits the input data to train/validation/test according to the percentage provided\n", " \n", " Args:\n", " data: Pre-processed and tokenized corpus, i.e. list of sentences.\n", " train_percent: integer 0-100, defines the portion of input corpus allocated for training\n", " validation_percent: integer 0-100, defines the portion of input corpus allocated for validation\n", " \n", " Note: train_percent + validation_percent need to be <=100\n", " the reminder to 100 is allocated for the test set\n", " \n", " Returns:\n", " train_data: list of sentences, the training part of the corpus\n", " validation_data: list of sentences, the validation part of the corpus\n", " test_data: list of sentences, the test part of the corpus\n", " \"\"\"\n", " # fixed seed here for reproducibility\n", " random.seed(87)\n", " \n", " # reshuffle all input sentences\n", " random.shuffle(data)\n", "\n", " train_size = int(len(data) * train_percent / 100)\n", " train_data = data[0:train_size]\n", " \n", " validation_size = int(len(data) * validation_percent / 100)\n", " validation_data = data[train_size:train_size + validation_size]\n", " \n", " test_data = data[train_size + validation_size:]\n", " \n", " return train_data, validation_data, test_data\n", "\n", "data = [x for x in range (0, 100)]\n", "\n", "train_data, validation_data, test_data = train_validation_test_split(data, 80, 10)\n", "print(\"split 80/10/10:\\n\",f\"train data:{train_data}\\n\", f\"validation data:{validation_data}\\n\", \n", " f\"test data:{test_data}\\n\")\n", "\n", "train_data, validation_data, test_data = train_validation_test_split(data, 98, 1)\n", "print(\"split 98/1/1:\\n\",f\"train data:{train_data}\\n\", f\"validation data:{validation_data}\\n\", \n", " f\"test data:{test_data}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Perplexity\n", "\n", "In order to implement the perplexity formula, you'll need to know how to implement m-th order root of a variable.\n", "\n", "\\begin{equation*}\n", "PP(W)=\\sqrt[M]{\\prod_{i=1}^{m}{\\frac{1}{P(w_i|w_{i-1})}}}\n", "\\end{equation*}\n", "\n", "Remember from calculus:\n", "\n", "\\begin{equation*}\n", "\\sqrt[M]{\\frac{1}{x}} = x^{-\\frac{1}{M}}\n", "\\end{equation*}\n", "\n", "Here is a code that will help you with the formula." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "316.22776601683796\n" ] } ], "source": [ "# to calculate the exponent, use the following syntax\n", "p = 10 ** (-250)\n", "M = 100\n", "perplexity = p ** (-1 / M)\n", "print(perplexity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all for the lab for \"N-gram language model\" lesson of week 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }