{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:44.568146", "start_time": "2016-10-13T11:56:44.552022" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Models\n", "Language models (LMs) calculate the probability to see a given sequence of words, as defined through a [tokenization](todo) algorithm, in a given language or sub-language/domain/genre. For example, an English language model may assign a higher probability to seeing the sequence \"How are you?\" than to \"Wassup' dawg?\", and for a hip-hop language model this proportion may be reversed. Language models (LMs) calculate the probability to see a given sequence of words.\n", "\n", "There are several use cases for such models: \n", "\n", "* To filter out bad translations in machine translation.\n", "* To rank speech recognition output. \n", "* In concept-to-text generation.\n", "\n", "More formally, a language model is a stochastic process that models the probability \\\\(\\prob(w_1,\\ldots,w_d)\\\\) of observing sequences of words \\\\(w_1,\\ldots,w_d\\\\). We can, without loss of generality, decompose the probability of such sequences into \n", "\n", "$$\n", "\\prob(w_1,\\ldots,w_d) = \\prob(w_1) \\prod_{i = 2}^d \\prob(w_i|w_1,\\ldots,w_{i-1}).\n", "$$\n", "\n", "This means that a language model can be defined by how it models the conditional probablity $\\prob(w_i|w_1,\\ldots,w_{i-1})$ of seeing a word \\\\(w_i\\\\) after having seen the *history* of previous words $w_1,\\ldots,w_{i-1}$. We also have to model the prior probability $\\prob(w_1)$, but it is easy to reduce this prior to a conditional probability as well.\n", "\n", "In practice it is common to define language models based on *equivalence classes* of histories instead of having different conditional distributions for each possible history. This overcomes sparsity and efficiency problems when working with full histories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## N-gram Language Models\n", "\n", "The most common type of equivalence class relies on *truncating* histories $w_1,\\ldots,w_{i-1}$ to length $n-1$:\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i|w_{i-n},\\ldots,w_{i-1}).\n", "$$\n", "\n", "That is, the probability of a word only depends on the last $n-1$ previous words. We will refer to such model as a *n-gram language model*.\n", "\n", "## A Uniform Baseline LM\n", "\n", "*Unigram* models are the simplest 1-gram language models. That is, they model the conditional probability of word using the prior probability of seeing that word:\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i).\n", "$$\n", "\n", "To setup datasets and as baseline for more complex language models, we first introduce the simplest instantituation of a unigram model: a *uniform* language model which assigns the same prior probability to each word. That is, given a *vocabulary* of words \\\\(\\vocab\\\\), the uniform LM is defined as:\n", "\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\frac{1}{|\\vocab|}.\n", "$$\n", "\n", "Let us \"train\" and test such a language model on the OHHLA corpus. First we need to load this corpus. Below we focus on a subset to make our code more responsive and to allow us to test models more quickly. Check the [loading from OHHLA](load_ohhla.ipynb) notebook to see how `load_albums` and `words` are defined. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:44.936040", "start_time": "2016-10-13T11:56:44.794073" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "\"[BAR] Can 't even call this a blues song [/BAR] [BAR] It 's been so long [/BAR] [BAR] Neither one of us was wrong or anything like that [/BAR] [BAR] It seems like yesterday [/BAR]\"" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statnlpbook.util as util\n", "util.execute_notebook('load_ohhla.ipynb')\n", "# docs = load_albums(j_live)\n", "# docs = load_all_songs(\"../data/ohhla/train\")\n", "docs = load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\")\n", "trainDocs, testDocs = docs[:len(docs)//2], docs[len(docs)//2:] \n", "train = words(trainDocs)\n", "test = words(testDocs)\n", "\" \".join(train[0:35])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now create a uniform language model. Language models in this book implement the `LanguageModel` [abstract base class](https://docs.python.org/3/library/abc.html). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:44.954182", "start_time": "2016-10-13T11:56:44.937783" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import abc \n", "class LanguageModel(metaclass=abc.ABCMeta):\n", " \"\"\"\n", " Args:\n", " vocab: the vocabulary underlying this language model. Should be a set of words.\n", " order: history length (-1).\n", " \"\"\"\n", " def __init__(self, vocab, order):\n", " self.vocab = vocab\n", " self.order = order\n", " \n", " @abc.abstractmethod\n", " def probability(self, word,*history):\n", " \"\"\"\n", " Args:\n", " word: the word we need the probability of\n", " history: words to condition on.\n", " Returns:\n", " the probability p(w|history)\n", " \"\"\"\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important method we have to provide is `probability(word,history)` which returns the probability of a word given a history. Let us implement a uniform LM using this class." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:44.975482", "start_time": "2016-10-13T11:56:44.956021" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "0.0002913752913752914" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class UniformLM(LanguageModel):\n", " \"\"\"\n", " A uniform language model that assigns the same probability to each word in the vocabulary. \n", " \"\"\"\n", " def __init__(self, vocab):\n", " super().__init__(vocab, 1)\n", " def probability(self, word,*history):\n", " return 1.0 / len(self.vocab) if word in self.vocab else 0.0\n", " \n", "vocab = set(train)\n", "baseline = UniformLM(vocab)\n", "baseline.probability(\"call\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling\n", "It is instructive and easy to sample language from a language model. In many, but not all, cases the more natural the generated language of an LM looks, the better this LM is.\n", "\n", "To sample from an LM one simply needs to iteratively sample from the LM conditional probability over words, and add newly sampled words to the next history. The only challenge in implementing this is to sample from a categorical distribution over words. Here we provide this functionality via `np.random.choice` from [numpy](http://www.numpy.org/). " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:45.040982", "start_time": "2016-10-13T11:56:44.977077" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "['ain',\n", " 'mayday',\n", " 'meaning',\n", " 'Every',\n", " 'truant',\n", " 'Or',\n", " 'veins',\n", " 'decisions',\n", " 'Africa',\n", " 'conscious']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "def sample(lm, init, amount):\n", " \"\"\"\n", " Sample from a language model.\n", " Args:\n", " lm: the language model\n", " init: the initial sequence of words to condition on\n", " amount: how long should the sampled sequence be\n", " \"\"\"\n", " words = list(lm.vocab)\n", " result = []\n", " result += init\n", " for _ in range(0, amount):\n", " history = result[-(lm.order-1):]\n", " probs = [lm.probability(word, *history) for word in words]\n", " sampled = np.random.choice(words,p=probs)\n", " result.append(sampled)\n", " return result\n", "\n", "sample(baseline, [], 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation\n", "How do we determine the quality of an (n-gram) LM? One way is through *extrinsic* evaluation: assess how much the LM improves performance on *downstream tasks* such as machine translation or speech recognition. Arguably this is the most important measure of LM quality, but it can be costly as re-training such systems may take days, and when we seek to develop general-purpose LMs we may have to evaluate performance on several tasks. This is problematic when one wants to iteratively improve LMs and test new models and parameters. It is hence useful to find *intrinsic* means of evaluation that assess the stand-alone quality of LMs with minimal overhead.\n", "\n", "One intrinsic way is to measure how well the LM plays the \"Shannon Game\": Predict what the next word in actual context should be, and win if your predictions match the words in an actual corpus. This can be formalized using the notion of *perplexity* of the LM on a given dataset. Given a test sequence \\\\(w_1,\\ldots,w_T\\\\) of \\\\(T\\\\) words, we calculate the perplexity \\\\(\\perplexity\\\\) as follows:\n", "\n", "$$\n", "\\perplexity(w_1,\\ldots,w_T) = \\prob(w_1,\\ldots,w_T)^{-\\frac{1}{T}} = \\sqrt[T]{\\prod_i^T \\frac{1}{\\prob(w_i|w_{i-n},\\ldots,w_{i-1})}}\n", "$$\n", "\n", "We can implement a perplexity function based on the `LanguageModel` interface. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:45.059308", "start_time": "2016-10-13T11:56:45.042382" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import math\n", "def perplexity(lm, data):\n", " \"\"\"\n", " Calculate the perplexity of the language model given the provided data.\n", " Args:\n", " lm: a language model.\n", " data: the data to calculate perplexity on.\n", "\n", " Returns:\n", " the perplexity of `lm` on `data`.\n", "\n", " \"\"\"\n", " log_prob = 0.0\n", " history_order = lm.order - 1\n", " for i in range(history_order, len(data)):\n", " history = data[i - history_order : i]\n", " word = data[i]\n", " p = lm.probability(word, *history)\n", " log_prob += math.log(p) if p > 0.0 else float(\"-inf\")\n", " return math.exp(-log_prob / (len(data) - history_order))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how the uniform model does on our test set. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:45.098765", "start_time": "2016-10-13T11:56:45.060841" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(baseline, test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out-of-Vocabularly Words\n", "The problem in the above example is that the baseline model assigns zero probability to words that are not in the vocabulary. Test sets will usually contain such words, and this leads to the above result of infinite perplexity. For example, the following three words do not appear in the training set vocabulary `vocab` and hence receive 0 probability." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:45.128569", "start_time": "2016-10-13T11:56:45.100780" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "[('does', 0.0), ('Ceremonies', 0.0), ('Masquerading', 0.0)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(w,baseline.probability(w)) for w in test if w not in vocab][:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Long Tail\n", "The fact that we regularly encounter new words in our corpus is a common phenomenon not specific to our corpus. Generally we will see a few words that appear repeatedly, and a long tail of words that appear only a few times. While each individual long-tail word is rare, the probability of seeing any long-tail word is quite high (the long tail covers a lot of the frequency mass).\n", "\n", "Let us observe this phenomenon for our data: we will rank the words according to their frequency, and plot this frequency against the rank. Let us first extracted the sorted counts and their ranks." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:51.191052", "start_time": "2016-10-13T11:56:51.173462" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "import collections\n", "counts = collections.defaultdict(int)\n", "for word in train:\n", " counts[word] += 1\n", "sorted_counts = sorted(counts.values(),reverse=True)\n", "ranks = range(1,len(sorted_counts)+1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now plot the counts against their rank. Play around with the x and y scale and change them to `'log'`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2016-10-13T11:56:54.006646", "start_time": "2016-10-13T11:56:53.871654" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%matplotlib inline\n", "import mpld3\n", "\n", "fig = plt.figure()\n", "plt.xscale('linear')\n", "plt.yscale('linear')\n", "plt.plot(ranks, sorted_counts)\n", "mpld3.display(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In log-space such rank vs frequency graphs resemble linear functions. This observation is known as *Zipf's Law*, and can be formalized as follows. Let \\\\(r\\_w\\\\) be the rank of a word \\\\(w\\\\), and \\\\(f\\_w\\\\) its frequency, then we have:\n", "\n", "$$\n", " f_w \\propto \\frac{1}{r_w}.\n", "$$\n", "\n", "## Inserting Out-of-Vocabularly Tokens\n", "The long tail of infrequent words is a problem for LMs because it means there will always be words with zero counts in your training set. There are various solutions to this problem. For example, when it comes to calculating the LM perplexity we could remove words that do not appear in the training set. This overcomes the problem of infinite perplexity but doesn't solve the actual issue: the LM assigns too low probability to unseen words. Moreover, the problem only gets worse when one considers n-gram models with larger \\\\(n\\\\), because these will encounter many unseen n-grams, which, when removed, will only leave small fractions of the original sentences.\n", "\n", "The principled solution to this problem is smoothing, and we will discuss it in more detail later. Before we get there we present a simple preprocessing step that generally simplifies the handling of unseen words, and gives rise to a simple smoothing heuristic. Namely, we replace unseen words in the test corpus with an out-of-vocabularly token, say `OOV`. This means that LMs can still work with a fixed vocabularly that consists of all training words, and the `OOV` token. Now we just need a way to estimate the probability of the `OOV` token to avoid the infinite perplexity problem." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "['[BAR]',\n", " 'scratched',\n", " '[/BAR]',\n", " '[BAR]',\n", " 'What',\n", " '[OOV]',\n", " 'it',\n", " 'take',\n", " '[/BAR]',\n", " '[BAR]']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "OOV = '[OOV]'\n", "def replace_OOVs(vocab,data):\n", " \"\"\"\n", " Replace every word not within the vocabulary with the `OOV` symbol.\n", " Args:\n", " vocab: the reference vocabulary.\n", " data: the sequence of tokens to replace words within\n", "\n", " Returns:\n", " a version of `data` where each word not in `vocab` is replaced by the `OOV` symbol.\n", " \"\"\"\n", "\n", " return [word if word in vocab else OOV for word in data]\n", "\n", "replace_OOVs(baseline.vocab, test[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that in practice we can enable language models to operate on any test set vocabulary if we decorate the model with following wrapper. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "class OOVAwareLM(LanguageModel):\n", " \"\"\"\n", " This LM converts out of vocabulary tokens to a special OOV token before their probability is calculated.\n", " \"\"\"\n", " def __init__(self, base_lm, missing_words, oov=OOV):\n", " super().__init__(base_lm.vocab | missing_words, base_lm.order)\n", " self.base_lm = base_lm\n", " self.oov = oov\n", " self.missing_words = missing_words\n", "\n", " def probability(self, word, *history):\n", " if word in self.base_lm.vocab:\n", " return self.base_lm.probability(word, *history)\n", " elif word in self.missing_words:\n", " return self.base_lm.probability(self.oov, *history) / len(self.missing_words)\n", " else:\n", " return 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This wrapper takes unseen words (outside of the vocabulary) and maps them to the `OOV` token. If the underlying base model has a probability for this token the `OOVAwareLM` returns that probability divided by the number of out-of-vocabulary words we expect. This number can be set easily when a fixed test corpus is available: it amounts to counting how many words in the test corpus do not appear in the base LM vocabulary.\n", "\n", "A simple way to (heuristically) estimate the `OOV` probability is to replace the first encounter of each word in the training set with the `OOV` token. Now we can estimate LMs as before, and will automatically get some estimate of the `OOV` probability. The underlying assumption of this heuristic is that the probability of unseen words is identical to the probability of encountering a new word. We illustrate the two operations of this method in the code below." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "['[OOV]', 'AA', '[OOV]', 'BB', 'AA']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def inject_OOVs(data):\n", " \"\"\"\n", " Uses a heuristic to inject OOV symbols into a dataset.\n", " Args:\n", " data: the sequence of words to inject OOVs into.\n", "\n", " Returns: the new sequence with OOV symbols injected.\n", " \"\"\"\n", "\n", " seen = set()\n", " result = []\n", " for word in data:\n", " if word in seen:\n", " result.append(word)\n", " else:\n", " result.append(OOV)\n", " seen.add(word)\n", " return result\n", "\n", "inject_OOVs([\"AA\",\"AA\",\"BB\",\"BB\",\"AA\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can apply this to our training and test set, and create a new uniform model." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "1290.0000000049852" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oov_train = inject_OOVs(train)\n", "oov_vocab = set(oov_train)\n", "oov_test = replace_OOVs(oov_vocab, test)\n", "oov_baseline = UniformLM(oov_vocab)\n", "perplexity(oov_baseline,oov_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training Language Models\n", "The uniform LM is obviously not good at modelling actual language. To improve upon this baseline, we can estimate the conditional n-gram distributions from the training data. To this end let us first introduce one parameter $\\param_{w,h}$ for each word $w$ and history $h$ of length $n - 1$, and define a parametrized language model $p_\\params$: \n", "\n", "$$\n", "\\prob_\\params(w|h) = \\param_{w,h}\n", "$$\n", "\n", "Training an n-gram LM amounts to estimating \\\\(\\params\\\\) from some training set \\\\(\\train=(w_1,\\ldots,w_n)\\\\).\n", "One way to do this is to choose the \\\\(\\params\\\\) that maximizes the log-likelihood of \\\\(\\train\\\\):\n", "$$\n", "\\params^* = \\argmax_\\params \\log p_\\params(\\train)\n", "$$\n", "\n", "As it turns out, this maximum-log-likelihood estimate (MLE) can calculated in closed form, simply by counting:\n", "$$\n", "\\param^*_{w,h} = \\frac{\\counts{\\train}{w,h}}{\\counts{\\train}{h}} \n", "$$\n", "\n", "where \n", "\n", "$$\n", "\\counts{D}{e} = \\text{Count of event } e \\text{ in } D \n", "$$\n", "\n", "Here the event $h$ means seeing the history $h$, and $w,h$ seeing the history $h$ followed by word $w$. \n", "\n", "Many LM variants can be implemented simply by estimating the counts in the nominator and denominator differently. We therefore introduce an interface for such count-based LMs. This will help us later to implement LM variants by modifying the counts of a base-LM. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "class CountLM(LanguageModel):\n", " \"\"\"\n", " A Language Model that uses counts of events and histories to calculate probabilities of words in context.\n", " \"\"\"\n", " @abc.abstractmethod\n", " def counts(self, word_and_history):\n", " pass\n", " @abc.abstractmethod\n", " def norm(self, history):\n", " pass\n", " \n", " def probability(self, word, *history):\n", " sub_history = tuple(history[-(self.order-1):]) if self.order > 1 else () \n", " return self.counts((word,) + sub_history) / self.norm(sub_history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us use this to code up a generic NGram model." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "class NGramLM(CountLM):\n", " def __init__(self, train, order):\n", " \"\"\"\n", " Create an NGram language model.\n", " Args:\n", " train: list of training tokens.\n", " order: order of the LM.\n", " \"\"\"\n", " super().__init__(set(train), order)\n", " self._counts = collections.defaultdict(float)\n", " self._norm = collections.defaultdict(float)\n", " for i in range(self.order, len(train)):\n", " history = tuple(train[i - self.order + 1 : i])\n", " word = train[i]\n", " self._counts[(word,) + history] += 1.0\n", " self._norm[history] += 1.0\n", " def counts(self, word_and_history):\n", " return self._counts[word_and_history]\n", " def norm(self, history):\n", " return self._norm[history]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us train a unigram model. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD8CAYAAAB3u9PLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAF8RJREFUeJzt3X+UX3V95/Hny0SQHhdQMt1iEg0uqRqlRgn4m0VdNawt\n4ZwFSeoKWE6z1rJdtVrj2Ypuqqew7C6n3aWtqIhaNXLwsGY1Nusuop5WMANGIEDqEFKT+CsC4i9+\nGHjvH/cz7DdfZpjvJDOTX8/HOffMvZ/7+dzP535n5vv63h9zJ1WFJElP2NcDkCTtHwwESRJgIEiS\nGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqZu/rAUzGnDlzasGCBft6GJJ0QLnxxht/XFVDE9U7\noAJhwYIFDA8P7+thSNIBJck/DVLPU0aSJMBAkCQ1BoIkCTAQJEnNQIGQZGmSzUlGkqwaY/0pSW5K\nsivJmT3lr0yysWd6IMkZbd2VSe7qWbd46nZLkjRZE95llGQWcBnwGmA7sCHJ2qq6rafad4HzgHf2\ntq2qrwCL23aeCowA/7unyruq6uq92QFJ0tQY5LbTk4GRqtoCkGQNsAx4NBCqamtb98jjbOdM4EtV\n9cs9Hq0kadoMcspoLrCtZ3l7K5us5cBn+so+mOTmJJcmOXwPtilJmiIzclE5ybHACcD6nuL3AM8G\nTgKeCrx7nLYrkwwnGd65c+e0j1WSDlWDnDLaAczvWZ7XyibjDcA1VfWr0YKq+n6bfTDJx+i7/tBT\n73LgcoAlS5bUJPt91IJVX9zTpgPbetHrp70PSZougxwhbAAWJjkuyWF0p37WTrKfFfSdLmpHDSQJ\ncAZw6yS3KUmaQhMGQlXtAi6gO91zO3BVVW1KsjrJ6QBJTkqyHTgL+FCSTaPtkyygO8L4at+mP5Xk\nFuAWYA7wgb3fHUnSnhro4XZVtQ5Y11d2Yc/8BrpTSWO13coYF6Gr6lWTGagkaXr5l8qSJMBAkCQ1\nBoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIA\nA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIGDIQkS5NsTjKSZNUY609JclOSXUnO7Fv3cJKNbVrb\nU35ckhvaNj+b5LC93x1J0p6aMBCSzAIuA04DFgErkizqq/Zd4Dzg02Ns4v6qWtym03vKLwYurarj\ngXuB8/dg/JKkKTLIEcLJwEhVbamqh4A1wLLeClW1tapuBh4ZpNMkAV4FXN2KPg6cMfCoJUlTbpBA\nmAts61ne3soG9aQkw0muTzL6pn8M8JOq2rWH25QkTbHZM9DHM6pqR5JnAtcmuQW4b9DGSVYCKwGe\n/vSnT9MQJUmDHCHsAOb3LM9rZQOpqh3t6xbgOuAFwN3A0UlGA2ncbVbV5VW1pKqWDA0NDdqtJGmS\nBgmEDcDCdlfQYcByYO0EbQBI8pQkh7f5OcDLgNuqqoCvAKN3JJ0LfH6yg5ckTZ0JA6Gd578AWA/c\nDlxVVZuSrE5yOkCSk5JsB84CPpRkU2v+HGA4ybfpAuCiqrqtrXs38I4kI3TXFD46lTsmSZqcga4h\nVNU6YF1f2YU98xvoTvv0t/sH4IRxtrmF7g4mSdJ+wL9UliQBBoIkqTEQJEmAgSBJagwESRJgIEiS\nGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJ\ngIEgSWoMBEkSMGAgJFmaZHOSkSSrxlh/SpKbkuxKcmZP+eIk30iyKcnNSc7uWXdlkruSbGzT4qnZ\nJUnSnpg9UYUks4DLgNcA24ENSdZW1W091b4LnAe8s6/5L4Fzquo7SZ4G3JhkfVX9pK1/V1Vdvbc7\nIUnaexMGAnAyMFJVWwCSrAGWAY8GQlVtbese6W1YVf/YM/+9JD8ChoCfIEnarwxyymgusK1neXsr\nm5QkJwOHAXf2FH+wnUq6NMnh47RbmWQ4yfDOnTsn260kaUAzclE5ybHAJ4E3V9XoUcR7gGcDJwFP\nBd49VtuquryqllTVkqGhoZkYriQdkgYJhB3A/J7lea1sIEmOBL4I/Mequn60vKq+X50HgY/RnZqS\nJO0jgwTCBmBhkuOSHAYsB9YOsvFW/xrgE/0Xj9tRA0kCnAHcOpmBS5Km1oSBUFW7gAuA9cDtwFVV\ntSnJ6iSnAyQ5Kcl24CzgQ0k2teZvAE4Bzhvj9tJPJbkFuAWYA3xgSvdMkjQpg9xlRFWtA9b1lV3Y\nM7+B7lRSf7u/Bf52nG2+alIjlSRNK/9SWZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiS\nGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkZKBCS\nLE2yOclIklVjrD8lyU1JdiU5s2/duUm+06Zze8pPTHJL2+ZfJsne744kaU9NGAhJZgGXAacBi4AV\nSRb1VfsucB7w6b62TwXeB7wIOBl4X5KntNV/Dfw+sLBNS/d4LyRJe232AHVOBkaqagtAkjXAMuC2\n0QpVtbWte6Sv7euAL1fVPW39l4GlSa4Djqyq61v5J4AzgC/tzc7srxas+uK097H1otfvd31LOrAM\ncspoLrCtZ3l7KxvEeG3ntvk92aYkaRrs9xeVk6xMMpxkeOfOnft6OJJ00BokEHYA83uW57WyQYzX\ndkebn3CbVXV5VS2pqiVDQ0MDditJmqxBAmEDsDDJcUkOA5YDawfc/nrgtUme0i4mvxZYX1XfB36a\n5MXt7qJzgM/vwfglSVNkwkCoql3ABXRv7rcDV1XVpiSrk5wOkOSkJNuBs4APJdnU2t4D/BldqGwA\nVo9eYAbeCnwEGAHu5CC9oCxJB4pB7jKiqtYB6/rKLuyZ38Dup4B6610BXDFG+TDwvMkMVpI0ffb7\ni8qSpJlhIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAk\nNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1AwVCkqVJNicZSbJqjPWHJ/lsW39D\nkgWt/I1JNvZMjyRZ3NZd17Y5uu7Xp3LHJEmTM2EgJJkFXAacBiwCViRZ1FftfODeqjoeuBS4GKCq\nPlVVi6tqMfAm4K6q2tjT7o2j66vqR1OwP5KkPTTIEcLJwEhVbamqh4A1wLK+OsuAj7f5q4FXJ0lf\nnRWtrSRpPzRIIMwFtvUsb29lY9apql3AfcAxfXXOBj7TV/axdrrovWMEiCRpBs3IReUkLwJ+WVW3\n9hS/sapOAF7RpjeN03ZlkuEkwzt37pyB0UrSoWmQQNgBzO9ZntfKxqyTZDZwFHB3z/rl9B0dVNWO\n9vVnwKfpTk09RlVdXlVLqmrJ0NDQAMOVJO2JQQJhA7AwyXFJDqN7c1/bV2ctcG6bPxO4tqoKIMkT\ngDfQc/0gyewkc9r8E4HfBm5FkrTPzJ6oQlXtSnIBsB6YBVxRVZuSrAaGq2ot8FHgk0lGgHvoQmPU\nKcC2qtrSU3Y4sL6FwSzg/wAfnpI9kiTtkQkDAaCq1gHr+sou7Jl/ADhrnLbXAS/uK/sFcOIkxypJ\nmkb+pbIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMg\nSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkoABAyHJ0iSbk4wkWTXG+sOTfLat\nvyHJgla+IMn9STa26W962pyY5JbW5i+TZKp2SpI0eRMGQpJZwGXAacAiYEWSRX3VzgfurarjgUuB\ni3vW3VlVi9v0lp7yvwZ+H1jYpqV7vhuSpL01yBHCycBIVW2pqoeANcCyvjrLgI+3+auBVz/eJ/4k\nxwJHVtX1VVXAJ4AzJj16SdKUGSQQ5gLbepa3t7Ix61TVLuA+4Ji27rgk30ry1SSv6Km/fYJtApBk\nZZLhJMM7d+4cYLiSpD0x3ReVvw88vapeALwD+HSSIyezgaq6vKqWVNWSoaGhaRmkJGmwQNgBzO9Z\nntfKxqyTZDZwFHB3VT1YVXcDVNWNwJ3Ab7b68ybYpiRpBg0SCBuAhUmOS3IYsBxY21dnLXBumz8T\nuLaqKslQuyhNkmfSXTzeUlXfB36a5MXtWsM5wOenYH8kSXto9kQVqmpXkguA9cAs4Iqq2pRkNTBc\nVWuBjwKfTDIC3EMXGgCnAKuT/Ap4BHhLVd3T1r0VuBI4AvhSmyRJ+8iEgQBQVeuAdX1lF/bMPwCc\nNUa7zwGfG2ebw8DzJjNYSdL08S+VJUmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAk\nSYCBIElqDARJEmAgSJIaA0GSBBgIkqRmoMdfS3tiwaovTnsfWy96/bT3IR0qPEKQJAEGgiSpMRAk\nSYCBIElqDARJEjBgICRZmmRzkpEkq8ZYf3iSz7b1NyRZ0Mpfk+TGJLe0r6/qaXNd2+bGNv36VO2U\nJGnyJrztNMks4DLgNcB2YEOStVV1W0+184F7q+r4JMuBi4GzgR8Dv1NV30vyPGA9MLen3RuraniK\n9kWStBcGOUI4GRipqi1V9RCwBljWV2cZ8PE2fzXw6iSpqm9V1fda+SbgiCSHT8XAJUlTa5BAmAts\n61nezu6f8nerU1W7gPuAY/rq/Bvgpqp6sKfsY+100XuTZFIjlyRNqRm5qJzkuXSnkf5dT/Ebq+oE\n4BVtetM4bVcmGU4yvHPnzukfrCQdogYJhB3A/J7lea1szDpJZgNHAXe35XnANcA5VXXnaIOq2tG+\n/gz4NN2pqceoqsuraklVLRkaGhpknyRJe2CQQNgALExyXJLDgOXA2r46a4Fz2/yZwLVVVUmOBr4I\nrKqqvx+tnGR2kjlt/onAbwO37t2uSJL2xoSB0K4JXEB3h9DtwFVVtSnJ6iSnt2ofBY5JMgK8Axi9\nNfUC4Hjgwr7bSw8H1ie5GdhId4Tx4ancMUnS5Az0tNOqWges6yu7sGf+AeCsMdp9APjAOJs9cfBh\nSpKmm4+/1kHJR29Lk2cgSFPMMNKBymcZSZIAA0GS1HjKSDqIeLpKe8MjBEkSYCBIkhpPGUmaEp6u\nOvB5hCBJAgwESVLjKSNJB4XpPmV1KJyuMhAkaS8dLGHkKSNJEmAgSJIaA0GSBBgIkqTGQJAkAQaC\nJKkxECRJgIEgSWoMBEkSMGAgJFmaZHOSkSSrxlh/eJLPtvU3JFnQs+49rXxzktcNuk1J0syaMBCS\nzAIuA04DFgErkizqq3Y+cG9VHQ9cClzc2i4ClgPPBZYCf5Vk1oDblCTNoEGOEE4GRqpqS1U9BKwB\nlvXVWQZ8vM1fDbw6SVr5mqp6sKruAkba9gbZpiRpBg0SCHOBbT3L21vZmHWqahdwH3DM47QdZJuS\npBm03z/tNMlKYGVb/HmSzTPU9Rzgx5NpkIv3Xf/2bd/2bd+P4xmDVBokEHYA83uW57WysepsTzIb\nOAq4e4K2E20TgKq6HLh8gHFOqSTDVbVkpvvdH/q3b/u274O378czyCmjDcDCJMclOYzuIvHavjpr\ngXPb/JnAtVVVrXx5uwvpOGAh8M0BtylJmkETHiFU1a4kFwDrgVnAFVW1KclqYLiq1gIfBT6ZZAS4\nh+4NnlbvKuA2YBfwh1X1MMBY25z63ZMkDWqgawhVtQ5Y11d2Yc/8A8BZ47T9IPDBQba5n5nx01T7\nUf/2bd/2ffD2Pa50Z3YkSYc6H10hSQIMBM2wJEcneWubPzXJF/b1mACS/Hxfj2Em9L7+h6ok/9C+\nLkjyuzPY79aZ6mtPHfSB0L7p9yfZ2JbnJfl8ku8kuTPJX7Q7nUbrvzzJN5Pc0aaVrfxfJvlG37Zn\nJ/lhkqcluSTJD5K8c4L+H06yMcm3k9yU5KV99d+W5IEkR/WUnZrkvtbujiT/pWfd2e15UF/o285u\n/bayv0nysiRXJrmrZ3vv62s7J8mvkrylr3xrkluS3Jzkq0me0cqPaNt6KMmcCb4lRwOH9BvSPnbI\nv/5VNfo7twCYsUA4IFTVQT3RfdNvbfOhu+31zW15Ft0dUpe05d8Avgu8sC3PAW4EXk8XntuAZ/Rs\neyndLbajy+8H3jle/2355z3zrwO+2lf/BuDro2NsZacCX2jzRwB3AC8ba/14/bayjW2frwTObGVP\nArYAx/XU+4M2hv6xbQXmtPn/BHx4vPWP8/1YA9zfxrIBuI7ucSd3AJ/i/1/XOhH4anv91wPHTvPP\nyc+ncdurgbf1LH8Q+A/AJcCtwC3A2WN9L4H/AZw3hWPpff0vGWsMMzEB/7N9bzcBK2eq397vNXA9\n3VMVNgJvn4F+N7SvxwJfa/3eCrxiJvf/8aaD/gihz6uAB6rqYwDV3QL7duD3kvwa8IfAlVV1U1v/\nY+BPgFVV9QhwFe2W2mY58Jm9GM+RwL2jC0n+BfBk4E+BFWM1qKrRX+ZJPeojyXOAf2z73OtJ7esv\nespWAH8MzE0yb5xNfmOyY2hWAXdW1WLgXcALgLfRPeTwmcDLkjwR+O90oXUicAVj3Kl2ALkCOAcg\nyRPofm62A4uB5wP/CrgkybEzMJbe1//6fTQGgN9r39slwB8lOWaG+u21Cvh6VS2uqkunu7OqOqnN\n/i6wvn0Pnk/3+7xfONQC4bl0n0oeVVU/pTsqOH6s9cBwK4fuzX85dI/8Bv418LlJjmH09ModwEeA\nP+tZt5zuE9zXgWcl+ef9jZM8he4P/L42yX5PA/6uZ/mSdjppO90DCH/Utj+f7tP4N+kC8OxxtreU\n7lPe3vpmVW1vgbuR7sjmWcDzgC+3Mf4p3V+zH5Cqaitwd5IXAK8FvgW8HPhMVT1cVT+kOxo6afyt\nTIt9OYY/SvJtulCaT/czfajYALw5yfuBE6rqZ/t4PI861AJhr1TVMPDkJM+ie4O9oarumeRm7m+f\nSJ5N96b6ifZkWOg+ma9pb46fY/e/7XhF+wXaQffp4geT7Pd17B4I72qfUH6D7um0o+dVz6YLAujC\nqf9I5StJdtDt/94cHY16sGf+Ybq/jQmwqb1Oi6vqhKp67RT0tS99BDgPeDPdEcN4drH77+WTxqt4\noEpyKt0RyUuq6vl0AXnQ7ed4quprwCl0v8tXJjlnHw/pUYdaINxGd276UUmOBJ5O92jux6xvy71/\nRT16lLC3p4uoqm/QXacYSnIC3aekL7e7EZaz+5vx19svz3OB85MsHrSfdjrs6Kr63hhj+DndefyX\nt6IVwHltDGuB30rS++ntlXQPytpIdx1hsn4G/LMJ6myme01e0sb/xCTPnaDN/u4aug8AJ9FdE/k6\ncHa6/w8yRPcG8U3gn4BF7XEvRwOvnuJx9L7+441huh1F9/9Tfpnk2cCLZ6DPsQzyszjl2s0YP6yq\nD9N9UHjhTI9hPIdaIPxf4NdGEzndP+r5r3TXDX5J9097zht9s23nNS8G/nPPNj4D/Fu66xGf35vB\ntF+GWXQPAlwBvL+qFrTpacDTRu/kGVXd/5W4CHj3JLp6JfCVccYwG3gRcGeS3wSeXFVzR8cB/Dl9\nRwnVPeL8bcA5SZ46iXFQVXcDf5/kVroLmmPVeYjumVgXt6OijcBLx6p7oGj79BXgqnYd5xrgZuDb\nwLXAn1TVD6pqG90R2q3t67emeBy9r/9LxhrDVPY3jr8DZie5ne5n+foZ6HMsNwMPtzv+3j6D/Z4K\nfDvJt+iOyP9iBvt+fPv6qvZ0Tzz2Lp/5wP8CvgPcSXfx8vCe9afQneO7g+6T6h+Msc2NdKd2+svf\nz8R3GT3c2m+k+0V8fSvfAjy7r+1/o3vjP5Xd7zw5gu5wc0Fb3m19f790d6qc2rPuSuCuNobb2msQ\n4H3ARX3b+S3g9ja/lZ67iFq79/Ys77beabfX8Qnt9V64r8fi5DTetN//P4SpVt0nsN95nPVfY4IL\na9Wde9/T/meNU/7MMcre0bN4XU/5/UzuDp+X0t1NNdr+vHHqPeYUUFXdDDynzS/oW/fvJzGGQ1a6\nfw/7BeCaqvrOvh6PNJ5D4ZTRw8BRvX+gNR2SXEJ3KukXfaumtf8kZwN/Rc/tq/39VtULq+pX09F/\nG8MRbf+eCDwyXf0cqKrqtqp6ZlX98b4ei/R4fLidJAk4NI4QJEkDMBAkSYCBIElqDARJEmAgSJKa\n/weFWYQ4j1fJOgAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "unigram = NGramLM(oov_train,1)\n", "def plot_probabilities(lm, context = (), how_many = 10): \n", " probs = sorted([(word,lm.probability(word,*context)) for word in lm.vocab], key=lambda x:x[1], reverse=True)[:how_many]\n", " util.plot_bar_graph([prob for _,prob in probs], [word for word, _ in probs])\n", "plot_probabilities(unigram)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The unigram LM has substantially reduced (and hence better) perplexity:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "91.4414922652717" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(unigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us also look at the language the unigram LM generates." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "['[OOV]', 'are', 'a', '[BAR]', '[OOV]', 'if', '[OOV]', '[OOV]', '[OOV]', 'me']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample(unigram, [], 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bigram LM\n", "\n", "The unigram model ignores any correlation between consecutive words in a sentence. The next best model to overcome this shortcoming is a bigram model. This model conditions the probability of the current word on the previous word. Let us construct such model from the training data. \n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEcFJREFUeJzt3XuwXWV9xvHv0yBEy0WQjK1ATNR4CVVxjLFVwQuKoVag\nHWjj2Aod2+hMmWodnNKxBYzjDNQ62qmI0DEjdVS8ayqpFOWqKCbcDYqEiBC0CgRR5Gbg1z/Wim4O\nB84+5+yTk/B+PzN7zrq8a73vfrP3s9+91l4rqSokSW34ndlugCRp2zH0Jakhhr4kNcTQl6SGGPqS\n1BBDX5IaYuhLUkMMfUlqiKEvSQ3ZabYbMNbee+9dCxYsmO1mSNIO5bLLLrutquZNVG67C/0FCxaw\nbt262W6GJO1QkvxomHIe3pGkhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIZs\nd1fkTteC48+e8TpuPPl1M16HJM0ER/qS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jek\nhhj6ktQQQ1+SGvKYuw3DbPIWEJK2d470Jakhhr4kNcTQl6SGGPqS1JChQj/JsiTXJdmQ5Phx1r8j\nybVJrk7y9SRPHVh3dJLr+8fRo2y8JGlyJgz9JHOAU4FDgcXAG5IsHlPsCmBJVT0P+Bzwr/22ewEn\nAi8GlgInJtlzdM2XJE3GMCP9pcCGqtpYVfcDZwGHDxaoqvOr6u5+9tvAvv30a4Fzq2pzVd0BnAss\nG03TJUmTNUzo7wPcPDC/qV/2SN4M/M8Ut5UkzaCRXpyV5C+BJcDLJ7ndCmAFwPz580fZJEnSgGFG\n+rcA+w3M79sve4gkrwbeBRxWVfdNZtuqOqOqllTVknnz5g3bdknSJA0T+muBRUkWJtkZWA6sHiyQ\n5AXA6XSB/7OBVecAhyTZsz+Be0i/TJI0CyY8vFNVW5IcSxfWc4BVVbU+yUpgXVWtBt4H7Ap8NgnA\nTVV1WFVtTvIeug8OgJVVtXlGnokkaUJDHdOvqjXAmjHLThiYfvWjbLsKWDXVBkqSRscrciWpIYa+\nJDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtS\nQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE\n0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQ4YK\n/STLklyXZEOS48dZf1CSy5NsSXLkmHUPJLmyf6weVcMlSZO300QFkswBTgVeA2wC1iZZXVXXDhS7\nCTgGOG6cXdxTVQeMoK2SpGmaMPSBpcCGqtoIkOQs4HDgN6FfVTf26x6cgTZKkkZkmMM7+wA3D8xv\n6pcNa26SdUm+neSISbVOkjRSw4z0p+upVXVLkqcB5yW5pqpuGCyQZAWwAmD+/PnboEmS1KZhRvq3\nAPsNzO/bLxtKVd3S/90IXAC8YJwyZ1TVkqpaMm/evGF3LUmapGFCfy2wKMnCJDsDy4GhfoWTZM8k\nu/TTewMvZeBcgCRp25ow9KtqC3AscA7wPeAzVbU+ycokhwEkeVGSTcBRwOlJ1vebPwdYl+Qq4Hzg\n5DG/+pEkbUNDHdOvqjXAmjHLThiYXkt32GfsdpcAz51mGyVJI+IVuZLUEENfkhpi6EtSQwx9SWqI\noS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6\nktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9J\nDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIUOFfpJlSa5LsiHJ8eOs\nPyjJ5Um2JDlyzLqjk1zfP44eVcMlSZM3YegnmQOcChwKLAbekGTxmGI3AccAnxyz7V7AicCLgaXA\niUn2nH6zJUlTMcxIfymwoao2VtX9wFnA4YMFqurGqroaeHDMtq8Fzq2qzVV1B3AusGwE7ZYkTcEw\nob8PcPPA/KZ+2TCG2jbJiiTrkqy79dZbh9y1JGmytosTuVV1RlUtqaol8+bNm+3mSNJj1jChfwuw\n38D8vv2yYUxnW0nSiA0T+muBRUkWJtkZWA6sHnL/5wCHJNmzP4F7SL9MkjQLJgz9qtoCHEsX1t8D\nPlNV65OsTHIYQJIXJdkEHAWcnmR9v+1m4D10HxxrgZX9MknSLNhpmEJVtQZYM2bZCQPTa+kO3Yy3\n7Spg1TTaKEkake3iRK4kadsw9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQl\nqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5Ia\nYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGG\nviQ1xNCXpIYY+pLUkKFCP8myJNcl2ZDk+HHW75Lk0/36S5Ms6JcvSHJPkiv7x0dG23xJ0mTsNFGB\nJHOAU4HXAJuAtUlWV9W1A8XeDNxRVc9Ishw4BfiLft0NVXXAiNstSZqCYUb6S4ENVbWxqu4HzgIO\nH1PmcODMfvpzwMFJMrpmSpJGYZjQ3we4eWB+U79s3DJVtQW4E3hSv25hkiuSXJjkwGm2V5I0DRMe\n3pmmnwDzq+r2JC8EvpRk/6r6xWChJCuAFQDz58+f4SZJUruGGenfAuw3ML9vv2zcMkl2AvYAbq+q\n+6rqdoCqugy4AXjm2Aqq6oyqWlJVS+bNmzf5ZyFJGsowob8WWJRkYZKdgeXA6jFlVgNH99NHAudV\nVSWZ158IJsnTgEXAxtE0XZI0WRMe3qmqLUmOBc4B5gCrqmp9kpXAuqpaDXwU+HiSDcBmug8GgIOA\nlUl+DTwIvLWqNs/EE5EkTWyoY/pVtQZYM2bZCQPT9wJHjbPd54HPT7ONkqQR8YpcSWrITP96R9vI\nguPPnvE6bjz5dTNeh6SZ5UhfkhriSF/T5rcMacfhSF+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1\nxNCXpIYY+pLUEC/O0g7NC8OkyXGkL0kNMfQlqSGGviQ1xNCXpIZ4IleaIk8ia0fkSF+SGmLoS1JD\nDH1JaojH9KUd0GyeT/Bcxo7Nkb4kNcTQl6SGGPqS1BBDX5Ia4olcSTsMT2BPnyN9SWqIoS9JDTH0\nJakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIUOFfpJlSa5LsiHJ8eOs3yXJ\np/v1lyZZMLDun/rl1yV57eiaLkmarAlDP8kc4FTgUGAx8IYki8cUezNwR1U9A/gAcEq/7WJgObA/\nsAz4cL8/SdIsGGakvxTYUFUbq+p+4Czg8DFlDgfO7Kc/BxycJP3ys6rqvqr6IbCh358kaRYME/r7\nADcPzG/ql41bpqq2AHcCTxpyW0nSNrJd3E8/yQpgRT97V5LrtmH1ewO3TWaDnGLd1m3dO0jdk65/\nB677qcMUGib0bwH2G5jft182XplNSXYC9gBuH3JbquoM4IxhGjxqSdZV1RLrtm7rfuzVPdv1z/Zz\nH88wh3fWAouSLEyyM92J2dVjyqwGju6njwTOq6rqly/vf92zEFgEfGc0TZckTdaEI/2q2pLkWOAc\nYA6wqqrWJ1kJrKuq1cBHgY8n2QBspvtgoC/3GeBaYAvwd1X1wAw9F0nSBIY6pl9Va4A1Y5adMDB9\nL3DUI2z7XuC902jjTJuVw0rWbd3W3UT9s/3cHybdURhJUgu8DYMkNaSp0E9y42y3QeNLcsQ4V3pv\nq7qPSfKU2ah7lJJcMoVtZrzfx/ZvkrcnecIM1ndj/3dBku/2069I8pWZqrOv46Qkx81kHaPQVOjP\nlP7FdU+SK/v5fZN8Ocn1SW5I8u/9L5+2ln9Zku8k+X7/WNEvf3mSb43Z905JfprkKUnel+T/doQX\n1hQcQXebj9lwDLDDh35VvWQKm22Lfj+Gh/bv24EZC309utZC/1b4zaf+hX0wb0xycpI39kF8TZKn\nT2HfN1TVAf3tJ74AfKmqFgHPBHalP5md5PeATwJvrapnAy8D3pLkdcDFwL5JBi+yeDWwvqp+XFXv\nBD4ybIOSvCnJ1UmuSvLxJK/vb4h3RZKvJXlyX+6kJKuSXND3x99P4fmPV/+/9Dfa+0aSTyU5LsnT\nk3w1yWVJLk7y7CQvAQ4D3pfkyin2/zB1H5Dk232ffDHJnkmOBJYAn+jrfvwU63vn1n5L8oEk5/XT\nr0ryiSSnJVmXZH2Sdw9sd3KSa/s2/ds0n/NdY0e0ST6U5Jjx6ppOv0+jf99G9wFwfpLz+32N2zfT\ncOsI9jGUJO9K8oMk3wCe1S97WD/0yy9IckqfMz9IcuC2audDVFVzD+AVwM+B3wd2obtg7N39urcB\nH5zk/hYA3+2nDwYuGrN+d7qL1Z4AvAdYOWb9wcDF/fT7gX8cWPcx4G8H5k8CjhuiTfsDPwD27uf3\nAvbktyfv/wZ4/8A+L+n7Yu++rY+bZh+/CLgSmAvsBlwPHAd8HVjUl3kx3TUdW5/nkSP6932kuq8G\nXt6XWbn13xm4AFgyzTr/EPhsP30x3fUojwNOBN4C7NWvm9PX9zy6W5VcN/Bv8sRptuGu/rX9lYFl\nH6IbaY9b11T6fbr9C9y49XW59bU5tm9G9DpYwG/flw/plxHt/4XANXTv693p7i02UT9sfc/9MfC1\nUbZn2EdrI/1Ba6vqJ1V1H3AD8L/98mvoXixTtT9w2eCCqvoFcBPwjPHWA+v65QCfor/OIckudC+O\nz0+hHa+iC6Hb+jZsprsi+pwk1wDvHKgT4Ozqbox3G/Az4MlTqHPQS4EvV9W9VfVL4L/pQuIlwGfT\nHQo7ne6Dd9TGq/t36YLuwr7MmcBBI6zzMuCFSXYH7gO+RTfCPZDuQ+DPk1wOXEHX74vp7lF1L/DR\nJH8G3D3C9ow1yrpG3b/j9c2O4EDgi1V1d/8eX83E/fCF/u9lTC9npqzl0L9vYPrBgfkHmcV7ElXV\nOmDXJM+iu531pX1gj8J/AB+qqufSjT7nDqwb7I8HmJk++B3g51V1wMDjOTNQzzZXVb8Gfkg3qr6E\nLuhfSfdBfw/dCPDgqnoecDYwt7qbEy6luzPtnwBfHUFTtvDQ9/Xcvn0zUde0pbtS/2F9M7utmlFb\n32cz9R6bUMuhP1Oupfva9xv96G8+3de/h63v59cPzG8d7S/vp6fiPOCoJE/q27AX3T2Rtt776OhH\n2nBEvgm8PsncJLvSBc3dwA+THNW3KUme35f/Jd2hgpmq+1fAHQPHUf8K2DoaG1XdF9MF2EX99Fvp\nRq+79/Xf2Z9HORSgb9se1V38+A/A88fb6ST9CFic7tYnT6Q7dPhodU3luU+3fwfnx+2bHcRFwBFJ\nHp9kN+D1PHo/bBe2i7tsPsZ8HTg5yZuq6r/S/acx7wc+VlV3JzkVuDTJF6rqyj6UT6E79rfVp+i+\nKu5B9x/UTFp1t8B4L3BhkgfowuckukMrd9B9KCyc4nMcpv61SVbTHd/8Kd1hszuBNwKnJflnumPe\nZwFX9X//sz8ZemRV3TADdR8NfCTdzwU3An/db/Kxfvk9wB9V1T1TrPpi4F3At6rqV0nupTtXc1WS\nK4Dv091q/Jt9+d2ALyeZCwR4xxTr3aqq6uZ0tz75Lt03jysmqGvS/T7d/qW7SvWrSX5cVa98hL7Z\n7lXV5Uk+Tff6/Rndfcrgkfthu+AVuSOQ7r+H/EpV/UE/vx/wYeDZdN+m1tCdfL2vX38Q3QfBbnRv\nwA9W1Wlj9nkl8P2qWj5m+UnAXVU1rV96bAtJdq2qu/oX/0XAiqq6/LFe92zoBw+XV9VQt9cdQX1N\n9e9jiSP9GVBVN9N91Xuk9RfR/QLi0fZxwKjbNQvOSHfhz1zgzG0cCrNZ9zaV7sKnC4BtORBopn8f\naxzpj0A/sr8EuH0mwzrJ+4A/pfvZ12kTlZeksQx9SWqIv96RpIYY+pLUEENfkhpi6EtSQwx9SWrI\n/wMxkzeMk8JmQwAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bigram = NGramLM(oov_train,2)\n", "plot_probabilities(bigram, ('I',))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see a more peaked distribution conditioned on \"I\" than in the case of the unigram model. Let us see how the bigram LM generates language." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "\"[BAR] [OOV] [/BAR] [BAR] Watch my man I wanna MC ain 't [OOV] [/BAR] [BAR] stroke Through the universe [/BAR] [BAR] Like a fender bender [/BAR] [BAR] Or good times [/BAR]\"" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\" \".join(sample(bigram, ['[BAR]'], 30))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the bigram model improve perplexity?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(bigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately the bigram model has the problem we tried to avoid using the OOV preprocessing method above. The problem is that there are contexts in which the OOV word (and other words) hasn't been seen, and hence it receives 0 probability." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Smoothing\n", "\n", "The general problem is that maximum likelhood estimates will always underestimate the true probability of some words, and in turn overestimate the (context-dependent) probabilities of other words. To overcome this issue we aim to _smooth_ the probabilities and move mass from seen events to unseen events.\n", "\n", "### Laplace Smoothing\n", "\n", "The easiest way to overcome the problem of zero probabilities is to simply add pseudo counts to each event in the dataset (in a Bayesian setting this amounts to a maximum posteriori estimate under a dirichlet prior on parameters).\n", "\n", "$$\n", "\\param^{\\alpha}_{w,h} = \\frac{\\counts{\\train}{h,w} + \\alpha}{\\counts{\\train}{h} + \\alpha \\lvert V \\rvert } \n", "$$\n", "\n", "Let us implement this in Python." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "0.0007692307692307692" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class LaplaceLM(CountLM):\n", " def __init__(self, base_lm, alpha):\n", " super().__init__(base_lm.vocab, base_lm.order)\n", " self.base_lm = base_lm\n", " self.alpha = alpha\n", " def counts(self, word_and_history):\n", " return self.base_lm.counts(word_and_history) + self.alpha\n", " def norm(self, history):\n", " return self.base_lm.norm(history) + self.alpha * len(self.base_lm.vocab)\n", "\n", "laplace_bigram = LaplaceLM(bigram, 0.1) \n", "laplace_bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should give a better perplexity value:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "72.41090509288753" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(laplace_bigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjusted counts\n", "It is often useful to think of smoothing algorithms as un-smoothed Maximum-Likelhood estimators that work with *adjusted* n-gram counts in the numerator, and fixed history counts in the denominator. This allows us to see how counts from high-frequency words are reduced, and counts of unseen words increased. If these changes are too big, the smoothing method is likely not very effective.\n", "\n", "Let us reformulate the laplace LM using adjusted counts. Note that we since we have histories with count 0, we do need to increase the original denominator by a small \\\\(\\epsilon\\\\) to avoid division by zero. \n", "$$\n", "\\begin{split}\n", "\\counts{\\train,\\alpha}{h,w} &= \\param^{\\alpha}_{w,h} \\cdot (\\counts{\\train}{h} + \\epsilon)\\\\\\\\\n", "\\counts{\\train,\\alpha}{h} &= \\counts{\\train}{h} + \\epsilon\n", "\\end{split}\n", "$$" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "(689.0, 664.1298035643539)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class AdjustedLaplaceLM(CountLM):\n", " def __init__(self, base_lm, alpha):\n", " super().__init__(base_lm.vocab, base_lm.order)\n", " self.base_lm = base_lm\n", " self.alpha = alpha\n", " self.eps = 0.000001\n", " def counts(self, word_and_history):\n", " history = word_and_history[1:]\n", " word = word_and_history[0]\n", " return 0.0 if word not in self.vocab else \\\n", " (self.base_lm.counts(word_and_history) + self.alpha) / \\\n", " (self.base_lm.norm(history) + self.alpha * len(self.base_lm.vocab)) * \\\n", " (self.base_lm.norm(history) + self.eps)\n", " def norm(self, history):\n", " return self.base_lm.norm(history) + self.eps\n", "\n", "adjusted_laplace_bigram = AdjustedLaplaceLM(bigram, 0.1)\n", "bigram.counts((OOV,OOV)), adjusted_laplace_bigram.counts((OOV,OOV))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see above that for high frequency words the absolute counts are altered quite substantially. This is unfortunate because for high frequency words we would expect the counts to be relatively accurate. Can we test more generally wether our adjusted counts are sensible?\n", "\n", "One option is to compare the adjusted counts to average counts in a held-out set. For example, for words of count 0 in the training set, how does their average count in the held-out set compare to their adjusted count in the smoothed model? To test this we need some helper functions." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "def avg_counts(train_lm, test_lm, vocab):\n", " \"\"\"\n", " Calculate a dictionary from counts in the training-LM to counts in the test-LM. \n", " \"\"\"\n", " avg_test_counts = collections.defaultdict(float)\n", " norm = collections.defaultdict(float)\n", " for ngram in util.cross_product([list(train_lm.vocab)] * train_lm.order):\n", " train_count = train_lm.counts(ngram)\n", " test_count = test_lm.counts(ngram)\n", " avg_test_counts[train_count] += test_count\n", " norm[train_count] += 1.0\n", " for c in avg_test_counts.keys():\n", " avg_test_counts[c] /= norm[c]\n", " return avg_test_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now calculate a table of training counts, test counts, and smoothed counts." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train CountTest CountSmoothed Count
000.0033750.004950
110.4374870.301893
221.1152070.754602
331.6615381.212537
442.6753251.572842
554.1018522.280637
664.6883122.562516
775.3928573.577428
\n", "
" ], "text/plain": [ " Train Count Test Count Smoothed Count\n", "0 0 0.003375 0.004950\n", "1 1 0.437487 0.301893\n", "2 2 1.115207 0.754602\n", "3 3 1.661538 1.212537\n", "4 4 2.675325 1.572842\n", "5 5 4.101852 2.280637\n", "6 6 4.688312 2.562516\n", "7 7 5.392857 3.577428" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_bigram = NGramLM(oov_test, 2)\n", "joint_vocab = set(oov_test + oov_train)\n", "avg_test_counts = avg_counts(bigram, test_bigram, joint_vocab)\n", "avg_laplace_counts = avg_counts(bigram, AdjustedLaplaceLM(bigram, 0.1), joint_vocab)\n", "frame = [(count, avg_test_counts[count], avg_laplace_counts[count]) for count in range(0,8)]\n", "pd.DataFrame(frame, columns = [\"Train Count\", \"Test Count\", \"Smoothed Count\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train vs Test Counts\n", "\n", "In the above table it is interesting to note that the test counts usually differ from the training counts (larger than 0) by an absolute amount of about 0.6 to 1.4. With larger corpora this difference is even more consistent. This suggest that good smoothing methods should, roughly speaking, take one count off of the real training counts, and then allocate this mass to the unseen words. In the exercises you will develop a model that captures this intuition.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpolation\n", "For a given context the smoothing methods discussed above shift mass uniformly across the words that haven't been seen in this context. This makes sense when the words are not in the vocabularly. However, when words are in the vocabularly but just have not been seen in the given context, we can do better because we can leverage statistics about the word from other contexts. In particular, we can *back-off* to the statistics of \\\\(n-1\\\\) grams. \n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "(0.0007633587786259542, 0.0007633587786259542)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adjusted_laplace_bigram.probability('skies','skies'), adjusted_laplace_bigram.probability('[/BAR]','skies')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple technique to use the \\\\(n-1\\\\) gram statistics is interpolation. Here we compose the probability of a word as the weighted sum of the probability of an \\\\(n\\\\)-gram model \\\\(p'\\\\) and a back-off \\\\(n-1\\\\) model \\\\(p''\\\\): \n", "\n", "$$\n", "\\prob_{\\alpha}(w_i|w_{i-n},\\ldots,w_{i-1}) = \\alpha \\cdot \\prob'(w_i|w_{i-n},\\ldots,w_{i-1}) + (1 - \\alpha) \\cdot \\prob''(w_i|w_{i-n+1},\\ldots,w_{i-1})\n", "$$\n", "\n", "A Python implementation of this model can be seen below. We also show how a more likely unigram now has a higher probability in a context it hasn't seen in before. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "(0.00010818203385572938, 0.10060635388029084)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class InterpolatedLM(LanguageModel):\n", " def __init__(self, main, backoff, alpha):\n", " super().__init__(main.vocab, main.order)\n", " self.main = main\n", " self.backoff = backoff\n", " self.alpha = alpha\n", " def probability(self, word, *history):\n", " return self.alpha * self.main.probability(word,*history) + \\\n", " (1.0 - self.alpha) * self.backoff.probability(word,*history)\n", "\n", "interpolated = InterpolatedLM(adjusted_laplace_bigram,unigram,0.01)\n", "interpolated.probability('skies','skies'), interpolated.probability('[/BAR]','skies')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now find a good $\\alpha$ parameter to optimise for perplexity. Notice that in practice this should be done using a development set." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alphas = np.arange(0,1.1,0.1)\n", "perplexities = [perplexity(InterpolatedLM(adjusted_laplace_bigram,unigram,alpha),oov_test) for alpha in alphas]\n", "fig = plt.figure()\n", "plt.plot(alphas,perplexities)\n", "mpld3.display(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Backoff \n", "Instead of combining probabilities for all words given a context, it makes sense to back-off only when no counts for a given event are available and rely on available counts where possible. \n", "\n", "A particularly simple, if not to say stupid, backoff method is [Stupid Backoff](http://www.aclweb.org/anthology/D07-1090.pdf). Let \\\\(w\\\\) be a word and \\\\(h_{n}\\\\) be an n-gram of length \\\\(n\\\\): \n", "\n", "$$\n", "\\prob_{\\mbox{Stupid}}(w|h_n) = \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_n,w}}{\\counts{\\train}{h_n}} &= \\mbox{if }\\counts{\\train}{h_n,w} > 0 \\\\\\\\\n", "\\prob_{\\mbox{Stupid}}(w|h_{n-1}) & \\mbox{otherwise}\n", "\\end{cases}\n", "$$" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "class StupidBackoff(LanguageModel):\n", " def __init__(self, main, backoff, alpha):\n", " super().__init__(main.vocab, main.order)\n", " self.main = main\n", " self.backoff = backoff\n", " self.alpha = alpha\n", " def probability(self, word, *history):\n", " return self.main.probability(word,*history) \\\n", " if self.main.counts((word,)+tuple(history)) > 0 \\\n", " else self.alpha * self.backoff.probability(word,*history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that the Stupid LM is very effective when it comes to *extrinsic* evaluations, but it doesn't represent a valid probability distribution: when you sum over the probabilities of all words given a history, the result may be larger than 1. This is the case because the main n-gram model probabilities for all non-zero count words already sum to 1. The fact that the probabilities sum to more than 1 makes perplexity values meaningless. The code below illustrates the problem." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "1.064711557993094" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stupid = StupidBackoff(bigram, unigram, 0.1)\n", "sum([stupid.probability(word, 'the') for word in stupid.vocab])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The are several \"proper backoff models\" that do not have this problem, e.g. the Katz-Backoff method. We refer to other material below for a deeper discussion of these.\n", "\n", "### Background Reading\n", "\n", "* Jurafsky & Martin, Speech and Language Processing: Chapter 4, N-Grams.\n", "* Bill MacCartney, Stanford NLP Lunch Tutorial: [Smoothing](http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }