{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "# %cd ..\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "util.execute_notebook('language_models.ipynb')\n", "# import tikzmagic\n", "%load_ext tikzmagic\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "$$\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "# Language Models" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Language Models\n", "\n", "calculate the **probability of seeing a sequence of words**. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "For example: how likely is the following sequence?\n", "\n", "> We're going to win bigly. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Is it more likely than this one?\n", "\n", "> We're going to win big league." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Use Cases: Machine Translation\n", "\n", "> Wir werden haushoch gewinnen\n", "\n", "translates to?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "> We will win by a mile\n", "\n", "or \n", "\n", "> We will win bigly" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Use Cases: Speech Recognition\n", "\n", "What did he [say](https://www.theguardian.com/us-news/video/2016/may/04/donald-trump-we-are-going-to-win-bigly-believe-me-video)?\n", "\n", "> We're going to win bigly\n", "\n", "or\n", "\n", "> We're going to win big league" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Use Cases: Natural Language Generation\n", "\n", "https://twitter.com/deepdrumpf\n", "\n", "Other applications?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Formally\n", "Models the probability \n", "\n", "$$\\prob(w_1,\\ldots,w_d)$$ \n", "\n", "of observing sequences of words \\\\(w_1,\\ldots,w_d\\\\). " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Without loss of generality: \n", "\n", "\\begin{align}\n", "\\prob(w_1,\\ldots,w_d) &= p(w_1) p(w_2|w_1) p(w_3|w_1, w_2) \\ldots \\\\\n", " &= \\prob(w_1) \\prod_{i = 2}^d \\prob(w_i|w_1,\\ldots,w_{i-1})\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Structured Prediction\n", "\n", "predict word $y=w_i$ \n", "* conditioned on history $\\x=w_1,\\ldots,w_{i-1}$." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## N-Gram Language Models\n", "\n", "Impossible to estimate sensible probability for each history \n", "\n", "$$\n", "\\x=w_1,\\ldots,w_{i-1}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Change **representation**\n", "truncate history to last $n-1$ words: \n", "\n", "$$\n", "\\mathbf{f}(\\x)=w_{i-(n-1)},\\ldots,w_{i-1}\n", "$$\n", "\n", "$\\prob(\\text{bigly}|\\text{...,blah, blah, blah, we, will, win}) \n", "= \\prob(\\text{bigly}|\\text{we, will, win})$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Unigram LM\n", "\n", "Set $n=1$:\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i).\n", "$$\n", "\n", "$\\prob(\\text{bigly}|\\text{we, will, win}) = \\prob(\\text{bigly})$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Uniform LM\n", "Same probability for each word in a *vocabulary* \\\\(\\vocab\\\\):\n", "\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\frac{1}{|\\vocab|}.\n", "$$\n", "\n", "$\\prob(\\text{big}) = \\prob(\\text{bigly}) = \\frac{1}{|\\vocab|}$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us look at a training set and create a uniform LM from it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['[BAR]', 'Can', \"'t\", 'even', 'call', 'this', 'a', 'blues', 'song', '[/BAR]']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[:10]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.9999999999999173" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = set(train)\n", "baseline = UniformLM(vocab)\n", "sum([baseline.probability(w) for w in vocab])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What about other words? Summing up probabilities?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Sampling\n", "* Sampling from an LM is easy and instructive\n", "* Usually, the better the LM, the better the samples" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Sample **incrementally**, one word at a time " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def sample_once(lm, history, words):\n", " probs = [lm.probability(word, *history) for word in words]\n", " return np.random.choice(words,p=probs)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'east'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_once(baseline, [], list(baseline.vocab)) " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def sample(lm, initial_history, amount_to_sample):\n", " words = list(lm.vocab)\n", " result = []\n", " result += initial_history\n", " for _ in range(0, amount_to_sample):\n", " history = result[-(lm.order - 1):]\n", " result.append(sample_once(lm,history,words))\n", " return result" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['sword',\n", " 'everytime',\n", " 'brutality',\n", " 'Well',\n", " 'breast',\n", " 'neck',\n", " 'alcohol',\n", " 'stiff',\n", " 'baron',\n", " 'system']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample(baseline, [], 10)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Evaluation\n", "* **Extrinsic**: how it improves a downstream task?\n", "* **Intrinsic**: how good does it model language?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Intrinsic Evaluation\n", "**Shannon Game**: Predict next word, win if prediction match words in actual corpus (or you gave it high probability)\n", "\n", "> Our horrible trade agreements with [???]\n", "\n", "Formalised by ..." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Perplexity \n", "Given test sequence \\\\(w_1,\\ldots,w_T\\\\), perplexity \\\\(\\perplexity\\\\) is **geometric mean of inverse probabilities**:\n", "\n", "\\begin{align}\n", "\\perplexity(w_1,\\ldots,w_T) &= \\sqrt[T]{\\frac{1}{\\prob(w_1)} \\frac{1}{\\prob(w_2|w_1)} \\ldots} \\\\\n", "&= \\sqrt[T]{\\prod_i^T \\frac{1}{\\prob(w_i|w_{i-n},\\ldots,w_{i-1})}}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Interpretation\n", "\n", "Consider LM where \n", "* at each position there are exactly **2** words with $\\frac{1}{2}$ probability\n", "* in test sequence, one of these is always the true word " ] }, { "cell_type": "raw", "metadata": { "hide_egal": true, "is_egal": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "Created with SnapsherapsbiglyMatkowonstronglyhelostsmally0.50.50.0" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Then \n", "\n", "* $\\perplexity(w_1,\\ldots,w_T) = \\sqrt[T]{2 \\cdot 2 \\cdot\\ldots} = 2$\n", "* Perplexity $\\approx$ average number of choices" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Perplexity of uniform LM on an **unseen** test set?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(baseline, test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Problem: model assigns **zero probability** to words not in the vocabulary. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('does', 0.0), ('Ceremonies', 0.0), ('Masquerading', 0.0)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(w,baseline.probability(w)) for w in test if w not in vocab][:3]" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## The Long Tail\n", "New words not specific to our corpus: \n", "* long **tail** of words that appear only a few times\n", "* each has low probability, but probability of seeing any long tail word is high\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us plot word frequency ranks (x-axis) against frequency (y-axis) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlkAAAFsCAYAAAD/p6zEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl81NW9//H3ZyYbCSSQECSsAcJuECUCohb3Coq0rqDW\nq9eK+nNpa3tvba/dr9Veb631alVs1ap1xbaKolaqiAsKgarIvkNASAgS9qzn90cCIotMkpk5s7ye\njwcPzXe+mXnj4+vknXPOnK855wQAAIDwCvgOAAAAkIgoWQAAABFAyQIAAIgAShYAAEAEULIAAAAi\ngJIFAAAQAZQsAACACKBkAQAARAAlCwAAIAJSfAeQpI4dO7rCwkLfMQAAAI5o7ty5m51z+Uc6LyZK\nVmFhoUpLS33HAAAAOCIzWxPKeUwXAgAARAAlCwAAIAIoWQAAABFAyQIAAIgAShYAAEAEULIAAAAi\ngJIFAAAQAZQsAACACKBkAQAARAAlCwAAIAIoWQAAABEQE/cujLQ9tfXaUV3nO0bI0lICys5I9R0D\nAAC0QlKUrNc+3ajvPvuR7xghSwmYfj/hWJ0zpMB3FAAA0EJeS5aZjZM0rqioKKKvM6Rbjn41fnBE\nXyOcnpq9Tr98eYFO6Z+vrPSk6MEAACQcc875zqCSkhJXWlrqO0bMmLtmiy54YJZuOq1I3z+rv+84\nAABgP2Y21zlXcqTzWPgeg4b1zNX4oV300MyVWrdll+84AACgBShZMerWMQMUNNMdry7yHQUAALQA\nJStGFeS00fWn9NG0+Rs1a0Wl7zgAAKCZKFkxbNLXeqtr+zb6xdQFqm/wv3YOAACEjpIVwzJSg/rx\n2IFavHG7npmz1nccAADQDJSsGDe2uLOG98rV/76+RFW7an3HAQAAIaJkxTgz08/GDdLW3bX6/T+X\n+Y4DAABCRMmKA4O75GjC8T30+KzVWl6+3XccAAAQAkpWnPjBWf3UJi2oX768SLGwgSwAAPhqlKw4\nkdc2Xd85va9mLq3QW0vKfccBAABHQMmKI1ecUKje+Vn61cuLVFPX4DsOAAD4CpSsOJKWEtBPzh2k\nVZt36s/vr/YdBwAAfAVKVpw5tX8nndo/X/f+c5kqtlf7jgMAAA6DkhWHbjt3kHbX1uu3/1jiOwoA\nADgMSlYc6pPfVleOKtSzpev06foq33EAAMAhULLi1E2n91VuZpp+MXUBWzoAABCDKFlxKqdNqn7w\n9f6as/pzvfzJZ77jAACAA1Cy4tjFJd01qCBbd0xbpN019b7jAACA/VCy4lgw0Hhfww1Ve/TQzBW+\n4wAAgP1QsuLciN55OmdIgR58e4XWb93tOw4AAGhCyUoAPxozQM5Jd7662HcUAADQhJKVALp1yNS1\no/to6scbNHvVFt9xAACAIlCyzGygmT1oZlPM7PpwPz8O7brRvVWQk6FfTF2g+ga2dAAAwLeQSpaZ\nPWJm5Wb26QHHzzazJWa23MxulSTn3CLn3HWSLpZ0Yvgj41Ay01J065gBWrBhm6bMXec7DgAASS/U\nkazHJJ29/wEzC0q6X9IYSYMkTTSzQU2PnSfpFUnTwpYUR3TeMV1U0rOD7np9ibbtqfUdBwCApBZS\nyXLOzZR04GKf4ZKWO+dWOudqJD0jaXzT+S8558ZIuuxwz2lmk8ys1MxKKyoqWpYeX2Jm+tm4warc\nWaP73lzuOw4AAEmtNWuyukraf16qTFJXMzvFzO41s4f0FSNZzrnJzrkS51xJfn5+K2Jgf8XdcnTR\nsG569L1VWlmxw3ccAACSVtgXvjvnZjjnbnbOXeucuz/cz48j+8HX+ys9JajbX1nkOwoAAEmrNSVr\nvaTu+33drekYPOvULkM3nVakfy4u14wl5b7jAACQlFpTsuZI6mtmvcwsTdIESS815wnMbJyZTa6q\nqmpFDBzKlScWqjAvU796eaFq6xt8xwEAIOmEuoXD05JmSepvZmVmdrVzrk7SjZJel7RI0nPOuQXN\neXHn3FTn3KScnJzm5sYRpKcEdds5g7SiYqeemLXGdxwAAJJOSignOecmHub4NLFNQ8w6fWAnndy3\no343fanGD+2ivLbpviMBAJA0uK1OAjMz/fTcQdpVU6+731jqOw4AAEnFa8liTVbk9T2qnb41sqee\nnr1WCzds8x0HAICk4bVksSYrOr53Rj/ltEnVL19eIOe4ryEAANHAdGESyMlM1S1n9dcHK7fotU83\n+o4DAEBSoGQliYnHd9eAzu10+7RF2lNb7zsOAAAJj5KVJFKCAf303EEq+3y3/vjOSt9xAABIeJSs\nJDKqqKPOHtxZ97+1Qhur9viOAwBAQuPThUnmx2MHqt45/ea1xb6jAACQ0Ph0YZLpkZepa07upb/9\na73mrtniOw4AAAmL6cIk9P9OKVJBToauf3KeVlbs8B0HAICERMlKQlnpKfrzvw9XfYPThMkfULQA\nAIgASlaS6ndUOz09aaTqG5wmPkzRAgAg3ChZSazfUe301DUjVVffWLRWbd7pOxIAAAmDTxcmuf6d\nvyhaEybPomgBABAmfLoQ+4pWLUULAICwYboQkhqL1tNNRWviZKYOAQBoLUoW9mkc0RqhmvoGTZz8\ngVZTtAAAaDFKFr5kQOfsfUVrAkULAIAWo2ThIAM6Z+sv36ZoAQDQGpQsHNLAgsaiVV1Xr4kPU7QA\nAGgutnDAYQ0syNZT14zUntrGorWmkqIFAECo2MIBX2n/ojVhMkULAIBQMV2II2qcOqRoAQDQHJQs\nhGRQly+K1kSKFgAAR0TJQsj2Fq1dTUVrbeUu35EAAIhZlCw0y6Au2XqqqWhNmDyLogUAwGFQstBs\njSNaI/YVrRfmlundZZu1dNN2bd1VI+ec74gAAHhnsfADsaSkxJWWlvqOgWZasKFKV/xptip31nzp\neFpKQPlt09UpO12d2qUrv126OrXLUKd2e49l7PsnAADxxszmOudKjniez5JlZuMkjSsqKrpm2bJl\n3nKg5fbU1mv91t2q2F6t8u3VKt+2Z9+/N/5zj8q3V2vrrtqDvveY7u111ahCjS0uUFoKg6oAgPgQ\nFyVrL0ayEl91Xb0q9hWvaq3evFPPlq7Tyoqdym+XrstG9NBlI3oqv12676gAAHwlShZiXkOD0zvL\nN+vR91ZpxpIKpQUDOndIga46sZeKu7FBLQAgNoVaslKiEQY4lEDANLpfvkb3y9fKih16fNYaPV+6\nTn/913oN69lBV51YqK8P7qzUIFOJAID4w0gWYsq2PbWaUlqmP89arTWVu9Q5O0PfOqGnLh3eQx2y\n0nzHAwCA6ULEt/oGpxlLyvXoe6v17vLNystK0+8nHKuT+nb0HQ0AkORCLVnMwyAmBQOm0wcepSe/\nPUKv3HyScrPS9K1HPtQ905eqvsH/LwYAABwJJQsxb3CXHL1444n65tCuumf6Ml356GxV7qj2HQsA\ngK9EyUJcyExL0W8vPkZ3nF+sD1dt0Tn3vqvS1Vt8xwIA4LAoWYgbZqaJw3vor9ePUnpqQJdM/kAP\nz1zJbXwAADGJkoW4c3TXHE296SSdMbCTbp+2SNc+MVdVuw/eUR4AAJ8oWYhL2RmpevDyYfrJuYP0\n5uJynft/72h+WZXvWAAA7OO1ZJnZODObXFXFD0c0n5np6pN66dlrT1BdvdMFD7yvh95eobr6Bt/R\nAABgnywkhi07a3TrC5/oHws3aXCXbP3mgiE6uiu35gEAhB/7ZCGp5Gal6aFvDdMDlx2n8u3VOu++\nd3X7Kwu1q6bOdzQAQJKiZCFhmJnGFBdo+i2jdcnxPfTwO6v09XtmaubSCt/RAABJiJKFhJPTJlV3\nnF+sZyeNVGogoCsema1bnv1IW3bW+I4GAEgilCwkrBG98zTtOyfrptOK9NLHG3TG3W/riQ/WaGc1\nU4gAgMhj4TuSwpKN2/Xjv83X3DWfq116is4/rqsuH9lTfY9q5zsaACDOhLrwnZKFpOGc07y1W/Xk\nB2v0yiefqaa+QSN75+pbIwt11uCjlBpkYBcAcGSULOArVO6o1nOlZfrLh2tU9vlu5bdL1znFBcpM\nCx7y/NysNPXMy1JhXqa652YqI/XQ5wEAEh8lCwhBfYPTzKUVeuKDNXp3+eZD3gfROamu4YvjZlKX\nnDbqmZepm07rqxP65EUzMgDAM0oWEEZbd9VodeUurancqdWbG/85a2WlausbNP2W0WqfmeY7IgAg\nSkItWSnRCAPEu/aZaRqamaah3dvvO7ZwwzaNu+9d3f7KIt110TEe0wEAYhErfYEWGtQlW5O+1lvP\nzy3T+8s3+44DAIgxlCygFb5zel8V5mXqR3+brz219b7jAABiCCULaIWM1KB+fX6x1lTu0j3Tl/mO\nAwCIIZQsoJVG9emoi0u66eF3VmrBhirfcQAAMcJryTKzcWY2uaqKH0yIbz8eO1AdMtN06wvzVVff\n4DsOACAGeC1ZzrmpzrlJOTk5PmMArdY+M00/P2+Q5q+v0h9mrFDF9mrKFgAkObZwAMLknOIC/X3g\net39xlLd/cZSmUkdMtOUm5Wmo7tk65ffOFrZGam+YwIAooSSBYSJmem+S4/Tm4vLtXlHtSp31Khy\nZ7Uqtlfr5U8+0/KKHfrzVcOV1zbdd1QAQBRQsoAwykgNamxxwUHH31pcruuenKuLHpqlJ64eoa7t\n23hIBwCIJj5dCETBqQM66clvj1DF9mpd+MD7Wl6+w3ckAECEUbKAKDm+MFfPTBqp2voGXfzQLM0v\n41O1AJDIKFlAFA3ukqPnrxulNqlBTXz4A32wstJ3JABAhFCygCjr1TFLU64/QZ1zMnTFI7P1xsJN\nviMBACKAkgV4UJDTRs9de4IGdm6n656cq7/OK/MdCQAQZpQswJPcrDT95ZqRGtErV7c897H+57XF\nmr1qi3ZW1/mOBgAIA3PO+c6gkpISV1pa6jsG4MWe2np9/7mP9cr8zyRJZlLvjlkqzMuSme077+KS\nbjprcGdfMQEATcxsrnOu5IjnUbKA2FC+bY/mr6/Sp+u3af76Km3YunvfY5t3VGtndZ1m/Mepym/H\nZqYA4FOoJYvNSIEY0Sk7Q6dnZ+j0gUcd9NjKih0663czdc/0pbr9m8Ue0gEAmos1WUAc6J3fVpeP\n7Kln5qzTsk3bfccBAISAkgXEiZtP76vM1KDufHWx7ygAgBBQsoA4kZuVphtOK9I/F5fr/eWbfccB\nABwBJQuII1eOKlTX9m10+7RFqqtv8B0HAPAVKFlAHMlIDeqHYwZowYZtOv3ut/X07LWqrqv3HQsA\ncAiULCDOnHdMFz18RYly2qTqR3+dr9H/M0OvNu2xBQCIHZQsIA6dOegovXjDiXri6uHqlJ2uG56a\npylzuTUPAMQS9skC4pSZ6eS++SrpmatJT5TqB89/rJq6Bl06oofvaAAAMZIFxL02aUE9fEWJThvQ\nST/+23zd/9ZyFsUDQAyISMkys2+Y2cNm9qyZnRWJ1wDwhYzUoB68fJjOKS7QXa8v0Xn3vae5a7b4\njgUASS3kkmVmj5hZuZl9esDxs81siZktN7NbJck593fn3DWSrpN0SXgjAziUtJSA7rv0WP3hsuP0\n+a4aXfDALP1wyieq2l3rOxoAJKXmjGQ9Juns/Q+YWVDS/ZLGSBokaaKZDdrvlNuaHgcQBWamscUF\nmn7LaF37td6aMq9MZ/3ubb21uNx3NABIOiEvfHfOzTSzwgMOD5e03Dm3UpLM7BlJ481skaQ7Jb3q\nnJsXpqwAQpSVnqIfjR2oc4YU6D+e/0RXPTZHxxd2UFGntirMy9KFw7opr22675gAkNBauyarq6R1\n+31d1nTsJklnSLrQzK471Dea2SQzKzWz0oqKilbGAHAoQ7q110s3nahbzuyn+ganfyzYpDteXazr\nn5ynhgbnOx4AJLSIbOHgnLtX0r1HOGeypMmSVFJSwrs9ECHpKUHdfHpf3Xx6X0nSM7PX6ta/ztdz\npes0YTjbPQBApLR2JGu9pO77fd2t6RiAGHXJ8d01oleufj1tkcq37/EdBwASVmtL1hxJfc2sl5ml\nSZog6aXWxwIQKWamX59frD21Dfrl1IVyjoFkAIiE5mzh8LSkWZL6m1mZmV3tnKuTdKOk1yUtkvSc\nc25BM55znJlNrqqqam5uAK3QJ7+tbjytSC9/8pkuffhD/Wvt574jAUDCsVj4LbakpMSVlpb6jgEk\nlfoGp8dnrdZ9by5X5c4aje6Xr4nDe+j0gZ2UGuRmEABwOGY21zlXcsTzKFlActtRXadH3l2lv3y4\nRpu2Vatj23RdOKybLjm+u3p1zPIdDwBiDiULQLPU1Tfo7aUVenr2Or21pFz1DU4nFXXUlaMKdeqA\nTgrYl883s0M/EQAkuLgoWWY2TtK4oqKia5YtW+YtB4Av27Rtj54vXacnP1irjdsO/gRiRmpAw3p2\n0Kg+HTWyd56GdMthihFA0oiLkrUXI1lAbKqtb9AbCzdp6abtXzq+dVetPlhZqcUbG49npQVVUpir\nId1ylBIIaGiP9hrdL99HZACIuFBLVkQ2IwWQGFKDAY0tLtDY4oJDPl65o1ofrtqiWSsqNWtlpd5e\n2nj3hoBJL1w/Ssf26BDNuAAQUxjJAhA2DQ1O26vrNOaemcpIC2razScrIzXoOxYAhBUjWQCiLhAw\n5bRJ1V0XHaPL/vihTv3fGWqbnqKh3dtr/NCuOqlvR98RASBqvK5UZTNSIDGdWNRRd55frON6dlDP\nvCy99ulGXf6nD/XiR9x1C0DyYLoQQMRV19Xr0oc/1NKN2/XKzSerR16m70gA0GKhThfymWsAEZee\nEtQ9lwyVTLr6z3O0omKH70gAEHGsyQIQFd1zM/XQ5cN049P/0rn3vqt+R7VVTmaazj+2q8YWFygt\nhd/5ACQW3tUARM2ooo6advPJGltcoA5ZaVq1eYe+++xHuvGpeWpo8L90AQDCiZEsAFHVOSdDv734\nGEmNWz5Mfmel7nx1sR54e4VuOLXIczoACB8+XQjAm0DAdO3Xemv80C767T+WaPHGbb4jAUDYeC1Z\nzrmpzrlJOTk5PmMA8MjM9PNxg5WVnqLfvLrYdxwACBvWZAHwrkNWmm44tUhvLanQ95/7WM/MXus7\nEgC0GmuyAMSEK0cV6u0lFXpj4Ua9MK9MeW3Tdeago3zHAoAWYyQLQEzISA3q6UkjVXrbmRrQuZ3+\n62/ztWnbHt+xAKDFKFkAYkpaSkB3XXiMtu6u1Sl3zdATs1b7jgQALcJ0IYCYU9wtR//47tf0s5cW\n6CcvLlAwENDxhR0kNW5qmpEa9JwQAI6MexcCiFnVdfW64k+z9eGqLfuOdW3fRrd/82id0r+Tx2QA\nklmo9y70WrLMbJykcUVFRdcsW7bMWw4AsWtXTZ1mLt2suoYG7alt0INvr9Dy8h365rFddecFxUpP\nYVQLQHTFRcnai5EsAKGqrqvX/W8u171vLte3T+ql284d5DsSgCQTasliTRaAuJKeEtQtZ/XX1t21\n+uO7q/T3jzZo3DEF+s7pfRUMmDJSg0oN8pkeAP5RsgDEpR+PHagOmWlaVr5dj763Wo++t1qSlJeV\nptu/Wazibjnq2r6N35AAkholC0BcykgN6ntn9pMkzVpRqQUbquSc9FzpOl335FxJ0i1n9tNNpxXJ\nzHxGBZCkKFkA4t4JffJ0Qp88SdJlI3vo3WWbNW3+Z7r7jaXqmZep8UO7ek4IIBmxcAFAQslMS9FZ\ngzvr7ouHanCXbP3Pa0u0u6bedywASYiSBSAhBQKmn5w7SBuqduvKR2drV02d70gAkozXkmVm48xs\nclVVlc8YABLUyN55uvviY/Thqi16Yd5633EAJBmvJcs5N9U5NyknJ8dnDAAJ7BtDu6pPfpamffKZ\n7ygAkgzThQASmpnpnOICfbiqUhXbq33HAZBEKFkAEt55Q7tIkr737EeavnCTausbPCcCkAwoWQAS\nXlGndrrz/CF6b8VmffvxUv3X3+YrFm4pBiCxsU8WgKRw8fHddUr/fD36/mo9MGOFurbP1JWjCpWT\nmeo7GoAExUgWgKTRKTtD//n1/ho/tIt+N32pjv/1dL27bLPvWAASFCULQFIxM9114TG6/9Lj1DM3\nUzc9PU87qtlDC0D4UbIAJJ20lIDOGVKgO84v1ue7ajVtPts7AAg/ShaApDWsZwf16pilKaVlvqMA\nSECULABJy8x06fAemr16ix59b5W276n1HQlAAqFkAUhqV55YqKHd2+sXUxdq3P+9yz0OAYSN+dwr\nxszGSRpXVFR0zbJly7zlAJDcdtfUa+onG/SfUz5Ru4wUpQUbf/8s6tRWj151vDLT2O0GwBfMbK5z\nruSI58XChnwlJSWutLTUdwwASe7Fj9ZrzuotkqSaugY9V1qmod3b684LijWgc7bndABiRagli1/P\nAKDJ+KFdNX5o131ft89M0+SZK3XFn2br7zecqC7t23hMByDesCYLAA7jx2MH6omrh6t8e7VG3fmm\n5q7Z4jsSgDjCdCEAHMHbSys06fFSdc7JUHHXnC891j4zVbedM0gZqUFP6QBEG9OFABAmo/vl666L\njtHvpy/Vws+27Tve0OC0unKXCvOy9I1ju6pj23SPKQHEGkayAKCFnHM67773NH99lSTphetHaVjP\nDp5TAYi0UEeyWJMFAC1kZvrDZcfprguHKDsjRb+YukB3vLpIL3603nc0ADGA6UIAaIXuuZnqnpup\n8u3Vuvefy7RwQ+N04qg+HZXfjulDIJkxkgUAYXDDqUVa8t9j9Np3v6a6BqeTfvOmNlbt8R0LgEeU\nLAAIo6JObXXDqX1UXdegP8xYroYG/+teAfhByQKAMPuPrw/QiF65enzWGv3x3ZW+4wDwhJIFABFw\n9yVDlZ2RomfmrNOr8z/Tgg1VviMBiDJKFgBEQNf2bXTrmIFaWbFT1/9lni56cJZ21dT5jgUgiihZ\nABAhE47vrum3jNZvLzpGu2rq9fOXFigW9iYEEB1eS5aZjTOzyVVVDKMDSDyBgKmoU1udN7SLMlID\neq60TPPWbvUdC0CUsOM7AERBxfZqHX/7dJ056CidVNRRkpTTJlXjh3aRmXlOB6A5uHchAMSQ/Hbp\nOm1AJ72xcJPeWLhp3/HuuW00rGeux2QAIoWSBQBR8vAVJaraXStJ2rGnTqP/9y099eE6bd5Rs++c\nbh3aaHCXHF8RAYQRJQsAoiQYMOVmpUmScrPSdFyPDnphXplemFe275z0lIA+/tlZykgN+ooJIEwo\nWQDgyaNXHa91W3bt+3rums/10xcX6P0Vm3VMt/aSpLYZKUpPoXAB8YiSBQCeZGekfmlqsCCnjX72\n0gL9+2NffBCoS06G3v3haQoEWBwPxBtKFgDEiNysND3yb8dr3eeNo1vzy6r0/Nwyra7cqd75bT2n\nA9BclCwAiCGnDui0798XfbZNz88t083P/Etjji7QDacWeUwGoLnY8R0AYlTfTm11TnGBNm+v0f+9\nuUz1Df73NQQQOkoWAMSolGBA9192nL5/Vj/tqW3Q9EWb9EnZVu2prfcdDUAImC4EgBg3pOmThtc+\nMVeSdPnIHvrvbxT7jAQgBJQsAIhx/Tu305TrTlDV7lrdM32Z5q/f5jsSgBBQsgAgDpQUNt56551l\nm/XsnHX67T+WHHROSiCgy0f2UF7b9GjHA3AIlCwAiCMnFnXUUx+u1f1vLT/osQYnZaUH9e2Te3tI\nBuBAlCwAiCNnDjpKS28fc8jHhv7yH1pduTPKiQAcDiULABJEz9xMvbmoXD+o/fiQj6enBHTLmf2Y\nTgSihJIFAAlibHGBHp+1RrNWVB70WH2D08Zte3Rsjw66cFg3D+mA5EPJAoAEce3oPrp2dJ9DPran\ntl4DfvKaPtu6O8qpgORFyQKAJJCRGlReVpqmLy5XXdPO8SkB04ThPZTfjulDIBIoWQCQJIb3ytWr\nn27Ux+u27juWlhI47OgXgNahZAFAknjg8mFf+nrwT1/Tpm3VntIAiY+SBQBJqlN2htZ9vksbq/Yc\n8vGASfnt0mVmUU4GJAZKFgAkqYKcDL2xcJPeWLjpsOf84rzB+rdRhdELBSSQsJcsM+st6b8k5Tjn\nLgz38wMAwuOX44/WnNVbDvv4f7+8UCsrdkQxEZBYQipZZvaIpHMllTvnjt7v+NmSfi8pKOmPzrk7\nnXMrJV1tZlMiERgAEB5FndqqqFPbwz7+0NsrtGVXbRQTAYkl1JGsxyTdJ+nxvQfMLCjpfklnSiqT\nNMfMXnLOLQx3SABA9HXIStPSjdv17Jy1hz1nVJ+O6p6bGcVUQPwIqWQ552aaWeEBh4dLWt40ciUz\ne0bSeEkhlSwzmyRpkiT16NEjxLgAgGjpk99WU+aW6YcvzD/sOWOO7nzQpxYBNGrNmqyuktbt93WZ\npBFmlifpdknHmtmPnHN3HOqbnXOTJU2WpJKSEteKHACACPjNBUN0y5n9Dvv4DU/N05adNVFMBMSX\nsC98d85VSrou3M8LAIiuYMDUpX2bwz6el5Wu9dymBzis1pSs9ZK67/d1t6ZjAIAkkN0mRR+tq9bb\nSyu+dDxg0rCeHZSZxi5BSG6t+T9gjqS+ZtZLjeVqgqRLm/MEZjZO0riioqJWxAAA+FCQk6HNO6r1\nb4/MPuix753RT985o6+HVEDsCHULh6clnSKpo5mVSfqZc+5PZnajpNfVuIXDI865Bc15cefcVElT\nS0pKrmlebACAbzed1lenDzxK7oBVtVc+Olubd3C7HiDUTxdOPMzxaZKmhTURACAuZKQGdVyPDgcd\nz85I1c7qOg+JgNgS8B0AAJBY2qanaGcNJQtgVSIAIKyy0oOau+ZzTXq89KDHjuneXjecyjpcJAev\nI1lmNs7MJldVVfmMAQAIo7HFBerYNl1rt+z60p/SNZ/rD28t9x0PiBpzB65Y9KCkpMSVlh78Gw8A\nIHHc9fpiPTBjhVb8eqzMzHccoMXMbK5zruRI57EmCwAQFZlpKWpwUk19g+8oQFRQsgAAUZGRGpQk\n7amhZCE5sPAdABAVmWmNJeuBt1eoXcbBP36GdMvRyX3zox0LiBivJYsd3wEgefTumKWUgOnBt1cc\n8vHuuW2IKyuTAAAKCUlEQVT0zn+eFuVUQOSw8B0AEDV19Q1qOMSPnZ+99KneWFiu0tvOiH4ooJlC\nXfjOdCEAIGpSgodeCpyRGlR1XX2U0wCRxcJ3AIB36SlBVdexIB6JhZIFAPAuPSWgmroGxcISFiBc\nmC4EAHiXntr4O/8fZqxQMHDkjUpH98vXwILsSMcCWoVPFwIAvOvdMUtm0l2vLwnp/LlrPtfDVxxx\n3THgFZ8uBADEhD219QrlR9KEhz9QdkaKnrh6RORDAYfApwsBAHFl747wR5KeElAtt+ZBHGDhOwAg\nrqQGTXX1/mdhgCOhZAEA4kpqkJEsxAdKFgAgrqQGA6phJAtxgDVZAIC4kho0bdlZrRc/Wh/S+cVd\nc9Q7v22EUwEHYwsHAEBc6dQuQ5u2Ves7z3wU0vklPTtoyvWjIpwKOBhbOAAA4kpdfYPWbtkV0rm3\n/f1Tbd1Vq2nfOTnCqZBM2MIBAJCQUoKBkKf/sjNSVbmjJsKJgENj4TsAIGEFg6a6Bj6JCD8oWQCA\nhBU0U32D/2UxSE6ULABAwkoJmOpjYO0xkhMlCwCQsIIBUz17asETShYAIGGlBE11TBfCEz5dCABI\nWMGAaevuWt389L+a/b2nDsjXN4/tFoFUSBZsRgoASFgje+fp/eWVmr++qlnft2nbHi0r30HJQquw\nGSkAAAe4/sm5Wl6+Q2/cMtp3FMSgUDcjZU0WAAAHSAkG2PoBrUbJAgDgAKkBUy2bmKKVKFkAABwg\nJWiqY+sHtBIlCwCAAwQDAdVSstBKlCwAAA6QGjTVM12IVqJkAQBwgJRAgOlCtBolCwCAA6QGWfiO\n1qNkAQBwgGCAhe9oPW6rAwDAAVKCAdU1OD323qqwPWe7jFR989iuCgQsbM+J2MZtdQAAOED3Dm0k\nST+fujCsz9u/czsd3TUnrM+J2OW1ZDnnpkqaWlJSco3PHAAA7O+iku46a3BnNYRp1/f3V1Tqhqfm\nqbqOdV7JhOlCAAAOIadNatieK7tN44/bWLhfMKKHhe8AAERYwBrXYXE/xORCyQIAIML2liw6VnKh\nZAEAEGF7P1DIdGFyoWQBABBhwaaWVU/JSiqULAAAIsyYLkxKlCwAACJs70hWuLaEQHygZAEAEGF7\n12Q1MF2YVChZAABEGFs4JCdKFgAAEcYWDsmJkgUAQIQFmn7aMl2YXChZAABEWHDfSBYlK5lQsgAA\niDBjTVZS4gbRAABE2N4tHJ76cK3eW745Kq+ZGgzoxtOKVJDTJiqvh4N5LVlmNk7SuKKiIp8xAACI\nqKOy0zW4S7bWbtmltVt2Rfz16hucyrdX6+iuOZo4vEfEXw+H5rVkOeemSppaUlJyjc8cAABEUmZa\nil65+eSovd6mbXs04tf/ZA2YZ6zJAgAgwTTtfSo6ll+ULAAAEszehfZ0LL8oWQAAJJimjiXHUJZX\nlCwAABLM3h3m6Vh+UbIAAEgwe9dksfDdL0oWAAAJ5ovpQr85kh0lCwCABMPC99hAyQIAIMGw8D02\nULIAAEgw7JMVGyhZAAAkmH2fLmTC0CtKFgAACWbvdGEDHcsrShYAAAnGxD5ZsYCSBQBAgtm38J3p\nQq8oWQAAJBj2yYoNlCwAABLMF7fVoWX5RMkCACDBfHFbHa8xkh4lCwCABGPcIDomULIAAEgwARa+\nxwRKFgAACWbvSBbThX5RsgAASFTMF3pFyQIAIAEFTEwWekbJAgAgAZmZGhjJ8iol3E9oZlmS/iCp\nRtIM59xfwv0aAADgqwWM2ULfQhrJMrNHzKzczD494PjZZrbEzJab2a1Nh8+XNMU5d42k88KcFwAA\nhMBkLHz3LNSRrMck3Sfp8b0HzCwo6X5JZ0oqkzTHzF6S1E3S/KbT6sOWFAAAhM6kd5ZVaHdNne8k\nUXXOkC4a3ivXdwxJIZYs59xMMys84PBwScudcyslycyekTRejYWrm6SP9BUjZWY2SdIkSerRo0dz\ncwMAgK8woleu5q+v0vqtu31HiapBXbLjq2QdRldJ6/b7ukzSCEn3SrrPzM6RNPVw3+ycmyxpsiSV\nlJQwoAkAQBg9cfUI3xGSXtgXvjvndkq6KtzPCwAAEE9as4XDeknd9/u6W9MxAACApNeakjVHUl8z\n62VmaZImSHqpOU9gZuPMbHJVVVUrYgAAAMSeULdweFrSLEn9zazMzK52ztVJulHS65IWSXrOObeg\nOS/unJvqnJuUk5PT3NwAAAAxLdRPF048zPFpkqaFNREAAEAC4LY6AAAAEUDJAgAAiACvJYuF7wAA\nIFF5LVksfAcAAImK6UIAAIAIoGQBAABEACULAAAgAlj4DgAAEAHmnPOdQWZWIWlN05c5kr6qdbX0\n8Y6SNrcooB9H+nvG0mu09Hma832hntua6+erHoun64drp2Xn897TiOunZefy3pNc105P51z+Ec92\nzsXUH0mTI/G4pFLff7dw/neIpddo6fM05/tCPbc1188RHoub64drp2Xn897D9dOac3nv4do51J9Y\nXJM1NcKPx4to/D3C9RotfZ7mfF+o57bm+uDaif5rROPaCfV83nsacf207Fzee7h2DhIT04XRYGal\nzrkS3zkQn7h+0FJcO2gNrp/4FosjWZEy2XcAxDWuH7QU1w5ag+snjiXNSBYAAEA0JdNIFgAAQNRQ\nsgAAACKAkgUAABABlCwAAIAISNqSZWZZZvZnM3vYzC7znQfxxcx6m9mfzGyK7yyIL2b2jab3nWfN\n7CzfeRA/zGygmT1oZlPM7HrfeXBkCVWyzOwRMys3s08POH62mS0xs+VmdmvT4fMlTXHOXSPpvKiH\nRcxpzvXjnFvpnLvaT1LEmmZeO39vet+5TtIlPvIidjTz2lnknLtO0sWSTvSRF82TUCVL0mOSzt7/\ngJkFJd0vaYykQZImmtkgSd0krWs6rT6KGRG7HlPo1w+wv8fU/GvntqbHkdweUzOuHTM7T9IrkqZF\nNyZaIqFKlnNupqQtBxweLml508hDjaRnJI2XVKbGoiUl2H8HtEwzrx9gn+ZcO9boN5Jedc7Ni3ZW\nxJbmvu84515yzo2RxDKXOJAM5aKrvhixkhrLVVdJf5V0gZk9oMS5bxTC75DXj5nlmdmDko41sx/5\niYYYd7j3npsknSHpQjO7zkcwxLzDve+cYmb3mtlDYiQrLqT4DuCLc26npKt850B8cs5VqnFNDdAs\nzrl7Jd3rOwfij3NuhqQZnmOgGZJhJGu9pO77fd2t6RgQCq4ftBTXDlqKaydBJEPJmiOpr5n1MrM0\nSRMkveQ5E+IH1w9aimsHLcW1kyASqmSZ2dOSZknqb2ZlZna1c65O0o2SXpe0SNJzzrkFPnMiNnH9\noKW4dtBSXDuJzZxzvjMAAAAknIQayQIAAIgVlCwAAIAIoGQBAABEACULAAAgAihZAAAAEUDJAgAA\niABKFgAAQARQsgAAACLg/wMGXetUeVWXWAAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.xscale('log')\n", "plt.yscale('log') \n", "plt.plot(ranks, sorted_counts)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "In log-space such rank vs frequency graphs are **linear** \n", "\n", "* Known as **Zipf's Law**\n", "\n", "Let $r_w$ be the rank of a word \\\\(w\\\\), and \\\\(f_w\\\\) its frequency:\n", "\n", "$$\n", " f_w \\propto \\frac{1}{r_w}.\n", "$$\n", "\n", "* Also true in [random text](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.8422&rep=rep1&type=pdf)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Out-of-Vocabularly (OOV) Tokens\n", "There will always be words with zero counts in your training set.\n", "\n", "Solutions:\n", "* Remove unseen words from test set (bad)\n", "* Move probability mass to unseen words (good, discuss later)\n", "* Replace unseen words with out-of-vocabularly token, estimate its probability" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Inserting OOV Tokens" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "['[BAR]',\n", " 'scratched',\n", " '[/BAR]',\n", " '[BAR]',\n", " 'What',\n", " '[OOV]',\n", " 'it',\n", " 'take',\n", " '[/BAR]',\n", " '[BAR]']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "replace_OOVs(baseline.vocab, test[:10])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What happens to perplexity if training set is small?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Estimate `OOV` probability\n", "What is the probability of seeing a word you haven't seen before?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Consider the \"words\"\n", "\n", "> AA AA BB BB AA\n", "\n", "Going left to right, how often do I see new words?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Inject `OOV` tokens to mark these \"new word events\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['[OOV]', 'AA', '[OOV]', 'BB', 'AA']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inject_OOVs([\"AA\",\"AA\",\"BB\",\"BB\",\"AA\"])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Now train on replaced data..." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1290.0000000049852" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oov_train = inject_OOVs(train)\n", "oov_vocab = set(oov_train)\n", "oov_test = replace_OOVs(oov_vocab, test)\n", "oov_baseline = UniformLM(oov_vocab)\n", "perplexity(oov_baseline,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What does this perplexity correspond to?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Training N-Gram Language Models" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "N-gram language models condition on a limited history: \n", "\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i|w_{i-(n-1)},\\ldots,w_{i-1}).\n", "$$\n", "\n", "What are its parameters (continuous values that control its behaviour)?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "One parameter $\\param_{w,h}$ for each word $w$ and history $h=w_{i-(n-1)},\\ldots,w_{i-1}$ pair:\n", "\n", "$$\n", "\\prob_\\params(w|h) = \\param_{w,h}\n", "$$\n", "\n", "$\\prob_\\params(\\text{bigly}|\\text{win}) = \\param_{\\text{bigly, win}}$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Maximum Likelihood Estimate\n", "\n", "Assume training set \\\\(\\train=(w_1,\\ldots,w_d)\\\\)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Maximum Likelihood Estimate\n", "\n", "Find \\\\(\\params\\\\) that maximizes the log-likelihood of \\\\(\\train\\\\):\n", "\n", "$$\n", "\\params^* = \\argmax_\\params \\log p_\\params(\\train)\n", "$$\n", "\n", "where\n", "\n", "$$\n", "\\prob_\\params(\\train) = \\ldots \\prob_\\params(w_i|\\ldots w_{i-1}) \\prob_\\params(w_{i+1}|\\ldots w_{i}) \\ldots \n", "$$\n", "\n", "**Structured Prediction**: this is your continuous optimization problem!" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Maximum-log-likelihood estimate (MLE) can be calculated in **[closed form](/notebooks/chapters/mle.ipynb)**:\n", "$$\n", "\\prob_{\\params^*}(w|h) = \\param^*_{w,h} = \\frac{\\counts{\\train}{h,w}}{\\counts{\\train}{h}} \n", "$$\n", "\n", "where \n", "\n", "$$\n", "\\counts{D}{e} = \\text{Count of } e \\text{ in } D \n", "$$\n", "\n", "Many LM variants: different estimation of counts. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Training a Unigram Model\n", "Let us train a unigram model..." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "What do you think the most probable words are? \n", "\n", "Remember our training set looks like this ..." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['[BAR]', 'I', '[OOV]', 'at', 'home', 'but', 'I', 'gotta', 'escape', '[/BAR]']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oov_train[10000:10010]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmMAAAFpCAYAAADQuy+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAG+9JREFUeJzt3X+w3XV95/HnqwFaqygoKWAIDXRTMbYaNcWfbXVZXcBu\ngzuthp0FtO5EZkWkK7ZZp7PLbqezqKi7tpQMtllxRqW0imY0I0NZrXYXNAEjECRroGFJGpKIHVCx\nQOC9f5zvtV8ON7nnknvvJ+fyfMycOd/v58f3fD7Jzc3rfD/f7zmpKiRJktTGT7UegCRJ0tOZYUyS\nJKkhw5gkSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJkiQ1ZBiTJElqyDAmSZLUkGFMkiSpocNaD2A6\njjnmmFqyZEnrYUiSJE3p5ptv/l5VLZyq3ViFsSVLlrBp06bWw5AkSZpSkntGaecypSRJUkOGMUmS\npIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKmhkcJYktOTbE2yLcmaSepPSXJj\nkoeTXNwrf0GSzb3Hg0ku6uouSbKzV3fmzE1LkiRpPEz5CfxJFgCXA28AdgAbk6yvqjt6zb4PXAic\n1e9bVVuB5b3j7ASu7TX5aFVddlAzkCRJGmOjnBk7FdhWVXdX1SPA1cDKfoOq2lNVG4FHD3Cc04C7\nqmqkrwaQJEl6OhgljC0C7u3t7+jKpmsV8JmhsncnuTXJuiRHP4VjSpIkjbU5uYA/yRHAbwJ/2Su+\nAjiZwTLmLuDD++m7OsmmJJv27t0762OVJEmaS1NeM8bgOq/Fvf0TurLpOAO4pap2TxT0t5N8HPji\nZB2r6krgSoAVK1bUNF932pas+dJsv8SM2H7pm1oPQZIkzYBRzoxtBJYmOak7w7UKWD/N1zmboSXK\nJMf3dt8M3D7NY0qSJI29Kc+MVdW+JBcA1wELgHVVtSXJ+V392iTHAZuAZwOPdx9fsayqHkzyTAZ3\nYr5z6NAfTLIcKGD7JPWSJEnz3ijLlFTVBmDDUNna3vZ9DJYvJ+v7I+B5k5SfM62RSpIkzUN+Ar8k\nSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJkiQ1ZBiTJElqyDAmSZLUkGFMkiSpIcOYJElSQ4YxSZKk\nhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0Z\nxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJ\nkiQ1ZBiTJElqyDAmSZLUkGFMkiSpIcOYJElSQ4YxSZKkhgxjkiRJDY0UxpKcnmRrkm1J1kxSf0qS\nG5M8nOTiobrtSW5LsjnJpl75c5Ncn+S73fPRBz8dSZKk8TJlGEuyALgcOANYBpydZNlQs+8DFwKX\n7ecwr6+q5VW1ole2BrihqpYCN3T7kiRJTyujnBk7FdhWVXdX1SPA1cDKfoOq2lNVG4FHp/HaK4Gr\nuu2rgLOm0VeSJGleGCWMLQLu7e3v6MpGVcBfJ7k5yepe+bFVtavbvg84dhrHlCRJmhcOm4PXeG1V\n7Uzyc8D1Se6sqq/1G1RVJanJOncBbjXAiSeeOPujlSRJmkOjnBnbCSzu7Z/QlY2kqnZ2z3uAaxks\newLsTnI8QPe8Zz/9r6yqFVW1YuHChaO+rCRJ0lgYJYxtBJYmOSnJEcAqYP0oB0/yzCRHTmwDbwRu\n76rXA+d12+cBX5jOwCVJkuaDKZcpq2pfkguA64AFwLqq2pLk/K5+bZLjgE3As4HHk1zE4M7LY4Br\nk0y81qer6svdoS8FrknyDuAe4C0zOzVJkqRD30jXjFXVBmDDUNna3vZ9DJYvhz0IvGQ/x7wfOG3k\nkUqSJM1DfgK/JElSQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHD\nmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJ\nkqSGDGOSJEkNGcYkSZIaMoxJkiQ1ZBiTJElqyDAmSZLUkGFMkiSpIcOYJElSQ4YxSZKkhgxjkiRJ\nDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoa\nKYwlOT3J1iTbkqyZpP6UJDcmeTjJxb3yxUm+kuSOJFuSvKdXd0mSnUk2d48zZ2ZKkiRJ4+OwqRok\nWQBcDrwB2AFsTLK+qu7oNfs+cCFw1lD3fcB7q+qWJEcCNye5vtf3o1V12UHPQpIkaUyNcmbsVGBb\nVd1dVY8AVwMr+w2qak9VbQQeHSrfVVW3dNs/AL4DLJqRkUuSJM0Do4SxRcC9vf0dPIVAlWQJ8FLg\nG73idye5Ncm6JEdP95iSJEnjbk4u4E/yLOCzwEVV9WBXfAVwMrAc2AV8eD99VyfZlGTT3r1752K4\nkiRJc2aUMLYTWNzbP6ErG0mSwxkEsU9V1ecmyqtqd1U9VlWPAx9nsBz6JFV1ZVWtqKoVCxcuHPVl\nJUmSxsIoYWwjsDTJSUmOAFYB60c5eJIAfw58p6o+MlR3fG/3zcDtow1ZkiRp/pjybsqq2pfkAuA6\nYAGwrqq2JDm/q1+b5DhgE/Bs4PEkFwHLgBcD5wC3JdncHfL9VbUB+GCS5UAB24F3zuzUJEmSDn1T\nhjGALjxtGCpb29u+j8Hy5bC/BbKfY54z+jAlSZLmJz+BX5IkqSHDmCRJUkOGMUmSpIYMY5IkSQ0Z\nxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJ\nkiQ1ZBiTJElqyDAmSZLUkGFMkiSpIcOYJElSQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJ\nasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQ\nYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkNjRTGkpyeZGuSbUnWTFJ/SpIbkzyc5OJR+iZ5bpLrk3y3\nez764KcjSZI0Xg6bqkGSBcDlwBuAHcDGJOur6o5es+8DFwJnTaPvGuCGqrq0C2lrgN+fgTmpZ8ma\nL7UewpS2X/qmkdvOt/lIkjTKmbFTgW1VdXdVPQJcDazsN6iqPVW1EXh0Gn1XAld121cxFOQkSZKe\nDkYJY4uAe3v7O7qyURyo77FVtavbvg84dsRjSpIkzRuHxAX8VVVATVaXZHWSTUk27d27d45HJkmS\nNLtGCWM7gcW9/RO6slEcqO/uJMcDdM97JjtAVV1ZVSuqasXChQtHfFlJkqTxMEoY2wgsTXJSkiOA\nVcD6EY9/oL7rgfO67fOAL4w+bEmSpPlhyrspq2pfkguA64AFwLqq2pLk/K5+bZLjgE3As4HHk1wE\nLKuqByfr2x36UuCaJO8A7gHeMtOTkyRJOtRNGcYAqmoDsGGobG1v+z4GS5Aj9e3K7wdOm85gJUmS\n5ptD4gJ+SZKkpyvDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUyS\nJKkhw5gkSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJkiQ1ZBiTJElqyDAmSZLUkGFMkiSpIcOYJElS\nQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYM\nY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJkqSGRgpjSU5PsjXJ\ntiRrJqlPko919bcmeVlX/oIkm3uPB5Nc1NVdkmRnr+7MmZ2aJEnSoe+wqRokWQBcDrwB2AFsTLK+\nqu7oNTsDWNo9XgFcAbyiqrYCy3vH2Qlc2+v30aq6bCYmIkmSNI5GOTN2KrCtqu6uqkeAq4GVQ21W\nAp+sgZuAo5IcP9TmNOCuqrrnoEctSZI0T4wSxhYB9/b2d3Rl022zCvjMUNm7u2XNdUmOHmEskiRJ\n88qcXMCf5AjgN4G/7BVfAZzMYBlzF/Dh/fRdnWRTkk179+6d9bFKkiTNpVHC2E5gcW//hK5sOm3O\nAG6pqt0TBVW1u6oeq6rHgY8zWA59kqq6sqpWVNWKhQsXjjBcSZKk8TFKGNsILE1yUneGaxWwfqjN\neuDc7q7KVwIPVNWuXv3ZDC1RDl1T9mbg9mmPXpIkacxNeTdlVe1LcgFwHbAAWFdVW5Kc39WvBTYA\nZwLbgIeAt0/0T/JMBndivnPo0B9MshwoYPsk9ZIkSfPelGEMoKo2MAhc/bK1ve0C3rWfvj8CnjdJ\n+TnTGqkkSdI85CfwS5IkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJ\nkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkNGcYkSZIaMoxJkiQ1\nZBiTJElqyDAmSZLUkGFMkiSpIcOYJElSQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgw\nJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUyS\nJKmhkcJYktOTbE2yLcmaSeqT5GNd/a1JXtar257ktiSbk2zqlT83yfVJvts9Hz0zU5IkSRofU4ax\nJAuAy4EzgGXA2UmWDTU7A1jaPVYDVwzVv76qllfVil7ZGuCGqloK3NDtS5IkPa2McmbsVGBbVd1d\nVY8AVwMrh9qsBD5ZAzcBRyU5forjrgSu6ravAs6axrglSZLmhVHC2CLg3t7+jq5s1DYF/HWSm5Os\n7rU5tqp2ddv3AceOPGpJkqR54rA5eI3XVtXOJD8HXJ/kzqr6Wr9BVVWSmqxzF+BWA5x44omzP1pJ\nkqQ5NMqZsZ3A4t7+CV3ZSG2qauJ5D3Atg2VPgN0TS5nd857JXryqrqyqFVW1YuHChSMMV5IkaXyM\nEsY2AkuTnJTkCGAVsH6ozXrg3O6uylcCD1TVriTPTHIkQJJnAm8Ebu/1Oa/bPg/4wkHORZIkaexM\nuUxZVfuSXABcBywA1lXVliTnd/VrgQ3AmcA24CHg7V33Y4Frk0y81qer6std3aXANUneAdwDvGXG\nZiVJkjQmRrpmrKo2MAhc/bK1ve0C3jVJv7uBl+znmPcDp01nsJIkSfONn8AvSZLUkGFMkiSpIcOY\nJElSQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmS\npIYMY5IkSQ0ZxiRJkhoyjEmSJDVkGJMkSWrIMCZJktSQYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkN\nGcYkSZIaMoxJkiQ1ZBiTJElqyDAmSZLU0GGtByA9nS1Z86XWQ5jS9kvf1HoIkjSveWZMkiSpIcOY\nJElSQ4YxSZKkhgxjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqaGRwliS05NsTbIt\nyZpJ6pPkY139rUle1pUvTvKVJHck2ZLkPb0+lyTZmWRz9zhz5qYlSZI0Hqb8OqQkC4DLgTcAO4CN\nSdZX1R29ZmcAS7vHK4Aruud9wHur6pYkRwI3J7m+1/ejVXXZzE1HkiRpvIxyZuxUYFtV3V1VjwBX\nAyuH2qwEPlkDNwFHJTm+qnZV1S0AVfUD4DvAohkcvyRJ0lgbJYwtAu7t7e/gyYFqyjZJlgAvBb7R\nK353t6y5LsnRI45ZkiRp3piTC/iTPAv4LHBRVT3YFV8BnAwsB3YBH95P39VJNiXZtHfv3rkYriRJ\n0pwZJYztBBb39k/oykZqk+RwBkHsU1X1uYkGVbW7qh6rqseBjzNYDn2SqrqyqlZU1YqFCxeOMFxJ\nkqTxMUoY2wgsTXJSkiOAVcD6oTbrgXO7uypfCTxQVbuSBPhz4DtV9ZF+hyTH93bfDNz+lGchSZI0\npqa8m7Kq9iW5ALgOWACsq6otSc7v6tcCG4AzgW3AQ8Dbu+6vAc4BbkuyuSt7f1VtAD6YZDlQwHbg\nnTM2K0mSpDExZRgD6MLThqGytb3tAt41Sb+/BbKfY54zrZFKkiTNQ34CvyRJUkOGMUmSpIYMY5Ik\nSQ2NdM2YJE1lyZovtR7CSLZf+qbWQ5CkJzCMSdIkDJeS5orLlJIkSQ0ZxiRJkhoyjEmSJDXkNWOS\n9DQwDtfAef2bnq48MyZJktSQYUySJKkhw5gkSVJDhjFJkqSGDGOSJEkNeTelJGnseHeo5hPPjEmS\nJDVkGJMkSWrIZUpJkhoahyVXcNl1NnlmTJIkqSHPjEmSpBnjmb7p88yYJElSQ4YxSZKkhgxjkiRJ\nDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1JBhTJIkqSHDmCRJUkOGMUmSpIYMY5IkSQ0ZxiRJkhoy\njEmSJDVkGJMkSWrIMCZJktSQYUySJKmhkcJYktOTbE2yLcmaSeqT5GNd/a1JXjZV3yTPTXJ9ku92\nz0fPzJQkSZLGx5RhLMkC4HLgDGAZcHaSZUPNzgCWdo/VwBUj9F0D3FBVS4Ebun1JkqSnlVHOjJ0K\nbKuqu6vqEeBqYOVQm5XAJ2vgJuCoJMdP0XclcFW3fRVw1kHORZIkaeyMEsYWAff29nd0ZaO0OVDf\nY6tqV7d9H3DsiGOWJEmaNw5rPQCAqqokNVldktUMlj4Bfphk69yNbMYcA3xvJg+YD8zk0aZtRufT\neC4wv+bjz9oUnM+M8t/OATifGTWu8/n5URqNEsZ2Aot7+yd0ZaO0OfwAfXcnOb6qdnVLmnsme/Gq\nuhK4coRxHrKSbKqqFa3HMVOcz6FrPs0FnM+hbj7NZz7NBZzPuBllmXIjsDTJSUmOAFYB64farAfO\n7e6qfCXwQLcEeaC+64Hzuu3zgC8c5FwkSZLGzpRnxqpqX5ILgOuABcC6qtqS5Pyufi2wATgT2AY8\nBLz9QH27Q18KXJPkHcA9wFtmdGaSJEljYKRrxqpqA4PA1S9b29su4F2j9u3K7wdOm85gx9hYL7NO\nwvkcuubTXMD5HOrm03zm01zA+YyVDHKUJEmSWvDrkCRJkhoyjEljJMlRSf59t/26JF9sPabZlOSH\nrceggf7Png5dSf5P97wkyb9pPR6NxjA2Dd0P94+TbO72T0jyhe77Ne9K8j+6u0Yn2r82yTeT3Nk9\nVnflv57kxqFjH5Zkd5LnJ/lQkvuSXDzH83ksyeYk305yS5JXD7W/KMk/JnlOr+x1SR7o+t2Z5LJe\n3Vu77ySd1cAwPI+ubG2S1yT5RJK/643vPw/1PSbJoxM3pPTKtye5rfuu1b9J8vNd+TO6Yz2S5JjZ\nnNd+HAX4H6Ja8GdvDFTVxO/tJcC8CGNJtrcew2wzjE3fXVW1PEmAzwGf775f8xeBZwF/BJDkOODT\nwPlVdQrwWuCdSd4EfB04YeI/+M6/ALZU1d9X1fuAtcyNu6pqebf946paXlUvAf4j8N+G2p7N4ONK\n/vVQ+de7Y7wU+I0krwGoqr8A/t3sDf0J+vMAeCVwU7f9vq5uOXBekpN67X67a3f2JMd8fVW9GPgq\n8AcAVfXj7lh/P8PjH9WlwC90wfNDwLOS/FUXND/V/VyS5OVdiLw5yXXdZ/lpFiX5r0ku6u3/UZL3\ndG+ubu/C/Vu7uiec1UzyJ0ne1mDY0/GTn71uTk+a17hK8vnu38qWiTfN46p3NvlS4Fe7v6/fbTkm\nTc0w9tT9c+Afq+p/AlTVY8DvAr+T5GcZ3F36iaq6pav/HvB7wJqqehy4hsHnrk1YBXxmDsc/lWcD\n/zCxk+QXGITNP2Dy4EJV/RjYzJO/LmtOJXkh8H+7v5O+n+mef9QrOxt4L7AoyQn7OeSNNJ5Tzxr+\nKXi+j0EAvghYBpwMvCbJ4cAfA79VVS8H1tG9SdCsWgecC5Dkpxj8m97B4E3ASxi84frQGAfj/s/e\nTcyfeQH8TvdvZQVwYZLntR7QDFhD90a5qj7aejAHaW/rAcw2w9hT9yLg5n5BVT0I/D/gn01WD2zq\nymEQvFYBJPlpBp/T9tlZHO8oJpbg7gT+DPjDXt0qBl/0/nXgBUme9F2iSY4GlgJfm4vBHsAZwJd7\n+x/qziTtAK6uqj0ASRYDx1fVNxmE4/29uz8d+PwsjvdgfLOqdnQBfzODpYkXAL8EXN/N+w8YfPuF\nZlFVbQfuT/JS4I3AtxicEf9MVT1WVbuBvwF+pd0oZ8x8m9eFSb7NIGQuZvB7TIeIqhrnn62RGMYa\nqapNDJaYXsAgPHyjqr7feFgTy5SnMAggn5xY9mJwBunq7j/9zzJY3pvwq90vsp3AdVV135yO+sn+\nJU8MYxPLlMcBp/WuhXsrgxAGg6A5fMbvK0l2Mvj7OZTOWvY93Nt+jMFnB4bBkvfy7vHLVfXGNsN7\n2vkz4G0MPvh63QHa7eOJv39/Zn8NNbuSvI7B2b1XdZdofAv/PjTHDGNP3R3Ay/sFSZ4NnMjgmwie\nVN/tb+ntT5wdO9SWKKmqGxl8MevCJL/M4J3i9d2FlKt4YnD5evdL7EXAO5IsHz7eXOmWiI+qqidd\n01VVP2Rw/ddru6Kzgbd1c1oPvDhJ/x3x6xl8yetm4L/M4rCn4wfAkVO02crg7+1VAEkOT/KiKfpo\nZlzL4I3MrzD45pGvA29NsiDJQuDXgG8y+NaRZUl+OslRjMcHYPd/9vY3r3H0HOAfquqhJKcwuN50\nPhjld4UOEYaxp+4G4GeTTFwjsgD4MIPrxB4CLmfwH/3yrv55wAeAD/aO8Rng3zK4/uyQ+m7O7pfS\nAuB+BqHlkqpa0j2eDzx/6AYEqurvGFw0+vtzPuB/8nrgK5NVJDkMeAVwV5JfBJ5VVYsm5sXghoUn\nnB2rqn0Mrsk6N8lzZ3XkI+i+ueJ/J7mdwQX8k7V5BPgt4APdGcvNwKsna6uZ1f3ZfwW4prtm8Vrg\nVuDbwP8Cfq+q7quqexmclb29e/5WoyGPbOhn71VMMq+W4zsIXwYOS/IdBr+/bpqi/bi4FXgsg7vj\nvYD/EOcn8E9DkiXAF6vql7r9xcCfAqcwCLYbgIur6uGu/tcYBLQjGSwd/fequmLomJuBO6tq1VD5\nJcAPq+oyZskk83kMuG2iGnh/VX0pyd3AmVV1Z6/vR4DdwDcYzPk3uvJnMDgz+Jqq2t4tAfykfrbn\nkeRPgL+qqq92dZ8Afh14ADiCQYi+EPhPwDOqak3vOC8G/qKqXtidLVvR3XhBkj8G9lTVH3b7T6iX\n4CcX7t8C/HZVfbf1eCSNh5G+m1KT697d/qsD1H+NKS5qHfo4hqaqasF+yk+epOw/9Ha/2iv/MW3v\nPHw1g7taAaiqt+2n3ZOWHavqVuCF3faSobp3z9gINS8lWQZ8EbjWICZpOlymnJ7HgOek9+GisyHJ\nhxgsX/5oqrYHaVbn03320J/S+4iMWfKTeVTVy6rq0dl6oXQf+gocDjw+W6+j8VNVd1TVyVX13tZj\nkTReXKaUJElqyDNjkiRJDRnGJEmSGjKMSZIkNWQYkyRJasgwJkmS1ND/B/q+3XHKtZaUAAAAAElF\nTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "unigram = NGramLM(oov_train,1)\n", "plot_probabilities(unigram)\n", "# sum([unigram.probability(w) for w in unigram.vocab])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "The unigram LM has substantially reduced (and hence better) perplexity:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "91.4414922652717" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(unigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Its samples look (a little) more reasonable:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['[OOV]',\n", " '[OOV]',\n", " '[OOV]',\n", " '[OOV]',\n", " '[OOV]',\n", " 'your',\n", " 'ways',\n", " 'is',\n", " '[BAR]',\n", " 'life']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample(unigram, [], 10)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Bigram Model\n", "We can do better by setting $n=2$" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAl0AAAFpCAYAAACmgZ0NAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGxpJREFUeJzt3Xu4ZXV93/H3xxmJeB0NJxYZ6NB0ohlviFPEWm8YLYPR\n0cZEqGYiMZ3yFLy1NqVJmpg05sE2GsUgU1Q0eMNLvIw4kRgQxQvCjMDAgOhkoGEoDeOTSkOoIvrt\nH+t3mO3xDGcPc+Z3zh7fr+fZz1nrt35rn+9v77XX/uy11947VYUkSZL2r/stdAGSJEk/CQxdkiRJ\nHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0sXegCZnPI\nIYfUihUrFroMSZKkOW3ZsuXbVTU1V79FGbpWrFjB5s2bF7oMSZKkOSX5n+P08+1FSZKkDgxdkiRJ\nHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjow\ndEmSJHVg6JIkSepg6UIXsFBWnP6ZhS5hTjed8fyFLkGSJM0Tj3RJkiR1YOiSJEnqwNAlSZLUgaFL\nkiSpA0OXJElSB4YuSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJ\nUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQO\nDF2SJEkdjBW6khyf5IYk25OcPsvyJDmzLd+a5OiRZcuSfCzJN5Jcn+Sp8zkASZKkSTBn6EqyBDgL\nWAOsAk5KsmpGtzXAynZZD5w9suxtwGer6jHAE4Hr56FuSZKkiTLOka5jgO1VtaOq7gLOB9bO6LMW\nOK8GlwHLkhya5GHAM4B3A1TVXVX1nXmsX5IkaSKME7oOA24emd/Z2sbpcySwC3hPkiuTvCvJg/ah\nXkmSpIm0v0+kXwocDZxdVU8C/gH4sXPCAJKsT7I5yeZdu3bt57IkSZL6Gid03QIcPjK/vLWN02cn\nsLOqvtbaP8YQwn5MVZ1TVauravXU1NQ4tUuSJE2McULXFcDKJEcmOQg4Edg4o89GYF37FOOxwO1V\ndWtV/W/g5iSPbv2eA1w3X8VLkiRNiqVzdaiqu5OcBlwILAHOraptSU5pyzcAm4ATgO3AncDJI1fx\nKuADLbDtmLFMkiTpJ8KcoQugqjYxBKvRtg0j0wWcuod1rwJW70ONkiRJE89vpJckSerA0CVJktSB\noUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OX\nJElSB4YuSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmS\npA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkd\nGLokSZI6MHRJkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKkDgxdkiRJHYwVupIcn+SG\nJNuTnD7L8iQ5sy3fmuTokWU3JbkmyVVJNs9n8ZIkSZNi6VwdkiwBzgKeC+wErkiysaquG+m2BljZ\nLk8Bzm5/pz27qr49b1VLkiRNmHGOdB0DbK+qHVV1F3A+sHZGn7XAeTW4DFiW5NB5rlWSJGlijRO6\nDgNuHpnf2drG7VPAXyXZkmT9fS1UkiRpks359uI8+BdVdUuSnwE+l+QbVfXFmZ1aIFsPcMQRR3Qo\nS5IkqZ9xjnTdAhw+Mr+8tY3Vp6qm/94GfILh7cofU1XnVNXqqlo9NTU1XvWSJEkTYpzQdQWwMsmR\nSQ4CTgQ2zuizEVjXPsV4LHB7Vd2a5EFJHgKQ5EHA84Br57F+SZKkiTDn24tVdXeS04ALgSXAuVW1\nLckpbfkGYBNwArAduBM4ua3+SOATSab/1wer6rPzPgpJkqRFbqxzuqpqE0OwGm3bMDJdwKmzrLcD\neOI+1ihJkjTx/EZ6SZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJ\nUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQO\nDF2SJEkdGLokSZI6MHRJkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKkDgxdkiRJHRi6\nJEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmS\nJHVg6JIkSerA0CVJktTBWKEryfFJbkiyPcnpsyxPkjPb8q1Jjp6xfEmSK5NcMF+FS5IkTZI5Q1eS\nJcBZwBpgFXBSklUzuq0BVrbLeuDsGctfA1y/z9VKkiRNqHGOdB0DbK+qHVV1F3A+sHZGn7XAeTW4\nDFiW5FCAJMuB5wPvmse6JUmSJso4oesw4OaR+Z2tbdw+bwV+E/jhfaxRkiRp4u3XE+mT/CJwW1Vt\nGaPv+iSbk2zetWvX/ixLkiSpu3FC1y3A4SPzy1vbOH2eBrwwyU0Mb0sel+T9s/2TqjqnqlZX1eqp\nqakxy5ckSZoM44SuK4CVSY5MchBwIrBxRp+NwLr2KcZjgdur6taq+s9VtbyqVrT1Lq6ql8/nACRJ\nkibB0rk6VNXdSU4DLgSWAOdW1bYkp7TlG4BNwAnAduBO4OT9V7IkSdLkmTN0AVTVJoZgNdq2YWS6\ngFPnuI5LgEv2ukJJkqQDgN9IL0mS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmS\nJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1YOiSJEnq\nwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGh\nS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ck\nSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1MFboSnJ8khuSbE9y+izLk+TMtnxrkqNb+wOSXJ7k\n6iTbkvz+fA9AkiRpEswZupIsAc4C1gCrgJOSrJrRbQ2wsl3WA2e39u8Bx1XVE4GjgOOTHDtPtUuS\nJE2McY50HQNsr6odVXUXcD6wdkaftcB5NbgMWJbk0DZ/R+tz/3ap+SpekiRpUowTug4Dbh6Z39na\nxuqTZEmSq4DbgM9V1ddm+ydJ1ifZnGTzrl27xq1fkiRpIuz3E+mr6gdVdRSwHDgmyeP20O+cqlpd\nVaunpqb2d1mSJEldjRO6bgEOH5lf3tr2qk9VfQf4PHD83pcpSZI02cYJXVcAK5McmeQg4ERg44w+\nG4F17VOMxwK3V9WtSaaSLANIcjDwXOAb81i/JEnSRFg6V4equjvJacCFwBLg3KraluSUtnwDsAk4\nAdgO3Amc3FY/FPiz9gnI+wEfqaoL5n8YkiRJi9ucoQugqjYxBKvRtg0j0wWcOst6W4En7WONkiRJ\nE89vpJckSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1YOiSJEnq\nwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGh\nS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ck\nSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKk\nDgxdkiRJHYwVupIcn+SGJNuTnD7L8iQ5sy3fmuTo1n54ks8nuS7JtiSvme8BSJIkTYI5Q1eSJcBZ\nwBpgFXBSklUzuq0BVrbLeuDs1n438B+qahVwLHDqLOtKkiQd8MY50nUMsL2qdlTVXcD5wNoZfdYC\n59XgMmBZkkOr6taq+jpAVf09cD1w2DzWL0mSNBHGCV2HATePzO/kx4PTnH2SrACeBHxtb4uUJEma\ndF1OpE/yYODPgddW1f/dQ5/1STYn2bxr164eZUmSJHUzTui6BTh8ZH55axurT5L7MwSuD1TVx/f0\nT6rqnKpaXVWrp6amxqldkiRpYowTuq4AViY5MslBwInAxhl9NgLr2qcYjwVur6pbkwR4N3B9Vb1l\nXiuXJEmaIEvn6lBVdyc5DbgQWAKcW1XbkpzSlm8ANgEnANuBO4GT2+pPA34VuCbJVa3tt6pq0/wO\nQ5IkaXGbM3QBtJC0aUbbhpHpAk6dZb0vAdnHGiVJkiae30gvSZLUgaFLkiSpA0OXJElSB4YuSZKk\nDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0Y\nuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJ\nkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OXJElSB4YuSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ\n6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjoYK3QlOT7JDUm2Jzl9luVJcmZb\nvjXJ0SPLzk1yW5Jr57NwSZKkSTJn6EqyBDgLWAOsAk5KsmpGtzXAynZZD5w9suy9wPHzUawkSdKk\nGudI1zHA9qraUVV3AecDa2f0WQucV4PLgGVJDgWoqi8CfzefRUuSJE2acULXYcDNI/M7W9ve9pEk\nSfqJtWhOpE+yPsnmJJt37dq10OVIkiTNq3FC1y3A4SPzy1vb3va5V1V1TlWtrqrVU1NTe7OqJEnS\nojdO6LoCWJnkyCQHAScCG2f02Qisa59iPBa4vapunedaJUmSJtacoauq7gZOAy4Ergc+UlXbkpyS\n5JTWbROwA9gOvBP4d9PrJ/kQ8FXg0Ul2JnnlPI9BkiRp0Vs6Tqeq2sQQrEbbNoxMF3DqHtY9aV8K\nlCRJOhAsmhPpJUmSDmSGLkmSpA4MXZIkSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkD\nQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6MHRJkiR1YOiSJEnqwNAlSZLUgaFLkiSpA0OXJElSB4Yu\nSZKkDgxdkiRJHRi6JEmSOjB0SZIkdWDokiRJ6sDQJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIk\nSR0YuiRJkjowdEmSJHVg6JIkSerA0CVJktSBoUuSJKkDQ5ckSVIHhi5JkqQODF2SJEkdGLokSZI6\nMHRJkiR1YOiSJEnqYOk4nZIcD7wNWAK8q6rOmLE8bfkJwJ3AK6rq6+Osq/mx4vTPLHQJc7rpjOeP\n1W8SxgKOZzEbdyzgeBbCgbStwYE1np/Uba2XOY90JVkCnAWsAVYBJyVZNaPbGmBlu6wHzt6LdSVJ\nkg5447y9eAywvap2VNVdwPnA2hl91gLn1eAyYFmSQ8dcV5Ik6YA3Tug6DLh5ZH5naxunzzjrSpIk\nHfDGOqerhyTrGd6aBLgjyQ0LWc99dAjw7fm6srxpvq7pPpnXsYDjmWeO514cSGMBxzPPHM+9WOCx\nwOSO5x+P02mc0HULcPjI/PLWNk6f+4+xLgBVdQ5wzhj1LFpJNlfV6oWuYz4cSGMBx7PYHUjjOZDG\nAo5nsXM8k2WctxevAFYmOTLJQcCJwMYZfTYC6zI4Fri9qm4dc11JkqQD3pxHuqrq7iSnARcyfO3D\nuVW1LckpbfkGYBPD10VsZ/jKiJPvbd39MhJJkqRFbKxzuqpqE0OwGm3bMDJdwKnjrnsAm+i3R2c4\nkMYCjmexO5DGcyCNBRzPYud4JkiGvCRJkqT9yZ8BkiRJ6sDQNaYky5N8Ksm3kvx1krclOSjJK5L8\n6Yy+lyRZneRrSa5K8jdJdrXpq5KsWJhRjCfJpiTLFroO7ZbkpiSHLHQdeyvJ55P8yxltr01y9kLV\ndF8luWOWtlOSrJtjvXct5l/iSPKGJK9f6DruqyQrklw7S/slSfb6U3CTfntMS/KsJP98oevYV0ke\nleRjbfqoJCcsdE37wtA1hvbbkh8HPllVK4GfAx4MvPHe1quqp1TVUcDvAh+uqqPa5ab9XfO+qKoT\nquo7C12HDggfYvjU8qgTW/vEq6oNVXXeHH1+o6qu61WT1DwLWHShK8lefT9oVf2vqnpJmz2K4UN7\nE8vQNZ7jgO9W1XsAquoHwOuAXwceuJCF7askn0yyJcm29gW19xxVaa8gr0/yzrb8L5McvNA1AyT5\nj0le3ab/JMnFbfq4JB9IcnaSza3u3x9Z74wk1yXZmuSPW9t7k5yZ5CtJdiR5yez/tY8kL09yeTsq\n+j/ab5hOL3tQks8kuTrJtUle2tqfnOQL7b68sP0M12LwMeD57StjaEd5HwVcmeSiJF9Pck2StdPL\nF+s2N5vpoyJJHpPk8pH2FUmuadP3HHFJckeSN7b777Ikj1ygun87yTeTfAl4dGv72SSfbdvQpUke\n09pnfXwkOS/Ji0au8wPT9+MCWNr+//VJPpbkR/bLSU5q29m1ye6vy0xyfNsGr05y0cwrTfJvkvzF\nQmyDSda1/dTVSd7XtqmLW9tFSY5o/V6Q4V2VK5P8VZJHtsfZKcDr2n7k6fNc24ok35h5myf53SRX\ntNv5nCRp/S9J8tYkm4HXzFZz6/fM7H5H6MokD2n/69q2D/kD4KVt+Uvb/vDctr+8cmQ/8tiRfejW\nJCvnc/z7pKq8zHEBXg38ySztV7Zlfzqj/RJg9cj8K2b2WSwX4BHt78HAtcBPAzcxfCvwCuBu4KjW\n5yPAyxe65lbLscBH2/SlwOUMX8b7e8C/HRnXknZ/PKGN7QZ2f4BkWfv7XuCjDC9CVjH8XuhCjevn\ngU8D92/z7wDWjdwnvwS8c6T/w9q4vwJMtbaXMnw9y4LfT62eC4C1bfp04I8ZPjn90NZ2CMPXzWSR\nb3N3zNL2BuD1bfoq4Mg2/Z+A32nT9+wPgAJe0Kb/23SfzuN4MnANwwvGh7bb/vXARcDK1ucpwMVt\netbHB/BMhqP/09vhjcDSBRjPina7Pq3Nn9vGcwmwmiHk/w0w1ba7i4EXtfmbR+6z6X3GG9r6pwGf\nAn5qAcb0WOCbwCHTtbX9wq+1+V8fue0fzu592m8Ab565bXa8zR8x0ud9I9v6JcA7RpbtqeZPj1zn\ng9v9tQK4trW9gpHnUuCPaPsHYFm7zR4EvB14WWs/CDi49324p8ui+RmgCfbwPbRPysdCX53kxW36\ncGDmK4Ibq+qqNr2F4QGwGGwBnpzkocD3gK8z7GCfzhCEfyXDkbulwKEMTxbXAd8F3p3kAoYwMO2T\nVfVD4LqFOvrQPIfhSfGK9iLxYOC2keXXAG9ur9YvqKpLkzwOeBzwubbOEuDWrlXfu+m3GD/V/r6S\nIWD9UZJnAD9k+E3W6dt9sW5zc/kIQ+A9o/196Sx97mL3drcFeG6f0n7E04FPVNWdAEk2Ag9geCvq\no20bAvipkXV+7PFRVV9I8o4kUwwvBv68qu7uNYgZbq6qL7fp9zPsA6b9M+CSqtoFwxE54BnAD4Av\nVtWNAFX1dyPrrGMIZC+qqu/v7+JncRzDi8pvT9eW5KnAv2rL38cQ2mH4pZcPZzi6fRBD+O1httv8\nxiS/yRDoHwFsYwhSAB8eWXdPNX8ZeEu7jz5eVTtHtsfZPA94YXafg/cA4Ajgq8BvJ1nerudb+zDO\neeXbi+O5juGJ8B7tyf4IhqNdM4PXI5jn3/baH5I8C/gF4KlV9USGsTxgRrfvjUz/gEXye51tR3gj\nwyufrzAc7Xo28E+B/8fwqus5VfUE4DPAA9oTwjEMb3n9IvDZkascHee9Psr3swB/VrvP/3t0Vb1h\nemFVfRM4miF8/WGS323rbBtZ5/FV9bwFqX52nwKek+Ro4IFVtQV4GcORhifXcN7j37J721uU29wY\nPswQ9n+O4esLZ9vRf7/ay28W19juB3xnZBs6qqp+fmT5nh4f5wEvZ/hC7HM71LknM1/k7uuL3msY\nwv7yfbyeHt7OcPTn8QxH+Wfuw/eX2W7zdwAvabW8c0Yt/zAyPWvNVXUGw5Gvg4EvT7/FfS8C/NLI\nNntEVV1fVR8EXsjwXLApyXH3bYjzz9A1nouAB6Z9SinDOTZvZjjs/jXgaUn+UVu2muEV4s0LU+pe\neRjwf6rqzrZxH7vQBe2lSxnC1Rfb9CkMwfGhDA/w29ur8jUASR4MPKyGL+x9HfDEhSh6DhcBL0ny\nMwBJHpHknh9STfIo4M6qej/w3xkC2A3AVHslTJL7J3ls/9JnV1V3AJ9neFKePoH+YcBtVfX9JM9m\nzB+LXcyq6q8ZgtR/4Udf1S82XwRelOTgJA8BXsDwSyI3JvllGD48lGScx8d7gdcC1MJ+WOCI6e0f\n+NfAl0aWXQ48M8N5qkuAk4AvAJcBz0hyJAyPtZF1rmQIAxvbY663i4FfTvLTI7V9hd0fSnkZwz4P\nhsfS9G8a/9rIdfw98JD9WOOebvNvt33tvZ0bO2vNSX62qq6pqjcx/IzgzNA1c0wXAq8aOXfsSe3v\nPwF2VNWZDC/6nrC3g9tfDF1jaK9MX8zwIPgWw/vG3wV+q6r+FngNQ5q+CngrcFI7FL/YfZbhBNTr\nGd4SuWyB69lblzK8dfjVdj98F7i0qq5m2Gl+A/ggwyFrGB6sFyTZyrCD+Pf9S7537Ynrd4C/bHV+\njmGM0x4PXN62td8D/rCq7mLYwb0pydUM5xYttk8tfYgh5E6Hrg8AqzOcbL6O4b5a7B6YZOfIZbbt\n58MMR34+0rm2sVXV1xnqvBr4C4YnNxieyF/ZtqFtwJwnxbfH3fXAe/ZPtWO7ATi17cseDtzzlSQ1\n/A7w6QzB/2pgS1V9qr3duB74eBvzjwTlqvoSw4u6z6Tz17XU8HN5bwS+0Gp7C/Aq4OS2X/hVhucd\nGM7d+miSLfzoOyyfBl6c/XAifTPbbf5OhnODL2T3djWbPdX82nbS/Fbg+wzb56jPA6umT6QH/ivD\nOa1bk2xr8wC/Alzb9pOPYzgiuyj4jfSSpPskw6cErwGOrqrbF7oe9ZHh05EXVNXjFriUieORLknS\nXkvyCwxHud5u4JLG45EuSZKkDjzSJUmS1IGhS5IkqQNDlyRJUgeGLkmSpA4MXZIkSR0YuiRJkjr4\n/72u84J4wcWvAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bigram = NGramLM(oov_train,2)\n", "plot_probabilities(laplace_bigram, (\"FIND\",)) #I, FIND, laplace .. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Samples should look (slightly) more fluent:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'FIND coming music ten through [/BAR] [BAR] [/BAR] [BAR] stroke roam erased Maintain talk another same stay Listening commercial PE-DA-DEE-DA-DEE-DA-DEE-DA-DEECE Shit school passed pain great Singing sayin Hey PE album mathematics'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\" \".join(sample(laplace_bigram, ['FIND'], 30)) # try: I, FIND" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "How about perplexity?" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(bigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Some contexts where OOV word (and others) haven't been seen, hence 0 probability..." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Smoothing\n", "\n", "Maximum likelihood \n", "* **underestimates** true probability of some words \n", "* **overestimates** the probabilities of other\n", "\n", "Solution: _smooth_ the probabilities and **move mass** from seen to unseen events." ] }, { "cell_type": "raw", "metadata": { "hideCode": false, "hidePrompt": false, "hide_egal": true, "is_egal": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "Created with SnapCount=1Count=0Count=11\" transform=\"matrix(1,0,0,1,289.9871,-7)\">>11\" transform=\"matrix(1,0,0,1,327.9896,170)\">>1Probability Mass per count-class for MLE Smoothing" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Laplace Smoothing\n", "\n", "Add **pseudo counts** to each event in the dataset \n", "\n", "$$\n", "\\param^{\\alpha}_{w,h} = \\frac{\\counts{\\train}{h,w} + \\alpha}{\\counts{\\train}{h} + \\alpha \\lvert V \\rvert } \n", "$$\n", "\n", "Bayesian view: *maximum posteriori* estimate under a dirichlet prior on parameters." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.0007692307692307692" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "laplace_bigram = LaplaceLM(bigram, 0.1) \n", "laplace_bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Perplexity should look better now:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "93.79512094464789" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(LaplaceLM(bigram, 0.001),oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "### Example\n", "Consider three events:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
wordtrain countMLELaplaceSame Denominator
smally 0 0/3 1/6 0.5/3
bigly 1 1/3 2/6 1/3
tremendously 2 2/3 3/6 1.5/3
" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = [\"word\", \"train count\", \"MLE\", \"Laplace\", \"Same Denominator\"]\n", "r1 = [\"smally\", \"0\", \"0/3\", \"1/6\", \"0.5/3\"]\n", "r2 = [\"bigly\", \"1\", \"1/3\", \"2/6\", \"1/3\"]\n", "r3 = [\"tremendously\", \"2\", \"2/3\", \"3/6\", \"1.5/3\"]\n", "util.Table([r1,r2,r3], column_names=c)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How is mass moved for Laplace Smoothing?" ] }, { "cell_type": "raw", "metadata": { "hide_egal": true, "is_egal": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "|Created with SnapCount=1|=1=2Count=2probability for count class in Maximum LikelihoodLaplace=0" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Events with higher counts get penalised more!\n", "\n", "Is this consistent with how counts behave on an unseen test?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hideCode": true, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Train CountTest CountLaplace Count
0 0.0034 0.0050
1 0.4375 0.3019
2 1.1152 0.7546
3 1.6615 1.2125
4 2.6753 1.5728
5 4.1019 2.2806
6 4.6883 2.5625
7 5.3929 3.5774
" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "util.Table(frame, column_names = [\"Train Count\", \"Test Count\", \"Laplace Count\"], number_format=\"{0:.4f}\")" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Penalty is closer to being constant than increasing by count:\n", "* Test counts usually between 0.6 and 1.4 smaller than train counts\n", "* In larger datasets this can be a constant!" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "So \"real\" re-allocation looks more like this:" ] }, { "cell_type": "raw", "metadata": { "hide_egal": true, "is_egal": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "|Created with SnapCount=1||=1=2Count=2Maximum LikelihoodTest Set Reallocation=0" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "### Interpolation\n", "* Laplace Smoothing assigns mass **uniformly** to the words that haven't been seen in a context." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(0.0005649717514124294, 0.0005649717514124294)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "laplace_bigram.probability('rhyme','man'), \\\n", "laplace_bigram.probability('of','man')" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Not all unseen words (in a context) are equal" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "With **interpolation** We can do better: \n", "* give more mass to words likely under the $n-1$-gram model. \n", " * Use $\\prob(\\text{of})$ for estimating $\\prob(\\text{of} | \\text{man})$\n", "* Combine $n$-gram model \\\\(p'\\\\) and a back-off \\\\(n-1\\\\) model \\\\(p''\\\\): \n", "\n", "$$\n", "\\prob_{\\alpha}(w_i|w_{i-n+1},\\ldots,w_{i-1}) = \\alpha \\cdot \\prob'(w_i|w_{i-n+1},\\ldots,w_{i-1}) + \\\\ (1 - \\alpha) \\cdot \\prob''(w_i|w_{i-n+2},\\ldots,w_{i-1})\n", "$$\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(0.001156307129798903, 0.007390310786106033)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interpolated = InterpolatedLM(bigram,unigram,0.01)\n", "interpolated.probability('rhyme','man'), \\\n", "interpolated.probability('of','man')" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Can we find a good $\\alpha$ parameter? Tune on some **development set**!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlMAAAFpCAYAAAC4SK2+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8ldW97/HvL/NEQhJC2AkkDDJDmCKgrUNFW5RYkU5W\nBRRbbU9tq+25p7a3r9ZTe+7p9XS+bW21oqjV1lZsFSxWsVqrMoR5klECIQFCAiEDmdf9I1uaWjAb\nkp1nD5/365XXzt772fDFR+DLWs+zljnnBAAAgPMT43UAAACAcEaZAgAA6AHKFAAAQA9QpgAAAHqA\nMgUAANADlCkAAIAeoEwBAAD0AGUKAACgByhTAAAAPUCZAgAA6IG4vvzJBgwY4IYOHdqXPyUAAMB5\nWbdu3THnXE53x/VpmRo6dKhKS0v78qcEAAA4L2ZWFshxTPMBAAD0AGUKAACgByhTAAAAPUCZAgAA\n6AHKFAAAQA9QpgAAAHqAMgUAANADlCkAAIAeoEwBAAD0AGUKAACgByhTAAAAPRBRZep4Q4tWbD3s\ndQwAABBFIqpMPbX2gD73xDqVH2/0OgoAAIgSEVWmSibmSZJe2FLpcRIAABAtIqpMFWSnqGhwhpZt\npkwBAIC+EVFlSpJKinzaXF6rsuoGr6MAAIAoEHFlak5R51Qfo1MAAKAvRFyZyu+frKkF/bWcMgUA\nAPpAxJUpSSopytP2ypPaV1XvdRQAABDhAipTZvZlM9tqZtvM7C7/a1lm9pKZ7fY/ZgY3auCumeiT\nGVN9AAAg+LotU2Y2QdJnJU2XNElSiZldIOkeSSudcyMlrfQ/DwmDMpJ0YWGWlm2u8DoKAACIcIGM\nTI2VtNo51+ica5P0mqR5kq6TtMR/zBJJc4MT8fyUTPJp15F67TpS53UUAAAQwQIpU1slXWJm2WaW\nIukaSUMk5Trn3p1HOywpN0gZz8vVE3yKYaoPAAAEWbdlyjm3Q9L/lfQXSSskbZTU/p5jnCR3ps+b\n2e1mVmpmpVVVVT1PHKCcfomaOTxbyzZXqDMeAABA7wvoAnTn3MPOuWnOuUslHZe0S9IRM/NJkv/x\n6Fk++6Bzrtg5V5yTk9NbuQNSUpSnfVUN2lHJVB8AAAiOQO/mG+h/LFDn9VJPSnpO0kL/IQsl/SkY\nAXti9oRBio0xLkQHAABBE+g6U8+Y2XZJz0v6gnPuhKTvSbrKzHZLutL/PKRkpSbo4hHZWra5kqk+\nAAAQFHGBHOScu+QMr1VLmtXriXrZtUV5+o9nNmvroZOaODjD6zgAACDCROQK6F19ZPwgxccy1QcA\nAIIj4stURkq8LhmZw1QfAAAIiogvU5I0Z6JPh06c0oaDJ7yOAgAAIkxUlKmrxucqITZGyzaxgCcA\nAOhdUVGm0pPiddnoHL2wpVIdHUz1AQCA3hMVZUqSSop8OnyySesOHPc6CgAAiCBRU6auHJurxLgY\nLdvEXX0AAKD3RE2ZSk2M0xVjBuqFrYfVzlQfAADoJVFTpqTOvfqq6pq1+p1qr6MAAIAIEVVl6oox\nA5WSEKtlm7mrDwAA9I6oKlPJCbGaNTZXK7YeVlt7h9dxAABABIiqMiV13tVX09Cit/Yx1QcAAHou\n6srUZaNylJYYxwKeAACgV0RdmUqKj9VV43K1YtthtbQx1QcAAHom6sqU1DnVV3uqVW/sOeZ1FAAA\nEOaiskxdMjJH6Ulx3NUHAAB6LCrLVEJcjD4yfpD+sv2wmtvavY4DAADCWFSWKUmaU+RTXVOb/raL\nqT4AAHD+orZMfeCCAcpMideyzezVBwAAzl/Ulqn42BjNnjBIL28/oqZWpvoAAMD5idoyJXXu1dfQ\n0q5Xdx71OgoAAAhTUV2mZgzL0oC0BD3PXX0AAOA8RXWZiouN0dUTfHplx1E1trR5HQcAAIShqC5T\nUuddfada27VyB1N9AADg3EV9mbpwaJYG9kvkrj4AAHBeor5MxcaYrpno0193Vqm+mak+AABwbqK+\nTEnStZN8amnr0Mvbj3gdBQAAhBnKlKQpQzKVl5HEVB8AADhnlClJMf6pvtd2Van2VKvXcQAAQBgJ\nqEyZ2d1mts3MtprZU2aWZGb3mtkhM9vo/7om2GGDqWRSnlrbnf6y7bDXUQAAQBjptkyZWb6kL0kq\nds5NkBQr6Qb/2z9yzk32f70QxJxBN2lwhoZkJWv5FhbwBAAAgQt0mi9OUrKZxUlKkRRxFxeZmeZM\nzNPfdx/T8YYWr+MAAIAw0W2Zcs4dkvR9SQckVUqqdc79xf/2F81ss5ktNrPMIObsEyVFPrV1OL3I\nVB8AAAhQINN8mZKukzRMUp6kVDO7WdIDkoZLmqzOkvWDs3z+djMrNbPSqqqqXgseDOPz0jU0O0XL\n2KsPAAAEKJBpvislveOcq3LOtUpaKuli59wR51y7c65D0kOSpp/pw865B51zxc654pycnN5LHgRm\nppKiPL2595iO1Td7HQcAAISBQMrUAUkzzSzFzEzSLEk7zMzX5ZjrJW0NRsC+VjLJpw4nrdjKVB8A\nAOheINdMrZb0B0nrJW3xf+ZBSfeb2RYz2yzpQ5LuDmbQvjI6t58uGJjGAp4AACAgcYEc5Jz7tqRv\nv+fl+b0fx3udU30+/WTlbh092aSB6UleRwIAACGMFdDPoKTIJ+ekF1hzCgAAdIMydQYXDOynMYP6\ncVcfAADoFmXqLEqKfCotO66KE6e8jgIAAEIYZeosSoryJDHVBwAA3h9l6iyGDkjVhPx0pvoAAMD7\noky9jzkT87Tx4AkdrGn0OgoAAAhRlKn3UVLUuS7pcqb6AADAWVCm3seQrBRNGtKfBTwBAMBZUaa6\ncW2RT1sPndT+Yw1eRwEAACGIMtWNayYy1QcAAM6OMtWNvP7JmlaYqec3MdUHAAD+FWUqACVFPr19\nuE57jtZ7HQUAAIQYylQArpnok5m4EB0AAPwLylQActOTNH1olpazgCcAAHgPylSASiblaffReu08\nXOd1FAAAEEIoUwG6esIgxTDVBwAA3oMyFaABaYm6aES2lm2ulHPO6zgAACBEUKbOQUlRnt451qBt\nFSe9jgIAAEIEZeoczB4/SHExxgKeAADgNMrUOchMTdAHLhigZZsrmOoDAACSKFPnrKTIp4M1p7S5\nvNbrKAAAIARQps7Rh8cNUnyscVcfAACQRJk6Zxkp8bp0ZI6Wb65URwdTfQAARDvK1HkomeRTRW2T\nNhw84XUUAADgMcrUebhybK4S4mKY6gMAAJSp89EvKV4fGp2jF7Yw1QcAQLSjTJ2nOUV5OnKyWWv3\n13gdBQAAeIgydZ5mjRmopPgYLdvMAp4AAESzgMqUmd1tZtvMbKuZPWVmSWaWZWYvmdlu/2NmsMOG\nktTEOM0ak6s/b61UO1N9AABErW7LlJnlS/qSpGLn3ARJsZJukHSPpJXOuZGSVvqfR5WSIp+O1bdo\n9b5qr6MAAACPBDrNFycp2cziJKVIqpB0naQl/veXSJrb+/FC24fGDFRKQqyeZ6oPAICo1W2Zcs4d\nkvR9SQckVUqqdc79RVKuc+7dFnFYUm7QUoaopPhYXTk2Vyu2Vqq1vcPrOAAAwAOBTPNlqnMUapik\nPEmpZnZz12Nc566/Z7xwyMxuN7NSMyutqqrqhcihpaTIp+ONrXpzL1N9AABEo0Cm+a6U9I5zrso5\n1yppqaSLJR0xM58k+R+PnunDzrkHnXPFzrninJyc3sodMi4bnaN+iXFazgKeAABEpUDK1AFJM80s\nxcxM0ixJOyQ9J2mh/5iFkv4UnIihLTEuVleNz9WKrYfV0sZUHwAA0SaQa6ZWS/qDpPWStvg/86Ck\n70m6ysx2q3P06ntBzBnSri3K08mmNv19T+RNYwIAgPcXF8hBzrlvS/r2e15uVucoVdT7wAUDlJEc\nr2WbKnXFmKi7Dh8AgKjGCui9ICEuRh8Zn6u/bD+iptZ2r+MAAIA+RJnqJSVFeapvbtNru5jqAwAg\nmlCmesnFI7KVlZqg5SzgCQBAVKFM9ZK42BjNnjBIL+84olMtTPUBABAtKFO9qKTIp8aWdv115xmX\n3AIAABGIMtWLZgzL1oC0RC1jAU8AAKIGZaoXxcaYrpk4SK+8fVQNzW1exwEAAH2AMtXLSory1NTa\noZVvM9UHAEA0oEz1suLCTA1KT9KyTUz1AQAQDShTvSwmxnTNRJ9e3VWluqZWr+MAAIAgo0wFwZwi\nn1raOvTS9iNeRwEAAEFGmQqCqQX9ld8/WctYwBMAgIhHmQoCM9OcIp9e312l2kam+gAAiGSUqSAp\nKfKptd3pxe2HvY4CAACCiDIVJBPzM1SQlcJUHwAAEY4yFSTvTvW9seeYahpavI4DAACChDIVRCVF\nPrV3OK3YylQfAACRijIVRON86Ro+IFXLt7CAJwAAkYoyFURmppIin97aW62qumav4wAAgCCgTAVZ\nyaQ8dThpxVYuRAcAIBJRpoJsVG4/jRyYpue5qw8AgIhEmeoDJUV5Wru/RkdONnkdBQAA9DLKVB8o\nmeSTc9ILWxidAgAg0lCm+sCInDSN9aWzgCcAABGIMtVHSop8Wld2XBUnTnkdBQAA9CLKVB8pKfJJ\nkpYzOgUAQEShTPWRwuxUTczP0LLNLOAJAEAkoUz1oZIinzaV1+pgTaPXUQAAQC+hTPWhOf6pPi5E\nBwAgcnRbpsxstJlt7PJ10szuMrN7zexQl9ev6YvA4WxwZoqmFPRnqg8AgAjSbZlyzu10zk12zk2W\nNE1So6Rn/W//6N33nHMvBDNopJgz0adtFSf1zrEGr6MAAIBecK7TfLMk7XXOlQUjTDQ4PdW3idEp\nAAAiwbmWqRskPdXl+RfNbLOZLTazzF7MFbF8Gcm6cGgm100BABAhAi5TZpYg6aOSfu9/6QFJwyVN\nllQp6Qdn+dztZlZqZqVVVVU9jBsZSorytPNInXYfqfM6CgAA6KFzGZm6WtJ659wRSXLOHXHOtTvn\nOiQ9JGn6mT7knHvQOVfsnCvOycnpeeIIcPXEQYox7uoDACASnEuZ+rS6TPGZma/Le9dL2tpboSLd\nwH5JmjEsW8s2V8g553UcAADQAwGVKTNLlXSVpKVdXr7fzLaY2WZJH5J0dxDyRaw5RT7trWrQ24eZ\n6gMAIJwFVKaccw3OuWznXG2X1+Y75yY654qccx91zjFndQ6unjBIsTHGmlMAAIQ5VkD3SHZaoi4e\nka3lmyuZ6gMAIIxRpjxUUuTT/upGbas46XUUAABwnihTHvrI+EGKizE9z1QfAABhizLlof4pCfrg\nyAFM9QEAEMYoUx4rKcpT+fFT2njwhNdRAADAeaBMeezD43OVEBuj5SzgCQBAWKJMeSw9KV6XjsrR\n8i2V6uhgqg8AgHBDmQoB107yqbK2SesPHPc6CgAAOEeUqRAwa2yuEuNi2KsPAIAwRJkKAWmJcfrQ\n6IFavqVS7Uz1AQAQVihTIaJkkk9Vdc1au7/G6ygAAOAcUKZCxBVjBio5Ppa9+gAACDOUqRCRkhCn\nWWMH6s9bDqutvcPrOAAAIECUqRBSUuRTdUOLVu1jqg8AgHBBmQohl48eqNQEpvoAAAgnlKkQkhQf\nq6vG5WrFtsNqZaoPAICwQJkKMSVFeTrR2Ko39hzzOgoAAAgAZSrEXDJqgPolxbGAJwAAYYIyFWIS\n42L14XGD9OK2w2pua/c6DgAA6AZlKgSVTPKprqlNr+9iqg8AgFBHmQpBH7xggPqnxGv5Fqb6AAAI\ndZSpEBQfG6PZ4wfppe1H1NTKVB8AAKGMMhWiSoryVN/cpld3VnkdBQAAvA/KVIiaOTxL2akJLOAJ\nAECIo0yFqLjYGM2eMEgrdxxVY0ub13EAAMBZUKZCWElRnk61tuuVt496HQUAAJwFZSqETR+WpZx+\niVrOAp4AAIQsylQIi40xzZno0ytvH1V9M1N9AACEom7LlJmNNrONXb5OmtldZpZlZi+Z2W7/Y2Zf\nBI42JUU+Nbd1aOWOI15HAQAAZ9BtmXLO7XTOTXbOTZY0TVKjpGcl3SNppXNupKSV/ufoZVMLMjUo\nPUnPb2KqDwCAUHSu03yzJO11zpVJuk7SEv/rSyTN7c1g6BQTY5pT5NPfdlWp9lSr13EAAMB7nGuZ\nukHSU/7vc51z7w6XHJaU22up8E9Kinxqae/Qy9uZ6gMAINQEXKbMLEHSRyX9/r3vOeecJHeWz91u\nZqVmVlpVxWre52PykP4anJnMAp4AAISgcxmZulrSeufcu8MjR8zMJ0n+xzMuhuSce9A5V+ycK87J\nyelZ2ihl1jnV9/ruYzrR2OJ1HAAA0MW5lKlP6x9TfJL0nKSF/u8XSvpTb4XCvyqZmKe2DqcXtx32\nOgoAAOgioDJlZqmSrpK0tMvL35N0lZntlnSl/zmCZEJ+ugqzU7SMBTwBAAgpcYEc5JxrkJT9nteq\n1Xl3H/qAmamkyKdfvrZP1fXNyk5L9DoSAAAQK6CHlZKiPLV3OP3qb/u8jgIAAPwoU2FkrC9dN84o\n0IN/26fFf3/H6zgAAEABTvMhdNx33QTV1LfoO8u2Kys1QXOn5HsdCQCAqMbIVJiJjTH9+IbJumh4\ntv7995v06s4zrkgBAAD6CGUqDCXFx+rBBdM0elA/ff6J9Vp/4LjXkQAAiFqUqTDVLylej946XQPT\nE7Xo0bXafaTO60gAAEQlylQYy+mXqMcXzVB8bIwWLF6jihOnvI4EAEDUoUyFuYLsFD22aLrqm9s0\n/+HVOt7AdjMAAPQlylQEGOtL18MLL1T58VO69dG1amhu8zoSAABRgzIVIaYPy9LPbpyqLYdq9fnf\nrFdLW4fXkQAAiAqUqQhy1bhc/fe8ifrbrir9++83qaPDeR0JAICIx6KdEeaTxUNU09Ci7/35bWWl\nJujb146TmXkdCwCAiEWZikB3XDpc1fXNeuj1dzQgLUF3XjHS60gAAEQsylQEMjN9/eqxqm5o0ff/\nsktZqYm6cUaB17EAAIhIlKkIFRNj+r8fK9KJxlZ9849blJkSr6sn+ryOBQBAxOEC9AgWHxujn984\nVVMKMvXl327Um3uPeR0JAICIQ5mKcMkJsXp4YbGGDkjR7Y+t09ZDtV5HAgAgolCmokD/lAQ9tmiG\nMpLjdcsja7T/WIPXkQAAiBiUqSgxKCNJj902XR1Omr94tY6ebPI6EgAAEYEyFUVG5KTp0VsvVE19\nixYsXqPaU61eRwIAIOxRpqJM0eD++tX8Yu2tqtdnl5SqqbXd60gAAIQ1ylQU+uDIAfrRpyZrbVmN\n7nxyg9ra2ccPAIDzRZmKUiVFefrOdRP08o4j+vrSLXKOffwAADgfLNoZxebPLFR1fbN+/PJuZacl\n6p6rx3gdCQCAsEOZinJfnjVS1fUt+uVre5WdmqDPXjrc60gAAIQVylSUMzPd+9Hxqmlo0X+9sENZ\nqQn62LTBXscCACBsUKag2BjTDz81SSdOteg/ntmszNR4XTEm1+tYAACEBS5AhyQpMS5Wv5pfrHG+\ndP3bb9ZrXVmN15EAAAgLlCmclpYYp0dvvVB5Gcm69ZG12nm4zutIAACEvIDKlJn1N7M/mNnbZrbD\nzC4ys3vN7JCZbfR/XRPssAi+7LRELVk0XckJsVqweLXKjzd6HQkAgJAW6MjUTyStcM6NkTRJ0g7/\n6z9yzk32f70QlIToc0OyUvTYohk61dKuBQ+vUXV9s9eRAAAIWd2WKTPLkHSppIclyTnX4pw7Eexg\n8NboQf20+JYLVVF7Src+ulb1zW1eRwIAICQFMjI1TFKVpEfMbIOZ/drMUv3vfdHMNpvZYjPLDF5M\neKF4aJZ+cdNUbas4qTseL1VzG/v4AQDwXoGUqThJUyU94JybIqlB0j2SHpA0XNJkSZWSfnCmD5vZ\n7WZWamalVVVVvZMafeaKMbm6/2NFemNPtb7yu01q72DbGQAAugqkTJVLKnfOrfY//4Okqc65I865\ndudch6SHJE0/04edcw8654qdc8U5OTm9kxp96mPTBuubc8Zq+ZZK3fvcNvbxAwCgi24X7XTOHTaz\ng2Y22jm3U9IsSdvNzOecq/Qfdr2krcEMCm995pLhqqpv1q9e26fstATddeUoryMBABASAl0B/YuS\nfmNmCZL2SbpV0k/NbLIkJ2m/pDuCkhAh457ZY1RT33J6Y+T5Mwu9jgQAiELH6ps1IC3R6xinBVSm\nnHMbJRW/5+X5vR8HoczM9N/zJup4Y6u+9aetykpJ0Jwin9exAABRZF1ZjeY/vEb3f7xIJUV5XseR\nxAroOEdxsTH62Y1TVFyYqbt+t0F/333M60gAgCix+0idFj1aqtz0JF00PNvrOKdRpnDOkuJj9euF\nF2pETprueLxUm8tZdgwAEFwVJ05pweI1SoiL0WOLpis7hKb5KFM4LxnJ8Xps0XRlpibolkfWal9V\nvdeRAAAR6kRjixYuXqP6pjYtuXW6hmSleB3pn1CmcN4Gpifp8dtmKMak+Q+v0eHaJq8jAQAizKmW\ndt22pFRl1Y16cEGxxuWlex3pX1Cm0CPDBqTq0Vunq/ZUqxYsXq0TjS1eRwIARIi29g7d+eR6rT9w\nXD+5YbIuGhE610l1RZlCj03Iz9CDC6Zp/7FG3bakVKda2HYGANAzzjl949ktWvn2Ud133QRdPTF0\n7x6nTKFXXDxigH766cnacOC4vvDkerW2d3gdCQAQxr7/l516urRcX5o1UjeH+LqGlCn0mtkTfPru\n3Il65e2j+tofNquDffwAAOfh0Tfe0c//ulefnl6gu68c6XWcbgW6AjoQkBtnFKi6vlk/eGmXstMS\n9L/njPM6EgAgjDy/qUL/uWy7PjI+V9+dO0Fm5nWkblGm0OvuvOICVTe06KHX39GAtETdcdkIryMB\nAMLA33cf01ee3qgLC7P0kxumKDYm9IuURJlCEJiZvlUyTjUNLfrvP7+tzNQEfbJ4iNexAAAhbOuh\nWt3xeKlG5KTpoYXFSoqP9TpSwChTCIqYGNP3PzFJxxtb9PWlW5SZkqCrxuV6HQsAEILKqht0yyNr\n1D8lQUsWTVdGcrzXkc4JF6AjaBLiYvTLm6dpQn6G7nxyvda8U+N1JABAiKmqa9b8h9eovcPpsdum\nKzc9yetI54wyhaBKTYzTI7dcqMGZybptyVrtqDzpdSQAQIioa2rVLY+sUVVdsxbf0rnnaziiTCHo\nslIT9NhtM5SWGKcFi9foYE2j15EAAB5rbmvX555Yp52H6/TAzVM1pSDT60jnjTKFPpHfP1mPLZqu\n1vYOzX94tarqmr2OBADwSEeH01ee3qQ39lTr/o8X6fLRA72O1COUKfSZkbn9tPiWC3XkZLNueWSN\n6ppavY4EAOhjzjl9Z9l2Ld9cqW9cM0bzpg72OlKPUabQp6YWZOqBm6dq5+E63f7YOjW1so8fAEST\nX7y6V4++uV+fvWSYbr80MtYhpEyhz10+eqC+/4lJemtfte767Ua1s+0MAESF3609oP95cafmTs7T\n168e63WcXkOZgifmTsnXt0rGacW2w/rmH7fKOQoVAESyl7cf0deXbtGlo3J0/8cnKSZMVjcPBIt2\nwjOLPjhM1Q3N+vlf9yonLUFf+fBoryMBAIJgXVmNvvDkek3Mz9ADN01VQlxkjeVQpuCpf//waFXX\nt+inr+xRVmqCbvnAMK8jAQB60a4jdVr0aKny+idr8S0XKjUx8qpH5P2KEFbMTN+dO0HHG1t07/Pb\nlZmaoOsm53sdCwDQCypOnNLCxWuUGBejxxZNV3ZaoteRgiKyxtkQluJiY/STG6ZoxrAsffXpTXrl\n7SNeRwIA9NCJxhYtWLxG9U1tWrJouoZkpXgdKWgoUwgJSfGxemhhsUYP6qdFj5bq60s360Rji9ex\nAADn4VRLuxY9ulYHahr10MJijfWlex0pqChTCBnpSfF6+o6LdMelw/V0abmu+MFremZdOXf6AUAY\naWvv0J1PrteGgyf00xsma+bwbK8jBR1lCiElNTFOX79mrJZ98YMaNiBVX/39Jn36oVXac7TO62gA\ngG445/T1pVu08u2juu+6CZo9wed1pD5BmUJIGutL1+/vuEjfmzdROyrrdPVPXtf3X9zJiukAEML+\n58Wd+v26cn151kjdPLPQ6zh9JqAyZWb9zewPZva2me0ws4vMLMvMXjKz3f7H8N3uGSEpJsZ0w/QC\nrfzqZbp2Up5+9tc9+vCP/qZXdx71OhoA4D0eeeMd/eLVvbpxRoHuunKk13H6VKAjUz+RtMI5N0bS\nJEk7JN0jaaVzbqSklf7nQK8bkJaoH35ysp767EzFx5pueWStvvCb9Tpc2+R1NACApOc2Veg7y7Zr\n9vhBuu+6CTKLnNXNA2HdXdxrZhmSNkoa7rocbGY7JV3unKs0M5+kV51z77uEdXFxsSstLe2F2IhW\nLW0deuj1ffrpyt2Kj43RVz88SgsuGqrYCNqWAADCyd93H9Otj67RlIJMPbZoupLiY72O1GvMbJ1z\nrri74wIZmRomqUrSI2a2wcx+bWapknKdc5X+Yw5Lyj3/uEBgEuJi9IUPXaCX7r5M0woz9Z/Pb9d1\nP/+7Nh084XU0AIg6W8prdcfjpRqRk6aHFhRHVJE6F4GUqThJUyU94JybIqlB75nS849YnXGIy8xu\nN7NSMyutqqrqaV5AklSQnaJHb71QP79xqo6ebNbcX7yhb/1pq042tXodDQCiwv5jDbr10TXqn5Kg\nJYumKyM53utIngmkTJVLKnfOrfY//4M6y9UR//Se/I9nvCrYOfegc67YOVeck5PTG5kBSZ1b0cwp\n8mnlVy/TwouG6olVZZr1g9f03KYK1qYCgCA6WtekBYvXqL3D6bHbpis3PcnrSJ7qtkw55w5LOmhm\n714PNUvSdknPSVrof22hpD8FJSHQjX5J8br3o+P13J0flC8jSV96aoMWLF6jd441eB0NACJOXVOr\nblm8VlV1zXrk1ukakZPmdSTPdXsBuiSZ2WRJv5aUIGmfpFvVWcSellQgqUzSJ51zNe/343ABOoKt\nvcPpN6vL9D8rdqq5vUNfuPwCfe7y4UqMi855fADoTc1t7br1kbVa806Nfr2wWJePHuh1pKAK9AL0\ngMpUb6EkpREQAAAWI0lEQVRMoa8cPdmk+5bv0PObKjR8QKrumztBH7hggNexACBstXc4fempDVq+\npVI/+tQkXT9lsNeRgq437+YDws7A9CT9v09P0WOLpqvdOd3069W667cbVFXX7HU0AAg7zjl95/lt\nWr6lUv/7mrFRUaTOBWUKEe3SUTl68a5L9eVZI/XClsO64gev6olVZero4AJ1AAjUL17dqyVvlen2\nS4frs5cO9zpOyKFMIeIlxcfq7qtG6c93XaKJ+Rn65h+3at4Db2pbRa3X0QAg5P12zQH9z4s7df2U\nfN0ze4zXcUISZQpRY0ROmn7zmRn68acmq/x4o679f3/Xfcu2q765zetoABCSXtp+RN94dosuG5Wj\n+z9epBh2mzgjyhSiiplp7pR8rfzK5fr09AItfuMdXfmD17RiayVrUwFAF6X7a3Tnk+s1cXB//eKm\nqYqPpTKcDf9lEJUyUuL1X9dP1DOfv1iZqQn63BPrdduSUh2safQ6GgB4bteROi16dK3y+yfrkVsu\nVGpinNeRQhplClFtakGmnr/zA/rmnLFata9aV/3oNf3i1T1qaevwOhoAeOLQiVNa8PAaJcXHasmi\n6cpKTfA6UsijTCHqxcXG6DOXDNfLX7lMl48aqPtX7NScn76u1fuqvY4GAH3qeEOLFi5eo4aWNi1Z\nNF1DslK8jhQWKFOAX17/ZP1y/jQ9vLBYjS3t+tSDq/S/fr9JNQ0tXkcDgKBrbGnToiVrdaCmUb9e\nUKyxvnSvI4UNyhTwHrPG5urlr1ymz18+Qs9uOKQrfvCqnl57kLWpAESs1vYO3fnkBm06eEI/vWGy\nZgzP9jpSWKFMAWeQnBCrr80eoxe+fIlGDkzTfzyzWZ968C3tPFzndTQA6FXOOX196Ra98vZR3Td3\ngmZP8HkdKexQpoD3MSq3n353+0W6/+NF2nO0XnN++rq+9+e31djC2lQAIsP9L+7UH9aV664rR+qm\nGYVexwlLlCmgGzExpk8WD9HKr16ueVPz9cvX9uqqH/5NK3cc8ToaAPTI4r+/owde3aubZhToy7NG\neh0nbFGmgABlpSbo/o9P0tN3XKTUxFjdtqRUdzxeqooTp7yOBgDn7LlNFfrOsu2aPX6QvnPdBJmx\nuvn5okwB52j6sCwt++Il+trsMXptV5Wu/OFr+vXr+9TWztpUAMLD67ur9NWnN2rGsCz9+IbJimWb\nmB6hTAHnISEuRp+/fIReuvsyzRyere8u36Frf/aG1h847nU0AHhfW8pr9bnH12lETpoeXFCspPhY\nryOFPcoU0ANDslL08MJi/fLmaTrR2KKPPfCmvvHsFtU2tnodDQD+xTvHGnTLI2uUmZqgJYumKyM5\n3utIEYEyBfSQmWn2hEF66SuX6bYPDNPv1h7UFT94Vc9uKGfzZAAh42hdkxYsXi0n6bFF05WbnuR1\npIhBmQJ6SVpinL5ZMk7P3fkBDclK0d2/26QbH1qtvVX1XkcDEOXqmlp1y+K1qq5v0eJbLtTwnDSv\nI0UUyhTQy8bnZWjp5y/Wf10/QdsqanX1j1/XD/+yU02t7V5HAxCFmtvadftj67TrSJ0euHmaJg/p\n73WkiEOZAoIgJsZ004xCrfzq5ZpT5NNPX9mjj/z4b3ptV5XX0QBEkfYOp7t/t1Fv7avW9z8xSZeN\nyvE6UkSiTAFBlNMvUT/61GQ9+ZkZijXTwsVr9G+/Wae1+2u4ngpAUDnn9J/Pb9MLWw7rm3PGau6U\nfK8jRaw4rwMA0eDiCwboz3ddol+9tk8PvLpXL2w5rIKsFF0/JV/zpuarMDvV64gAIszP/7pHj71V\npjsuHa7PXDLc6zgRzfryX8fFxcWutLS0z34+IBQ1NLdpxdbDWrqhXG/urZZzUnFhpuZNHaw5E33K\nSOFWZQA989s1B3TP0i2aNyVf3//EJMWwKOd5MbN1zrnibo+jTAHeqaw9pT9uqNAz68u152i9EmJj\ndOW4gZo3ZbAuG52j+Fhm4gEErrq+WU+tOaAfvrRLl47K0UMLivlzpAcoU0AYcc5p66GTemZ9uZ7b\nVKGahhZlpybo2kl5mjc1XxPzM9g3C8AZOee0/sBxPf5WmV7Yclgt7R26YsxA/ezGKUpJ4GqenqBM\nAWGqtb1Dr+2s0rMbDuml7UfU0t6hCwamad7UfM2dnK+8/sleRwQQAhqa2/SnjRV6fFWZdlSeVL/E\nOH1s2mDdNKNAI3P7eR0vIlCmgAhQ29iq5VsqtXR9uUrLjstMunhEtuZNGazZEwYpNZF/dQLRZveR\nOj2xqkxL1x9SXXObxvrSteCiQn10Uh5/JvQyyhQQYcqqG/TshkNauv6QDtQ0Kjk+VrMnDNK8qfm6\neMQAdn0HIlhre4f+su2IHl+1X6v21SghNkZziny6eWahphb05zKAIOnVMmVm+yXVSWqX1OacKzaz\neyV9VtK7qxB+wzn3wvv9OJQpoOecc1pXdlzPrD+k5ZsrdLKpTbnpiZo7JV/zpgzW6EEM7wORorL2\nlJ5afUBPrT2oqrpmDc5M1s0zC/WJaYOVnZbodbyIF4wyVeycO9bltXsl1Tvnvh9oKMoU0LuaWtv1\nyttHtXR9uV7dWaW2DqfxeemaN3WwPjopTzn9+MMWCDcdHU5v7q3W46v26+UdR9XhnD40eqDmzyzU\npaNyGIXuQ4GWKSZXgTCWFB+rayb6dM1En47VN+v5TRVauv6Q7lu2Xf/nhR26bFSO5k3N15Vjc5UU\nH+t1XADvo7axVb9fd1C/WX1A7xxrUFZqgm6/dLhunF6gIVkpXsfD+wh0ZOodSbXqnOb7lXPuQf/I\n1K3+10slfdU5d/z9fhxGpoC+sftInZZuOKRn1x/S4ZNN6pcYpzlFPs2bOljFhZks4AeEkC3ltXp8\n1X49t6lCTa0dmlaYqfkzC3X1xEFKjOMfQV7q7Wm+fOfcITMbKOklSV+UtFPSMUlO0n2SfM65RWf4\n7O2SbpekgoKCaWVlZef0CwFw/to7nFbtq9Yz68u1YuthNba0a0hWsq6fMljzpuRr6AC2sQG80NTa\nruc3VeiJVWXaVF6rlIRYzZ2Sr5tnFGpcXrrX8eAXtLv5znStlJkNlbTMOTfh/T7LyBTgncaWNr24\n7bCWrj+kv+85JuekqQX9NW/qYJUU+dQ/JcHriEDE23+sQb9ZXaanS8tVe6pVFwxM0/yZhbp+ar7S\nk9hKKtT0Wpkys1RJMc65Ov/3L0n6jqRNzrlK/zF3S5rhnLvh/X4syhQQGiprT+lPGyu0dH25dh3p\n3MZm1tiBmjd1sC4blaOEOLafAHpLW3uHXnn7qB5fVabXdx9TXIzpIxMG6eYZhZo5PItlDUJYb5ap\n4ZKe9T+Nk/Skc+6/zOxxSZPVOc23X9Id75ars6FMAaHFOadtFSe1dP0hPbfpkI7VtygzJV4fnZSn\neVMHq2gw29gA5+toXZOeXntQT64+oIraJg1KT9KNMwp0w4VDNDA9yet4CACLdgI4J63tHXp9d5We\nWe/fxqatQyNyUjVv6mDNnZKvfLaxAbrlnNOad2r0+Koyrdh6WG0dTpeMHKCbZhTqyrEDFcemw2GF\nMgXgvNWeatWft1Rq6fpDWrO/RmbSzGHZmjc1X1dP9CmNLSuAf1LX1Ko/bjikx1eVadeReqUnxekT\nxUN004wCDc9J8zoezhNlCkCvOFDd2LmNzYZylVU3Kik+RrPHD9K8qYP1gQvYxgbRbUflST2xqkx/\n3HBIDS3tmpifofkzC3XtpDwlJ7CsQbijTAHoVc45rT9wQkvXl+v5TZ3b2Azs59/GZmq+xgzidm5E\nh+a2dq3YeliPv1Wm0rLjSoyL0bWT8jR/ZqEmDenvdTz0IsoUgKBpbmvXKzuO6pn1h/TqzqNq63Aa\n50vXvKn5+ujkPA3sx8W1iDzlxxv15OoD+t3ag6puaNHQ7BTdPLNQH582mKVFIhRlCkCfqK5v1rLN\nlVq6vlybymsVG2MaM6ifJuRlaHx+usbnZWisr59SErjOCuGno8Pptd1VeuKtMr2y86hM0qyxuZo/\ns1AfvGAAuwlEOMoUgD6352idnttYoQ0HT2jroVodb2yVJMWYNDwnTePz0jtLVl5nycpIYZFChKaa\nhhb9vrRzn7wDNY0akJaoT08fohumF3BnaxShTAHwlHNOlbVN2lZxUlsP1WpbxUltq6hVZW3T6WMG\nZyb/o2Dldz6y/g684pzThoMn9MRbZVq2pVItbR2aPixL82cW6iPjB7GYbRQKtEwx7g4gKMxMef2T\nldc/WVeNyz39enV9s79YdZarbRUn9eK2I6ffH5CW2Fmw/FOEE/IyNCQrmcVDETSNLW16bmOFHl9V\npm0VJ5WWGKcbLhyim2YUavSgfl7HQxigTAHoU9lpibp0VI4uHZVz+rW6plbtqKzTtopabT3UWbLe\n2HNMbR2dI+f9kuI0zpeuCfkZ/qKVoeEDUlkAET2y52i9nlhVpmfWl6uuqU1jBvXTd+dO0Nwp+ayl\nhnPC/y0APNcvKV7Th2Vp+rCs0681tbZr15G6f5omfGJVmZrbOiRJiXExGutLP3391YT8dI3K7aek\neNb2wb9yzuloXbMO1DRqX1W9/rSxQm/urVZ8rOmaiT7dPLNQxYWZjIDivHDNFICw0dbeoX3HGv5p\nBGtbxUnVNbVJkuJiTBcMTDtdrsbnZWhcXjqjDFGiua1d5cdP6UB1o8qqG3Sg5pQO1DSorLpRB483\nqqm14/Sx+f2TdeOMAn2yeIhy+iV6mBqhjAvQAUQF55wO1pzS1ora0+Vq66GTOlbffPqYYQNSNe6f\n7iRMV3Yaf4GGoxONLTpQ06iy6kYdqGnsLE41DTpQ3ajKk03q+ldacnysCrNTNCQrRYVZKSrMTlFB\ndqoKslJUkJXC6v3oFhegA4gKZqaC7BQVZKfomom+068fPdnUWbAOndTWilptOnhCyzdXnn7fl5F0\neorw3euwfBlJTPN4rL3DqbK2c3TpQE2jyvyFqbNANeikfxTyXQPSElWYnaKZw7M7S1N2yukClZOW\nyPlEn6BMAYhIA9OTdEV6kq4Y8487CU80tmi7/07Crf5RrJVvHz09mpGZEt9Zrk7fSZiuodmpLMzY\nyxpb2nSw5pR/Kq7xn0aayo83qrX9H8NL8bGmwZmd5WjykP7/GGnK7hxdYjFYhAKm+QBEtcaWNu2o\nrNP2d6/DqqzVzsN1p/9CT02I1bi8f1x/NSEvQ8MGpCopPoZRj7NwzulYfYu/KDW8Z0quUVV1zf90\nfL+kuNPlqCArtcv3Kcrrn8x0HDzDNB8ABCAlIU7TCjM1rTDz9GstbR3afbRO2/wXuW+tOKmnSw+q\nsaX99DFmndfkJMfHKik+VikJsUpO6PK9/73khC6PZ3v9PY8p8XFKSohRQmzoFrbW9g4dOn6qcxqu\nplEHqv9Rmg7WNKrhPf+tBqUnqSArRR8andNZlLJTVegvTP1T4kP21wkEgjIFAO+REBfjv5YqQ9IQ\nSZ3X8uyvbtDWQ7U6dOKUmlradaq1XY3+x6Z3v29p1/GGFh3q8vqplnY1trbrXCcCYmPsdFlLTojx\nl6xYJcfHKCUh7p/fS4jr/L5LmUtKiFVK/BlKXpdiF/8+a3WdbGrtcr1S4z+NNFWcOKWOLr+ehLgY\nFfgv9L5oRHZnUcruHGkanJnMkhWIaJQpAAhAbIxpRE6aRuSkndfnnXNqbuv4R+nyl6yAHt/9vsvr\nVXXNZzz2XMX5C1vXghUbY6o4cer03orvykpNUEFWiqYWZOr6Kfmnp+IKs1M1sF8i15YhalGmAKAP\nmJmS/CNJ/VOC83M459TU2tGlgLXpVEuHfwStrXOUrMsIWtfnp0fQWtrV2t6hosGdF3v/Y4QpRf2S\n2JgaOBPKFABECDM7PcIEoO+wsRUAAEAPUKYAAAB6gDIFAADQA5QpAACAHqBMAQAA9ABlCgAAoAco\nUwAAAD1AmQIAAOiBgBbtNLP9kuoktUtqc84Vm1mWpN9JGippv6RPOueOBycmAABAaDqXkakPOecm\nO+eK/c/vkbTSOTdS0kr/cwAAgKjSk2m+6yQt8X+/RNLcnscBAAAIL4GWKSfpZTNbZ2a3+1/Ldc5V\n+r8/LCm319MBAACEuEA3Ov6gc+6QmQ2U9JKZvd31TeecMzN3pg/6y9ftklRQUNCjsAAAAKHGnDtj\nBzr7B8zulVQv6bOSLnfOVZqZT9KrzrnR3Xy2SlLZeWYN1ABJx4L8cyC4OIfhj3MY3jh/4Y9z2DsK\nnXM53R3UbZkys1RJMc65Ov/3L0n6jqRZkqqdc98zs3skZTnn/qMXgveImZV2uUgeYYhzGP44h+GN\n8xf+OId9K5BpvlxJz5rZu8c/6ZxbYWZrJT1tZrepc7Tpk8GLCQAAEJq6LVPOuX2SJp3h9Wp1jk4B\nAABErUhcAf1BrwOgxziH4Y9zGN44f+GPc9iHzvkCdAAAAPxDJI5MAQAA9JmwLVNmNtvMdprZHv/d\nhO9938zsp/73N5vZVC9y4uwCOIc3+c/dFjN708z+5do9eKe789fluAvNrM3MPt6X+dC9QM6hmV1u\nZhvNbJuZvdbXGfH+AvhzNMPMnjezTf5zeKsXOSNdWE7zmVmspF2SrpJULmmtpE8757Z3OeYaSV+U\ndI2kGZJ+4pyb4UFcnEGA5/BiSTucc8fN7GpJ93IOQ0Mg56/LcS9JapK02Dn3h77OijML8Pdgf0lv\nSprtnDtgZgOdc0c9CYx/EeA5/IakDOfc18wsR9JOSYOccy1eZI5U4ToyNV3SHufcPv//EL9V516B\nXV0n6THXaZWk/v7FRREauj2Hzrk3nXPH/U9XSRrcxxlxdoH8HpQ6/0HzjCT+Ag49gZzDGyUtdc4d\nkCSKVMgJ5Bw6Sf2sc32jNEk1ktr6NmbkC9cylS/pYJfn5f7XzvUYeOdcz89tkv4c1EQ4F92ePzPL\nl3S9pAf6MBcCF8jvwVGSMs3sVf/erAv6LB0CEcg5/JmksZIqJG2R9GXnXEffxIsege7NB3jGzD6k\nzjL1Qa+z4Jz8WNLXnHMd/kV/EX7iJE1T55qCyZLeMrNVzrld3sbCOfiIpI2SrpA0Qp37677unDvp\nbazIEq5l6pCkIV2eD/a/dq7HwDsBnR8zK5L0a0lX+xeKRWgI5PwVS/qtv0gNkHSNmbU55/7YNxHR\njUDOYbk6tw1rkNRgZn9T5yLOlKnQEMg5vFXS91znBdJ7zOwdSWMkrembiNEhXKf51koaaWbDzCxB\n0g2SnnvPMc9JWuC/q2+mpFrnXGVfB8VZdXsOzaxA0lJJ8/mXcMjp9vw554Y554Y654ZK+oOkf6NI\nhZRA/hz9k6QPmlmcmaWo82aeHX2cE2cXyDk8IP9uJWaWK2m0pH19mjIKhOXIlHOuzczulPSipFh1\n3iW0zcw+53//l5JeUOedfHskNaqznSNEBHgOvyUpW9Iv/KMbbWzcGRoCPH8IYYGcQ+fcDjNbIWmz\npA5Jv3bObfUuNboK8PfhfZIeNbMtkkydU+/HPAsdocJyaQQAAIBQEa7TfAAAACGBMgUAANADlCkA\nAIAeoEwBAAD0AGUKAACgByhTAAAAPUCZAgAA6AHKFAAAQA/8f8p27NGlN++XAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alphas = np.arange(0,1.1,0.1)\n", "perplexities = [perplexity(InterpolatedLM(bigram,unigram,alpha),oov_test) \n", " for alpha in alphas]\n", "plt.plot(alphas,perplexities)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "### Backoff \n", "* When we have counts for an event, trust these counts and not the simpler model\n", " * use $\\prob(\\text{bigly}|\\text{win})$ if you have seen $(\\text{win, bigly})$, not $\\prob(\\text{bigly})$\n", "* **back-off** only when no counts for a given event are available." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Stupid Backoff\n", "Let \\\\(w\\\\) be a word and \\\\(h_{m}\\\\) an n-gram of length \\\\(m\\\\): \n", "\n", "$$\n", "\\prob_{\\mbox{Stupid}}(w|h_{m}) = \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_{m},w}}{\\counts{\\train}{h_{m}}} &= \\mbox{if }\\counts{\\train}{h_{m},w} > 0 \\\\\\\\\n", "\\prob_{\\mbox{Stupid}}(w|h_{m-1}) & \\mbox{otherwise}\n", "\\end{cases}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "What is the problem with this model?" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1.0647115579930924" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stupid = StupidBackoff(bigram, unigram, 0.1)\n", "sum([stupid.probability(word, 'the') for word in stupid.vocab])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Absolute Discounting\n", "Recall that in test data, a constant probability mass is taken away for each non-zero count event. Can this be captured in a smoothing algorithm?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Yes: subtract (tunable) constant $d$ from each non-zero probability:\n", "\n", "$$\n", "\\prob_{\\mbox{Absolute}}(w|h_{m}) = \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_{m},w}-d}{\\counts{\\train}{h_{m}}} &= \\mbox{if }\\counts{\\train}{h_{m},w} > 0 \\\\\\\\\n", "\\alpha(h_{m-1})\\cdot\\prob_{\\mbox{Absolute}}(w|h_{m-1}) & \\mbox{otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "$\\alpha(h_{m-1})$ is a normalizer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Unigram Backoff\n", "\n", "Assume, for example:\n", "* *Mos Def* is a rapper name that appears often in the data\n", "* *glasses* appears slightly less often\n", "* neither Def nor glasses have been seen in the context of the word *reading*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then the final-backoff unigram model might assign a higher probability to\n", "\n", "> I can't see without my reading Def\n", "\n", "than\n", "\n", "> I can't see without my reading glasses\n", "\n", "because $\\prob(\\text{Def}) > \\prob(\\text{glasses})$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But *Def* never follows anything but *Mos*, and we can determine this by looking at the training data!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Knesser Ney Smoothing\n", "\n", "Absolute Discounting, but use as final backoff probability the probability that a word appears after (any) word in the training set: \n", "\n", "$$\n", "\\prob_{\\mbox{KN}}(w) = \\frac{\\left|\\{w_{-1}:\\counts{\\train}{w_{-1},w}> 1\\} \\right|}\n", "{\\sum_{w'}\\left|\\{w_{-1}:\\counts{\\train}{w_{-1},w'}\\} > 1 \\right|} \n", "$$\n", "\n", "This is the *continuation probability*" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary\n", "\n", "* LMs model probability of sequences of words \n", "* Defined in terms of \"next-word\" distributions conditioned on history\n", "* N-gram models truncate history representation\n", "* Often trained by maximizing log-likelihood of training data and ...\n", "* smoothing to deal with sparsity\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Background Reading\n", "\n", "* Jurafsky & Martin, Speech and Language Processing: Chapter 4, N-Grams.\n", "* Bill MacCartney, Stanford NLP Lunch Tutorial: [Smoothing](http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_code_all_hidden": false, "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }