{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "# %cd ..\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "util.execute_notebook('language_models.ipynb')\n", "# import tikzmagic\n", "%load_ext tikzmagic\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "$$\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "$$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import Image\n", "import random" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "# Language Modelling\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Language Models\n", "\n", "calculate the **probability of seeing a sequence of words**. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What is the most likely next word?\n", "\n", "> We're going to ..." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "How about now?\n", "\n", "> We're going to win ..." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "How likely is this sequence?\n", "\n", "> We're going to win bigly. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Is it more likely than this one?\n", "\n", "> We're going to win big league." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Use Cases: Machine Translation\n", "\n", "> Vi skal vinne stort\n", "\n", "translates to?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "> We will win by a mile\n", "\n", "or \n", "\n", "> We will win bigly" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Use Cases: Speech Recognition\n", "\n", "What did he [say](https://www.theguardian.com/us-news/video/2016/may/04/donald-trump-we-are-going-to-win-bigly-believe-me-video)?\n", "\n", "> We're going to win bigly\n", "\n", "or\n", "\n", "> We're going to win big league" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Use Cases: Natural Language Generation\n", "\n", "\n", "\n", "https://twitter.com/deepdrumpf" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Other applications of language models?" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Outlook: Importance of Language Models\n", "\n", "- State of the art NLP models utilise **transfer learning**\n", "- Good (neural) language models are excellent starting points for transfer learning\n", " * Step 1: unsupervised language model **pre-training**\n", " * Step 2: supervised language model **fine-tuning**" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Outlook: How to build good language models?\n", "\n", "- Train them on large datasets\n", "- Use a large number of features/parameters" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "... but first, the basics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Overview\n", "\n", "* Language Modelling from scratch\n", "* Evaluation\n", "* Dealing with Out-Of-Vocabulary words (OOVs)\n", "* Training\n", "* Smoothing" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Formally\n", "Models the probability \n", "\n", "$$\\prob(w_1,\\ldots,w_d)$$ \n", "\n", "of observing sequences of words \\\\(w_1,\\ldots,w_d\\\\). " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Without loss of generality: \n", "\n", "\\begin{align}\n", "\\prob(w_1,\\ldots,w_d) &= p(w_1) p(w_2|w_1) p(w_3|w_1, w_2) \\ldots \\\\\n", " &= \\prob(w_1) \\prod_{i = 2}^d \\prob(w_i|w_1,\\ldots,w_{i-1})\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Structured Prediction\n", "\n", "predict word $y=w_i$ \n", "* conditioned on history $\\x=w_1,\\ldots,w_{i-1}$." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## N-Gram Language Models\n", "\n", "Impossible to estimate sensible probability for each history \n", "\n", "$$\n", "\\x=w_1,\\ldots,w_{i-1}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Change **representation**\n", "truncate history to last $n-1$ words: \n", "\n", "$$\n", "\\mathbf{f}(\\x)=w_{i-(n-1)},\\ldots,w_{i-1}\n", "$$\n", "\n", "$\\prob(\\text{bigly}|\\text{...,blah, blah, blah, we, will, win}) \n", "= \\prob(\\text{bigly}|\\text{we, will, win})$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Unigram LM\n", "\n", "Set $n=1$:\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i).\n", "$$\n", "\n", "$\\prob(\\text{bigly}|\\text{we, will, win}) = \\prob(\\text{bigly})$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Bigram LM\n", "\n", "Set $n=2$:\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i|w_{i-1}).\n", "$$\n", "\n", "$\\prob(\\text{bigly}|\\text{we, will, win}) = \\prob(\\text{bigly}|\\text{win})$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### *Uniform* LM\n", "Set $n=0$:\n", "\n", "Same probability for each word in the vocabulary \\\\(\\vocab\\\\):\n", "\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\frac{1}{|\\vocab|}.\n", "$$\n", "\n", "$\\prob(\\text{big}) = \\prob(\\text{bigly}) = \\frac{1}{|\\vocab|}$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Let us look at a training set and create a uniform LM from it." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Can', \"'t\", 'even', 'call', 'this', 'a', 'blues', 'song', 'It']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[:9]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.9999999999999635" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = set(train)\n", "baseline = UniformLM(vocab)\n", "sum([baseline.probability(w) for w in vocab])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What about words outside the vocabulary? What is their probability?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Sampling\n", "* Sampling from an LM is easy and instructive\n", "* Usually, the better the LM, the better the samples" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Sample **incrementally**, one word at a time " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def sample_once(lm, history, words):\n", " probs = [lm.probability(word, *history) for word in words]\n", " return np.random.choice(words,p=probs)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'sunrise'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_once(baseline, [], list(baseline.vocab)) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def sample(lm, initial_history, amount_to_sample):\n", " words = list(lm.vocab)\n", " result = []\n", " result += initial_history\n", " for _ in range(0, amount_to_sample):\n", " history = result[-(lm.order - 1):]\n", " result.append(sample_once(lm,history,words))\n", " return result" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['dummies',\n", " 'find',\n", " 'being',\n", " 'bars',\n", " 'clap',\n", " 'rapping',\n", " 'droppin',\n", " 'fender',\n", " 'hated',\n", " 'Recognize']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample(baseline, [], 10)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Evaluation\n", "* **Extrinsic**: how does it improve a downstream task?\n", "* **Intrinsic**: how well does it model language?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Intrinsic Evaluation\n", "**Shannon Game**: Predict next word, win if prediction matches the word in actual corpus\n", "\n", "> Our horrible trade agreements with [???]\n", "\n", "The expected reward is the probability of the corpus." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Formalised by\n", "\n", "\\begin{align}\n", "\\prob(w_1) \\prob(w_2|w_1) \\ldots \\prob(w_T|w_1,\\ldots,w_{T-1}) &= \\prod_{i=1}^T \\prob(w_i|w_1,\\ldots,w_{i-1})\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But then the longer the sequence, the lower the probability...\n", "\n", "$\\to$ normalise by the length" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Perplexity \n", "Given test sequence \\\\(w_1,\\ldots,w_T\\\\), perplexity \\\\(\\perplexity\\\\) is **geometric mean of inverse probabilities** or, put differently the **inverse probability of the test set, normalised by the number of words**:\n", "\n", "\\begin{align}\n", "\\perplexity(w_1,\\ldots,w_T) &= \\sqrt[T]{\\frac{1}{\\prob(w_1)} \\frac{1}{\\prob(w_2|w_1)} \\ldots} \\\\\n", "&= \\sqrt[T]{\\prod_i^T \\frac{1}{\\prob(w_i|w_1,\\ldots,w_{i-1})}}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Perplexity for a bigram language model:\n", "\n", "\\begin{align}\n", "\\perplexity(w_1,\\ldots,w_T) &= \\sqrt[T]{\\prod_i^T \\frac{1}{\\prob(w_i|w_{i-1})}}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Perplexity for a unigram language model:\n", "\n", "\\begin{align}\n", "\\perplexity(w_1,\\ldots,w_T) &= \\sqrt[T]{\\prod_i^T \\frac{1}{\\prob(w_i)}}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Perplexity for a uniform language model:\n", "\n", "\\begin{align}\n", "\\perplexity(w_1,\\ldots,w_T) &= \\sqrt[T]{\\prod_i^T \\frac{1}{1/|V|}} = |V|\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Brief note on inverse functions\n", "\n", "- $T$ is the number of words under consideration, e.g. for bigram language models, it is 2.\n", "- For simplicity, assume $a = \\frac{1}{p(w_1, ... w_T)}$\n", "- $\\sqrt[T]{a}$ is the inverse function of $a^T$\n", "- Meaning, we are looking for a number for the Perplexity $PP$, which, when multiplied with itself $T$ times, results in $a$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Interpretation\n", "\n", "Consider LM where \n", "* at each position there are exactly **2** possible words with $\\frac{1}{2}$ probability each\n", "* in test sequence, one of these is always the true word " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Then \n", "\n", "* $\\perplexity(w_1,\\ldots,w_T) = \\sqrt[T]{2 \\cdot 2 \\cdot\\ldots} = 2$\n", "* Whenever a model has to guess the next word, it is confused as to which one the 2 to pick\n", "* Perplexity $\\approx$ average number of choices\n", "* The lower the number of average choice, i.e. the lower the Perplexity, the better" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Perplexity of uniform LM on an **unseen** test set?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(baseline, test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Problem: model assigns **zero probability** to words not in the vocabulary. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('does', 0.0),\n", " ('Ceremonies', 0.0),\n", " ('Masquerading', 0.0),\n", " ('also', 0.0),\n", " ('Creativity', 0.0)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(w,baseline.probability(w)) for w in test if w not in vocab][:5]" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## The Long Tail\n", "New words not specific to our corpus: \n", "* long **tail** of words that appear only a few times\n", "* each individual one has low probability, but probability of seeing any long tail word is high\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us plot word frequency ranks (x-axis) against frequency (y-axis) " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.xscale('log')\n", "plt.yscale('log') \n", "plt.plot(ranks, sorted_counts)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "In log-space such rank vs frequency graphs are **linear** \n", "\n", "* Known as **Zipf's Law**\n", "\n", "Let $r_w$ be the rank of a word \\\\(w\\\\), and \\\\(f_w\\\\) its frequency:\n", "\n", "$$\n", " f_w \\propto \\frac{1}{r_w}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Out-of-Vocabularly (OOV) Tokens\n", "In your test set, there will virtually always be words with zero training counts.\n", "\n", "Why is this a problem?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "* If probability of a word in the test set is 0, the entire probability of the test set is 0\n", " * Perplexity is based on inverse probability of test set\n", " * Since we cannot divide by 0, we cannot compute perplexity at all at this point\n", "* Underestimating probability of unseen words\n", " * Downstream application performance suffers" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Solutions:\n", "1. Remove unseen words from test set (pretend there is no problem)?\n", "2. Use subword tokenisation to ensure there are no `OOV` tokens?\n", "3. **Replace unseen words with out-of-vocabularly token, estimate its probability (`OOV` injection)?**\n", "4. Move probability mass to unseen words (smoothing)?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### `OOV` Injection Procedures\n", "\n", "* For the test set: mark all words not in the training vocabulary as `OOV`\n", "* For the training set: when a word occurs for the first time, mark it as `OOV`" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Replacing Words with OOV Tokens" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['with', 'the', 'lyrics', 'of', 'the', 'year', 'Than', 'the', 'gimmick', 'with', 'the', 'gear', 'and', 'the', 'right', 'puppeteer', 'Now', 'you', 'can', 'be', 'the', 'next', 'rock', 'Shakespear', 'you', \"'\", 're', 'still', '10', 'steps', 'away', 'from', 'having', 'a', 'career', 'You', 'step', 'up', 'the', 'plate']\n", "['with', 'the', 'lyrics', 'of', 'the', '[OOV]', '[OOV]', 'the', '[OOV]', 'with', 'the', '[OOV]', 'and', 'the', 'right', '[OOV]', 'Now', 'you', 'can', 'be', 'the', 'next', 'rock', '[OOV]', 'you', \"'\", 're', 'still', '10', 'steps', 'away', 'from', '[OOV]', 'a', 'career', 'You', '[OOV]', 'up', 'the', 'plate']\n" ] } ], "source": [ "print(test[60:100])\n", "\n", "# Replace every word not within the vocabulary with the `OOV` symbol\n", "# [word if word in vocab else OOV for word in data]\n", "print(replace_OOVs(baseline.vocab, test[60:100]))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Injecting OOV Tokens for New Word Events\n", "\n", "Consider the \"words\"\n", "\n", "> AA AA BB BB AA\n", "\n", "Going left to right, how often do I see new words?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Inject `OOV` tokens to mark these \"new word events\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['[OOV]', 'AA', '[OOV]', 'BB', 'AA']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inject_OOVs([\"AA\",\"AA\",\"BB\",\"BB\",\"AA\"])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Estimate `OOV` Probability\n", "What is the probability of seeing a word you haven't seen before?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Train on replaced data..." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1287.9999999984573" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oov_train = inject_OOVs(train)\n", "oov_vocab = set(oov_train)\n", "oov_test = replace_OOVs(oov_vocab, test)\n", "oov_baseline = UniformLM(oov_vocab)\n", "perplexity(oov_baseline,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "## [tinyurl.com/diku-nlp-oov](https://tinyurl.com/diku-nlp-oov)\n", "\n", "([Responses](https://docs.google.com/forms/d/1D9p_ej-puRhwGEl_P-h9RWxwvW9KK427qx63-9NB-20/edit#responses))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### `OOV` and Perplexity\n", "\n", "* LM can achieve low perplexity by choosing small vocabulary and assigning high probability to unknown words\n", " * Perplexities are vocabulary-dependent" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Training N-Gram Language Models" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "N-gram language models condition on a limited history: \n", "\n", "$$\n", "\\prob(w_i|w_1,\\ldots,w_{i-1}) = \\prob(w_i|w_{i-(n-1)},\\ldots,w_{i-1}).\n", "$$\n", "\n", "What are its parameters (continuous values that control its behaviour)?" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "One parameter $\\param_{w,h}$ for each word $w$ and history $h=w_{i-(n-1)},\\ldots,w_{i-1}$ pair:\n", "\n", "$$\n", "\\prob_\\params(w|h) = \\param_{w,h}\n", "$$\n", "\n", "$\\prob_\\params(\\text{bigly}|\\text{win}) = \\param_{\\text{bigly, win}}$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Maximum Likelihood Estimate\n", "\n", "Assume training set \\\\(\\train=(w_1,\\ldots,w_d)\\\\)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "Find \\\\(\\params\\\\) that maximises the log-likelihood of \\\\(\\train\\\\):\n", "\n", "$$\n", "\\params^* = \\argmax_\\params \\log p_\\params(\\train)\n", "$$\n", "\n", "where\n", "\n", "$$\n", "\\prob_\\params(\\train) = \\ldots \\prob_\\params(w_i|\\ldots w_{i-1}) \\prob_\\params(w_{i+1}|\\ldots w_{i}) \\ldots \n", "$$\n", "\n", "**Structured Prediction**: this is your continuous optimisation problem!" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Maximum-log-likelihood estimate (MLE) can be calculated in **closed form**:\n", "$$\n", "\\prob_{\\params^*}(w|h) = \\param^*_{w,h} = \\frac{\\counts{\\train}{h,w}}{\\counts{\\train}{h}} \n", "$$\n", "\n", "where \n", "\n", "$$\n", "\\counts{D}{e} = \\text{Count of } e \\text{ in } D \n", "$$\n", "\n", "Event $h$ means seeing the history $h$, and $w,h$ seeing the history $h$ followed by word $w$. \n", "\n", "Many LM variants: different estimation of counts. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Training a Unigram Model\n", "Let us train a unigram model..." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "What do you think the most probable words are? \n", "\n", "Remember our training set looks like this ..." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['bird', 'I', 'know', 'this', '[OOV]', '[OOV]', '[OOV]', 'is', 'this', '[OOV]']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oov_train[1000:1010]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "unigram = NGramLM(oov_train,1)\n", "plot_probabilities(unigram)\n", "# sum([unigram.probability(w) for w in unigram.vocab])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "The unigram LM has substantially reduced (and hence better) perplexity than the uniform LM:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(1287.9999999984573, 128.9093846843014)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(oov_baseline,oov_test), perplexity(unigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Its samples look (a little) more reasonable:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['hands', 'play', 'below', 'never', 'around', 'type', 'grows', 'about', 'debate', 'himself'] \n", "\n", "['the', '[OOV]', 'to', 'in', 'Singing', '[OOV]', '[OOV]', \"'m\", 'live', 'to']\n" ] } ], "source": [ "print(sample(oov_baseline, [], 10), \"\\n\")\n", "print(sample(unigram, [], 10))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Bigram Model\n", "We can do better by setting $n=2$" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bigram = NGramLM(oov_train,2)\n", "plot_probabilities(bigram, (\"I\",)) # bigrams starting with \"I\"" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Samples should look (slightly) more fluent:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"I set em Yo [OOV] enemies [OOV] [OOV] is yours what the [OOV] it So it There 's mind [OOV] Recognize your [OOV] up [OOV] [OOV] wanna get up [OOV] and\"" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\" \".join(sample(bigram, ['I'], 30)) # try: I, FIND, [OOV]" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "How about perplexity?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(bigram,oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "Some contexts where OOV word (and others) haven't been seen, hence 0 probability..." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Out-of-Vocabularly (OOV) Tokens\n", "In your test set, there will virtually always be words with zero training counts.\n", "\n", "Solutions:\n", "1. Remove unseen words from test set (pretend there is no problem)?\n", "2. Use subword tokenisation to ensure there are no `OOV` tokens?\n", "3. Replace unseen words with out-of-vocabularly token, estimate its probability (`OOV` injection)?\n", "4. **Move probability mass to unseen words (smoothing)?**" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Smoothing\n", "\n", "Maximum likelihood \n", "* **underestimates** true probability of some words \n", "* **overestimates** the probabilities of other\n", "\n", "Solution: _smooth_ the probabilities and **move mass** from seen to unseen events." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Laplace Smoothing / Additive Smoothing\n", "\n", "Add **pseudo counts** to each event in the dataset \n", "\n", "$$\n", "\\param^{\\alpha}_{w,h} = \\frac{\\counts{\\train}{h,w} + \\alpha}{\\counts{\\train}{h} + \\alpha \\lvert V \\rvert } \n", "$$" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.0007704160246533128" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "laplace_bigram = LaplaceLM(bigram, 0.1) \n", "laplace_bigram.probability(\"[OOV]\",\"money\")" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "Perplexity should look better now:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "255.11837473847797" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perplexity(LaplaceLM(bigram, 0.001),oov_test)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Example\n", "Consider three events:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
wordtrain countMLELaplaceSame Denominator
smally 0 0/3 1/6 0.5/3
bigly 1 1/3 2/6 1/3
tremendously 2 2/3 3/6 1.5/3
" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = [\"word\", \"train count\", \"MLE\", \"Laplace\", \"Same Denominator\"]\n", "r1 = [\"smally\", \"0\", \"0/3\", \"1/6\", \"0.5/3\"]\n", "r2 = [\"bigly\", \"1\", \"1/3\", \"2/6\", \"1/3\"]\n", "r3 = [\"tremendously\", \"2\", \"2/3\", \"3/6\", \"1.5/3\"]\n", "util.Table([r1,r2,r3], column_names=c)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Interpolation / Jelinek-Mercer Smoothing\n", "* Laplace Smoothing assigns mass **uniformly** to the words that haven't been seen in a context." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "(0.0005656108597285067, 0.0005656108597285067)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "laplace_bigram.probability('rhyme','man'), \\\n", "laplace_bigram.probability('of','man') \n", "# also try: 'skies','skies' vs. '[/BAR]','skies'" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "Problem: not all unseen words (in a context) are equal" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "With **interpolation** we can do better: \n", "* give more mass to words likely under the $n-1$-gram model. \n", " * Use $\\prob(\\text{of})$ for estimating $\\prob(\\text{of} | \\text{man})$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "* Combine $n$-gram model \\\\(p'\\\\) and a back-off \\\\(n-1\\\\) model \\\\(p''\\\\): \n", "\n", "$$\n", "\\prob_{\\alpha}(w_i|w_{i-n+1},\\ldots,w_{i-1}) = \\alpha \\cdot \\prob'(w_i|w_{i-n+1},\\ldots,w_{i-1}) + \\\\ (1 - \\alpha) \\cdot \\prob''(w_i|w_{i-n+2},\\ldots,w_{i-1})\n", "$$\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "(0.0014514278429372768, 0.009276517083120857)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interpolated = InterpolatedLM(bigram,unigram,0.01)\n", "interpolated.probability('rhyme','man'), \\\n", "interpolated.probability('of','man')" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "Can we find a good $\\alpha$ parameter? Tune on some **development set**!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alphas = np.arange(0,1.1,0.1)\n", "perplexities = [perplexity(InterpolatedLM(bigram,unigram,alpha),oov_test) \n", " for alpha in alphas]\n", "plt.plot(alphas,perplexities)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Backoff \n", "* When we have counts for an event, trust these counts and not the simpler model\n", " * use $\\prob(\\text{bigly}|\\text{win})$ if you have seen $(\\text{win, bigly})$, not $\\prob(\\text{bigly})$\n", "* **back-off** only when no counts for a given event are available." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Stupid Backoff\n", "Let \\\\(w\\\\) be a word and \\\\(h_{m}\\\\) an n-gram of length \\\\(m\\\\): \n", "\n", "$$\n", "\\prob_{\\mbox{Stupid}}(w|h_{m}) = \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_{m},w}}{\\counts{\\train}{h_{m}}} &= \\mbox{if }\\counts{\\train}{h_{m},w} > 0 \\\\\\\\\n", "\\prob_{\\mbox{Stupid}}(w|h_{m-1}) & \\mbox{otherwise}\n", "\\end{cases}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "skip" } }, "source": [ "What is the problem with this model?" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "1.0684727180010114" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stupid = StupidBackoff(bigram, unigram, 0.1)\n", "sum([stupid.probability(word, 'the') for word in stupid.vocab])" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "source": [ "Discuss with your neighbour and enter your answer here:\n", "## [tinyurl.com/diku-nlp-backoff](https://tinyurl.com/diku-nlp-backoff)\n", "([Responses](https://docs.google.com/forms/d/1UMmtDqpzf7pXxWqqsgK7pH3sA5EVXXWzfRaXHe9LICM/edit#responses))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Solution\n", "\n", "The score is not a probability distribution (probabilities do not sum to 1). Sampling thus requires further normalisation." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "source": [ "### Exercise\n", "\n", "How can we check whether a language model provides a valid probability distribution? Solve [Task 2](../exercises/language_models.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Absolute Discounting\n", "Recall that in test data, a constant probability mass is taken away for each non-zero count event. Can this be captured in a smoothing algorithm?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Yes: subtract (tunable) constant $d$ from each non-zero probability:\n", "\n", "$$\n", "\\prob_{\\mbox{Absolute}}(w|h_{m}) = \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_{m},w}-d}{\\counts{\\train}{h_{m}}} &= \\mbox{if }\\counts{\\train}{h_{m},w} > 0 \\\\\\\\\n", "\\alpha(h_{m-1})\\cdot\\prob_{\\mbox{Absolute}}(w|h_{m-1}) & \\mbox{otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "$\\alpha(h_{m-1})$ is a normaliser" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Unigram Backoff\n", "\n", "Assume, for example:\n", "* *Mos Def* is a rapper name that appears often in the data\n", "* *glasses* appears slightly less often\n", "* neither *Def* nor *glasses* have been seen in the context of the word *reading*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Then the final-backoff unigram model might assign a higher probability to\n", "\n", "> I can't see without my reading Def\n", "\n", "than\n", "\n", "> I can't see without my reading glasses\n", "\n", "because $\\prob(\\text{Def}) > \\prob(\\text{glasses})$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "But *Def* never follows anything but *Mos*, and we can determine this by looking at the training data!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Kneser Ney Smoothing\n", "\n", "Absolute Discounting, but as final backoff probability, use the probability that a word appears after (any) word in the training set: \n", "\n", "$$\n", "\\prob_{\\mbox{KN}}(w) = \\frac{\\left|\\{w_{-1}:\\counts{\\train}{w_{-1},w}> 1\\} \\right|}\n", "{\\sum_{w'}\\left|\\{w_{-1}:\\counts{\\train}{w_{-1},w'}\\} > 1 \\right|} \n", "$$\n", "\n", "This is the *continuation probability*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Modified Kneser Ney Smoothing\n", "\n", "Rather than using a single discount $d$:\n", " use three different discounts $d_1$, $d_2$, $d_3$\n", " for 1-grams, 2-grams and n-grams with count 3 or more\n", "\n", "See [Chen and Goodman 1998, p. 19](https://dash.harvard.edu/bitstream/handle/1/25104739/tr-10-98.pdf?sequence=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Interpolation vs. Backoff\n", "\n", "* Both combine information from higher- and lower-order models, e.g. 2-gram and 1-gram\n", "* Both use lower-order models to determine probability of n-grams with zero counts\n", "* Difference: \n", " * Interpolated models use lower-order models also for n-grams with **non-zero** counts\n", " * Backoff models only do it for n-grams with **zero** counts" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary\n", "\n", "* LMs model probability of sequences of words \n", "* Defined in terms of \"next-word\" distributions conditioned on history\n", "* N-gram models truncate history representation\n", "* Often trained by maximising log-likelihood of training data and ...\n", "* smoothing to deal with sparsity" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false, "slideshow": { "slide_type": "slide" } }, "source": [ "## Background Reading\n", "\n", "* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 3, N-Gram Language Models.\n", "* Bill MacCartney, Stanford NLP Lunch Tutorial: [Smoothing](http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)\n", "* Chen, Stanley F. and Joshua Goodman. 1998. [An Empirical Study of Smoothing Techniques for Language Modeling.](https://dash.harvard.edu/bitstream/handle/1/25104739/tr-10-98.pdf?sequence=1) Harvard Computer Science Group Technical Report TR-10-98.\n", "* [Sampling from language models](https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277), a practical guide\n", "* [Interpretation of Perplexity](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_code_all_hidden": false, "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }