{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Model Exercises\n", "In these exercises you will extend and develop language models. We will use the code from the notes, but within a python package [`lm`](http://localhost:8888/edit/statnlpbook/lm.py)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup 1: Load Libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:18.569772", "start_time": "2016-10-21T16:59:18.552156" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import sys, os\n", "_snlp_book_dir = \"..\"\n", "sys.path.append(_snlp_book_dir) \n", "from statnlpbook.lm import *\n", "from statnlpbook.ohhla import *\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup 2: Load Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:18.613748", "start_time": "2016-10-21T16:59:18.575886" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "docs = load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\")\n", "assert len(docs) == 50, \"Your ohhla corpus is corrupted, please download it again!\"\n", "trainDocs, testDocs = docs[:len(docs)//2], docs[len(docs)//2:] \n", "train = words(trainDocs)\n", "test = words(testDocs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1: Optimal Pseudo Count \n", "\n", "Plot the perplexity for laplace smoothing on the given data as a function of alpha in the interval [0.001, 0.1] in steps by 0.001. Is it fair to assume that this is a convex function? Write a method that finds the optimal pseudo count `alpha` number for [laplace smoothing](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L180) for the given data up to some predefined numerical precision `epsilon` under the assumption that the perplexity is a convex function of alpha. How often did you have to call `perplexity` to find the optimum?\n", "\n", "Tips:\n", "\n", "You don't need 1st or 2nd order derivatives in this case, only the gradient descent direction. Think about recursively slicing up the problem.\n", "" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:19.151308", "start_time": "2016-10-21T16:59:18.615252" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0 0.0\n" ] }, { "data": { "text/plain": [ "0.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmUAAAFpCAYAAADdpV/BAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEthJREFUeJzt3V+M5Xd53/HPU29xIKhg8y9gs12nuKqWJm2qiWnVpEX8\nsU0lsqjxhUFVVi2VL1ou0og2plQFnKgClNZRFdrKCpFcLmpSS1E2Iq1lTFErlBLPOoRkSRwvhsQ2\nTjC2ReSg4po8vZgf0WQ7zqx9jmee2Xm9pKM5v9/vOzPP6KuZffucM+Pq7gAAsL/+3H4PAACAKAMA\nGEGUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGODIfg/wbLz0pS/tY8eO7fcY\nAAC7On369Ne6+2W7rTuQUXbs2LFsbm7u9xgAALuqqt89n3WevgQAGECUAQAMIMoAAAYQZQAAA4gy\nAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCA\nKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAA\nA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwABribKquraq\n7q2qs1V14w7XL66qjy/XP1tVx865frSqnqiqd69jHgCAg2blKKuqi5J8JMlbkhxP8vaqOn7Osncm\neby7X5Pk5iQfOuf6v0vy31adBQDgoFrHI2VXJTnb3fd395NJbkty4pw1J5Lcuty/Pckbq6qSpKre\nluRLSc6sYRYAgANpHVF2WZIHth0/uJzbcU13P5Xk60leUlUvTPLjST6whjkAAA6s/X6h//uT3Nzd\nT+y2sKpuqKrNqtp85JFHnvvJAAD20JE1fIyHkrx62/Hly7md1jxYVUeSvCjJo0lel+S6qvpwkhcn\n+eOq+j/d/TPnfpLuviXJLUmysbHRa5gbAGCMdUTZ3UmurKorshVf1yd5xzlrTiU5meRXklyX5FPd\n3Ul+8NsLqur9SZ7YKcgAAC50K0dZdz9VVe9KckeSi5L8XHefqaqbkmx296kkH03ysao6m+SxbIUb\nAACL2nrA6mDZ2Njozc3N/R4DAGBXVXW6uzd2W7ffL/QHACCiDABgBFEGADCAKAMAGECUAQAMIMoA\nAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABggLVEWVVdW1X3VtXZ\nqrpxh+sXV9XHl+ufrapjy/k3V9XpqvqN5e0b1jEPAMBBs3KUVdVFST6S5C1Jjid5e1UdP2fZO5M8\n3t2vSXJzkg8t57+W5K3d/T1JTib52KrzAAAcROt4pOyqJGe7+/7ufjLJbUlOnLPmRJJbl/u3J3lj\nVVV3/1p3f2U5fybJ86vq4jXMBABwoKwjyi5L8sC24weXczuu6e6nknw9yUvOWfPDSe7p7m+uYSYA\ngAPlyH4PkCRV9dpsPaV59Z+x5oYkNyTJ0aNH92gyAIC9sY5Hyh5K8uptx5cv53ZcU1VHkrwoyaPL\n8eVJfiHJj3T3F5/uk3T3Ld290d0bL3vZy9YwNgDAHOuIsruTXFlVV1TV85Jcn+TUOWtOZeuF/Ely\nXZJPdXdX1YuTfCLJjd39mTXMAgBwIK0cZctrxN6V5I4kv5Xk57v7TFXdVFU/tCz7aJKXVNXZJD+W\n5Nt/NuNdSV6T5F9X1eeW28tXnQkA4KCp7t7vGZ6xjY2N3tzc3O8xAAB2VVWnu3tjt3X+oj8AwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECU\nAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIAB\nRBkAwACiDABgAFEGADDAWqKsqq6tqnur6mxV3bjD9Yur6uPL9c9W1bFt196znL+3qq5ZxzwAAAfN\nylFWVRcl+UiStyQ5nuTtVXX8nGXvTPJ4d78myc1JPrS87/Ek1yd5bZJrk/yH5eMBABwq63ik7Kok\nZ7v7/u5+MsltSU6cs+ZEkluX+7cneWNV1XL+tu7+Znd/KcnZ5eMBABwqR9bwMS5L8sC24weTvO7p\n1nT3U1X19SQvWc7/73Pe97I1zLSSD/zSmXzhK3+432MAAM+h46/6C3nfW1+732P8iQPzQv+quqGq\nNqtq85FHHtnvcQAA1modj5Q9lOTV244vX87ttObBqjqS5EVJHj3P902SdPctSW5Jko2NjV7D3E9r\nUjUDAIfDOh4puzvJlVV1RVU9L1sv3D91zppTSU4u969L8qnu7uX89ctvZ16R5Mokv7qGmQAADpSV\nHylbXiP2riR3JLkoyc9195mquinJZnefSvLRJB+rqrNJHstWuGVZ9/NJvpDkqST/tLu/tepMAAAH\nTW09YHWwbGxs9Obm5n6PAQCwq6o63d0bu607MC/0BwC4kIkyAIABRBkAwACiDABgAFEGADCAKAMA\nGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gy\nAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCA\nKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAA\nA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABVoqyqrq0qu6sqvuW\nt5c8zbqTy5r7qurkcu4FVfWJqvrtqjpTVR9cZRYAgINs1UfKbkxyV3dfmeSu5fhPqapLk7wvyeuS\nXJXkfdvi7ae6+68k+b4kf7uq3rLiPAAAB9KqUXYiya3L/VuTvG2HNdckubO7H+vux5PcmeTa7v5G\nd/+PJOnuJ5Pck+TyFecBADiQVo2yV3T3w8v930/yih3WXJbkgW3HDy7n/kRVvTjJW7P1aBsAwKFz\nZLcFVfXJJN+1w6X3bj/o7q6qfqYDVNWRJP8lyb/v7vv/jHU3JLkhSY4ePfpMPw0AwGi7Rll3v+np\nrlXVH1TVK7v74ap6ZZKv7rDsoSSv33Z8eZJPbzu+Jcl93f3Tu8xxy7I2Gxsbzzj+AAAmW/Xpy1NJ\nTi73Tyb5xR3W3JHk6qq6ZHmB/9XLuVTVTyZ5UZIfXXEOAIADbdUo+2CSN1fVfUnetBynqjaq6meT\npLsfS/ITSe5ebjd192NVdXm2ngI9nuSeqvpcVf3jFecBADiQqvvgPRO4sbHRm5ub+z0GAMCuqup0\nd2/sts5f9AcAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoA\nAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMsFKUVdWlVXVnVd23vL3kadadXNbcV1Und7h+qqp+c5VZAAAO\nslUfKbsxyV3dfWWSu5bjP6WqLk3yviSvS3JVkvdtj7eq+vtJnlhxDgCAA23VKDuR5Nbl/q1J3rbD\nmmuS3Nndj3X340nuTHJtklTVC5P8WJKfXHEOAIADbdUoe0V3P7zc//0kr9hhzWVJHth2/OByLkl+\nIsm/TfKNFecAADjQjuy2oKo+meS7drj03u0H3d1V1ef7iavqryf5S939z6rq2HmsvyHJDUly9OjR\n8/00AAAHwq5R1t1verprVfUHVfXK7n64ql6Z5Ks7LHsoyeu3HV+e5NNJ/laSjar68jLHy6vq0939\n+uygu29JckuSbGxsnHf8AQAcBKs+fXkqybd/m/Jkkl/cYc0dSa6uqkuWF/hfneSO7v6P3f2q7j6W\n5AeS/M7TBRkAwIVu1Sj7YJI3V9V9Sd60HKeqNqrqZ5Okux/L1mvH7l5uNy3nAABYVPfBeyZwY2Oj\nNzc393sMAIBdVdXp7t7YbZ2/6A8AMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGAA\nUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGAAUQYAMIAoAwAYQJQBAAwgygAA\nBhBlAAADiDIAgAFEGQDAAKIMAGAAUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIM\nAGAAUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGCA6u79nuEZq6pHkvzuGj/k\nS5N8bY0fj/WxN7PZn7nszWz2Z7Z1789f7O6X7bboQEbZulXVZndv7Pcc/P/szWz2Zy57M5v9mW2/\n9sfTlwAAA4gyAIABRNmWW/Z7AJ6WvZnN/sxlb2azP7Pty/54TRkAwAAeKQMAGOCCj7Kquraq7q2q\ns1V14w7XL66qjy/XP1tVx7Zde89y/t6qumYv5z4Mnu3eVNWbq+p0Vf3G8vYNez37YbDK985y/WhV\nPVFV796rmQ+LFX+ufW9V/UpVnVm+h75jL2c/DFb42fbnq+rWZV9+q6res9ezX+jOY2/+TlXdU1VP\nVdV151w7WVX3LbeTz8mA3X3B3pJclOSLSb47yfOS/HqS4+es+SdJ/tNy//okH1/uH1/WX5zkiuXj\nXLTfX9OFcltxb74vyauW+381yUP7/fVcaLdV9mfb9duT/Nck797vr+dCuq34vXMkyeeT/LXl+CV+\nro3an3ckuW25/4IkX05ybL+/pgvldp57cyzJ9yb5z0mu23b+0iT3L28vWe5fsu4ZL/RHyq5Kcra7\n7+/uJ5PcluTEOWtOJLl1uX97kjdWVS3nb+vub3b3l5KcXT4e6/Gs96a7f627v7KcP5Pk+VV18Z5M\nfXis8r2Tqnpbki9la39Yr1X25uokn+/uX0+S7n60u7+1R3MfFqvsTyf5zqo6kuT5SZ5M8od7M/ah\nsOvedPeXu/vzSf74nPe9Jsmd3f1Ydz+e5M4k1657wAs9yi5L8sC24weXczuu6e6nknw9W//1eD7v\ny7O3yt5s98NJ7unubz5Hcx5Wz3p/quqFSX48yQf2YM7DaJXvnb+cpKvqjuUpmn+xB/MeNqvsz+1J\n/ijJw0l+L8lPdfdjz/XAh8gq/67vSRMcWfcHhL1SVa9N8qFs/dc/c7w/yc3d/cTywBlzHEnyA0m+\nP8k3ktxVVae7+679HYvFVUm+leRV2XqK7H9V1Se7+/79HYu9cqE/UvZQkldvO758ObfjmuUh4xcl\nefQ835dnb5W9SVVdnuQXkvxId3/xOZ/28Fllf16X5MNV9eUkP5rkX1bVu57rgQ+RVfbmwST/s7u/\n1t3fSPLLSf7Gcz7x4bLK/rwjyX/v7v/b3V9N8pkk/ldM67PKv+t70gQXepTdneTKqrqiqp6XrRdU\nnjpnzakk3/4tiuuSfKq3XtV3Ksn1y2/JXJHkyiS/ukdzHwbPem+q6sVJPpHkxu7+zJ5NfLg86/3p\n7h/s7mPdfSzJTyf5N939M3s1+CGwys+1O5J8T1W9YImBv5vkC3s092Gxyv78XpI3JElVfWeSv5nk\nt/dk6sPhfPbm6dyR5OqquqSqLsnWMzR3rH3C/f5tiOf6luTvJfmdbP3GxXuXczcl+aHl/ndk6zfE\nzmYrur572/u+d3m/e5O8Zb+/lgvt9mz3Jsm/ytbrLj637fby/f56LrTbKt872z7G++O3L0ftTZJ/\nkK1fwPjNJB/e76/lQryt8LPthcv5M9mK5X++31/LhXY7j735/mw9ovxH2Xr08sy29/1Hy56dTfIP\nn4v5/EV/AIABLvSnLwEADgRRBgAwgCgDABhAlAEADCDKAAAGEGUAAAOIMgCAAUQZAMAA/w/yZ4yu\nl2HZPwAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "oov_train = inject_OOVs(train)\n", "oov_vocab = set(oov_train)\n", "oov_test = replace_OOVs(oov_vocab, test)\n", "bigram = NGramLM(oov_train,2)\n", "\n", "interval = [x/1000.0 for x in range(1, 100, 1)]\n", "perplexity_at_1 = perplexity(LaplaceLM(bigram, alpha=1.0), oov_test)\n", "\n", "def plot_perplexities(interval):\n", " \"\"\"Plots the perplexity of LaplaceLM for every alpha in interval.\"\"\"\n", " perplexities = [0.0 for alpha in interval] # todo\n", " plt.plot(interval, perplexities)\n", " \n", "def find_optimal(low, high, epsilon=1e-6):\n", " \"\"\"Returns the optimal pseudo count alpha within the interval [low, high] and the perplexity.\"\"\"\n", " print(high, low)\n", " if high - low < epsilon:\n", " return 0.0 # todo\n", " else:\n", " return 0.0 # todo\n", "\n", "plot_perplexities(interval) \n", "find_optimal(0.0, 1.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Sanity Check LM\n", "Implement a method that tests whether a language model provides a valid probability distribution." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:19.237379", "start_time": "2016-10-21T16:59:19.153304" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.064711557993089\n" ] } ], "source": [ "def sanity_check(lm, *history):\n", " \"\"\"Throws an AssertionError if lm does not define a valid probability distribution for all words \n", " in the vocabulary.\"\"\" \n", " probability_mass = 1.0 # todo\n", " assert abs(probability_mass - 1.0) < 1e-6, probability_mass\n", "\n", "unigram = NGramLM(oov_train,1)\n", "stupid = StupidBackoff(bigram, unigram, 0.1)\n", "print(sum([stupid.probability(word, 'the') for word in stupid.vocab]))\n", "sanity_check(stupid, 'the')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Subtract Count LM\n", "Develop and implement a language model that subtracts a count $d\\in[0,1]$ from each non-zero count in the training set. Let's first formalize this:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\begin{align}\n", "\\#_{w=0}(h_n) &= \\sum_{w \\in V} \\mathbf{1}[\\counts{\\train}{h_n,w} = 0]\\\\\n", "\\#_{w>0}(h_n) &= \\sum_{w \\in V} \\mathbf{1}[\\counts{\\train}{h_n,w} > 0]\\\\\n", "\\prob(w|h_n) &= \n", "\\begin{cases}\n", "\\frac{\\counts{\\train}{h_n,w} - d}{\\counts{\\train}{h_n}} & \\mbox{if }\\counts{\\train}{h_n,w} > 0 \\\\\\\\\n", "\\frac{???}{\\counts{\\train}{h_n}} & \\mbox{otherwise}\n", "\\end{cases}\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:19.337884", "start_time": "2016-10-21T16:59:19.240468" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n" ] }, { "data": { "text/plain": [ "inf" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class SubtractCount(CountLM): \n", " def __init__(self, base_lm, d):\n", " super().__init__(base_lm.vocab, base_lm.order)\n", " self.base_lm = base_lm\n", " self.d = d \n", " self._counts = base_lm._counts # not good style since it is a protected member\n", " self.vocab = base_lm.vocab\n", "\n", " def counts(self, word_and_history):\n", " if self._counts[word_and_history] > 0:\n", " return 0.0 # todo\n", " else:\n", " return 0.0 # todo\n", "\n", " def norm(self, history):\n", " return self.base_lm.norm(history) \n", " \n", "subtract_lm = SubtractCount(unigram, 0.1)\n", "oov_prob = subtract_lm.probability(OOV, 'the')\n", "rest_prob = sum([subtract_lm.probability(word, 'the') for word in subtract_lm.vocab])\n", "print(oov_prob + rest_prob)\n", "sanity_check(subtract_lm, 'the')\n", "perplexity(subtract_lm, oov_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4: Normalisation of Stupid LM\n", "Develop and implement a version of the [stupid language model](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L205) that provides probabilities summing up to 1." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2016-10-21T16:59:19.398354", "start_time": "2016-10-21T16:59:19.339446" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n" ] }, { "data": { "text/plain": [ "inf" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class StupidBackoffNormalized(LanguageModel):\n", " def __init__(self, main, backoff, alpha):\n", " super().__init__(main.vocab, main.order)\n", " self.main = main\n", " self.backoff = backoff\n", " self.alpha = alpha \n", "\n", " def probability(self, word, *history):\n", " return 0.0 # todo\n", " \n", "less_stupid = StupidBackoffNormalized(bigram, unigram, 0.1)\n", "print(sum([less_stupid.probability(word, 'the') for word in less_stupid.vocab]))\n", "sanity_check(less_stupid, 'the')\n", "perplexity(less_stupid, oov_test)" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }