{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Language Model Exercises\n",
"In these exercises you will extend and develop language models. We will use the code from the notes, but within a python package [`lm`](http://localhost:8888/edit/statnlpbook/lm.py)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup 1: Load Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:18.569772",
"start_time": "2016-10-21T16:59:18.552156"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline\n",
"import sys, os\n",
"_snlp_book_dir = \"..\"\n",
"sys.path.append(_snlp_book_dir) \n",
"from statnlpbook.lm import *\n",
"from statnlpbook.ohhla import *\n",
"# %cd .. \n",
"import sys\n",
"sys.path.append(\"..\")\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"$$\n",
"\\newcommand{\\prob}{p}\n",
"\\newcommand{\\vocab}{V}\n",
"\\newcommand{\\params}{\\boldsymbol{\\theta}}\n",
"\\newcommand{\\param}{\\theta}\n",
"\\DeclareMathOperator{\\perplexity}{PP}\n",
"\\DeclareMathOperator{\\argmax}{argmax}\n",
"\\newcommand{\\train}{\\mathcal{D}}\n",
"\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup 2: Load Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:18.613748",
"start_time": "2016-10-21T16:59:18.575886"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"docs = load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\")\n",
"assert len(docs) == 50, \"Your ohhla corpus is corrupted, please download it again!\"\n",
"trainDocs, testDocs = docs[:len(docs)//2], docs[len(docs)//2:] \n",
"train = words(trainDocs)\n",
"test = words(testDocs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 1: Optimal Pseudo Count \n",
"\n",
"Plot the perplexity for laplace smoothing on the given data as a function of alpha in the interval [0.001, 0.1] in steps by 0.001. Is it fair to assume that this is a convex function? Write a method that finds the optimal pseudo count `alpha` number for [laplace smoothing](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L180) for the given data up to some predefined numerical precision `epsilon` under the assumption that the perplexity is a convex function of alpha. How often did you have to call `perplexity` to find the optimum?\n",
"\n",
"Tips:\n",
"\n",
"You don't need 1st or 2nd order derivatives in this case, only the gradient descent direction. Think about recursively slicing up the problem.\n",
""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:19.151308",
"start_time": "2016-10-21T16:59:18.615252"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.0 0.0\n"
]
},
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmUAAAFpCAYAAADdpV/BAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEthJREFUeJzt3V+M5Xd53/HPU29xIKhg8y9gs12nuKqWJm2qiWnVpEX8\nsU0lsqjxhUFVVi2VL1ou0og2plQFnKgClNZRFdrKCpFcLmpSS1E2Iq1lTFErlBLPOoRkSRwvhsQ2\nTjC2ReSg4po8vZgf0WQ7zqx9jmee2Xm9pKM5v9/vOzPP6KuZffucM+Pq7gAAsL/+3H4PAACAKAMA\nGEGUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGODIfg/wbLz0pS/tY8eO7fcY\nAAC7On369Ne6+2W7rTuQUXbs2LFsbm7u9xgAALuqqt89n3WevgQAGECUAQAMIMoAAAYQZQAAA4gy\nAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCA\nKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAA\nA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwABribKquraq\n7q2qs1V14w7XL66qjy/XP1tVx865frSqnqiqd69jHgCAg2blKKuqi5J8JMlbkhxP8vaqOn7Osncm\neby7X5Pk5iQfOuf6v0vy31adBQDgoFrHI2VXJTnb3fd395NJbkty4pw1J5Lcuty/Pckbq6qSpKre\nluRLSc6sYRYAgANpHVF2WZIHth0/uJzbcU13P5Xk60leUlUvTPLjST6whjkAAA6s/X6h//uT3Nzd\nT+y2sKpuqKrNqtp85JFHnvvJAAD20JE1fIyHkrx62/Hly7md1jxYVUeSvCjJo0lel+S6qvpwkhcn\n+eOq+j/d/TPnfpLuviXJLUmysbHRa5gbAGCMdUTZ3UmurKorshVf1yd5xzlrTiU5meRXklyX5FPd\n3Ul+8NsLqur9SZ7YKcgAAC50K0dZdz9VVe9KckeSi5L8XHefqaqbkmx296kkH03ysao6m+SxbIUb\nAACL2nrA6mDZ2Njozc3N/R4DAGBXVXW6uzd2W7ffL/QHACCiDABgBFEGADCAKAMAGECUAQAMIMoA\nAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABggLVEWVVdW1X3VtXZ\nqrpxh+sXV9XHl+ufrapjy/k3V9XpqvqN5e0b1jEPAMBBs3KUVdVFST6S5C1Jjid5e1UdP2fZO5M8\n3t2vSXJzkg8t57+W5K3d/T1JTib52KrzAAAcROt4pOyqJGe7+/7ufjLJbUlOnLPmRJJbl/u3J3lj\nVVV3/1p3f2U5fybJ86vq4jXMBABwoKwjyi5L8sC24weXczuu6e6nknw9yUvOWfPDSe7p7m+uYSYA\ngAPlyH4PkCRV9dpsPaV59Z+x5oYkNyTJ0aNH92gyAIC9sY5Hyh5K8uptx5cv53ZcU1VHkrwoyaPL\n8eVJfiHJj3T3F5/uk3T3Ld290d0bL3vZy9YwNgDAHOuIsruTXFlVV1TV85Jcn+TUOWtOZeuF/Ely\nXZJPdXdX1YuTfCLJjd39mTXMAgBwIK0cZctrxN6V5I4kv5Xk57v7TFXdVFU/tCz7aJKXVNXZJD+W\n5Nt/NuNdSV6T5F9X1eeW28tXnQkA4KCp7t7vGZ6xjY2N3tzc3O8xAAB2VVWnu3tjt3X+oj8AwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECU\nAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIAB\nRBkAwACiDABgAFEGADDAWqKsqq6tqnur6mxV3bjD9Yur6uPL9c9W1bFt196znL+3qq5ZxzwAAAfN\nylFWVRcl+UiStyQ5nuTtVXX8nGXvTPJ4d78myc1JPrS87/Ek1yd5bZJrk/yH5eMBABwq63ik7Kok\nZ7v7/u5+MsltSU6cs+ZEkluX+7cneWNV1XL+tu7+Znd/KcnZ5eMBABwqR9bwMS5L8sC24weTvO7p\n1nT3U1X19SQvWc7/73Pe97I1zLSSD/zSmXzhK3+432MAAM+h46/6C3nfW1+732P8iQPzQv+quqGq\nNqtq85FHHtnvcQAA1modj5Q9lOTV244vX87ttObBqjqS5EVJHj3P902SdPctSW5Jko2NjV7D3E9r\nUjUDAIfDOh4puzvJlVV1RVU9L1sv3D91zppTSU4u969L8qnu7uX89ctvZ16R5Mokv7qGmQAADpSV\nHylbXiP2riR3JLkoyc9195mquinJZnefSvLRJB+rqrNJHstWuGVZ9/NJvpDkqST/tLu/tepMAAAH\nTW09YHWwbGxs9Obm5n6PAQCwq6o63d0bu607MC/0BwC4kIkyAIABRBkAwACiDABgAFEGADCAKAMA\nGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gy\nAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCA\nKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAA\nA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABVoqyqrq0qu6sqvuW\nt5c8zbqTy5r7qurkcu4FVfWJqvrtqjpTVR9cZRYAgINs1UfKbkxyV3dfmeSu5fhPqapLk7wvyeuS\nXJXkfdvi7ae6+68k+b4kf7uq3rLiPAAAB9KqUXYiya3L/VuTvG2HNdckubO7H+vux5PcmeTa7v5G\nd/+PJOnuJ5Pck+TyFecBADiQVo2yV3T3w8v930/yih3WXJbkgW3HDy7n/kRVvTjJW7P1aBsAwKFz\nZLcFVfXJJN+1w6X3bj/o7q6qfqYDVNWRJP8lyb/v7vv/jHU3JLkhSY4ePfpMPw0AwGi7Rll3v+np\nrlXVH1TVK7v74ap6ZZKv7rDsoSSv33Z8eZJPbzu+Jcl93f3Tu8xxy7I2Gxsbzzj+AAAmW/Xpy1NJ\nTi73Tyb5xR3W3JHk6qq6ZHmB/9XLuVTVTyZ5UZIfXXEOAIADbdUo+2CSN1fVfUnetBynqjaq6meT\npLsfS/ITSe5ebjd192NVdXm2ngI9nuSeqvpcVf3jFecBADiQqvvgPRO4sbHRm5ub+z0GAMCuqup0\nd2/sts5f9AcAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoA\nAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACi\nDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAM\nIMoAAAYQZQAAA4gyAIABRBkAwACiDABgAFEGADCAKAMAGECUAQAMIMoAAAYQZQAAA4gyAIABRBkA\nwACiDABgAFEGADCAKAMAGECUAQAMsFKUVdWlVXVnVd23vL3kadadXNbcV1Und7h+qqp+c5VZAAAO\nslUfKbsxyV3dfWWSu5bjP6WqLk3yviSvS3JVkvdtj7eq+vtJnlhxDgCAA23VKDuR5Nbl/q1J3rbD\nmmuS3Nndj3X340nuTHJtklTVC5P8WJKfXHEOAIADbdUoe0V3P7zc//0kr9hhzWVJHth2/OByLkl+\nIsm/TfKNFecAADjQjuy2oKo+meS7drj03u0H3d1V1ef7iavqryf5S939z6rq2HmsvyHJDUly9OjR\n8/00AAAHwq5R1t1verprVfUHVfXK7n64ql6Z5Ks7LHsoyeu3HV+e5NNJ/laSjar68jLHy6vq0939\n+uygu29JckuSbGxsnHf8AQAcBKs+fXkqybd/m/Jkkl/cYc0dSa6uqkuWF/hfneSO7v6P3f2q7j6W\n5AeS/M7TBRkAwIVu1Sj7YJI3V9V9Sd60HKeqNqrqZ5Okux/L1mvH7l5uNy3nAABYVPfBeyZwY2Oj\nNzc393sMAIBdVdXp7t7YbZ2/6A8AMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGAA\nUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGAAUQYAMIAoAwAYQJQBAAwgygAA\nBhBlAAADiDIAgAFEGQDAAKIMAGAAUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIM\nAGAAUQYAMIAoAwAYQJQBAAwgygAABhBlAAADiDIAgAFEGQDAAKIMAGCA6u79nuEZq6pHkvzuGj/k\nS5N8bY0fj/WxN7PZn7nszWz2Z7Z1789f7O6X7bboQEbZulXVZndv7Pcc/P/szWz2Zy57M5v9mW2/\n9sfTlwAAA4gyAIABRNmWW/Z7AJ6WvZnN/sxlb2azP7Pty/54TRkAwAAeKQMAGOCCj7Kquraq7q2q\ns1V14w7XL66qjy/XP1tVx7Zde89y/t6qumYv5z4Mnu3eVNWbq+p0Vf3G8vYNez37YbDK985y/WhV\nPVFV796rmQ+LFX+ufW9V/UpVnVm+h75jL2c/DFb42fbnq+rWZV9+q6res9ezX+jOY2/+TlXdU1VP\nVdV151w7WVX3LbeTz8mA3X3B3pJclOSLSb47yfOS/HqS4+es+SdJ/tNy//okH1/uH1/WX5zkiuXj\nXLTfX9OFcltxb74vyauW+381yUP7/fVcaLdV9mfb9duT/Nck797vr+dCuq34vXMkyeeT/LXl+CV+\nro3an3ckuW25/4IkX05ybL+/pgvldp57cyzJ9yb5z0mu23b+0iT3L28vWe5fsu4ZL/RHyq5Kcra7\n7+/uJ5PcluTEOWtOJLl1uX97kjdWVS3nb+vub3b3l5KcXT4e6/Gs96a7f627v7KcP5Pk+VV18Z5M\nfXis8r2Tqnpbki9la39Yr1X25uokn+/uX0+S7n60u7+1R3MfFqvsTyf5zqo6kuT5SZ5M8od7M/ah\nsOvedPeXu/vzSf74nPe9Jsmd3f1Ydz+e5M4k1657wAs9yi5L8sC24weXczuu6e6nknw9W//1eD7v\ny7O3yt5s98NJ7unubz5Hcx5Wz3p/quqFSX48yQf2YM7DaJXvnb+cpKvqjuUpmn+xB/MeNqvsz+1J\n/ijJw0l+L8lPdfdjz/XAh8gq/67vSRMcWfcHhL1SVa9N8qFs/dc/c7w/yc3d/cTywBlzHEnyA0m+\nP8k3ktxVVae7+679HYvFVUm+leRV2XqK7H9V1Se7+/79HYu9cqE/UvZQkldvO758ObfjmuUh4xcl\nefQ835dnb5W9SVVdnuQXkvxId3/xOZ/28Fllf16X5MNV9eUkP5rkX1bVu57rgQ+RVfbmwST/s7u/\n1t3fSPLLSf7Gcz7x4bLK/rwjyX/v7v/b3V9N8pkk/ldM67PKv+t70gQXepTdneTKqrqiqp6XrRdU\nnjpnzakk3/4tiuuSfKq3XtV3Ksn1y2/JXJHkyiS/ukdzHwbPem+q6sVJPpHkxu7+zJ5NfLg86/3p\n7h/s7mPdfSzJTyf5N939M3s1+CGwys+1O5J8T1W9YImBv5vkC3s092Gxyv78XpI3JElVfWeSv5nk\nt/dk6sPhfPbm6dyR5OqquqSqLsnWMzR3rH3C/f5tiOf6luTvJfmdbP3GxXuXczcl+aHl/ndk6zfE\nzmYrur572/u+d3m/e5O8Zb+/lgvt9mz3Jsm/ytbrLj637fby/f56LrTbKt872z7G++O3L0ftTZJ/\nkK1fwPjNJB/e76/lQryt8LPthcv5M9mK5X++31/LhXY7j735/mw9ovxH2Xr08sy29/1Hy56dTfIP\nn4v5/EV/AIABLvSnLwEADgRRBgAwgCgDABhAlAEADCDKAAAGEGUAAAOIMgCAAUQZAMAA/w/yZ4yu\nl2HZPwAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"oov_train = inject_OOVs(train)\n",
"oov_vocab = set(oov_train)\n",
"oov_test = replace_OOVs(oov_vocab, test)\n",
"bigram = NGramLM(oov_train,2)\n",
"\n",
"interval = [x/1000.0 for x in range(1, 100, 1)]\n",
"perplexity_at_1 = perplexity(LaplaceLM(bigram, alpha=1.0), oov_test)\n",
"\n",
"def plot_perplexities(interval):\n",
" \"\"\"Plots the perplexity of LaplaceLM for every alpha in interval.\"\"\"\n",
" perplexities = [0.0 for alpha in interval] # todo\n",
" plt.plot(interval, perplexities)\n",
" \n",
"def find_optimal(low, high, epsilon=1e-6):\n",
" \"\"\"Returns the optimal pseudo count alpha within the interval [low, high] and the perplexity.\"\"\"\n",
" print(high, low)\n",
" if high - low < epsilon:\n",
" return 0.0 # todo\n",
" else:\n",
" return 0.0 # todo\n",
"\n",
"plot_perplexities(interval) \n",
"find_optimal(0.0, 1.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 2: Sanity Check LM\n",
"Implement a method that tests whether a language model provides a valid probability distribution."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:19.237379",
"start_time": "2016-10-21T16:59:19.153304"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.064711557993089\n"
]
}
],
"source": [
"def sanity_check(lm, *history):\n",
" \"\"\"Throws an AssertionError if lm does not define a valid probability distribution for all words \n",
" in the vocabulary.\"\"\" \n",
" probability_mass = 1.0 # todo\n",
" assert abs(probability_mass - 1.0) < 1e-6, probability_mass\n",
"\n",
"unigram = NGramLM(oov_train,1)\n",
"stupid = StupidBackoff(bigram, unigram, 0.1)\n",
"print(sum([stupid.probability(word, 'the') for word in stupid.vocab]))\n",
"sanity_check(stupid, 'the')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 3: Subtract Count LM\n",
"Develop and implement a language model that subtracts a count $d\\in[0,1]$ from each non-zero count in the training set. Let's first formalize this:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\\begin{align}\n",
"\\#_{w=0}(h_n) &= \\sum_{w \\in V} \\mathbf{1}[\\counts{\\train}{h_n,w} = 0]\\\\\n",
"\\#_{w>0}(h_n) &= \\sum_{w \\in V} \\mathbf{1}[\\counts{\\train}{h_n,w} > 0]\\\\\n",
"\\prob(w|h_n) &= \n",
"\\begin{cases}\n",
"\\frac{\\counts{\\train}{h_n,w} - d}{\\counts{\\train}{h_n}} & \\mbox{if }\\counts{\\train}{h_n,w} > 0 \\\\\\\\\n",
"\\frac{???}{\\counts{\\train}{h_n}} & \\mbox{otherwise}\n",
"\\end{cases}\n",
"\\end{align}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:19.337884",
"start_time": "2016-10-21T16:59:19.240468"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0\n"
]
},
{
"data": {
"text/plain": [
"inf"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class SubtractCount(CountLM): \n",
" def __init__(self, base_lm, d):\n",
" super().__init__(base_lm.vocab, base_lm.order)\n",
" self.base_lm = base_lm\n",
" self.d = d \n",
" self._counts = base_lm._counts # not good style since it is a protected member\n",
" self.vocab = base_lm.vocab\n",
"\n",
" def counts(self, word_and_history):\n",
" if self._counts[word_and_history] > 0:\n",
" return 0.0 # todo\n",
" else:\n",
" return 0.0 # todo\n",
"\n",
" def norm(self, history):\n",
" return self.base_lm.norm(history) \n",
" \n",
"subtract_lm = SubtractCount(unigram, 0.1)\n",
"oov_prob = subtract_lm.probability(OOV, 'the')\n",
"rest_prob = sum([subtract_lm.probability(word, 'the') for word in subtract_lm.vocab])\n",
"print(oov_prob + rest_prob)\n",
"sanity_check(subtract_lm, 'the')\n",
"perplexity(subtract_lm, oov_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 4: Normalisation of Stupid LM\n",
"Develop and implement a version of the [stupid language model](https://github.com/uclmr/stat-nlp-book/blob/python/statnlpbook/lm.py#L205) that provides probabilities summing up to 1."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T16:59:19.398354",
"start_time": "2016-10-21T16:59:19.339446"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0\n"
]
},
{
"data": {
"text/plain": [
"inf"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class StupidBackoffNormalized(LanguageModel):\n",
" def __init__(self, main, backoff, alpha):\n",
" super().__init__(main.vocab, main.order)\n",
" self.main = main\n",
" self.backoff = backoff\n",
" self.alpha = alpha \n",
"\n",
" def probability(self, word, *history):\n",
" return 0.0 # todo\n",
" \n",
"less_stupid = StupidBackoffNormalized(bigram, unigram, 0.1)\n",
"print(sum([less_stupid.probability(word, 'the') for word in less_stupid.vocab]))\n",
"sanity_check(less_stupid, 'the')\n",
"perplexity(less_stupid, oov_test)"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}