{ "metadata": { "name": "Spelling corrector" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "I recently came across [this article](http://norvig.com/spell-correct.html) by Peter Norvig. It is a very interesting enquiry into the world of spelling correction. In the spirit of learning by doing, I'm going to try to deconstruct what he does and then reuse it and experiment with it inside this notebook. First of all, let's paste the complete code by Peter Norvig in the next cell." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "The code" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re, collections\n", "\n", "def words(text): return re.findall('[a-z]+', text.lower()) \n", "\n", "def train(features):\n", " model = collections.defaultdict(lambda: 1)\n", " for f in features:\n", " model[f] += 1\n", " return model\n", "\n", "NWORDS = train(words(file('big.txt').read()))\n", "\n", "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "\n", "def edits1(word):\n", " splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n", " deletes = [a + b[1:] for a, b in splits if b]\n", " transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]\n", " replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]\n", " inserts = [a + c + b for a, b in splits for c in alphabet]\n", " return set(deletes + transposes + replaces + inserts)\n", "\n", "def known_edits2(word):\n", " return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)\n", "\n", "def known(words): return set(w for w in words if w in NWORDS)\n", "\n", "def correct(word):\n", " candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n", " return max(candidates, key=NWORDS.get)" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "IOError", "evalue": "[Errno 2] No such file or directory: 'big.txt'", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mIOError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mmodel\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 11\u001b[1;33m \u001b[0mNWORDS\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtrain\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfile\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'big.txt'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 12\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 13\u001b[0m \u001b[0malphabet\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'abcdefghijklmnopqrstuvwxyz'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mIOError\u001b[0m: [Errno 2] No such file or directory: 'big.txt'" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, it fails, which is no surprise given that I don't have the *big.txt* data file he uses to construct his dictionary of possible words (train his model, as he says). Let's do this step by step. The file we're talking about is [online](http://norvig.com/big.txt), so I'm just gonna use a local version of it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%time NWORDS = train(words(file(r'files/big.txt').read()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 0.47 s, sys: 0.00 s, total: 0.47 s\n", "Wall time: 0.47 s\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Surprising how fast this was, given that this file is 6 MB and full of words. Let's check how many words actually end up in the dictionary." ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(NWORDS.keys())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ "29157" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about sampling some of these words?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "NWORDS.keys()[:20]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "['nunnery',\n", " 'presnya',\n", " 'woods',\n", " 'clotted',\n", " 'spiders',\n", " 'hanging',\n", " 'disobeying',\n", " 'scold',\n", " 'originality',\n", " 'grenadiers',\n", " 'pigment',\n", " 'appropriation',\n", " 'strictest',\n", " 'bringing',\n", " 'revelers',\n", " 'wooded',\n", " 'wooden',\n", " 'wednesday',\n", " 'shows',\n", " 'immunities']" ] } ], "prompt_number": 4 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm going to try to understand the theory Norvig is explaining in this section. We will say that we are trying to find the correction $c$, out of all possible corrections, that maximizes the probability of $c$ given the original input word $w$:\n", "\n", "$$c_0 = \\argmax_c P(c | w)$$\n", "\n", "This expression can be expanded using Bayes' theorem, whose demonstration is pasted below as a reminder." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$P(A|B)=\\frac{P(A \\cap B)}{P(B)}, \\text{ if } P(B) \\neq 0, $$\n", "$$P(B|A) = \\frac{P(A \\cap B)}{P(A)}, \\text{ if } P(A) \\neq 0,$$\n", "$$\\implies P(A \\cap B) = P(A|B) P(B) = P(B|A)P(A),$$\n", "$$\\implies P(A|B) = \\frac{P(B|A) P(A)}{P(B)}, \\text{ if } P(B) \\neq 0.$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we expand the orignal expression, we obtain:\n", "\n", "$$c_0 = \\argmax_c \\frac{P(w | c) P(c)}{P(w)}$$\n", "\n", "Assuming that all typos are equally probable for a given correct word, we can set $P(w)$ to 1 and thus ignore it. \n", "\n", "$$c_0 = \\argmax_c P(w | c) P(c)$$\n", "\n", "This leads us to an interesting expression, made up of three terms:\n", "\n", "- $P(c)$: the probability of a given correction to appear in the langage at hand. As Peter Norvig says, this is the language model.\n", "- $P(w | c)$: the probability of the badly typed word $w$ based on the given word $c$. The error model (how do people make typos?).\n", "- $\\argmax_c$: the control mechanism. We're searching the best value of the product of two terms over all words in the language model." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Building the language model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the adapted code, using the file saved to my disk." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re \n", "\n", "def words(text): return re.findall('[a-z]+', text.lower()) \n", "\n", "def train(features):\n", " model = collections.defaultdict(lambda: 1)\n", " for f in features:\n", " model[f] += 1\n", " return model\n", "\n", "NWORDS = train(words(file(r'files/big.txt').read()))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore some data that can be found in the dictionary." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word in ['darling', 'eternal', 'flame', 'better', 'host', 'the']:\n", " print word, NWORDS[word]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "darling 39\n", "eternal 27\n", "flame 16\n", "better 267\n", "host 24\n", "the 80031\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm asking myself: what are the most frequent words in this dictionary? Let's use Pandas to find out." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "words_df = pd.DataFrame(NWORDS.items())\n", "words_df.head(6)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0 nunnery 4
1 presnya 2
2 woods 23
3 clotted 2
4 spiders 2
5 hanging 43
\n", "
" ], "output_type": "pyout", "prompt_number": 7, "text": [ " 0 1\n", "0 nunnery 4\n", "1 presnya 2\n", "2 woods 23\n", "3 clotted 2\n", "4 spiders 2\n", "5 hanging 43" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "words_df.sort(columns=1).tail(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
11184 in 22051
15454 to 28767
22090 and 38314
13587 of 40026
9392 the 80031
\n", "
" ], "output_type": "pyout", "prompt_number": 8, "text": [ " 0 1\n", "11184 in 22051\n", "15454 to 28767\n", "22090 and 38314\n", "13587 of 40026\n", "9392 the 80031" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now know that the 5 most frequent words in the corpus are \"the\", \"of\", \"and\", \"to\" and \"in\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As Peter Norvig explains, the measure to say if a word is far from another word is to use the *edit distance* between them. Here, we can start considering all words that are within 1 edit of a given word for taking into account corrections. As he says:\n", "\n", ">An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). \n", "\n", "Peter's function is below." ] }, { "cell_type": "code", "collapsed": false, "input": [ "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "def edits1(word):\n", " splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n", " deletes = [a + b[1:] for a, b in splits if b]\n", " transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]\n", " replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]\n", " inserts = [a + c + b for a, b in splits for c in alphabet]\n", " return set(deletes + transposes + replaces + inserts)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what the output of this function produces." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print edits1(\"bloating\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set(['bloatinsg', 'bfloating', 'blozating', 'blouating', 'eloating', 'bloatzing', 'sbloating', 'cloating', 'bloatuing', 'bloaoting', 'bloatikg', 'wloating', 'bloaiting', 'bloacing', 'brloating', 'bloaeing', 'ibloating', 'bloazting', 'bloyating', 'blzoating', 'bmloating', 'bloatinlg', 'bcloating', 'blocting', 'mbloating', 'bloatineg', 'blolating', 'bloatirg', 'bjoating', 'blkoating', 'blonting', 'blooating', 'bloyting', 'byoating', 'blloating', 'bloatinzg', 'blobting', 'ybloating', 'bloatving', 'bloajting', 'bloatinwg', 'broating', 'bliating', 'bzoating', 'blofting', 'bloatqng', 'bloatigg', 'bloatyng', 'pbloating', 'bloatign', 'bloatqing', 'bgoating', 'bloatxing', 'bloatying', 'bloaiing', 'blofating', 'bloatiug', 'bolating', 'bvloating', 'blowating', 'bloiating', 'lbloating', 'bloqating', 'bloatiqng', 'blogating', 'bloatcng', 'bloatbing', 'bboating', 'blyoating', 'bloateng', 'blqating', 'zbloating', 'bloatung', 'blgating', 'bloatisng', 'bloatging', 'cbloating', 'bloatinyg', 'bloatisg', 'bloatmng', 'bloatding', 'bloatifng', 'bloatting', 'bloatiwng', 'bloatfing', 'blotating', 'bloatsng', 'bloatina', 'bloatinc', 'bloatinb', 'bloatine', 'bloatind', 'bloating', 'bloatinf', 'bloatini', 'bloatinh', 'bloatink', 'bloatinj', 'bloatinm', 'bloatinl', 'bloatino', 'bloatinn', 'bloatinq', 'bloatinp', 'bloatins', 'bloatinr', 'bloatinu', 'bloatint', 'bloatinw', 'bloatinv', 'bloatiny', 'bloatinx', 'bloatinz', 'abloating', 'bloaticg', 'blcoating', 'blovating', 'bloaring', 'jloating', 'bloaping', 'bxloating', 'lboating', 'xbloating', 'bloatiing', 'bloatjing', 'bgloating', 'bloagting', 'nbloating', 'bloatinkg', 'bloatwing', 'bllating', 'bloatixg', 'bloatwng', 'buoating', 'bloazing', 'blosting', 'bloatang', 'bloatfng', 'bloatgng', 'fbloating', 'bloatong', 'blfoating', 'bloatiqg', 'boloating', 'blating', 'bloawting', 'bloayting', 'blotaing', 'blroating', 'bloafing', 'blmating', 'bloatimg', 'rbloating', 'bloqting', 'bzloating', 'bleoating', 'blcating', 'blooting', 'bloaming', 'bloatinrg', 'bwoating', 'uloating', 'bloatiog', 'bloabting', 'bloatijg', 'bloatling', 'bloading', 'bloatning', 'bloaqting', 'blomting', 'bloatitg', 'blfating', 'bloatsing', 'boating', 'bloaking', 'blnating', 'loating', 'btloating', 'bloathing', 'bloatinog', 'buloating', 'bloatindg', 'bpoating', 'bloaxing', 'bhoating', 'bloatinqg', 'blvoating', 'ploating', 'baloating', 'bloatdng', 'bloatinug', 'bloaving', 'nloating', 'bloatiig', 'bloattng', 'bnoating', 'blodating', 'bloaxting', 'blsating', 'bloatvng', 'bloatkng', 'bloatifg', 'bloatinvg', 'bloatrng', 'blomating', 'aloating', 'bloeting', 'bloathng', 'bnloating', 'bmoating', 'bleating', 'blnoating', 'blqoating', 'bloaeting', 'dbloating', 'bloatinjg', 'btoating', 'bvoating', 'bloatiyng', 'bloatintg', 'bloateing', 'bloatring', 'bloaling', 'blotting', 'bloeating', 'blojating', 'blosating', 'bloatipg', 'bloabing', 'bloatirng', 'byloating', 'bloatnng', 'bloatincg', 'blorting', 'blhoating', 'obloating', 'oloating', 'bljoating', 'blobating', 'bloatnig', 'qloating', 'hbloating', 'bloatinxg', 'bloatlng', 'bloatipng', 'bloativng', 'blopating', 'bxoating', 'vloating', 'bloatizng', 'blouting', 'bloatieg', 'blogting', 'jbloating', 'biloating', 'bloawing', 'bloatitng', 'bloatibg', 'blowting', 'blpating', 'bwloating', 'xloating', 'bloaating', 'blwoating', 'bluating', 'bjloating', 'tbloating', 'blojting', 'bloamting', 'bloakting', 'mloating', 'blboating', 'bloatinng', 'bioating', 'bloatingc', 'bloatingb', 'bloatinga', 'bloatingg', 'bloauing', 'bloatinge', 'bloatingd', 'bloatingk', 'bloatingj', 'bloatingi', 'bloatingh', 'bloatingo', 'bloatingn', 'bloatingm', 'bloatingl', 'kbloating', 'bloatingr', 'bloatingq', 'bloatxng', 'bloatingw', 'bloatingv', 'bloatingu', 'bloatbng', 'bloatingz', 'bloatingy', 'bloatingx', 'blonating', 'blaoting', 'blorating', 'bloatixng', 'zloating', 'beoating', 'bloatibng', 'bloahing', 'bloatilg', 'tloating', 'bloiting', 'bbloating', 'bloatiag', 'bloting', 'wbloating', 'bloasing', 'blwating', 'booating', 'blxating', 'blokating', 'ebloating', 'bloalting', 'blocating', 'bloaqing', 'bdloating', 'bloatinig', 'blbating', 'bloauting', 'ubloating', 'vbloating', 'bloatijng', 'bloatiwg', 'bltating', 'bloatilng', 'bloaning', 'bloapting', 'bkoating', 'bloatzng', 'hloating', 'blolting', 'bloarting', 'bloxting', 'bfoating', 'blokting', 'iloating', 'bloatpng', 'blaating', 'blmoating', 'blhating', 'gloating', 'bloatng', 'bloatin', 'bloatinpg', 'blioating', 'bloatig', 'bloatihng', 'bloatihg', 'bqoating', 'bloacting', 'blohating', 'rloating', 'bldoating', 'beloating', 'blopting', 'bloataing', 'blxoating', 'bloatizg', 'blaoating', 'bloatinmg', 'bluoating', 'blpoating', 'bloatming', 'bloafting', 'blsoating', 'blkating', 'bloatingf', 'bsoating', 'bloatking', 'gbloating', 'bkloating', 'bloatinbg', 'bcoating', 'blzating', 'bldating', 'bloasting', 'blrating', 'sloating', 'bljating', 'lloating', 'blozting', 'bloationg', 'bloaing', 'kloating', 'bloahting', 'bloatiung', 'bloatings', 'bloadting', 'bloaying', 'bloatingp', 'yloating', 'bloanting', 'bloajing', 'bloatingt', 'bsloating', 'blyating', 'blvating', 'bloxating', 'bloaticng', 'bloatping', 'bloaging', 'bloatiang', 'bloaaing', 'bloatidng', 'bloatidg', 'bloatcing', 'bhloating', 'bploating', 'bloatinhg', 'blodting', 'bloatiyg', 'bloaoing', 'floating', 'bdoating', 'dloating', 'bqloating', 'bloatjng', 'bloavting', 'baoating', 'blohting', 'blovting', 'bloativg', 'bloatoing', 'bltoating', 'bloatieng', 'blgoating', 'bloatinag', 'bloatikng', 'bloatinfg', 'bloatimng', 'qbloating', 'bloaitng', 'bloatigng'])\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a number of modifications of the original word! Let's see how this was done. First, let's look at the splitting done in the code. The purpose of this is to easily calculate the alterations considered as modifications of a word." ] }, { "cell_type": "code", "collapsed": false, "input": [ "word = 'bloating'\n", "splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n", "print splits" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[('', 'bloating'), ('b', 'loating'), ('bl', 'oating'), ('blo', 'ating'), ('bloa', 'ting'), ('bloat', 'ing'), ('bloati', 'ng'), ('bloatin', 'g'), ('bloating', '')]\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From there it's easy to code deletions of one letter:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "deletes = [(a + b[1:]) for (a, b) in splits if b != '']\n", "print deletes" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['loating', 'boating', 'blating', 'bloting', 'bloaing', 'bloatng', 'bloatig', 'bloatin']\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then transpositions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "transposes = [(a + b[1] + b[0] + b[2:]) for (a, b) in splits if len(b) > 1]\n", "print transposes" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['lboating', 'bolating', 'blaoting', 'blotaing', 'bloaitng', 'bloatnig', 'bloatign']\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Than alterations (replace a letter by another)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "replaces = [(a + c + b[1:]) for (a, b) in splits for c in alphabet if b != '']\n", "print replaces" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['aloating', 'bloating', 'cloating', 'dloating', 'eloating', 'floating', 'gloating', 'hloating', 'iloating', 'jloating', 'kloating', 'lloating', 'mloating', 'nloating', 'oloating', 'ploating', 'qloating', 'rloating', 'sloating', 'tloating', 'uloating', 'vloating', 'wloating', 'xloating', 'yloating', 'zloating', 'baoating', 'bboating', 'bcoating', 'bdoating', 'beoating', 'bfoating', 'bgoating', 'bhoating', 'bioating', 'bjoating', 'bkoating', 'bloating', 'bmoating', 'bnoating', 'booating', 'bpoating', 'bqoating', 'broating', 'bsoating', 'btoating', 'buoating', 'bvoating', 'bwoating', 'bxoating', 'byoating', 'bzoating', 'blaating', 'blbating', 'blcating', 'bldating', 'bleating', 'blfating', 'blgating', 'blhating', 'bliating', 'bljating', 'blkating', 'bllating', 'blmating', 'blnating', 'bloating', 'blpating', 'blqating', 'blrating', 'blsating', 'bltating', 'bluating', 'blvating', 'blwating', 'blxating', 'blyating', 'blzating', 'bloating', 'blobting', 'blocting', 'blodting', 'bloeting', 'blofting', 'blogting', 'blohting', 'bloiting', 'blojting', 'blokting', 'blolting', 'blomting', 'blonting', 'blooting', 'blopting', 'bloqting', 'blorting', 'blosting', 'blotting', 'blouting', 'blovting', 'blowting', 'bloxting', 'bloyting', 'blozting', 'bloaaing', 'bloabing', 'bloacing', 'bloading', 'bloaeing', 'bloafing', 'bloaging', 'bloahing', 'bloaiing', 'bloajing', 'bloaking', 'bloaling', 'bloaming', 'bloaning', 'bloaoing', 'bloaping', 'bloaqing', 'bloaring', 'bloasing', 'bloating', 'bloauing', 'bloaving', 'bloawing', 'bloaxing', 'bloaying', 'bloazing', 'bloatang', 'bloatbng', 'bloatcng', 'bloatdng', 'bloateng', 'bloatfng', 'bloatgng', 'bloathng', 'bloating', 'bloatjng', 'bloatkng', 'bloatlng', 'bloatmng', 'bloatnng', 'bloatong', 'bloatpng', 'bloatqng', 'bloatrng', 'bloatsng', 'bloattng', 'bloatung', 'bloatvng', 'bloatwng', 'bloatxng', 'bloatyng', 'bloatzng', 'bloatiag', 'bloatibg', 'bloaticg', 'bloatidg', 'bloatieg', 'bloatifg', 'bloatigg', 'bloatihg', 'bloatiig', 'bloatijg', 'bloatikg', 'bloatilg', 'bloatimg', 'bloating', 'bloatiog', 'bloatipg', 'bloatiqg', 'bloatirg', 'bloatisg', 'bloatitg', 'bloatiug', 'bloativg', 'bloatiwg', 'bloatixg', 'bloatiyg', 'bloatizg', 'bloatina', 'bloatinb', 'bloatinc', 'bloatind', 'bloatine', 'bloatinf', 'bloating', 'bloatinh', 'bloatini', 'bloatinj', 'bloatink', 'bloatinl', 'bloatinm', 'bloatinn', 'bloatino', 'bloatinp', 'bloatinq', 'bloatinr', 'bloatins', 'bloatint', 'bloatinu', 'bloatinv', 'bloatinw', 'bloatinx', 'bloatiny', 'bloatinz']\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, inserts." ] }, { "cell_type": "code", "collapsed": false, "input": [ "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "inserts = [(a + c + b) for (a, b) in splits for c in alphabet]\n", "print inserts" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['abloating', 'bbloating', 'cbloating', 'dbloating', 'ebloating', 'fbloating', 'gbloating', 'hbloating', 'ibloating', 'jbloating', 'kbloating', 'lbloating', 'mbloating', 'nbloating', 'obloating', 'pbloating', 'qbloating', 'rbloating', 'sbloating', 'tbloating', 'ubloating', 'vbloating', 'wbloating', 'xbloating', 'ybloating', 'zbloating', 'baloating', 'bbloating', 'bcloating', 'bdloating', 'beloating', 'bfloating', 'bgloating', 'bhloating', 'biloating', 'bjloating', 'bkloating', 'blloating', 'bmloating', 'bnloating', 'boloating', 'bploating', 'bqloating', 'brloating', 'bsloating', 'btloating', 'buloating', 'bvloating', 'bwloating', 'bxloating', 'byloating', 'bzloating', 'blaoating', 'blboating', 'blcoating', 'bldoating', 'bleoating', 'blfoating', 'blgoating', 'blhoating', 'blioating', 'bljoating', 'blkoating', 'blloating', 'blmoating', 'blnoating', 'blooating', 'blpoating', 'blqoating', 'blroating', 'blsoating', 'bltoating', 'bluoating', 'blvoating', 'blwoating', 'blxoating', 'blyoating', 'blzoating', 'bloaating', 'blobating', 'blocating', 'blodating', 'bloeating', 'blofating', 'blogating', 'blohating', 'bloiating', 'blojating', 'blokating', 'blolating', 'blomating', 'blonating', 'blooating', 'blopating', 'bloqating', 'blorating', 'blosating', 'blotating', 'blouating', 'blovating', 'blowating', 'bloxating', 'bloyating', 'blozating', 'bloaating', 'bloabting', 'bloacting', 'bloadting', 'bloaeting', 'bloafting', 'bloagting', 'bloahting', 'bloaiting', 'bloajting', 'bloakting', 'bloalting', 'bloamting', 'bloanting', 'bloaoting', 'bloapting', 'bloaqting', 'bloarting', 'bloasting', 'bloatting', 'bloauting', 'bloavting', 'bloawting', 'bloaxting', 'bloayting', 'bloazting', 'bloataing', 'bloatbing', 'bloatcing', 'bloatding', 'bloateing', 'bloatfing', 'bloatging', 'bloathing', 'bloatiing', 'bloatjing', 'bloatking', 'bloatling', 'bloatming', 'bloatning', 'bloatoing', 'bloatping', 'bloatqing', 'bloatring', 'bloatsing', 'bloatting', 'bloatuing', 'bloatving', 'bloatwing', 'bloatxing', 'bloatying', 'bloatzing', 'bloatiang', 'bloatibng', 'bloaticng', 'bloatidng', 'bloatieng', 'bloatifng', 'bloatigng', 'bloatihng', 'bloatiing', 'bloatijng', 'bloatikng', 'bloatilng', 'bloatimng', 'bloatinng', 'bloationg', 'bloatipng', 'bloatiqng', 'bloatirng', 'bloatisng', 'bloatitng', 'bloatiung', 'bloativng', 'bloatiwng', 'bloatixng', 'bloatiyng', 'bloatizng', 'bloatinag', 'bloatinbg', 'bloatincg', 'bloatindg', 'bloatineg', 'bloatinfg', 'bloatingg', 'bloatinhg', 'bloatinig', 'bloatinjg', 'bloatinkg', 'bloatinlg', 'bloatinmg', 'bloatinng', 'bloatinog', 'bloatinpg', 'bloatinqg', 'bloatinrg', 'bloatinsg', 'bloatintg', 'bloatinug', 'bloatinvg', 'bloatinwg', 'bloatinxg', 'bloatinyg', 'bloatinzg', 'bloatinga', 'bloatingb', 'bloatingc', 'bloatingd', 'bloatinge', 'bloatingf', 'bloatingg', 'bloatingh', 'bloatingi', 'bloatingj', 'bloatingk', 'bloatingl', 'bloatingm', 'bloatingn', 'bloatingo', 'bloatingp', 'bloatingq', 'bloatingr', 'bloatings', 'bloatingt', 'bloatingu', 'bloatingv', 'bloatingw', 'bloatingx', 'bloatingy', 'bloatingz']\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "possibilities = set(deletes + transposes + replaces + inserts)\n", "print possibilities" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set(['bloatinsg', 'bfloating', 'blozating', 'blouating', 'eloating', 'bloatzing', 'sbloating', 'cloating', 'bloatuing', 'bloaoting', 'bloatikg', 'wloating', 'bloaiting', 'bloacing', 'brloating', 'bloaeing', 'ibloating', 'bloazting', 'bloyating', 'blzoating', 'bmloating', 'bloatinlg', 'bcloating', 'blocting', 'mbloating', 'bloatineg', 'blolating', 'bloatirg', 'bjoating', 'blkoating', 'blonting', 'blooating', 'bloyting', 'byoating', 'blloating', 'bloatinzg', 'blobting', 'ybloating', 'bloatving', 'bloajting', 'bloatinwg', 'broating', 'bliating', 'bzoating', 'blofting', 'bloatqng', 'bloatigg', 'bloatyng', 'pbloating', 'bloatign', 'bloatqing', 'bgoating', 'bloatxing', 'bloatying', 'bloaiing', 'blofating', 'bloatiug', 'bolating', 'bvloating', 'blowating', 'bloiating', 'lbloating', 'bloqating', 'bloatiqng', 'blogating', 'bloatcng', 'bloatbing', 'bboating', 'blyoating', 'bloateng', 'blqating', 'zbloating', 'bloatung', 'blgating', 'bloatisng', 'bloatging', 'cbloating', 'bloatinyg', 'bloatisg', 'bloatmng', 'bloatding', 'bloatifng', 'bloatting', 'bloatiwng', 'bloatfing', 'blotating', 'bloatsng', 'bloatina', 'bloatinc', 'bloatinb', 'bloatine', 'bloatind', 'bloating', 'bloatinf', 'bloatini', 'bloatinh', 'bloatink', 'bloatinj', 'bloatinm', 'bloatinl', 'bloatino', 'bloatinn', 'bloatinq', 'bloatinp', 'bloatins', 'bloatinr', 'bloatinu', 'bloatint', 'bloatinw', 'bloatinv', 'bloatiny', 'bloatinx', 'bloatinz', 'abloating', 'bloaticg', 'blcoating', 'blovating', 'bloaring', 'jloating', 'bloaping', 'bxloating', 'lboating', 'xbloating', 'bloatiing', 'bloatjing', 'bgloating', 'bloagting', 'nbloating', 'bloatinkg', 'bloatwing', 'bllating', 'bloatixg', 'bloatwng', 'buoating', 'bloazing', 'blosting', 'bloatang', 'bloatfng', 'bloatgng', 'fbloating', 'bloatong', 'blfoating', 'bloatiqg', 'boloating', 'blating', 'bloawting', 'bloayting', 'blotaing', 'blroating', 'bloafing', 'blmating', 'bloatimg', 'rbloating', 'bloqting', 'bzloating', 'bleoating', 'blcating', 'blooting', 'bloaming', 'bloatinrg', 'bwoating', 'uloating', 'bloatiog', 'bloabting', 'bloatijg', 'bloatling', 'bloading', 'bloatning', 'bloaqting', 'blomting', 'bloatitg', 'blfating', 'bloatsing', 'boating', 'bloaking', 'blnating', 'loating', 'btloating', 'bloathing', 'bloatinog', 'buloating', 'bloatindg', 'bpoating', 'bloaxing', 'bhoating', 'bloatinqg', 'blvoating', 'ploating', 'baloating', 'bloatdng', 'bloatinug', 'bloaving', 'nloating', 'bloatiig', 'bloattng', 'bnoating', 'blodating', 'bloaxting', 'blsating', 'bloatvng', 'bloatkng', 'bloatifg', 'bloatinvg', 'bloatrng', 'blomating', 'aloating', 'bloeting', 'bloathng', 'bnloating', 'bmoating', 'bleating', 'blnoating', 'blqoating', 'bloaeting', 'dbloating', 'bloatinjg', 'btoating', 'bvoating', 'bloatiyng', 'bloatintg', 'bloateing', 'bloatring', 'bloaling', 'blotting', 'bloeating', 'blojating', 'blosating', 'bloatipg', 'bloabing', 'bloatirng', 'byloating', 'bloatnng', 'bloatincg', 'blorting', 'blhoating', 'obloating', 'oloating', 'bljoating', 'blobating', 'bloatnig', 'qloating', 'hbloating', 'bloatinxg', 'bloatlng', 'bloatipng', 'bloativng', 'blopating', 'bxoating', 'vloating', 'bloatizng', 'blouting', 'bloatieg', 'blogting', 'jbloating', 'biloating', 'bloawing', 'bloatitng', 'bloatibg', 'blowting', 'blpating', 'bwloating', 'xloating', 'bloaating', 'blwoating', 'bluating', 'bjloating', 'tbloating', 'blojting', 'bloamting', 'bloakting', 'mloating', 'blboating', 'bloatinng', 'bioating', 'bloatingc', 'bloatingb', 'bloatinga', 'bloatingg', 'bloauing', 'bloatinge', 'bloatingd', 'bloatingk', 'bloatingj', 'bloatingi', 'bloatingh', 'bloatingo', 'bloatingn', 'bloatingm', 'bloatingl', 'kbloating', 'bloatingr', 'bloatingq', 'bloatxng', 'bloatingw', 'bloatingv', 'bloatingu', 'bloatbng', 'bloatingz', 'bloatingy', 'bloatingx', 'blonating', 'blaoting', 'blorating', 'bloatixng', 'zloating', 'beoating', 'bloatibng', 'bloahing', 'bloatilg', 'tloating', 'bloiting', 'bbloating', 'bloatiag', 'bloting', 'wbloating', 'bloasing', 'blwating', 'booating', 'blxating', 'blokating', 'ebloating', 'bloalting', 'blocating', 'bloaqing', 'bdloating', 'bloatinig', 'blbating', 'bloauting', 'ubloating', 'vbloating', 'bloatijng', 'bloatiwg', 'bltating', 'bloatilng', 'bloaning', 'bloapting', 'bkoating', 'bloatzng', 'hloating', 'blolting', 'bloarting', 'bloxting', 'bfoating', 'blokting', 'iloating', 'bloatpng', 'blaating', 'blmoating', 'blhating', 'gloating', 'bloatng', 'bloatin', 'bloatinpg', 'blioating', 'bloatig', 'bloatihng', 'bloatihg', 'bqoating', 'bloacting', 'blohating', 'rloating', 'bldoating', 'beloating', 'blopting', 'bloataing', 'blxoating', 'bloatizg', 'blaoating', 'bloatinmg', 'bluoating', 'blpoating', 'bloatming', 'bloafting', 'blsoating', 'blkating', 'bloatingf', 'bsoating', 'bloatking', 'gbloating', 'bkloating', 'bloatinbg', 'bcoating', 'blzating', 'bldating', 'bloasting', 'blrating', 'sloating', 'bljating', 'lloating', 'blozting', 'bloationg', 'bloaing', 'kloating', 'bloahting', 'bloatiung', 'bloatings', 'bloadting', 'bloaying', 'bloatingp', 'yloating', 'bloanting', 'bloajing', 'bloatingt', 'bsloating', 'blyating', 'blvating', 'bloxating', 'bloaticng', 'bloatping', 'bloaging', 'bloatiang', 'bloaaing', 'bloatidng', 'bloatidg', 'bloatcing', 'bhloating', 'bploating', 'bloatinhg', 'blodting', 'bloatiyg', 'bloaoing', 'floating', 'bdoating', 'dloating', 'bqloating', 'bloatjng', 'bloavting', 'baoating', 'blohting', 'blovting', 'bloativg', 'bloatoing', 'bltoating', 'bloatieng', 'blgoating', 'bloatinag', 'bloatikng', 'bloatinfg', 'bloatimng', 'qbloating', 'bloaitng', 'bloatigng'])\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "len(possibilities)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 17, "text": [ "442" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "%time edits1('bloating');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s\n", "Wall time: 0.00 s\n" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, this function is very very fast. If we go one step further, and apply this function recursively to get the set of 2 edit distance words, we obtain the following." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def edits2(word):\n", " return set(e2 for e1 in edits1(word) for e2 in edits1(e1))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However this syntax is a little tough for me. I'd rather write it like this." ] }, { "cell_type": "code", "collapsed": false, "input": [ "one_distance_words = edits1('bloating')\n", "two_distance_words = set([])\n", "for word in one_distance_words:\n", " two_distance_words = two_distance_words.union(edits1(word))\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 40 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many words are in the two distance word set from *bloating*?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(two_distance_words)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 41, "text": [ "90902" ] } ], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, as Peter Norvig does it, we only keep the known words in the resulting set of possible words until edit distance of 2. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "two_distance_words = [word for word in two_distance_words if word in NWORDS]\n", "len(two_distance_words)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 43, "text": [ "19" ] } ], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "print(two_distance_words)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['loafing', 'clotting', 'blowing', 'beating', 'locating', 'coating', 'bleating', 'floating', 'blocking', 'blaming', 'boasting', 'loading', 'looting', 'loathing', 'blazing', 'blotting', 'blooming', 'plating', 'blasting']\n" ] } ], "prompt_number": 44 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Interlude: sets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "test_set = set()\n", "test_set" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 30, "text": [ "set([])" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "test_set2 = set([1, 2, 2, 3])\n", "test_set2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 37, "text": [ "set([1, 2, 3])" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "test_set = test_set.union(test_set2)\n", "test_set" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 39, "text": [ "set([1, 2, 3])" ] } ], "prompt_number": 39 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Back to the point" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def edits2(word):\n", " one_distance_words = edits1(word)\n", " two_distance_words = set([])\n", " for word in one_distance_words:\n", " two_distance_words = two_distance_words.union(edits1(word))\n", " return two_distance_words" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "ed2 = edits2('bloating')\n", "len(ed2)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 51, "text": [ "90902" ] } ], "prompt_number": 51 }, { "cell_type": "code", "collapsed": false, "input": [ "%time edits2('bloating');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 3.64 s, sys: 0.00 s, total: 3.64 s\n", "Wall time: 3.64 s\n" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "def known_edits2(word):\n", " one_distance_words = edits1(word)\n", " two_distance_words = set([])\n", " for word in one_distance_words:\n", " two_distance_words = two_distance_words.union(edits1(word))\n", " return [word for word in two_distance_words if word in NWORDS]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 56 }, { "cell_type": "code", "collapsed": false, "input": [ "%time known_edits2('bloating');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 3.72 s, sys: 0.00 s, total: 3.72 s\n", "Wall time: 3.72 s\n" ] } ], "prompt_number": 57 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we compare with Norvig's function?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def edits2_norvig(word):\n", " return set(e2 for e1 in edits1(word) for e2 in edits1(e1))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 58 }, { "cell_type": "code", "collapsed": false, "input": [ "%time edits2_norvig('bloating');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 0.26 s, sys: 0.00 s, total: 0.26 s\n", "Wall time: 0.26 s\n" ] } ], "prompt_number": 59 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Damn! Why? I have a clue: maybe it's all the unions that get done during the loop. Why not generate a huge list and produce a set at the end? In fact, that's what Peter does (and he knows how to use double loop list comprehensions)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def edits2(word):\n", " one_distance_words = edits1(word)\n", " two_distance_words = []\n", " for word in one_distance_words:\n", " two_distance_words += list(edits1(word))\n", " return set(two_distance_words)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 65 }, { "cell_type": "code", "collapsed": false, "input": [ "%time edits2('bloating');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 0.23 s, sys: 0.00 s, total: 0.23 s\n", "Wall time: 0.23 s\n" ] } ], "prompt_number": 66 }, { "cell_type": "code", "collapsed": false, "input": [ "len(edits2('bloating')), len(edits2_norvig('bloating'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 69, "text": [ "(90902, 90902)" ] } ], "prompt_number": 69 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Last steps and we can correct stuff" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def known(words): return set(w for w in words if w in NWORDS)\n", "\n", "def correct(word, verbose=False):\n", " candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n", " if verbose: \n", " print len(candidates)\n", " if len(candidates) < 20:\n", " print candidates\n", " return max(candidates, key=NWORDS.get)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 74 }, { "cell_type": "code", "collapsed": false, "input": [ "correct('treheb', verbose=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2\n", "['trees', 'tree']\n" ] }, { "output_type": "pyout", "prompt_number": 77, "text": [ "'trees'" ] } ], "prompt_number": 77 }, { "cell_type": "code", "collapsed": false, "input": [ "correct('narh')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 78, "text": [ "'are'" ] } ], "prompt_number": 78 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about this 'or' syntax?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print known(['brurr'])\n", "known(['brlur']) or ['true']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "set([])\n" ] }, { "output_type": "pyout", "prompt_number": 84, "text": [ "['true']" ] } ], "prompt_number": 84 }, { "cell_type": "code", "collapsed": false, "input": [ "bool(set([]))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 85, "text": [ "False" ] } ], "prompt_number": 85 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above line: using this or syntax makes the preceding expression a boolean expression. Which results in an incremental sort of loop. Clever coding, but not easy to figure out for me." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Shortcomings of Peter's article" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While discussing this with a colleague, we figured out that the code was actually straightforward to understand. Basically, we understood the following \"create a list of candidates with the keyboard input measure and use the one that has the highest frequency in the language\". However, in the conceptual framework, one of the important things that wasn't really elaborated upon is the fact that the edit measure has an interesting property: it works both way. If you take the word *bloat*, a distance 2 word would be *broad*. However, if you use the word *broad* as input, you can also find *bloat* in the distance 2 words list. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "print known_edits2('bloat')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['boost', 'plot', 'blest', 'lost', 'broad', 'boa', 'boast', 'bout', 'blurt', 'gloat', 'plat', 'blown', 'blows', 'boats', 'flat', 'blast', 'bolt', 'lout', 'aloft', 'blood', 'coat', 'bloom', 'clot', 'loath', 'afloat', 'brat', 'float', 'lot', 'groat', 'loot', 'sloat', 'bloated', 'blunt', 'goat', 'slot', 'cloak', 'boat', 'load', 'loaf', 'loam', 'loan', 'bloch', 'block', 'bat', 'bloke', 'bleak', 'beat', 'blow', 'blot', 'bloc', 'flout', 'boot']\n" ] } ], "prompt_number": 87 }, { "cell_type": "code", "collapsed": false, "input": [ "print known_edits2('broad')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['crowd', 'bred', 'bryan', 'broad', 'boa', 'rod', 'browed', 'proud', 'trod', 'brow', 'broaden', 'broader', 'breed', 'read', 'broke', 'abroad', 'blood', 'bored', 'roan', 'broadly', 'brat', 'brag', 'board', 'groan', 'groat', 'dread', 'bond', 'briar', 'brian', 'boat', 'load', 'roads', 'break', 'bread', 'brown', 'brows', 'brood', 'broom', 'brook', 'tread', 'bad', 'bold', 'roar', 'roam', 'road', 'bead', 'brand']\n" ] } ], "prompt_number": 88 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is something I believe to be very central and beautiful, but is something he doesn't spend time describing. " ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things I learnt:\n", "\n", "- double loop list comprehensions\n", "- list comprehensions without brackets \n", "- using or in an expression makes the expression elements return boolean values\n", "- Bayes' theorem and that probabilistic math is not necessarily that hard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things to think about:\n", "\n", "- have some of the further work recommendations changed?\n", "- has the technology improved since 2007 when Peter Norvig was writing this?\n", "- is there another probabilistic way of doing this?\n", "- are there other word distances? (example: convert a word to sounds and use a sound distance)" ] } ], "metadata": {} } ] }