{
 "metadata": {
  "name": "",
  "signature": "sha256:86ac4edfedab2134020525a1313023d729488ab09532408f443b98e4ca33734f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this post, we're gonna take a look at the Finnish language. Our starting point is a file found [here](http://www.csc.fi/english/research/sciences/linguistics/taajuussanasto-B9996/view), which contains the 10000 most common words found in Finnish."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib2\n",
      "response = urllib2.urlopen('http://www.csc.fi/english/research/sciences/linguistics/taajuussanasto-B9996/download')\n",
      "words = response.read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As the file is encoded in utf-8, we need to decode it, meaning convert it to unicode, to use it first. A great help in figuring this out is the fantastic tutorial presentation at [http://farmdev.com/talks/unicode/](http://farmdev.com/talks/unicode/)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words = words.decode('utf-8')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 38
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now split the text to its component lines in order to start exploring it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "words = words.splitlines()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 39
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 40,
       "text": [
        "[u'                   Sanahakemisto (laskevan taajuuden mukaan)',\n",
        " u'',\n",
        " u'   N        Abs   Rel    Uppslagsord',\n",
        " u'   1    2716396 4,614851 olla (verbi)',\n",
        " u'   2    1566108 2,660641 ja (konjunktio)',\n",
        " u'   3     593462 1,008225 ei (verbi)',\n",
        " u'   4     538609 0,915036 se (pronomini)',\n",
        " u'   5     443301 0,753118 ett\\xe4 (konjunktio)',\n",
        " u'   6     417984 0,710108 joka (pronomini)',\n",
        " u'   7     344927 0,585992 vuosi (substantiivi)']"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we see, the words have to be separated from their headers, which give additional information:\n",
      "\n",
      "- rank\n",
      "- absolute count of word in corpus\n",
      "- relative frequency\n",
      "- the word we're talking about\n",
      "\n",
      "This can be parsed in the following way: we develop a simple lambda function for each part of the header that we can then apply to the whole list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words[10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 41,
       "text": [
        "u'   8     302803 0,514428 h\\xe4n (pronomini)'"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print words[10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   8     302803 0,514428 h\u00e4n (pronomini)\n"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, the rank."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rank = lambda w: int(w[:8])\n",
      "rank(words[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 43,
       "text": [
        "8"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The absolute word count."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "abs_count = lambda w: int(w[8:15])\n",
      "abs_count(words[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 44,
       "text": [
        "302803"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The relative count."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rel_count = lambda w: float(w[15:25].replace(',', '.'))\n",
      "rel_count(words[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 45,
       "text": [
        "0.514428"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, the word itself, in unicode form."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "the_word = lambda w: w[25:].split('(')[0]\n",
      "print the_word(words[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "h\u00e4n \n"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Having set up these methods, we can apply them to each row that we want to parse from this text file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "word_dict = dict(\n",
      "    [(the_word(w), \n",
      "      (rank(w), abs_count(w), rel_count(w))) for w in words[3:-6]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "With this new word dictionary, we can build a sort of widget that lets us play with the words."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.html.widgets import interact"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML, display"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 49
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def show_word(n):\n",
      "    word = word_dict.keys()[n]\n",
      "    s = '<h3>Word: %s</h3><table>\\n' % word\n",
      "    for k,v in zip(('rank', 'relative count', 'absolute count'),\n",
      "                   word_dict[word]):\n",
      "        s += '<tr><td>{0}</td><td>{1}</td></tr>\\n'.format(k,v)\n",
      "    s += '</table>'\n",
      "    display(HTML(s))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's test this function with a call on the 3rd index item."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "show_word(3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: arvokisa </h3><table>\n",
        "<tr><td>rank</td><td>4188</td></tr>\n",
        "<tr><td>relative count</td><td>1360</td></tr>\n",
        "<tr><td>absolute count</td><td>0.00231</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x3902f70>"
       ]
      }
     ],
     "prompt_number": 51
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And now, let's make this interactive with the latests IPython Widget machinery."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interact(show_word,\n",
      "         n=(0, len(word_dict.keys()) - 1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: diplomaattinen </h3><table>\n",
        "<tr><td>rank</td><td>7201</td></tr>\n",
        "<tr><td>relative count</td><td>672</td></tr>\n",
        "<tr><td>absolute count</td><td>0.001142</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x4ba4570>"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 52,
       "text": [
        "<function __main__.show_word>"
       ]
      }
     ],
     "prompt_number": 52
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Adding a translation from an external website"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can easily supplement a translation of the word displayed by using an external website and displaying a search page for the word we're using:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import IFrame\n",
      "IFrame('http://www.fincd.com/index.php?txtSearch=tunti&lang=fi', width='100%', height=350)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "\n",
        "        <iframe\n",
        "            width=\"100%\"\n",
        "            height=350\"\n",
        "            src=\"http://www.fincd.com/index.php?txtSearch=tunti&lang=fi\"\n",
        "            frameborder=\"0\"\n",
        "            allowfullscreen\n",
        "        ></iframe>\n",
        "        "
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 53,
       "text": [
        "<IPython.lib.display.IFrame at 0x4ba4bf0>"
       ]
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "But how to encode unicode strings for use in URLs?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's use an example of what we want to do: the encoded url for the word kyll\u00e4 is http://www.fincd.com/finnish/kyll%E4.html"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_word = u'kyll\u00e4'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This word has to be encoded somehow. Looking at the website, we find it uses the `iso-8859-1` coding. Let's try that on our word."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_word.encode('iso-8859-1')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 55,
       "text": [
        "'kyll\\xe4'"
       ]
      }
     ],
     "prompt_number": 55
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once encoded, we can use the `quote` method to make it ready for urls:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "urllib2.quote(my_word.encode('iso-8859-1'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 56,
       "text": [
        "'kyll%E4'"
       ]
      }
     ],
     "prompt_number": 56
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, we can put this together:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(my_word.encode('iso-8859-1'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "http://www.fincd.com/index.php?txtSearch=kyll%E4&lang=fi\n"
       ]
      }
     ],
     "prompt_number": 57
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Good! Now, let's write the word and translation thing and let's interact with it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def show_word_and_translation(n):\n",
      "    word = word_dict.keys()[n]\n",
      "    s = '<h3>Word: %s</h3><table>\\n' % word\n",
      "    for k,v in zip(('rank', 'relative count', 'absolute count'),\n",
      "                   word_dict[word]):\n",
      "        s += '<tr><td>{0}</td><td>{1}</td></tr>\\n'.format(k,v)\n",
      "    s += '</table>'\n",
      "    display(HTML(s))\n",
      "    url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))\n",
      "    display(IFrame(url, width='100%', height=350))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 58
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interact(show_word_and_translation,\n",
      "         n=(0, len(word_dict.keys()) - 1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: enso </h3><table>\n",
        "<tr><td>rank</td><td>2690</td></tr>\n",
        "<tr><td>relative count</td><td>2351</td></tr>\n",
        "<tr><td>absolute count</td><td>0.003994</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x4ba49d0>"
       ]
      },
      {
       "html": [
        "\n",
        "        <iframe\n",
        "            width=\"100%\"\n",
        "            height=350\"\n",
        "            src=\"http://www.fincd.com/index.php?txtSearch=enso&lang=fi\"\n",
        "            frameborder=\"0\"\n",
        "            allowfullscreen\n",
        "        ></iframe>\n",
        "        "
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.lib.display.IFrame at 0x4ba4c70>"
       ]
      }
     ],
     "prompt_number": 59
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Can we do better than that? Let's see if we can just take the table with the content from the website."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "response = urllib2.urlopen('http://www.fincd.com/finnish/kyll%E4.html')\n",
      "source = response.read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 60
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source_split = source.decode(\"iso-8859-1\").splitlines()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 61
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The really interesting part is here:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source_split[85:120]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 62,
       "text": [
        "[u'\\t<a href=\"/friends/\">[Links]</a>&nbsp;',\n",
        " u'\\t<a href=\"javascript:bookmark()\">[Bookmark]</a>',\n",
        " u'  </td></tr>',\n",
        " u'</table>',\n",
        " u'',\n",
        " u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\">',\n",
        " u'  <tr>',\n",
        " u'    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>',\n",
        " u'    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/kyll%E4.html\">kyll\\xe4</a>\\t',\n",
        " u'\\t</td>',\n",
        " u'  </tr>',\n",
        " u'  <tr>',\n",
        " u'    <td id=\"lang_cell\" width=\"20%\">English:</td>',\n",
        " u'    <td id=\"content_cell\" width=\"80%\">',\n",
        " u'\\t<li><a href=\"/english/yes.html\">yes</a></li>',\n",
        " u'\\t<p id=\"msg\"></p>\\t</td>',\n",
        " u'  </tr>',\n",
        " u'  <tr>',\n",
        " u'    <td colspan=\"2\" id=\"suggestion_cell\"><!--Write your own explain here--></td>',\n",
        " u'  </tr>',\n",
        " u'  <tr>',\n",
        " u'  <td colspan=\"2\">',\n",
        " u'  <table width=\"100%\">',\n",
        " u'    <td width=\"50%\" id=\"discuss_cell\"> </td>',\n",
        " u'    <td width=\"50%\" id=\"discuss_cell\"> </td>',\n",
        " u'  </tr>',\n",
        " u'  </table>',\n",
        " u'  </td>',\n",
        " u'  <tr><td colspan=\"3\" align=\"right\" id=\"copy_right\"><small><a href = \"/old/\" title=\"Suomi Englanti sanakirja \">Suomi Englanti Suomi sanakirja Beta5</a></small></td></tr>',\n",
        " u'</table>',\n",
        " u'',\n",
        " u'<br>',\n",
        " u'',\n",
        " u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\" style=\"border-style:none\">',\n",
        " u'<tr>']"
       ]
      }
     ],
     "prompt_number": 62
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can the exact indices for the table with the information here:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source_split.index(u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\">')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 63,
       "text": [
        "90"
       ]
      }
     ],
     "prompt_number": 63
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source_split.index(u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\" style=\"border-style:none\">')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 64,
       "text": [
        "118"
       ]
      }
     ],
     "prompt_number": 64
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This can easily be extracted to HTML."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "HTML(\"\".join(source_split[90:118]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table width=\"728\" align=\"center\" id=\"tbDotBorder\">  <tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/kyll%E4.html\">kyll\u00e4</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/yes.html\">yes</a></li>\t<p id=\"msg\"></p>\t</td>  </tr>  <tr>    <td colspan=\"2\" id=\"suggestion_cell\"><!--Write your own explain here--></td>  </tr>  <tr>  <td colspan=\"2\">  <table width=\"100%\">    <td width=\"50%\" id=\"discuss_cell\"> </td>    <td width=\"50%\" id=\"discuss_cell\"> </td>  </tr>  </table>  </td>  <tr><td colspan=\"3\" align=\"right\" id=\"copy_right\"><small><a href = \"/old/\" title=\"Suomi Englanti sanakirja \">Suomi Englanti Suomi sanakirja Beta5</a></small></td></tr></table><br>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 65,
       "text": [
        "<IPython.core.display.HTML at 0x4ba4e30>"
       ]
      }
     ],
     "prompt_number": 65
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's try to extract only the meaningful information using regular expressions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "src = \"\".join(source_split[90:118])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 90
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 91
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "p = re.compile('<tr>')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 92
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iterator = p.finditer(src)\n",
      "for match in iterator:\n",
      "    print match.span()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(53, 57)\n",
        "(201, 205)\n",
        "(368, 372)\n",
        "(461, 465)\n",
        "(619, 623)\n"
       ]
      }
     ],
     "prompt_number": 93
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Judging from these matches, we only need the first two table rows to extract our data. Here, the data lies therefore between characters 53 and 367."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "HTML(src[53:367])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/kyll%E4.html\">kyll\u00e4</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/yes.html\">yes</a></li>\t<p id=\"msg\"></p>\t</td>  </tr> "
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 94,
       "text": [
        "<IPython.core.display.HTML at 0x4bba330>"
       ]
      }
     ],
     "prompt_number": 94
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's design a function that extracts exactly this last part:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def extract_word_definition(source):\n",
      "    source_split = source.decode(\"iso-8859-1\").splitlines()\n",
      "    start = source_split.index(u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\">')\n",
      "    stop = source_split.index(u'<table width=\"728\" align=\"center\" id=\"tbDotBorder\" style=\"border-style:none\">')\n",
      "    src = \"\".join(source_split[start:stop])\n",
      "    p = re.compile('<tr>')\n",
      "    iterator = p.finditer(src)\n",
      "    spans = [match.span() for match in iterator]\n",
      "    start = spans[0][0]\n",
      "    stop = spans[2][0]\n",
      "    return src[start:stop]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 99
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "HTML(extract_word_definition(source))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/kyll%E4.html\">kyll\u00e4</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/yes.html\">yes</a></li>\t<p id=\"msg\"></p>\t</td>  </tr>  "
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 100,
       "text": [
        "<IPython.core.display.HTML at 0x4c364b0>"
       ]
      }
     ],
     "prompt_number": 100
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can integrate this into the existing code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def show_word_and_translation_html_only(n):\n",
      "    word = word_dict.keys()[n]\n",
      "    s = '<h3>Word: %s</h3><table>\\n' % word\n",
      "    for k,v in zip(('rank', 'relative count', 'absolute count'),\n",
      "                   word_dict[word]):\n",
      "        s += '<tr><td>{0}</td><td>{1}</td></tr>\\n'.format(k,v)\n",
      "    url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))\n",
      "    s += extract_word_definition(urllib2.urlopen(url).read())\n",
      "    s += '</table>'\n",
      "    display(HTML(s))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 101
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interact(show_word_and_translation_html_only,\n",
      "         n=(0, len(word_dict.keys()) - 1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: toimisto </h3><table>\n",
        "<tr><td>rank</td><td>1561</td></tr>\n",
        "<tr><td>relative count</td><td>4584</td></tr>\n",
        "<tr><td>absolute count</td><td>0.007788</td></tr>\n",
        "<tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/toimisto.html\">toimisto</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/board.html\">board</a></li><li><a href=\"/english/bureau.html\">bureau</a></li><li><a href=\"/english/office.html\">office</a></li>\t<p id=\"msg\"></p>\t</td>  </tr>  </table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x4bba5b0>"
       ]
      }
     ],
     "prompt_number": 102
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, I can present the same tool but with a sorted ranking."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "word_dict[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "KeyError",
       "evalue": "0",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
        "\u001b[1;32m<ipython-input-112-547a459bb01d>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mword_dict\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[1;31mKeyError\u001b[0m: 0"
       ]
      }
     ],
     "prompt_number": 112
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sorted_keys = sorted(word_dict.keys(), key=lambda n:word_dict[n][0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 113
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sorted_keys[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 115,
       "text": [
        "[u'olla ',\n",
        " u'ja ',\n",
        " u'ei ',\n",
        " u'se ',\n",
        " u'ett\\xe4 ',\n",
        " u'joka ',\n",
        " u'h\\xe4n ',\n",
        " u'saada ',\n",
        " u'mutta ',\n",
        " u't\\xe4m\\xe4 ']"
       ]
      }
     ],
     "prompt_number": 115
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def show_word_and_translation_html_only_sorted(n):\n",
      "    word = sorted_keys[n]\n",
      "    s = '<h3>Word: %s</h3><table>\\n' % word\n",
      "    for k,v in zip(('rank', 'relative count', 'absolute count'),\n",
      "                   word_dict[word]):\n",
      "        s += '<tr><td>{0}</td><td>{1}</td></tr>\\n'.format(k,v)\n",
      "    url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))\n",
      "    s += extract_word_definition(urllib2.urlopen(url).read())\n",
      "    s += '</table>'\n",
      "    display(HTML(s))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 116
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interact(show_word_and_translation_html_only_sorted,\n",
      "         n=(0, len(word_dict.keys()) - 1))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: joukkue </h3><table>\n",
        "<tr><td>rank</td><td>180</td></tr>\n",
        "<tr><td>relative count</td><td>31653</td></tr>\n",
        "<tr><td>absolute count</td><td>0.053775</td></tr>\n",
        "<tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/joukkue.html\">joukkue</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/team.html\">team</a></li>\t<p id=\"msg\"></p>\t</td>  </tr>  </table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x4b59b70>"
       ]
      }
     ],
     "prompt_number": 117
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To make things easier, we can also just consider the first 200 words:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interact(show_word_and_translation_html_only_sorted,\n",
      "         n=(0, 200))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<h3>Word: tulla </h3><table>\n",
        "<tr><td>rank</td><td>14</td></tr>\n",
        "<tr><td>relative count</td><td>192327</td></tr>\n",
        "<tr><td>absolute count</td><td>0.326742</td></tr>\n",
        "<tr>    <td id=\"lang_cell\" width=\"20%\">Finnish:</td>    <td id=\"helper_cell\" width=\"80%\"><a href = \"/finnish/tulla.html\">tulla</a>\t\t</td>  </tr>  <tr>    <td id=\"lang_cell\" width=\"20%\">English:</td>    <td id=\"content_cell\" width=\"80%\">\t<li><a href=\"/english/come.html\">come</a></li><li><a href=\"/english/get.html\">get</a></li><li><a href=\"/english/grow.html\">grow</a></li><li><a href=\"/english/show+up.html\">show up</a></li>\t<p id=\"msg\"></p>\t</td>  </tr>  </table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x4b729d0>"
       ]
      }
     ],
     "prompt_number": 118
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Conclusions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this post, I've tried to develop a simple dictionary tool for learning the Finnish language and that can be used interactively using the IPython Notebook HTML widgets. I found this fun to write, and I hope to use this again for learning purposes. \n",
      "\n",
      "One of the things that could be done with this is to take into account the notion of words that I already know, a sort of vocabulary database, and then make suggestions based on word similarities in terms of their writing for potential learning candidates, so as to expand my current base of vocabulary."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}