{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Goal of this study\n",
      "\n",
      "According to Tae Kim's guide to Japanese grammar and [its introduction to Japanese verbs](http://www.guidetojapanese.org/learn/grammar/verbs), the following holds true:\n",
      "\n",
      "> All ru-verbs end in \u300c\u308b\u300d while u-verbs can end in a number of u-vowel sounds including \u300c\u308b\u300d. Therefore, if a verb does not end in \u300c\u308b\u300d, it will always be an u-verb. For verbs ending in \u300c\u308b\u300d, if the vowel sound preceding the \u300c\u308b\u300d is an /a/, /u/ or /o/ vowel sound, it will always be an u-verb. Otherwise, if the preceding sound is an /i/ or /e/ vowel sound, it will be a ru-verb in most cases. A list of common exceptions are at the end of this section.\n",
      "\n",
      "In this exploration, we're going to check that this description and try to find the statistics (i.e. the percentages for the different categories).\n",
      "\n",
      "# Sample file extraction \n",
      "\n",
      "The first step in our study is to read the dictionary XML and browse it for verbs. \n",
      "\n",
      "We're starting small, with a sample file that contains one word so that we can extract:\n",
      "\n",
      "- the pronunciation in Hiragana\n",
      "- the grammatical function of the word"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import xml.etree.cElementTree as ET"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tree = ET.ElementTree(file='./JMdict_example.xml')\n",
      "tree"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "<ElementTree at 0x3d630d0>"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "root = tree.getroot()\n",
      "root"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "<Element 'entry' at 0x03DC45D8>"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for elem in root:\n",
      "    print elem"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<Element 'ent_seq' at 0x03DC4608>\n",
        "<Element 'k_ele' at 0x03DC4488>\n",
        "<Element 'r_ele' at 0x03DC4548>\n",
        "<Element 'sense' at 0x03DC4770>\n",
        "<Element 'sense' at 0x03DC4998>\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the evaluation above, we see that an element of the XML structure contains a pronunciation that we can extract from the *r_ele/reb* tag:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print root.find('r_ele/reb').text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u3046\u3088\u304f\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The grammatical function comes from the *sense/pos* tag. It should be noted that a word can have more than one grammatical function in Japanese. In the case below, the word uyoku can be an adjective-noun and a noun."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print [elem.text for elem in root.findall('sense/pos')]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['adj-no;', 'n;']\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now define a function that acts as a shortcut for the two operations done above (pronunciation and grammatical function):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def extract_information(elem):\n",
      "    return (elem.find('r_ele/reb').text,\n",
      "            [sense.text for sense in elem.findall('sense/pos')])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "extract_information(root)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "(u'\\u3046\\u3088\\u304f', ['adj-no;', 'n;'])"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Where are the verbs?\n",
      "\n",
      "If we take a look at the DTD of the JMdict XML file we find the following entities corresponding to verbs:"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "<!ENTITY v1 \"Ichidan verb\">\n",
      "<!ENTITY v2a-s \"Nidan verb with 'u' ending (archaic)\">\n",
      "<!ENTITY v4h \"Yodan verb with `hu/fu' ending (archaic)\">\n",
      "<!ENTITY v4r \"Yodan verb with `ru' ending (archaic)\">\n",
      "<!ENTITY v5 \"Godan verb (not completely classified)\">\n",
      "<!ENTITY v5aru \"Godan verb - -aru special class\">\n",
      "<!ENTITY v5b \"Godan verb with `bu' ending\">\n",
      "<!ENTITY v5g \"Godan verb with `gu' ending\">\n",
      "<!ENTITY v5k \"Godan verb with `ku' ending\">\n",
      "<!ENTITY v5k-s \"Godan verb - Iku/Yuku special class\">\n",
      "<!ENTITY v5m \"Godan verb with `mu' ending\">\n",
      "<!ENTITY v5n \"Godan verb with `nu' ending\">\n",
      "<!ENTITY v5r \"Godan verb with `ru' ending\">\n",
      "<!ENTITY v5r-i \"Godan verb with `ru' ending (irregular verb)\">\n",
      "<!ENTITY v5s \"Godan verb with `su' ending\">\n",
      "<!ENTITY v5t \"Godan verb with `tsu' ending\">\n",
      "<!ENTITY v5u \"Godan verb with `u' ending\">\n",
      "<!ENTITY v5u-s \"Godan verb with `u' ending (special class)\">\n",
      "<!ENTITY v5uru \"Godan verb - Uru old class verb (old form of Eru)\">\n",
      "<!ENTITY vz \"Ichidan verb - zuru verb (alternative form of -jiru verbs)\">\n",
      "<!ENTITY vi \"intransitive verb\">\n",
      "<!ENTITY vk \"Kuru verb - special class\">\n",
      "<!ENTITY vn \"irregular nu verb\">\n",
      "<!ENTITY vr \"irregular ru verb, plain form ends with -ri\">\n",
      "<!ENTITY vs \"noun or participle which takes the aux. verb suru\">\n",
      "<!ENTITY vs-c \"su verb - precursor to the modern suru\">\n",
      "<!ENTITY vs-s \"suru verb - special class\">\n",
      "<!ENTITY vs-i \"suru verb - irregular\">\n",
      "<!ENTITY vt \"transitive verb\">\n",
      "<!ENTITY v4k \"Yodan verb with `ku' ending (archaic)\">\n",
      "<!ENTITY v4g \"Yodan verb with `gu' ending (archaic)\">\n",
      "<!ENTITY v4s \"Yodan verb with `su' ending (archaic)\">\n",
      "<!ENTITY v4t \"Yodan verb with `tsu' ending (archaic)\">\n",
      "<!ENTITY v4n \"Yodan verb with `nu' ending (archaic)\">\n",
      "<!ENTITY v4b \"Yodan verb with `bu' ending (archaic)\">\n",
      "<!ENTITY v4m \"Yodan verb with `mu' ending (archaic)\">\n",
      "<!ENTITY v2k-k \"Nidan verb (upper class) with `ku' ending (archaic)\">\n",
      "<!ENTITY v2g-k \"Nidan verb (upper class) with `gu' ending (archaic)\">\n",
      "<!ENTITY v2t-k \"Nidan verb (upper class) with `tsu' ending (archaic)\">\n",
      "<!ENTITY v2d-k \"Nidan verb (upper class) with `dzu' ending (archaic)\">\n",
      "<!ENTITY v2h-k \"Nidan verb (upper class) with `hu/fu' ending (archaic)\">\n",
      "<!ENTITY v2b-k \"Nidan verb (upper class) with `bu' ending (archaic)\">\n",
      "<!ENTITY v2m-k \"Nidan verb (upper class) with `mu' ending (archaic)\">\n",
      "<!ENTITY v2y-k \"Nidan verb (upper class) with `yu' ending (archaic)\">\n",
      "<!ENTITY v2r-k \"Nidan verb (upper class) with `ru' ending (archaic)\">\n",
      "<!ENTITY v2k-s \"Nidan verb (lower class) with `ku' ending (archaic)\">\n",
      "<!ENTITY v2g-s \"Nidan verb (lower class) with `gu' ending (archaic)\">\n",
      "<!ENTITY v2s-s \"Nidan verb (lower class) with `su' ending (archaic)\">\n",
      "<!ENTITY v2z-s \"Nidan verb (lower class) with `zu' ending (archaic)\">\n",
      "<!ENTITY v2t-s \"Nidan verb (lower class) with `tsu' ending (archaic)\">\n",
      "<!ENTITY v2d-s \"Nidan verb (lower class) with `dzu' ending (archaic)\">\n",
      "<!ENTITY v2n-s \"Nidan verb (lower class) with `nu' ending (archaic)\">\n",
      "<!ENTITY v2h-s \"Nidan verb (lower class) with `hu/fu' ending (archaic)\">\n",
      "<!ENTITY v2b-s \"Nidan verb (lower class) with `bu' ending (archaic)\">\n",
      "<!ENTITY v2m-s \"Nidan verb (lower class) with `mu' ending (archaic)\">\n",
      "<!ENTITY v2y-s \"Nidan verb (lower class) with `yu' ending (archaic)\">\n",
      "<!ENTITY v2r-s \"Nidan verb (lower class) with `ru' ending (archaic)\">\n",
      "<!ENTITY v2w-s \"Nidan verb (lower class) with `u' ending and `we' conjugation (archaic)\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we can see, this is quite complex. Instead of parsing every single verb type and differentiating between it, we will focus on four verb types only:\n",
      "\n",
      "- Ichidan \n",
      "- Nidan\n",
      "- Yodan\n",
      "- Godan\n",
      "\n",
      "As a side note, the Ichidan verbs are the *ru* verbs Tae Kim is talking about, while the Godan verbs are the *u* verbs.\n",
      "\n",
      "Below, we're looping over the dictionary items and extracting words that possibly contain verbs. To do this, we're checking for the presence of either of the four above verb types."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tree = ET.ElementTree(file='../JMdict.xml')\n",
      "tree"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "<ElementTree at 0x3d63bf0>"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "verb_list = []\n",
      "verb_types = [u'Ichidan', u'Nidan', u'Yodan', u'Godan']\n",
      "for elem in tree.getroot().findall('entry'):\n",
      "    item_info = extract_information(elem)\n",
      "    for verb_type in verb_types:\n",
      "        if \" \".join(item_info[1]).find(verb_type) != -1:\n",
      "                verb_list.append((item_info[0], verb_type))\n",
      "                break\n",
      "                "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(verb_list)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "10266"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "verb_list[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "[(u'\\u3042\\u3052\\u3064\\u3089\\u3046', u'Godan'),\n",
        " (u'\\u3042\\u3057\\u3089\\u3046', u'Godan'),\n",
        " (u'\\u3042\\u3076\\u308c\\u308b', u'Ichidan'),\n",
        " (u'\\u3042\\u3084\\u3059', u'Godan'),\n",
        " (u'\\u3044\\u304b\\u3059', u'Godan'),\n",
        " (u'\\u3044\\u3058\\u3051\\u308b', u'Ichidan'),\n",
        " (u'\\u3044\\u3058\\u308a\\u307e\\u308f\\u3059', u'Godan'),\n",
        " (u'\\u3044\\u3061\\u3083\\u3064\\u304f', u'Godan'),\n",
        " (u'\\u3044\\u306a\\u306a\\u304f', u'Godan'),\n",
        " (u'\\u3044\\u3073\\u308b', u'Godan')]"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As the unicode is a bit hard to read, we can format the previous output to the following effect:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for elem in map(\": \".join, verb_list[:10]):\n",
      "    print elem"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u3042\u3052\u3064\u3089\u3046: Godan\n",
        "\u3042\u3057\u3089\u3046: Godan\n",
        "\u3042\u3076\u308c\u308b: Ichidan\n",
        "\u3042\u3084\u3059: Godan\n",
        "\u3044\u304b\u3059: Godan\n",
        "\u3044\u3058\u3051\u308b: Ichidan\n",
        "\u3044\u3058\u308a\u307e\u308f\u3059: Godan\n",
        "\u3044\u3061\u3083\u3064\u304f: Godan\n",
        "\u3044\u306a\u306a\u304f: Godan\n",
        "\u3044\u3073\u308b: Godan\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "First statistical breakdown"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we now have a list of verb types, we can calculate their relative frequency:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for verb_type in verb_types:\n",
      "    print \"%s: %i out of %i verbs\" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)), len(verb_list))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Ichidan: 3415 out of 10266 verbs\n",
        "Nidan: 38 out of 10266 verbs\n",
        "Yodan: 25 out of 10266 verbs\n",
        "Godan: 6788 out of 10266 verbs\n"
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for verb_type in verb_types:\n",
      "    print \"%s: %.1f percent\" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)) / float(len(verb_list)) * 100)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Ichidan: 33.3 percent\n",
        "Nidan: 0.4 percent\n",
        "Yodan: 0.2 percent\n",
        "Godan: 66.1 percent\n"
       ]
      }
     ],
     "prompt_number": 257
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "According to this output, there's a majority of *u* verbs in the Japanese language."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Verb types by last character "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Following Tae Kim's data,we're going to count number of verbs of a given type by classifying them according to their last character:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "last_char_dict = {}\n",
      "for verb in verb_list:\n",
      "    last_char = verb[0][-1] \n",
      "    if last_char not in last_char_dict:\n",
      "        last_char_dict[last_char] = {verb[1] : 1}\n",
      "    else:\n",
      "        if verb[1] in last_char_dict[last_char]:\n",
      "            last_char_dict[last_char][verb[1]] += 1\n",
      "        else:\n",
      "            last_char_dict[last_char][verb[1]] = 1\n",
      "        "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 40
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "last_char_dict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 41,
       "text": [
        "{u'\\u3046': {u'Godan': 719, u'Nidan': 6, u'Yodan': 4},\n",
        " u'\\u304f': {u'Godan': 911, u'Nidan': 2, u'Yodan': 4},\n",
        " u'\\u3050': {u'Godan': 137, u'Nidan': 2},\n",
        " u'\\u3059': {u'Godan': 1803, u'Nidan': 1, u'Yodan': 3},\n",
        " u'\\u305a': {u'Ichidan': 1, u'Nidan': 2},\n",
        " u'\\u3064': {u'Godan': 244, u'Nidan': 2, u'Yodan': 1},\n",
        " u'\\u306c': {u'Godan': 9, u'Nidan': 2},\n",
        " u'\\u3075': {u'Nidan': 1},\n",
        " u'\\u3076': {u'Godan': 102, u'Nidan': 2, u'Yodan': 2},\n",
        " u'\\u3080': {u'Godan': 596, u'Nidan': 4},\n",
        " u'\\u3086': {u'Nidan': 7},\n",
        " u'\\u308b': {u'Godan': 2267, u'Ichidan': 3414, u'Nidan': 7, u'Yodan': 11}}"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), last_char_dict[last_char_dict.keys()[0]].items()))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 53,
       "text": [
        "u'Nidan: 4, Godan: 596'"
       ]
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for key in last_char_dict:\n",
      "    print \"verbs ending in %s\" % key, \", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), last_char_dict[key].items()))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "verbs ending in \u3080 Nidan: 4, Godan: 596\n",
        "verbs ending in \u3064 Yodan: 1, Nidan: 2, Godan: 244\n",
        "verbs ending in \u3046 Yodan: 4, Nidan: 6, Godan: 719\n",
        "verbs ending in \u308b Yodan: 11, Nidan: 7, Ichidan: 3414, Godan: 2267\n",
        "verbs ending in \u3086 Nidan: 7\n",
        "verbs ending in \u306c Nidan: 2, Godan: 9\n",
        "verbs ending in \u304f Yodan: 4, Nidan: 2, Godan: 911\n",
        "verbs ending in \u3050 Nidan: 2, Godan: 137\n",
        "verbs ending in \u3075 Nidan: 1\n",
        "verbs ending in \u3076 Yodan: 2, Nidan: 2, Godan: 102\n",
        "verbs ending in \u3059 Yodan: 3, Nidan: 1, Godan: 1803\n",
        "verbs ending in \u305a Nidan: 2, Ichidan: 1\n"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we can see from the output the first idea given by Tae Kim seems correct: if it doesn't end in *ru* it's a Godan verb (except for one \u305a Ichidan verb). Let's check the second idea: \n",
      "> if it ends in *ru*, the second before last hiragana mostly determines the verb type\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "second_last_char_dict = {}\n",
      "for verb in verb_list:\n",
      "    last_char = verb[0][-1] \n",
      "    if last_char == u'\u308b' and len(verb[0]) >= 2:\n",
      "        second_last_char = verb[0][-2] \n",
      "        if second_last_char not in second_last_char_dict:\n",
      "            second_last_char_dict[second_last_char] = {verb[1] : 1}\n",
      "        else:\n",
      "            if verb[1] in second_last_char_dict[second_last_char]:\n",
      "                second_last_char_dict[second_last_char][verb[1]] += 1\n",
      "            else:\n",
      "                second_last_char_dict[second_last_char][verb[1]] = 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 57
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for key in second_last_char_dict:\n",
      "    print key, second_last_char_dict[key]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u3083 {u'Yodan': 2, u'Godan': 5}\n",
        "\u308f {u'Godan': 130}\n",
        "\u30a7 {u'Godan': 1}\n",
        "\u30ab {u'Godan': 1}\n",
        "\u30af {u'Godan': 5}\n",
        "\u30b3 {u'Godan': 4}\n",
        "\u30b7 {u'Godan': 3}\n",
        "\u3044 {u'Ichidan': 32, u'Godan': 76}\n",
        "\u3048 {u'Yodan': 1, u'Ichidan': 398, u'Godan': 41}\n",
        "\u30cb {u'Godan': 3}\n",
        "\u304c {u'Nidan': 1, u'Godan': 162}\n",
        "\u3050 {u'Godan': 25}\n",
        "\u30d3 {u'Godan': 1}\n",
        "\u3054 {u'Nidan': 1, u'Godan': 3}\n",
        "\u3058 {u'Ichidan': 104, u'Godan': 21}\n",
        "\u305c {u'Ichidan': 11}\n",
        "\u30df {u'Godan': 1}\n",
        "\u3060 {u'Nidan': 1, u'Godan': 18}\n",
        "\u30e3 {u'Godan': 1}\n",
        "\u3064 {u'Godan': 34}\n",
        "\u30e7 {u'Godan': 1}\n",
        "\u3068 {u'Godan': 161}\n",
        "\u306c {u'Godan': 6}\n",
        "\u3070 {u'Godan': 45}\n",
        "\u3078 {u'Ichidan': 3, u'Godan': 5}\n",
        "\u307c {u'Godan': 32}\n",
        "\u3080 {u'Godan': 13}\n",
        "\u3084 {u'Godan': 28}\n",
        "\u3088 {u'Godan': 36}\n",
        "\u308c {u'Ichidan': 548, u'Godan': 1}\n",
        "\u30b0 {u'Godan': 5}\n",
        "\u30b8 {u'Godan': 2}\n",
        "\u30c4 {u'Godan': 1}\n",
        "\u30c8 {u'Godan': 2}\n",
        "\u304b {u'Yodan': 2, u'Godan': 149}\n",
        "\u304f {u'Godan': 69}\n",
        "\u30d0 {u'Godan': 2}\n",
        "\u3053 {u'Godan': 35}\n",
        "\u30d4 {u'Godan': 2}\n",
        "\u3057 {u'Ichidan': 1, u'Godan': 37}\n",
        "\u305b {u'Ichidan': 297, u'Godan': 9}\n",
        "\u30dc {u'Godan': 2}\n",
        "\u305f {u'Godan': 72}\n",
        "\u30e0 {u'Godan': 2}\n",
        "\u3067 {u'Ichidan': 79, u'Godan': 1}\n",
        "\u306b {u'Ichidan': 3}\n",
        "\u30ec {u'Ichidan': 2}\n",
        "\u306f {u'Yodan': 1, u'Godan': 44}\n",
        "\u3073 {u'Ichidan': 40, u'Godan': 7}\n",
        "\u307b {u'Godan': 6}\n",
        "\u307f {u'Ichidan': 61}\n",
        "\u3081 {u'Ichidan': 470, u'Godan': 7}\n",
        "\u3089 {u'Ichidan': 1}\n",
        "\u30ad {u'Godan': 2}\n",
        "\u30b1 {u'Godan': 1}\n",
        "\u30b9 {u'Godan': 4}\n",
        "\u30c1 {u'Godan': 2}\n",
        "\u3042 {u'Yodan': 1, u'Godan': 28}\n",
        "\u3046 {u'Godan': 13}\n",
        "\u30c9 {u'Godan': 1}\n",
        "\u304a {u'Godan': 45}\n",
        "\u304e {u'Ichidan': 29, u'Godan': 40}\n",
        "\u30d1 {u'Godan': 1}\n",
        "\u3052 {u'Ichidan': 251, u'Godan': 5}\n",
        "\u30d5 {u'Godan': 2}\n",
        "\u3056 {u'Yodan': 2, u'Godan': 11}\n",
        "\u305a {u'Ichidan': 79, u'Godan': 21}\n",
        "\u305e {u'Godan': 3}\n",
        "\u30e1 {u'Ichidan': 1}\n",
        "\u3066 {u'Ichidan': 185, u'Godan': 3}\n",
        "\u306a {u'Nidan': 2, u'Godan': 147}\n",
        "\u30ed {u'Godan': 3}\n",
        "\u306e {u'Godan': 31}\n",
        "\u3072 {u'Ichidan': 3, u'Godan': 4}\n",
        "\u3076 {u'Nidan': 1, u'Godan': 70}\n",
        "\u307e {u'Yodan': 2, u'Godan': 147}\n",
        "\u3082 {u'Godan': 43}\n",
        "\u3086 {u'Godan': 1}\n",
        "\u308a {u'Ichidan': 24, u'Godan': 2}\n",
        "\u30ba {u'Godan': 1}\n",
        "\u30c6 {u'Godan': 2}\n",
        "\u30ca {u'Godan': 1}\n",
        "\u304d {u'Ichidan': 30, u'Godan': 108}\n",
        "\u3051 {u'Ichidan': 630, u'Godan': 22}\n",
        "\u3055 {u'Godan': 58}\n",
        "\u30d6 {u'Godan': 6}\n",
        "\u3059 {u'Godan': 16}\n",
        "\u305d {u'Nidan': 1, u'Godan': 10}\n",
        "\u3061 {u'Ichidan': 38, u'Godan': 9}\n",
        "\u30e2 {u'Godan': 3}\n",
        "\u3065 {u'Godan': 2}\n",
        "\u3069 {u'Godan': 35}\n",
        "\u306d {u'Ichidan': 61, u'Godan': 15}\n",
        "\u3071 {u'Godan': 5}\n",
        "\u3075 {u'Godan': 18}\n",
        "\u3079 {u'Ichidan': 33, u'Godan': 9}\n"
       ]
      }
     ],
     "prompt_number": 58
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This still needs a little sorting!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def vowel_type(character):\n",
      "    vowels = ['a', 'i', 'u', 'e', 'o']\n",
      "    chars = [u'\u3042\u304b\u3055\u305f\u306a\u306f\u307e\u3084\u3089\u308f\u3083\u30ab\u304c\u3060\u30e3\u3070\u30d0\u30d1\u3056\u30ca\u3071', \n",
      "             u'\u30b7\u3044\u30cb\u30d3\u3058\u30df\u30b8\u30d4\u3057\u306b\u3073\u307f\u30ad\u30c1\u304e\u3072\u308a\u304d\u3061',\n",
      "             u'\u30af\u3050\u3064\u306c\u3080\u30b0\u30c4\u304f\u30e0\u30b9\u3046\u30d5\u305a\u3076\u3086\u30ba\u30d6\u3059\u3065\u3075',\n",
      "             u'\u3079\u30a7\u3048\u305c\u3078\u308c\u305b\u3067\u30ec\u3081\u30b1\u3052\u30e1\u3066\u30c6\u3051\u306d',\n",
      "             u'\u30b3\u3054\u30e7\u3068\u307c\u3088\u30c8\u3053\u30dc\u307b\u30c9\u304a\u305e\u30ed\u306e\u3082\u305d\u30e2\u3069']\n",
      "    for ind, hiragana in enumerate(chars):\n",
      "        if character in hiragana:\n",
      "            return vowels[ind]\n",
      "    print character\n",
      "    raise Exception('character not supported')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 248
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "final_classification = {}\n",
      "for key in second_last_char_dict:\n",
      "    vowel = vowel_type(key)\n",
      "    if vowel not in final_classification:\n",
      "        final_classification[vowel] = second_last_char_dict[key]\n",
      "    else:\n",
      "        for verb_group in second_last_char_dict[key]:\n",
      "            if verb_group in final_classification[vowel]:\n",
      "                final_classification[vowel][verb_group] += second_last_char_dict[key][verb_group]\n",
      "            else:\n",
      "                final_classification[vowel][verb_group] = second_last_char_dict[key][verb_group]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 250
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "final_classification"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 251,
       "text": [
        "{'a': {u'Godan': 1055, u'Ichidan': 1, u'Nidan': 4, u'Yodan': 10},\n",
        " 'e': {u'Godan': 122, u'Ichidan': 2969, u'Yodan': 1},\n",
        " 'i': {u'Godan': 320, u'Ichidan': 365},\n",
        " 'o': {u'Godan': 456, u'Nidan': 2},\n",
        " 'u': {u'Godan': 314, u'Ichidan': 79, u'Nidan': 1}}"
       ]
      }
     ],
     "prompt_number": 251
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for key in final_classification.keys():\n",
      "    print 'vowel sound ending in %s' % key\n",
      "    print \", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), final_classification[key].items()))    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "vowel sound ending in a\n",
        "Yodan: 10, Nidan: 4, Ichidan: 1, Godan: 1055\n",
        "vowel sound ending in u\n",
        "Nidan: 1, Ichidan: 79, Godan: 314\n",
        "vowel sound ending in e\n",
        "Yodan: 1, Ichidan: 2969, Godan: 122\n",
        "vowel sound ending in i\n",
        "Ichidan: 365, Godan: 320\n",
        "vowel sound ending in o\n",
        "Nidan: 2, Godan: 456\n"
       ]
      }
     ],
     "prompt_number": 255
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for key in final_classification.keys():\n",
      "    print 'vowel sound ending in %s' % key\n",
      "    total = sum(map(lambda i: i[1], final_classification[key].items()))\n",
      "    print \", \".join(map(lambda s: \": \".join((s[0], format(s[1] / float(total) * 100, \".2f\"))), final_classification[key].items()))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "vowel sound ending in a\n",
        "Yodan: 0.93, Nidan: 0.37, Ichidan: 0.09, Godan: 98.60\n",
        "vowel sound ending in u\n",
        "Nidan: 0.25, Ichidan: 20.05, Godan: 79.70\n",
        "vowel sound ending in e\n",
        "Yodan: 0.03, Ichidan: 96.02, Godan: 3.95\n",
        "vowel sound ending in i\n",
        "Ichidan: 53.28, Godan: 46.72\n",
        "vowel sound ending in o\n",
        "Nidan: 0.44, Godan: 99.56\n"
       ]
      }
     ],
     "prompt_number": 263
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Concluding remarks\n",
      "\n",
      "## High certainty\n",
      "\n",
      "- if vowel sound is o, it's a *u* verb in 100% of cases\n",
      "- if vowel sound is a, it's a *u* verb in 99% of cases\n",
      "- if vowel sound is e, it's a *ru* verb in 96% of cases\n",
      "\n",
      "## Lower certainty\n",
      "\n",
      "- if vowel sound is u, it's a *u* verb in 80% of cases and a *ru* verb in 20% of cases\n",
      "- if vowel sound is i, it's *ru* verb in 53% of cases and *u* verb in 47% of cases\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Side note"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The more I'm computing this sort of thing, the more I'm thinking I could do this in an easier way with Pandas."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}