{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Goal of this study\n", "\n", "According to Tae Kim's guide to Japanese grammar and [its introduction to Japanese verbs](http://www.guidetojapanese.org/learn/grammar/verbs), the following holds true:\n", "\n", "> All ru-verbs end in \u300c\u308b\u300d while u-verbs can end in a number of u-vowel sounds including \u300c\u308b\u300d. Therefore, if a verb does not end in \u300c\u308b\u300d, it will always be an u-verb. For verbs ending in \u300c\u308b\u300d, if the vowel sound preceding the \u300c\u308b\u300d is an /a/, /u/ or /o/ vowel sound, it will always be an u-verb. Otherwise, if the preceding sound is an /i/ or /e/ vowel sound, it will be a ru-verb in most cases. A list of common exceptions are at the end of this section.\n", "\n", "In this exploration, we're going to check that this description and try to find the statistics (i.e. the percentages for the different categories).\n", "\n", "# Sample file extraction \n", "\n", "The first step in our study is to read the dictionary XML and browse it for verbs. \n", "\n", "We're starting small, with a sample file that contains one word so that we can extract:\n", "\n", "- the pronunciation in Hiragana\n", "- the grammatical function of the word" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import xml.etree.cElementTree as ET" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "tree = ET.ElementTree(file='./JMdict_example.xml')\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in root:\n", " print elem" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\n", "\n", "\n", "\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the evaluation above, we see that an element of the XML structure contains a pronunciation that we can extract from the *r_ele/reb* tag:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print root.find('r_ele/reb').text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u3046\u3088\u304f\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The grammatical function comes from the *sense/pos* tag. It should be noted that a word can have more than one grammatical function in Japanese. In the case below, the word uyoku can be an adjective-noun and a noun." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [elem.text for elem in root.findall('sense/pos')]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['adj-no;', 'n;']\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now define a function that acts as a shortcut for the two operations done above (pronunciation and grammatical function):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def extract_information(elem):\n", " return (elem.find('r_ele/reb').text,\n", " [sense.text for sense in elem.findall('sense/pos')])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "extract_information(root)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "(u'\\u3046\\u3088\\u304f', ['adj-no;', 'n;'])" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Where are the verbs?\n", "\n", "If we take a look at the DTD of the JMdict XML file we find the following entities corresponding to verbs:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, this is quite complex. Instead of parsing every single verb type and differentiating between it, we will focus on four verb types only:\n", "\n", "- Ichidan \n", "- Nidan\n", "- Yodan\n", "- Godan\n", "\n", "As a side note, the Ichidan verbs are the *ru* verbs Tae Kim is talking about, while the Godan verbs are the *u* verbs.\n", "\n", "Below, we're looping over the dictionary items and extracting words that possibly contain verbs. To do this, we're checking for the presence of either of the four above verb types." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tree = ET.ElementTree(file='../JMdict.xml')\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "verb_list = []\n", "verb_types = [u'Ichidan', u'Nidan', u'Yodan', u'Godan']\n", "for elem in tree.getroot().findall('entry'):\n", " item_info = extract_information(elem)\n", " for verb_type in verb_types:\n", " if \" \".join(item_info[1]).find(verb_type) != -1:\n", " verb_list.append((item_info[0], verb_type))\n", " break\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "len(verb_list)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "10266" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "verb_list[:10]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "[(u'\\u3042\\u3052\\u3064\\u3089\\u3046', u'Godan'),\n", " (u'\\u3042\\u3057\\u3089\\u3046', u'Godan'),\n", " (u'\\u3042\\u3076\\u308c\\u308b', u'Ichidan'),\n", " (u'\\u3042\\u3084\\u3059', u'Godan'),\n", " (u'\\u3044\\u304b\\u3059', u'Godan'),\n", " (u'\\u3044\\u3058\\u3051\\u308b', u'Ichidan'),\n", " (u'\\u3044\\u3058\\u308a\\u307e\\u308f\\u3059', u'Godan'),\n", " (u'\\u3044\\u3061\\u3083\\u3064\\u304f', u'Godan'),\n", " (u'\\u3044\\u306a\\u306a\\u304f', u'Godan'),\n", " (u'\\u3044\\u3073\\u308b', u'Godan')]" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the unicode is a bit hard to read, we can format the previous output to the following effect:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in map(\": \".join, verb_list[:10]):\n", " print elem" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u3042\u3052\u3064\u3089\u3046: Godan\n", "\u3042\u3057\u3089\u3046: Godan\n", "\u3042\u3076\u308c\u308b: Ichidan\n", "\u3042\u3084\u3059: Godan\n", "\u3044\u304b\u3059: Godan\n", "\u3044\u3058\u3051\u308b: Ichidan\n", "\u3044\u3058\u308a\u307e\u308f\u3059: Godan\n", "\u3044\u3061\u3083\u3064\u304f: Godan\n", "\u3044\u306a\u306a\u304f: Godan\n", "\u3044\u3073\u308b: Godan\n" ] } ], "prompt_number": 23 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "First statistical breakdown" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we now have a list of verb types, we can calculate their relative frequency:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for verb_type in verb_types:\n", " print \"%s: %i out of %i verbs\" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)), len(verb_list))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Ichidan: 3415 out of 10266 verbs\n", "Nidan: 38 out of 10266 verbs\n", "Yodan: 25 out of 10266 verbs\n", "Godan: 6788 out of 10266 verbs\n" ] } ], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "for verb_type in verb_types:\n", " print \"%s: %.1f percent\" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)) / float(len(verb_list)) * 100)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Ichidan: 33.3 percent\n", "Nidan: 0.4 percent\n", "Yodan: 0.2 percent\n", "Godan: 66.1 percent\n" ] } ], "prompt_number": 257 }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to this output, there's a majority of *u* verbs in the Japanese language." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Verb types by last character " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following Tae Kim's data,we're going to count number of verbs of a given type by classifying them according to their last character:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "last_char_dict = {}\n", "for verb in verb_list:\n", " last_char = verb[0][-1] \n", " if last_char not in last_char_dict:\n", " last_char_dict[last_char] = {verb[1] : 1}\n", " else:\n", " if verb[1] in last_char_dict[last_char]:\n", " last_char_dict[last_char][verb[1]] += 1\n", " else:\n", " last_char_dict[last_char][verb[1]] = 1\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "last_char_dict" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 41, "text": [ "{u'\\u3046': {u'Godan': 719, u'Nidan': 6, u'Yodan': 4},\n", " u'\\u304f': {u'Godan': 911, u'Nidan': 2, u'Yodan': 4},\n", " u'\\u3050': {u'Godan': 137, u'Nidan': 2},\n", " u'\\u3059': {u'Godan': 1803, u'Nidan': 1, u'Yodan': 3},\n", " u'\\u305a': {u'Ichidan': 1, u'Nidan': 2},\n", " u'\\u3064': {u'Godan': 244, u'Nidan': 2, u'Yodan': 1},\n", " u'\\u306c': {u'Godan': 9, u'Nidan': 2},\n", " u'\\u3075': {u'Nidan': 1},\n", " u'\\u3076': {u'Godan': 102, u'Nidan': 2, u'Yodan': 2},\n", " u'\\u3080': {u'Godan': 596, u'Nidan': 4},\n", " u'\\u3086': {u'Nidan': 7},\n", " u'\\u308b': {u'Godan': 2267, u'Ichidan': 3414, u'Nidan': 7, u'Yodan': 11}}" ] } ], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "\", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), last_char_dict[last_char_dict.keys()[0]].items()))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 53, "text": [ "u'Nidan: 4, Godan: 596'" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "for key in last_char_dict:\n", " print \"verbs ending in %s\" % key, \", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), last_char_dict[key].items()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "verbs ending in \u3080 Nidan: 4, Godan: 596\n", "verbs ending in \u3064 Yodan: 1, Nidan: 2, Godan: 244\n", "verbs ending in \u3046 Yodan: 4, Nidan: 6, Godan: 719\n", "verbs ending in \u308b Yodan: 11, Nidan: 7, Ichidan: 3414, Godan: 2267\n", "verbs ending in \u3086 Nidan: 7\n", "verbs ending in \u306c Nidan: 2, Godan: 9\n", "verbs ending in \u304f Yodan: 4, Nidan: 2, Godan: 911\n", "verbs ending in \u3050 Nidan: 2, Godan: 137\n", "verbs ending in \u3075 Nidan: 1\n", "verbs ending in \u3076 Yodan: 2, Nidan: 2, Godan: 102\n", "verbs ending in \u3059 Yodan: 3, Nidan: 1, Godan: 1803\n", "verbs ending in \u305a Nidan: 2, Ichidan: 1\n" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see from the output the first idea given by Tae Kim seems correct: if it doesn't end in *ru* it's a Godan verb (except for one \u305a Ichidan verb). Let's check the second idea: \n", "> if it ends in *ru*, the second before last hiragana mostly determines the verb type\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "second_last_char_dict = {}\n", "for verb in verb_list:\n", " last_char = verb[0][-1] \n", " if last_char == u'\u308b' and len(verb[0]) >= 2:\n", " second_last_char = verb[0][-2] \n", " if second_last_char not in second_last_char_dict:\n", " second_last_char_dict[second_last_char] = {verb[1] : 1}\n", " else:\n", " if verb[1] in second_last_char_dict[second_last_char]:\n", " second_last_char_dict[second_last_char][verb[1]] += 1\n", " else:\n", " second_last_char_dict[second_last_char][verb[1]] = 1" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 57 }, { "cell_type": "code", "collapsed": false, "input": [ "for key in second_last_char_dict:\n", " print key, second_last_char_dict[key]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u3083 {u'Yodan': 2, u'Godan': 5}\n", "\u308f {u'Godan': 130}\n", "\u30a7 {u'Godan': 1}\n", "\u30ab {u'Godan': 1}\n", "\u30af {u'Godan': 5}\n", "\u30b3 {u'Godan': 4}\n", "\u30b7 {u'Godan': 3}\n", "\u3044 {u'Ichidan': 32, u'Godan': 76}\n", "\u3048 {u'Yodan': 1, u'Ichidan': 398, u'Godan': 41}\n", "\u30cb {u'Godan': 3}\n", "\u304c {u'Nidan': 1, u'Godan': 162}\n", "\u3050 {u'Godan': 25}\n", "\u30d3 {u'Godan': 1}\n", "\u3054 {u'Nidan': 1, u'Godan': 3}\n", "\u3058 {u'Ichidan': 104, u'Godan': 21}\n", "\u305c {u'Ichidan': 11}\n", "\u30df {u'Godan': 1}\n", "\u3060 {u'Nidan': 1, u'Godan': 18}\n", "\u30e3 {u'Godan': 1}\n", "\u3064 {u'Godan': 34}\n", "\u30e7 {u'Godan': 1}\n", "\u3068 {u'Godan': 161}\n", "\u306c {u'Godan': 6}\n", "\u3070 {u'Godan': 45}\n", "\u3078 {u'Ichidan': 3, u'Godan': 5}\n", "\u307c {u'Godan': 32}\n", "\u3080 {u'Godan': 13}\n", "\u3084 {u'Godan': 28}\n", "\u3088 {u'Godan': 36}\n", "\u308c {u'Ichidan': 548, u'Godan': 1}\n", "\u30b0 {u'Godan': 5}\n", "\u30b8 {u'Godan': 2}\n", "\u30c4 {u'Godan': 1}\n", "\u30c8 {u'Godan': 2}\n", "\u304b {u'Yodan': 2, u'Godan': 149}\n", "\u304f {u'Godan': 69}\n", "\u30d0 {u'Godan': 2}\n", "\u3053 {u'Godan': 35}\n", "\u30d4 {u'Godan': 2}\n", "\u3057 {u'Ichidan': 1, u'Godan': 37}\n", "\u305b {u'Ichidan': 297, u'Godan': 9}\n", "\u30dc {u'Godan': 2}\n", "\u305f {u'Godan': 72}\n", "\u30e0 {u'Godan': 2}\n", "\u3067 {u'Ichidan': 79, u'Godan': 1}\n", "\u306b {u'Ichidan': 3}\n", "\u30ec {u'Ichidan': 2}\n", "\u306f {u'Yodan': 1, u'Godan': 44}\n", "\u3073 {u'Ichidan': 40, u'Godan': 7}\n", "\u307b {u'Godan': 6}\n", "\u307f {u'Ichidan': 61}\n", "\u3081 {u'Ichidan': 470, u'Godan': 7}\n", "\u3089 {u'Ichidan': 1}\n", "\u30ad {u'Godan': 2}\n", "\u30b1 {u'Godan': 1}\n", "\u30b9 {u'Godan': 4}\n", "\u30c1 {u'Godan': 2}\n", "\u3042 {u'Yodan': 1, u'Godan': 28}\n", "\u3046 {u'Godan': 13}\n", "\u30c9 {u'Godan': 1}\n", "\u304a {u'Godan': 45}\n", "\u304e {u'Ichidan': 29, u'Godan': 40}\n", "\u30d1 {u'Godan': 1}\n", "\u3052 {u'Ichidan': 251, u'Godan': 5}\n", "\u30d5 {u'Godan': 2}\n", "\u3056 {u'Yodan': 2, u'Godan': 11}\n", "\u305a {u'Ichidan': 79, u'Godan': 21}\n", "\u305e {u'Godan': 3}\n", "\u30e1 {u'Ichidan': 1}\n", "\u3066 {u'Ichidan': 185, u'Godan': 3}\n", "\u306a {u'Nidan': 2, u'Godan': 147}\n", "\u30ed {u'Godan': 3}\n", "\u306e {u'Godan': 31}\n", "\u3072 {u'Ichidan': 3, u'Godan': 4}\n", "\u3076 {u'Nidan': 1, u'Godan': 70}\n", "\u307e {u'Yodan': 2, u'Godan': 147}\n", "\u3082 {u'Godan': 43}\n", "\u3086 {u'Godan': 1}\n", "\u308a {u'Ichidan': 24, u'Godan': 2}\n", "\u30ba {u'Godan': 1}\n", "\u30c6 {u'Godan': 2}\n", "\u30ca {u'Godan': 1}\n", "\u304d {u'Ichidan': 30, u'Godan': 108}\n", "\u3051 {u'Ichidan': 630, u'Godan': 22}\n", "\u3055 {u'Godan': 58}\n", "\u30d6 {u'Godan': 6}\n", "\u3059 {u'Godan': 16}\n", "\u305d {u'Nidan': 1, u'Godan': 10}\n", "\u3061 {u'Ichidan': 38, u'Godan': 9}\n", "\u30e2 {u'Godan': 3}\n", "\u3065 {u'Godan': 2}\n", "\u3069 {u'Godan': 35}\n", "\u306d {u'Ichidan': 61, u'Godan': 15}\n", "\u3071 {u'Godan': 5}\n", "\u3075 {u'Godan': 18}\n", "\u3079 {u'Ichidan': 33, u'Godan': 9}\n" ] } ], "prompt_number": 58 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This still needs a little sorting!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def vowel_type(character):\n", " vowels = ['a', 'i', 'u', 'e', 'o']\n", " chars = [u'\u3042\u304b\u3055\u305f\u306a\u306f\u307e\u3084\u3089\u308f\u3083\u30ab\u304c\u3060\u30e3\u3070\u30d0\u30d1\u3056\u30ca\u3071', \n", " u'\u30b7\u3044\u30cb\u30d3\u3058\u30df\u30b8\u30d4\u3057\u306b\u3073\u307f\u30ad\u30c1\u304e\u3072\u308a\u304d\u3061',\n", " u'\u30af\u3050\u3064\u306c\u3080\u30b0\u30c4\u304f\u30e0\u30b9\u3046\u30d5\u305a\u3076\u3086\u30ba\u30d6\u3059\u3065\u3075',\n", " u'\u3079\u30a7\u3048\u305c\u3078\u308c\u305b\u3067\u30ec\u3081\u30b1\u3052\u30e1\u3066\u30c6\u3051\u306d',\n", " u'\u30b3\u3054\u30e7\u3068\u307c\u3088\u30c8\u3053\u30dc\u307b\u30c9\u304a\u305e\u30ed\u306e\u3082\u305d\u30e2\u3069']\n", " for ind, hiragana in enumerate(chars):\n", " if character in hiragana:\n", " return vowels[ind]\n", " print character\n", " raise Exception('character not supported')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 248 }, { "cell_type": "code", "collapsed": false, "input": [ "final_classification = {}\n", "for key in second_last_char_dict:\n", " vowel = vowel_type(key)\n", " if vowel not in final_classification:\n", " final_classification[vowel] = second_last_char_dict[key]\n", " else:\n", " for verb_group in second_last_char_dict[key]:\n", " if verb_group in final_classification[vowel]:\n", " final_classification[vowel][verb_group] += second_last_char_dict[key][verb_group]\n", " else:\n", " final_classification[vowel][verb_group] = second_last_char_dict[key][verb_group]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 250 }, { "cell_type": "code", "collapsed": false, "input": [ "final_classification" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 251, "text": [ "{'a': {u'Godan': 1055, u'Ichidan': 1, u'Nidan': 4, u'Yodan': 10},\n", " 'e': {u'Godan': 122, u'Ichidan': 2969, u'Yodan': 1},\n", " 'i': {u'Godan': 320, u'Ichidan': 365},\n", " 'o': {u'Godan': 456, u'Nidan': 2},\n", " 'u': {u'Godan': 314, u'Ichidan': 79, u'Nidan': 1}}" ] } ], "prompt_number": 251 }, { "cell_type": "code", "collapsed": false, "input": [ "for key in final_classification.keys():\n", " print 'vowel sound ending in %s' % key\n", " print \", \".join(map(lambda s: \": \".join((s[0], str(s[1]))), final_classification[key].items())) " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "vowel sound ending in a\n", "Yodan: 10, Nidan: 4, Ichidan: 1, Godan: 1055\n", "vowel sound ending in u\n", "Nidan: 1, Ichidan: 79, Godan: 314\n", "vowel sound ending in e\n", "Yodan: 1, Ichidan: 2969, Godan: 122\n", "vowel sound ending in i\n", "Ichidan: 365, Godan: 320\n", "vowel sound ending in o\n", "Nidan: 2, Godan: 456\n" ] } ], "prompt_number": 255 }, { "cell_type": "code", "collapsed": false, "input": [ "for key in final_classification.keys():\n", " print 'vowel sound ending in %s' % key\n", " total = sum(map(lambda i: i[1], final_classification[key].items()))\n", " print \", \".join(map(lambda s: \": \".join((s[0], format(s[1] / float(total) * 100, \".2f\"))), final_classification[key].items()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "vowel sound ending in a\n", "Yodan: 0.93, Nidan: 0.37, Ichidan: 0.09, Godan: 98.60\n", "vowel sound ending in u\n", "Nidan: 0.25, Ichidan: 20.05, Godan: 79.70\n", "vowel sound ending in e\n", "Yodan: 0.03, Ichidan: 96.02, Godan: 3.95\n", "vowel sound ending in i\n", "Ichidan: 53.28, Godan: 46.72\n", "vowel sound ending in o\n", "Nidan: 0.44, Godan: 99.56\n" ] } ], "prompt_number": 263 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Concluding remarks\n", "\n", "## High certainty\n", "\n", "- if vowel sound is o, it's a *u* verb in 100% of cases\n", "- if vowel sound is a, it's a *u* verb in 99% of cases\n", "- if vowel sound is e, it's a *ru* verb in 96% of cases\n", "\n", "## Lower certainty\n", "\n", "- if vowel sound is u, it's a *u* verb in 80% of cases and a *ru* verb in 20% of cases\n", "- if vowel sound is i, it's *ru* verb in 53% of cases and *u* verb in 47% of cases\n" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Side note" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The more I'm computing this sort of thing, the more I'm thinking I could do this in an easier way with Pandas." ] } ], "metadata": {} } ] }