{
"metadata": {
"name": "Kanjidic2 & JMDict"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This document is organized in 3 sections:\n",
"\n",
"- Recap of Eli Bendersky's XML parsing from [his online article](http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/)\n",
"- Parsing and helper functions for Kanjidic2\n",
"- Parsing and helper functions for JMDict"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Eli Bendersky's article tests"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Basic parsing"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"\n",
"\n",
" \n",
" text,source\n",
" \n",
" \n",
" \n",
" xml,sgml\n",
" \n",
" \n",
" \n",
" \n",
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import xml.etree.cElementTree as ET\n",
"tree = ET.ElementTree(file='doc1.xml')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tree.getroot()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 80,
"text": [
""
]
}
],
"prompt_number": 80
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root = tree.getroot()\n",
"root.tag, root.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 83,
"text": [
"('doc', {})"
]
}
],
"prompt_number": 83
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for child_of_root in root:\n",
" print child_of_root.tag, child_of_root.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"branch {'hash': '1cdf045c', 'name': 'testing'}\n",
"branch {'hash': 'f200013e', 'name': 'release01'}\n",
"branch {'name': 'invalid'}\n"
]
}
],
"prompt_number": 84
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root[0].tag, root[0].text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 85,
"text": [
"('branch', '\\n text,source\\n ')"
]
}
],
"prompt_number": 85
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Find interesting stuff"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for elem in tree.iter():\n",
" print elem.tag, elem.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"doc {}\n",
"branch {'hash': '1cdf045c', 'name': 'testing'}\n",
"branch {'hash': 'f200013e', 'name': 'release01'}\n",
"sub-branch {'name': 'subrelease01'}\n",
"branch {'name': 'invalid'}\n"
]
}
],
"prompt_number": 86
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for elem in tree.iter(tag='branch'):\n",
" print elem.tag, elem.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"branch {'hash': '1cdf045c', 'name': 'testing'}\n",
"branch {'hash': 'f200013e', 'name': 'release01'}\n",
"branch {'name': 'invalid'}\n"
]
}
],
"prompt_number": 87
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Using XPath"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for elem in tree.iterfind('branch/sub-branch'):\n",
" print elem.tag, elem.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"sub-branch {'name': 'subrelease01'}\n"
]
}
],
"prompt_number": 97
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for elem in tree.iterfind('branch'):\n",
" print elem.tag, elem.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"branch {'hash': '1cdf045c', 'name': 'testing'}\n",
"branch {'hash': 'f200013e', 'name': 'release01'}\n",
"branch {'name': 'invalid'}\n"
]
}
],
"prompt_number": 99
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for elem in tree.iterfind('branch[@name=\"release01\"]'):\n",
" print elem.tag, elem.attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"branch {'hash': 'f200013e', 'name': 'release01'}\n"
]
}
],
"prompt_number": 90
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Kanjidic2"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Example file"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"\n",
"\n",
" \u672c\n",
" \n",
" 672c\n",
" 43-60\n",
" \n",
" \n",
" 75\n",
" 2\n",
" \n",
" \n",
" 1\n",
" 5\n",
" 52-81\n",
" 10\n",
" 4\n",
" \n",
" \n",
" 96\n",
" 2536\n",
" 3502\n",
" 2183\n",
" 211\n",
" 15\n",
" 212\n",
" 20\n",
" 14421\n",
" 70\n",
" 25\n",
" 45\n",
" 61\n",
" 76\n",
" 47\n",
" 6\n",
" 37\n",
" 2.1\n",
" 1046\n",
" 215\n",
" \n",
" \n",
" 4-5-3\n",
" 0a5.25\n",
" 5023.0\n",
" 1855\n",
" \n",
" \n",
" \n",
" ben3\n",
" bon\n",
" \ubcf8\n",
" \u30db\u30f3\n",
" \u3082\u3068\n",
" book\n",
" present\n",
" main\n",
" true\n",
" real\n",
" counter for long cylindrical things\n",
" livre\n",
" pr\u00e9sent\n",
" essentiel\n",
" origine\n",
" principal\n",
" r\u00e9alit\u00e9\n",
" v\u00e9rit\u00e9\n",
" compteur d'objets allong\u00e9s\n",
" libro\n",
" origen\n",
" base\n",
" contador de cosas alargadas\n",
" livro\n",
" presente\n",
" real\n",
" verdadeiro\n",
" principal\n",
" sufixo p/ contagem De coisas longas\n",
" \n",
" \u307e\u3068\n",
" \n",
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import xml.etree.cElementTree as ET\n",
"tree = ET.ElementTree(file='kanjidic2_example.xml')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 105
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First of all, what does the tree look like in this example file?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"elems = [elem for elem in tree.iter()][:10]\n",
"elems"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 25,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the root: the 'character'."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root = tree.getroot()\n",
"root"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 15,
"text": [
""
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the literal."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"literal = root[0]\n",
"literal"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 16,
"text": [
""
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"kanji = literal.text\n",
"kanji"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 17,
"text": [
"u'\\u672c'"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print kanji"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u672c\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the meanings."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"meanings = [elem for elem in tree.iter('meaning')]\n",
"[meaning.text for meaning in meanings]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 44,
"text": [
"['book',\n",
" 'present',\n",
" 'main',\n",
" 'true',\n",
" 'real',\n",
" 'counter for long cylindrical things',\n",
" 'livre',\n",
" u'pr\\xe9sent',\n",
" 'essentiel',\n",
" 'origine',\n",
" 'principal',\n",
" u'r\\xe9alit\\xe9',\n",
" u'v\\xe9rit\\xe9',\n",
" u\"compteur d'objets allong\\xe9s\",\n",
" 'libro',\n",
" 'origen',\n",
" 'base',\n",
" 'contador de cosas alargadas',\n",
" 'livro',\n",
" 'presente',\n",
" 'real',\n",
" 'verdadeiro',\n",
" 'principal',\n",
" 'sufixo p/ contagem De coisas longas']"
]
}
],
"prompt_number": 44
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But here we only want english meanings."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"meanings[10].attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 46,
"text": [
"{'m_lang': 'fr'}"
]
}
],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"english_meanings = filter(lambda elem: elem.attrib == {}, meanings)\n",
"[meaning.text for meaning in english_meanings]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 51,
"text": [
"['book',\n",
" 'present',\n",
" 'main',\n",
" 'true',\n",
" 'real',\n",
" 'counter for long cylindrical things']"
]
}
],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can get the Kanas."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"readings = [elem for elem in tree.iter('reading')]\n",
"print [reading.text for reading in readings]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['ben3', 'bon', u'\\ubcf8', u'\\u30db\\u30f3', u'\\u3082\\u3068']\n"
]
}
],
"prompt_number": 60
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filtering for kanas."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"readings[0].attrib['r_type']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 64,
"text": [
"'pinyin'"
]
}
],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"'r_type' in readings[0].attrib"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 67,
"text": [
"True"
]
}
],
"prompt_number": 67
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"kanas = filter(lambda reading: reading.attrib['r_type'] in ['ja_on', 'ja_kun'], readings)\n",
"kanas"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 73,
"text": [
"[, ]"
]
}
],
"prompt_number": 73
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for kana in kanas:\n",
" print kana.text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u30db\u30f3\n",
"\u3082\u3068\n"
]
}
],
"prompt_number": 75
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"The whole Kanjidic2 file"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import xml.etree.cElementTree as ET\n",
"tree = ET.ElementTree(file='kanjidic2.xml')\n",
"tree"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 1,
"text": [
""
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root = tree.getroot()\n",
"root"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 2,
"text": [
""
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root.findall('character/literal')[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 3,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I *understand* now: you have to specify the exact branching in the findall command while iter works because it filters the depth first search."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Searching for the entry of a specific kanji."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"search_kanji = u'\u672c'\n",
"literals = root.findall('character/literal')\n",
"literals[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 4,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(literals)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 5,
"text": [
"13108"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tree.find('character/literal')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 147,
"text": [
""
]
}
],
"prompt_number": 147
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[literal.text for literal in literals].index(u'\u8a71')\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 6,
"text": [
"2948"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print literals[2948].text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u8a71\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the parent node."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"characters = root.findall('character')\n",
"characters[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 151,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 151
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print characters[2948][0].text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u8a71\n"
]
}
],
"prompt_number": 153
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Defining helper functions"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Find a specific kanji in the dictionary"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def find_element_by_kanji(tree, kanji):\n",
" root = tree.getroot()\n",
" literals = root.findall('character/literal')\n",
" index = [literal.text for literal in literals].index(kanji)\n",
" return root.findall('character')[index]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 155
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"kuruma = find_element_by_kanji(tree, u'\u8eca')\n",
"kuruma"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 158,
"text": [
""
]
}
],
"prompt_number": 158
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print kuruma[0].text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u8eca\n"
]
}
],
"prompt_number": 160
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Extract meaningful information from a 'character'"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def extract_data(element):\n",
" \"\"\"returns the kanji, the kana and the meanings from an element\"\"\"\n",
" kanji = element.find('literal').text\n",
" kana = [elem.text for elem in filter(lambda reading: reading.attrib['r_type'] in ['ja_on', 'ja_kun'], element.findall('reading_meaning/rmgroup/reading'))]\n",
" meanings = [elem.text for elem in filter(lambda elem: elem.attrib == {}, element.findall('reading_meaning/rmgroup/meaning'))]\n",
" return (kanji, kana, meanings)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 173
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def disp_data(data):\n",
" print data[0]\n",
" for item in data[1]:\n",
" print item\n",
" for item in data[2]:\n",
" print item \n",
"\n",
"data = extract_data(kuruma)\n",
"disp_data(data)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u8eca\n",
"\u30b7\u30e3\n",
"\u304f\u308b\u307e\n",
"car\n"
]
}
],
"prompt_number": 176
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"disp_data(extract_data(find_element_by_kanji(tree, u'\u8a71')))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u8a71\n",
"\u30ef\n",
"\u306f\u306a.\u3059\n",
"\u306f\u306a\u3057\n",
"tale\n",
"talk\n"
]
}
],
"prompt_number": 178
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"disp_data(extract_data(find_element_by_kanji(tree, u'\u5c16')))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u5c16\n",
"\u30bb\u30f3\n",
"\u3068\u304c.\u308b\n",
"\u3055\u304d\n",
"\u3059\u308b\u3069.\u3044\n",
"be pointed\n",
"sharp\n",
"taper\n",
"displeased\n",
"angry\n",
"edgy\n"
]
}
],
"prompt_number": 179
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"JMdict"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Working with the example file"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"\n",
" 1171270\n",
" \n",
" \u53f3\u7ffc\n",
" ichi1\n",
" news1\n",
" nf04\n",
" \n",
" \n",
" \u3046\u3088\u304f\n",
" ichi1\n",
" news1\n",
" nf04\n",
" \n",
" \n",
" adj-no;\n",
" right-wing\n",
" aile droite (oiseau, arm\u00e9e, parti politique, base-ball)\n",
" \u043f\u0440\u0430\u0301\u0432\u043e\u0435 \u043a\u0440\u044b\u043b\u043e\u0301\n",
" \u043f\u0440\u0430\u0301\u0432\u044b\u0439 \u0444\u043b\u0430\u043d\u0433\n",
" die Rechte\n",
" rechter Fl\u00fcgel\n",
" \n",
" \n",
" n;\n",
" right field (e.g. in sport)\n",
" right flank\n",
" right wing\n",
" {Sport}\n",
" rechte Flanke\n",
" rechter Fl\u00fcgel\n",
" \n",
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tree = ET.ElementTree(file='JMdict_example.xml')\n",
"tree"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 195,
"text": [
""
]
}
],
"prompt_number": 195
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root = tree.getroot()\n",
"root"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 196,
"text": [
""
]
}
],
"prompt_number": 196
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the first few lines."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"elems = [elem for elem in tree.iter()][:10]\n",
"elems"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 197,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 197
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expression = root.find('k_ele/keb').text\n",
"print expression"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u53f3\u7ffc\n"
]
}
],
"prompt_number": 198
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reading = root.find('r_ele/reb').text\n",
"print reading"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u3046\u3088\u304f\n"
]
}
],
"prompt_number": 200
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"senses = root.findall('sense/gloss')\n",
"senses"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 203,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 203
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"senses = filter(lambda sense: sense.attrib == {}, senses)\n",
"senses"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 208,
"text": [
"[,\n",
" ,\n",
" ,\n",
" ]"
]
}
],
"prompt_number": 208
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for sense in senses:\n",
" print sense.text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"right-wing\n",
"right field (e.g. in sport)\n",
"right flank\n",
"right wing\n"
]
}
],
"prompt_number": 209
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Working with the whole file"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tree = ET.ElementTree(file='JMdict.xml')\n",
"tree"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 3,
"text": [
""
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"root = tree.getroot()\n",
"root"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 4,
"text": [
""
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"word_entries = tree.getroot().findall('entry/k_ele/keb')\n",
"words = [entry.text for entry in word_entries]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(words)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 6,
"text": [
"165048"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for word in words[:50]:\n",
" print word"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u3003\n",
"\u4edd\n",
"\u3005\n",
"\u6f22\u6570\u5b57\u30bc\u30ed\n",
"\u25cb\n",
"\u3007\n",
"\uff21\uff22\uff23\u9806\n",
"\uff23\uff24\u30d7\u30ec\u30fc\u30e4\u30fc\n",
"\uff23\uff24\u30d7\u30ec\u30a4\u30e4\u30fc\n",
"\uff2e\u97ff\n",
"\uff2f\u30d0\u30c3\u30af\n",
"\uff32\uff33\uff12\uff13\uff12\u30b1\u30fc\u30d6\u30eb\n",
"\uff34\u30b7\u30e3\u30c4\n",
"\uff34\u30d0\u30c3\u30af\n",
"\u3042\u3046\u3093\u306e\u547c\u5438\n",
"\u963f\u543d\u306e\u547c\u5438\n",
"\u660e\u767d\n",
"\u660e\u767d\n",
"\u5078\u9591\n",
"\u767d\u5730\n",
"\u660e\u304b\u3093\n",
"\u60aa\u3069\u3044\n",
"\u8ad6\u3046\n",
"\u99ac\u9154\u6728\n",
"\u5f7c\u51e6\n",
"\u5f7c\u6240\n",
"\u3042\u3063\u3068\u8a00\u3046\u9593\u306b\n",
"\u3042\u3063\u3068\u3044\u3046\u9593\u306b\n",
"\u3042\u3063\u3068\u3086\u3046\u9593\u306b\n",
"\u5f7c\u306e\n",
"\u3042\u306e\u4eba\n",
"\u5f7c\u306e\u4eba\n",
"\u3042\u306e\u65b9\n",
"\u5f7c\u306e\u65b9\n",
"\u6ea2\u308c\u308b\n",
"\u963f\u5446\u9640\u7f85\n",
"\u7518\u5b50\n",
"\u5929\u9b5a\n",
"\u96e8\u5b50\n",
"\ud867\ude8a\n",
"\u5f7c\n",
"\u3044\u3044\u52a0\u6e1b\u306b\u3057\u306a\u3055\u3044\n",
"\u3044\u3044\u5e74\u3092\u3057\u3066\n",
"\u5426\u3005\n",
"\u5426\u5426\n",
"\u5982\u4f55\u308f\u3057\u3044\n",
"\u3044\u304b\u306a\u308b\u5834\u5408\u3067\u3082\n",
"\u5982\u4f55\u306b\u3082\n",
"\u5e7e\u3064\u3082\n",
"\u884c\u3051\u306a\u3044\n"
]
}
],
"prompt_number": 221
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words[49][0] in words[34]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 237,
"text": [
"False"
]
}
],
"prompt_number": 237
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ask for a specific kanji in an expression:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_words = filter(lambda expression: u'\u5bfa' in expression, words)\n",
"for word in filtered_words:\n",
" print word"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u99c6\u3051\u8fbc\u307f\u5bfa\n",
"\u99c6\u8fbc\u307f\u5bfa\n",
"\u53e4\u793e\u5bfa\n",
"\u5c71\u5bfa\n",
"\u5bfa\n",
"\u5bfa\u9662\n",
"\u7985\u5bfa\n",
"\u50e7\u5bfa\n",
"\u5927\u5bfa\u9662\n",
"\u4e2d\u7985\u5bfa\u6e56\n",
"\u5c3c\u5bfa\n",
"\u4ecf\u5bfa\n",
"\u672b\u5bfa\n",
"\u53e4\u5bfa\n",
"\u5bfa\u793e\n",
"\u793e\u5bfa\n",
"\u56fd\u5206\u5bfa\n",
"\u5bfa\u53c2\u308a\n",
"\u5bfa\u5b50\u5c4b\n",
"\u5bfa\u5c0f\u5c4b\n",
"\u56de\u6559\u5bfa\u9662\n",
"\u7e01\u5207\u308a\u5bfa\n",
"\u6c0f\u5bfa\n",
"\u6a80\u90a3\u5bfa\n",
"\u52c5\u9858\u5bfa\n",
"\u5bfa\u7537\n",
"\u5bfa\u92ad\n",
"\u83e9\u63d0\u5bfa\n",
"\u5bfa\u683c\n",
"\u516b\u767e\u516b\u5bfa\n",
"\u5bfa\u5185\n",
"\u5165\u5bfa\n",
"\u6575\u306f\u672c\u80fd\u5bfa\u306b\u3042\u308a\n",
"\u6575\u306f\u672c\u80fd\u5bfa\u306b\u5728\u308a\n",
"\u5bfa\u53f7\n",
"\u5bfa\u57df\n",
"\u5b98\u5bfa\n",
"\u5927\u899a\u5bfa\u7d71\n",
"\u8107\u5bfa\n",
"\u5bfa\u4e2d\n",
"\u5bfa\u793e\u5949\u884c\n",
"\u5bfa\u9810\u3051\n",
"\u5bfa\u5165\u308a\n",
"\u5357\u90fd\u4e03\u5927\u5bfa\n",
"\u4e03\u5927\u5bfa\n",
"\u672c\u9858\u5bfa\u6d3e\n",
"\u4ecf\u5149\u5bfa\u6d3e\n",
"\u8aa0\u7167\u5bfa\u6d3e\n",
"\u5c11\u6797\u5bfa\u62f3\u6cd5\n",
"\u5bfa\n",
"\u304a\u5bfa\n",
"\u5fa1\u5bfa\n",
"\u304a\u5bfa\u69d8\n",
"\u304a\u5bfa\u3055\u307e\n",
"\u5fa1\u5bfa\u69d8\n",
"\u7d05\u5999\u84ee\u5bfa\n",
"\u5bfa\u8acb\n",
"\u5bfa\u8acb\u3051\n",
"\u5bfa\u8acb\u5236\u5ea6\n",
"\u5bfa\u6a80\u5236\u5ea6\n",
"\u4e09\u4e95\u5bfa\u6b69\u884c\u866b\n",
"\u4e09\u4e95\u5bfa\u82a5\u866b\n",
"\u5bfa\u5b50\n",
"\u5c11\u6797\u5bfa\u6d41\n",
"\u5bfa\u9818\n",
"\u672c\u80fd\u5bfa\u306e\u5909\n",
"\u5bfa\u52d9\n",
"\u76e3\u5bfa\n",
"\u90fd\u5bfa\n",
"\u526f\u5bfa\n",
"\u5bfa\u52d9\u6240\n",
"\u79c1\u5bfa\n",
"\u304a\u5bfa\u3055\u3093\n",
"\u5fa1\u5bfa\u3055\u3093\n",
"\u9053\u660e\u5bfa\u7c89\n",
"\u5ec3\u5bfa\n",
"\u5f53\u5bfa\n",
"\u5bfa\u5185\u753a\n",
"\u5bae\u5bfa\n",
"\u795e\u5bae\u5bfa\n",
"\u8af8\u5bfa\n"
]
}
],
"prompt_number": 240
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Outline of what could be done from this"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- build a sort of exploratory app that starts with a kanji or a word, then lists all compounds from the dictionary that contain the given kanjis and makes it able to reselect any one of them at a later stage while offering the possibility to visualize the data associated to each kanji\n",
"- probably the most easy thing to do is classify words with respect to frequency\n",
"- add support for reading Anki decks or better: integrate with Anki desktop, as it is written in Python"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}