{ "metadata": { "name": "Kanjidic2 & JMDict" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This document is organized in 3 sections:\n", "\n", "- Recap of Eli Bendersky's XML parsing from [his online article](http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/)\n", "- Parsing and helper functions for Kanjidic2\n", "- Parsing and helper functions for JMDict" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Eli Bendersky's article tests" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Basic parsing" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "\n", " \n", " text,source\n", " \n", " \n", " \n", " xml,sgml\n", " \n", " \n", " \n", " \n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import xml.etree.cElementTree as ET\n", "tree = ET.ElementTree(file='doc1.xml')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "tree.getroot()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 80, "text": [ "" ] } ], "prompt_number": 80 }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root.tag, root.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 83, "text": [ "('doc', {})" ] } ], "prompt_number": 83 }, { "cell_type": "code", "collapsed": false, "input": [ "for child_of_root in root:\n", " print child_of_root.tag, child_of_root.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "branch {'hash': '1cdf045c', 'name': 'testing'}\n", "branch {'hash': 'f200013e', 'name': 'release01'}\n", "branch {'name': 'invalid'}\n" ] } ], "prompt_number": 84 }, { "cell_type": "code", "collapsed": false, "input": [ "root[0].tag, root[0].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 85, "text": [ "('branch', '\\n text,source\\n ')" ] } ], "prompt_number": 85 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Find interesting stuff" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in tree.iter():\n", " print elem.tag, elem.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "doc {}\n", "branch {'hash': '1cdf045c', 'name': 'testing'}\n", "branch {'hash': 'f200013e', 'name': 'release01'}\n", "sub-branch {'name': 'subrelease01'}\n", "branch {'name': 'invalid'}\n" ] } ], "prompt_number": 86 }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in tree.iter(tag='branch'):\n", " print elem.tag, elem.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "branch {'hash': '1cdf045c', 'name': 'testing'}\n", "branch {'hash': 'f200013e', 'name': 'release01'}\n", "branch {'name': 'invalid'}\n" ] } ], "prompt_number": 87 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Using XPath" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in tree.iterfind('branch/sub-branch'):\n", " print elem.tag, elem.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sub-branch {'name': 'subrelease01'}\n" ] } ], "prompt_number": 97 }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in tree.iterfind('branch'):\n", " print elem.tag, elem.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "branch {'hash': '1cdf045c', 'name': 'testing'}\n", "branch {'hash': 'f200013e', 'name': 'release01'}\n", "branch {'name': 'invalid'}\n" ] } ], "prompt_number": 99 }, { "cell_type": "code", "collapsed": false, "input": [ "for elem in tree.iterfind('branch[@name=\"release01\"]'):\n", " print elem.tag, elem.attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "branch {'hash': 'f200013e', 'name': 'release01'}\n" ] } ], "prompt_number": 90 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Kanjidic2" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example file" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "\n", " \u672c\n", " \n", " 672c\n", " 43-60\n", " \n", " \n", " 75\n", " 2\n", " \n", " \n", " 1\n", " 5\n", " 52-81\n", " 10\n", " 4\n", " \n", " \n", " 96\n", " 2536\n", " 3502\n", " 2183\n", " 211\n", " 15\n", " 212\n", " 20\n", " 14421\n", " 70\n", " 25\n", " 45\n", " 61\n", " 76\n", " 47\n", " 6\n", " 37\n", " 2.1\n", " 1046\n", " 215\n", " \n", " \n", " 4-5-3\n", " 0a5.25\n", " 5023.0\n", " 1855\n", " \n", " \n", " \n", " ben3\n", " bon\n", " \ubcf8\n", " \u30db\u30f3\n", " \u3082\u3068\n", " book\n", " present\n", " main\n", " true\n", " real\n", " counter for long cylindrical things\n", " livre\n", " pr\u00e9sent\n", " essentiel\n", " origine\n", " principal\n", " r\u00e9alit\u00e9\n", " v\u00e9rit\u00e9\n", " compteur d'objets allong\u00e9s\n", " libro\n", " origen\n", " base\n", " contador de cosas alargadas\n", " livro\n", " presente\n", " real\n", " verdadeiro\n", " principal\n", " sufixo p/ contagem De coisas longas\n", " \n", " \u307e\u3068\n", " \n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import xml.etree.cElementTree as ET\n", "tree = ET.ElementTree(file='kanjidic2_example.xml')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 105 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, what does the tree look like in this example file?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "elems = [elem for elem in tree.iter()][:10]\n", "elems" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 25, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the root: the 'character'." ] }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 15, "text": [ "" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the literal." ] }, { "cell_type": "code", "collapsed": false, "input": [ "literal = root[0]\n", "literal" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 16, "text": [ "" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "kanji = literal.text\n", "kanji" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 17, "text": [ "u'\\u672c'" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "print kanji" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u672c\n" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the meanings." ] }, { "cell_type": "code", "collapsed": false, "input": [ "meanings = [elem for elem in tree.iter('meaning')]\n", "[meaning.text for meaning in meanings]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 44, "text": [ "['book',\n", " 'present',\n", " 'main',\n", " 'true',\n", " 'real',\n", " 'counter for long cylindrical things',\n", " 'livre',\n", " u'pr\\xe9sent',\n", " 'essentiel',\n", " 'origine',\n", " 'principal',\n", " u'r\\xe9alit\\xe9',\n", " u'v\\xe9rit\\xe9',\n", " u\"compteur d'objets allong\\xe9s\",\n", " 'libro',\n", " 'origen',\n", " 'base',\n", " 'contador de cosas alargadas',\n", " 'livro',\n", " 'presente',\n", " 'real',\n", " 'verdadeiro',\n", " 'principal',\n", " 'sufixo p/ contagem De coisas longas']" ] } ], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But here we only want english meanings." ] }, { "cell_type": "code", "collapsed": false, "input": [ "meanings[10].attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 46, "text": [ "{'m_lang': 'fr'}" ] } ], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "english_meanings = filter(lambda elem: elem.attrib == {}, meanings)\n", "[meaning.text for meaning in english_meanings]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 51, "text": [ "['book',\n", " 'present',\n", " 'main',\n", " 'true',\n", " 'real',\n", " 'counter for long cylindrical things']" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can get the Kanas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "readings = [elem for elem in tree.iter('reading')]\n", "print [reading.text for reading in readings]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['ben3', 'bon', u'\\ubcf8', u'\\u30db\\u30f3', u'\\u3082\\u3068']\n" ] } ], "prompt_number": 60 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtering for kanas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "readings[0].attrib['r_type']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 64, "text": [ "'pinyin'" ] } ], "prompt_number": 64 }, { "cell_type": "code", "collapsed": false, "input": [ "'r_type' in readings[0].attrib" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 67, "text": [ "True" ] } ], "prompt_number": 67 }, { "cell_type": "code", "collapsed": false, "input": [ "kanas = filter(lambda reading: reading.attrib['r_type'] in ['ja_on', 'ja_kun'], readings)\n", "kanas" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 73, "text": [ "[, ]" ] } ], "prompt_number": 73 }, { "cell_type": "code", "collapsed": false, "input": [ "for kana in kanas:\n", " print kana.text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u30db\u30f3\n", "\u3082\u3068\n" ] } ], "prompt_number": 75 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The whole Kanjidic2 file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import xml.etree.cElementTree as ET\n", "tree = ET.ElementTree(file='kanjidic2.xml')\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 2, "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "root.findall('character/literal')[:10]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "I *understand* now: you have to specify the exact branching in the findall command while iter works because it filters the depth first search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Searching for the entry of a specific kanji." ] }, { "cell_type": "code", "collapsed": false, "input": [ "search_kanji = u'\u672c'\n", "literals = root.findall('character/literal')\n", "literals[:10]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "len(literals)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 5, "text": [ "13108" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "tree.find('character/literal')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 147, "text": [ "" ] } ], "prompt_number": 147 }, { "cell_type": "code", "collapsed": false, "input": [ "[literal.text for literal in literals].index(u'\u8a71')\n", "\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 6, "text": [ "2948" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "print literals[2948].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u8a71\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting the parent node." ] }, { "cell_type": "code", "collapsed": false, "input": [ "characters = root.findall('character')\n", "characters[:10]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 151, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 151 }, { "cell_type": "code", "collapsed": false, "input": [ "print characters[2948][0].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u8a71\n" ] } ], "prompt_number": 153 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Defining helper functions" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Find a specific kanji in the dictionary" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_element_by_kanji(tree, kanji):\n", " root = tree.getroot()\n", " literals = root.findall('character/literal')\n", " index = [literal.text for literal in literals].index(kanji)\n", " return root.findall('character')[index]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 155 }, { "cell_type": "code", "collapsed": false, "input": [ "kuruma = find_element_by_kanji(tree, u'\u8eca')\n", "kuruma" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 158, "text": [ "" ] } ], "prompt_number": 158 }, { "cell_type": "code", "collapsed": false, "input": [ "print kuruma[0].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u8eca\n" ] } ], "prompt_number": 160 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Extract meaningful information from a 'character'" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def extract_data(element):\n", " \"\"\"returns the kanji, the kana and the meanings from an element\"\"\"\n", " kanji = element.find('literal').text\n", " kana = [elem.text for elem in filter(lambda reading: reading.attrib['r_type'] in ['ja_on', 'ja_kun'], element.findall('reading_meaning/rmgroup/reading'))]\n", " meanings = [elem.text for elem in filter(lambda elem: elem.attrib == {}, element.findall('reading_meaning/rmgroup/meaning'))]\n", " return (kanji, kana, meanings)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 173 }, { "cell_type": "code", "collapsed": false, "input": [ "def disp_data(data):\n", " print data[0]\n", " for item in data[1]:\n", " print item\n", " for item in data[2]:\n", " print item \n", "\n", "data = extract_data(kuruma)\n", "disp_data(data)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u8eca\n", "\u30b7\u30e3\n", "\u304f\u308b\u307e\n", "car\n" ] } ], "prompt_number": 176 }, { "cell_type": "code", "collapsed": false, "input": [ "disp_data(extract_data(find_element_by_kanji(tree, u'\u8a71')))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u8a71\n", "\u30ef\n", "\u306f\u306a.\u3059\n", "\u306f\u306a\u3057\n", "tale\n", "talk\n" ] } ], "prompt_number": 178 }, { "cell_type": "code", "collapsed": false, "input": [ "disp_data(extract_data(find_element_by_kanji(tree, u'\u5c16')))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u5c16\n", "\u30bb\u30f3\n", "\u3068\u304c.\u308b\n", "\u3055\u304d\n", "\u3059\u308b\u3069.\u3044\n", "be pointed\n", "sharp\n", "taper\n", "displeased\n", "angry\n", "edgy\n" ] } ], "prompt_number": 179 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "JMdict" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Working with the example file" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", " 1171270\n", " \n", " \u53f3\u7ffc\n", " ichi1\n", " news1\n", " nf04\n", " \n", " \n", " \u3046\u3088\u304f\n", " ichi1\n", " news1\n", " nf04\n", " \n", " \n", " adj-no;\n", " right-wing\n", " aile droite (oiseau, arm\u00e9e, parti politique, base-ball)\n", " \u043f\u0440\u0430\u0301\u0432\u043e\u0435 \u043a\u0440\u044b\u043b\u043e\u0301\n", " \u043f\u0440\u0430\u0301\u0432\u044b\u0439 \u0444\u043b\u0430\u043d\u0433\n", " die Rechte\n", " rechter Fl\u00fcgel\n", " \n", " \n", " n;\n", " right field (e.g. in sport)\n", " right flank\n", " right wing\n", " {Sport}\n", " rechte Flanke\n", " rechter Fl\u00fcgel\n", " \n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tree = ET.ElementTree(file='JMdict_example.xml')\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 195, "text": [ "" ] } ], "prompt_number": 195 }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 196, "text": [ "" ] } ], "prompt_number": 196 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the first few lines." ] }, { "cell_type": "code", "collapsed": false, "input": [ "elems = [elem for elem in tree.iter()][:10]\n", "elems" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 197, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 197 }, { "cell_type": "code", "collapsed": false, "input": [ "expression = root.find('k_ele/keb').text\n", "print expression" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u53f3\u7ffc\n" ] } ], "prompt_number": 198 }, { "cell_type": "code", "collapsed": false, "input": [ "reading = root.find('r_ele/reb').text\n", "print reading" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u3046\u3088\u304f\n" ] } ], "prompt_number": 200 }, { "cell_type": "code", "collapsed": false, "input": [ "senses = root.findall('sense/gloss')\n", "senses" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 203, "text": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 203 }, { "cell_type": "code", "collapsed": false, "input": [ "senses = filter(lambda sense: sense.attrib == {}, senses)\n", "senses" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 208, "text": [ "[,\n", " ,\n", " ,\n", " ]" ] } ], "prompt_number": 208 }, { "cell_type": "code", "collapsed": false, "input": [ "for sense in senses:\n", " print sense.text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "right-wing\n", "right field (e.g. in sport)\n", "right flank\n", "right wing\n" ] } ], "prompt_number": 209 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Working with the whole file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tree = ET.ElementTree(file='JMdict.xml')\n", "tree" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ "" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "root = tree.getroot()\n", "root" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "word_entries = tree.getroot().findall('entry/k_ele/keb')\n", "words = [entry.text for entry in word_entries]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "len(words)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 6, "text": [ "165048" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words[:50]:\n", " print word" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u3003\n", "\u4edd\n", "\u3005\n", "\u6f22\u6570\u5b57\u30bc\u30ed\n", "\u25cb\n", "\u3007\n", "\uff21\uff22\uff23\u9806\n", "\uff23\uff24\u30d7\u30ec\u30fc\u30e4\u30fc\n", "\uff23\uff24\u30d7\u30ec\u30a4\u30e4\u30fc\n", "\uff2e\u97ff\n", "\uff2f\u30d0\u30c3\u30af\n", "\uff32\uff33\uff12\uff13\uff12\u30b1\u30fc\u30d6\u30eb\n", "\uff34\u30b7\u30e3\u30c4\n", "\uff34\u30d0\u30c3\u30af\n", "\u3042\u3046\u3093\u306e\u547c\u5438\n", "\u963f\u543d\u306e\u547c\u5438\n", "\u660e\u767d\n", "\u660e\u767d\n", "\u5078\u9591\n", "\u767d\u5730\n", "\u660e\u304b\u3093\n", "\u60aa\u3069\u3044\n", "\u8ad6\u3046\n", "\u99ac\u9154\u6728\n", "\u5f7c\u51e6\n", "\u5f7c\u6240\n", "\u3042\u3063\u3068\u8a00\u3046\u9593\u306b\n", "\u3042\u3063\u3068\u3044\u3046\u9593\u306b\n", "\u3042\u3063\u3068\u3086\u3046\u9593\u306b\n", "\u5f7c\u306e\n", "\u3042\u306e\u4eba\n", "\u5f7c\u306e\u4eba\n", "\u3042\u306e\u65b9\n", "\u5f7c\u306e\u65b9\n", "\u6ea2\u308c\u308b\n", "\u963f\u5446\u9640\u7f85\n", "\u7518\u5b50\n", "\u5929\u9b5a\n", "\u96e8\u5b50\n", "\ud867\ude8a\n", "\u5f7c\n", "\u3044\u3044\u52a0\u6e1b\u306b\u3057\u306a\u3055\u3044\n", "\u3044\u3044\u5e74\u3092\u3057\u3066\n", "\u5426\u3005\n", "\u5426\u5426\n", "\u5982\u4f55\u308f\u3057\u3044\n", "\u3044\u304b\u306a\u308b\u5834\u5408\u3067\u3082\n", "\u5982\u4f55\u306b\u3082\n", "\u5e7e\u3064\u3082\n", "\u884c\u3051\u306a\u3044\n" ] } ], "prompt_number": 221 }, { "cell_type": "code", "collapsed": false, "input": [ "words[49][0] in words[34]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 237, "text": [ "False" ] } ], "prompt_number": 237 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ask for a specific kanji in an expression:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "filtered_words = filter(lambda expression: u'\u5bfa' in expression, words)\n", "for word in filtered_words:\n", " print word" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u99c6\u3051\u8fbc\u307f\u5bfa\n", "\u99c6\u8fbc\u307f\u5bfa\n", "\u53e4\u793e\u5bfa\n", "\u5c71\u5bfa\n", "\u5bfa\n", "\u5bfa\u9662\n", "\u7985\u5bfa\n", "\u50e7\u5bfa\n", "\u5927\u5bfa\u9662\n", "\u4e2d\u7985\u5bfa\u6e56\n", "\u5c3c\u5bfa\n", "\u4ecf\u5bfa\n", "\u672b\u5bfa\n", "\u53e4\u5bfa\n", "\u5bfa\u793e\n", "\u793e\u5bfa\n", "\u56fd\u5206\u5bfa\n", "\u5bfa\u53c2\u308a\n", "\u5bfa\u5b50\u5c4b\n", "\u5bfa\u5c0f\u5c4b\n", "\u56de\u6559\u5bfa\u9662\n", "\u7e01\u5207\u308a\u5bfa\n", "\u6c0f\u5bfa\n", "\u6a80\u90a3\u5bfa\n", "\u52c5\u9858\u5bfa\n", "\u5bfa\u7537\n", "\u5bfa\u92ad\n", "\u83e9\u63d0\u5bfa\n", "\u5bfa\u683c\n", "\u516b\u767e\u516b\u5bfa\n", "\u5bfa\u5185\n", "\u5165\u5bfa\n", "\u6575\u306f\u672c\u80fd\u5bfa\u306b\u3042\u308a\n", "\u6575\u306f\u672c\u80fd\u5bfa\u306b\u5728\u308a\n", "\u5bfa\u53f7\n", "\u5bfa\u57df\n", "\u5b98\u5bfa\n", "\u5927\u899a\u5bfa\u7d71\n", "\u8107\u5bfa\n", "\u5bfa\u4e2d\n", "\u5bfa\u793e\u5949\u884c\n", "\u5bfa\u9810\u3051\n", "\u5bfa\u5165\u308a\n", "\u5357\u90fd\u4e03\u5927\u5bfa\n", "\u4e03\u5927\u5bfa\n", "\u672c\u9858\u5bfa\u6d3e\n", "\u4ecf\u5149\u5bfa\u6d3e\n", "\u8aa0\u7167\u5bfa\u6d3e\n", "\u5c11\u6797\u5bfa\u62f3\u6cd5\n", "\u5bfa\n", "\u304a\u5bfa\n", "\u5fa1\u5bfa\n", "\u304a\u5bfa\u69d8\n", "\u304a\u5bfa\u3055\u307e\n", "\u5fa1\u5bfa\u69d8\n", "\u7d05\u5999\u84ee\u5bfa\n", "\u5bfa\u8acb\n", "\u5bfa\u8acb\u3051\n", "\u5bfa\u8acb\u5236\u5ea6\n", "\u5bfa\u6a80\u5236\u5ea6\n", "\u4e09\u4e95\u5bfa\u6b69\u884c\u866b\n", "\u4e09\u4e95\u5bfa\u82a5\u866b\n", "\u5bfa\u5b50\n", "\u5c11\u6797\u5bfa\u6d41\n", "\u5bfa\u9818\n", "\u672c\u80fd\u5bfa\u306e\u5909\n", "\u5bfa\u52d9\n", "\u76e3\u5bfa\n", "\u90fd\u5bfa\n", "\u526f\u5bfa\n", "\u5bfa\u52d9\u6240\n", "\u79c1\u5bfa\n", "\u304a\u5bfa\u3055\u3093\n", "\u5fa1\u5bfa\u3055\u3093\n", "\u9053\u660e\u5bfa\u7c89\n", "\u5ec3\u5bfa\n", "\u5f53\u5bfa\n", "\u5bfa\u5185\u753a\n", "\u5bae\u5bfa\n", "\u795e\u5bae\u5bfa\n", "\u8af8\u5bfa\n" ] } ], "prompt_number": 240 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Outline of what could be done from this" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- build a sort of exploratory app that starts with a kanji or a word, then lists all compounds from the dictionary that contain the given kanjis and makes it able to reselect any one of them at a later stage while offering the possibility to visualize the data associated to each kanji\n", "- probably the most easy thing to do is classify words with respect to frequency\n", "- add support for reading Anki decks or better: integrate with Anki desktop, as it is written in Python" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }