{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing polysemy in dictionaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "For this tutorial we will use data from the project \"[Quantitative Historical Linguistics](http://www.quanthistling.info/)\". The website of the project provides a ZIP package of GrAF/XML files for the printed sources that were digitized within the project:\n", "\n", "http://www.quanthistling.info/data/downloads/xml/data.zip\n", "\n", "The ZIP package contains several files encoded as described in the [ISO standard 24612](http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326) \"Linguistic annotation framework (LAF)\". The QuantHistLing data contains dictionary and wordlist sources. Those were first tokenized into entries, for each entry you will find annotations for at least the head word(s) (\"head\" annotation) and translation(s) (\"translation\" annotation) in the case of dictionaries. We will only use the dictionaries of the \"Witotoan\" compoment in this tutorial. The ZIP package also contains a CSV file \"sources.csv\" with some information for each source, for example the languages as ISO codes, type of source, etc. Be aware that the ZIP package contains a filtered version of the sources: only entries that contain a Spanish word that is part of the Spanish swadesh list are included in the download package.\n", "\n", "For a simple example how to parse one of the source please see here:\n", "\n", "http://graf-python.readthedocs.org/en/latest/Querying%20GrAF%20graphs.html\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import sys\n", "import csv\n", "import codecs\n", "import collections\n", "import glob\n", "\n", "import graf\n", "import scipy.sparse\n", "import networkx\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the data\n", "\n", "In the first step we download and extract the data. You may change to a local \"tmp\" directory before the download or just download the data to the current working directory. For this you need to install the Python library `requests`. You may also download and extract the data manually, the data is only downloaded for you if the file `sources.csv` is not found." ] }, { "cell_type": "code", "collapsed": false, "input": [ "os.chdir(\"/Users/pbouda/Projects/git-github/notebooks/polysemy\")\n", "if not os.path.exists(\"sources.csv\"):\n", " import requests\n", " import zipfile\n", " r = requests.get(\n", " \"http://www.quanthistling.info/data/downloads/xml/data.zip\")\n", " with open(\"data.zip\", \"wb\") as f:\n", " f.write(r.content)\n", "\n", " z = zipfile.ZipFile(\"data.zip\")\n", " z.extractall()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we open the file \"sources.csv\" and read out all the sources that are part of the component \"Witotoan\" and that are dictionaries. We will store a list of those source in `witotoan_sources`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sources = csv.reader(open(\"sources.csv\", \"rU\"), delimiter=\"\\t\")\n", "dict_sources = list()\n", "for source in sources:\n", " if source[0] != \"QLCID\" and source[1] == \"dictionary\":\n", " dict_sources.append(source[0])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GrAF to Matrix\n", "\n", "Next we define a helper function that transform a GrAF graph into a networkx graph. For this we traverse the graph by querying for all entries. For each entry we look for connected nodes that have \"head\" or \"translation\" annotation. All of those nodes that are Spanish are stored in the list `spa`. All non-Spanish annotations are stored in `others`. In the end the collected annotation are added to the new networkx graph, and each spanish node is connected to all the other nodes for each entry:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import unicodedata\n", "tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)\n", " if unicodedata.category(unichr(i)).startswith('P') or unicodedata.category(unichr(i)).startswith('S'))\n", "def remove_punctuation(text):\n", " return text.translate(tbl)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "if not os.path.exists(os.path.join(\"stopwords\", \"spanish\")):\n", " import requests\n", " import zipfile\n", " r = requests.get(\n", " \"https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip?raw=true\")\n", " with open(\"stopwords.zip\", \"wb\") as f:\n", " f.write(r.content)\n", "\n", " z = zipfile.ZipFile(\"stopwords.zip\")\n", " z.extractall()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "stopwords = list()\n", "with codecs.open(os.path.join(\"stopwords\", \"spanish\"), \"r\", \"utf-8\") as f:\n", " for line in f:\n", " stopwords.append(line.strip())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "parser = graf.GraphParser()\n", "all_dicts_frame = None\n", "parsed_first = False\n", "\n", "spa_to_indi = collections.defaultdict(set)\n", "indi = set()\n", "spa = set()\n", "spa_to_spa = []\n", "\n", "for d in dict_sources:\n", " for f in glob.glob(os.path.join(d, \"dict-*-dictinterpretation.xml\")):\n", " #print(\"Parsing {0}...\".format(f))\n", " graf_graph = parser.parse(f)\n", " \n", " for (node_id, node) in graf_graph.nodes.items():\n", " if node_id.endswith(\"..entry\"):\n", " entry_spa = set()\n", " spa_to_spa_tmp = list()\n", " others = set()\n", " for e in node.out_edges:\n", " if e.annotations.get_first().label == \"head\" or e.annotations.get_first().label == \"translation\":\n", " # get lang\n", " for n in e.to_node.links[0][0].nodes:\n", " if n.annotations.get_first().label == \"iso-639-3\":\n", " if n.annotations.get_first().features.get_value(\"substring\") == \"spa\":\n", " substr = remove_punctuation(e.to_node.annotations.get_first().features.get_value(\"substring\"))\n", " collo = set()\n", " for w in substr.split(\" \"):\n", " if w not in stopwords:\n", " entry_spa.add(w)\n", " collo.add(w)\n", " if len(collo) > 1:\n", " spa_to_spa_tmp.append(list(collo))\n", " break\n", " else:\n", " trans = u\"{0}|{1}\".format(e.to_node.annotations.get_first().features.get_value(\"substring\"), d)\n", " others.add(trans)\n", " break\n", " \n", " if len(entry_spa) > 0 and len(others) > 0:\n", " #spa_to_spa.append(list(entry_spa))\n", " spa_to_spa.extend(spa_to_spa_tmp)\n", " for head in entry_spa:\n", " for translation in others:\n", " spa_to_indi[head].add(translation)\n", " spa.add(head)\n", " indi.add(translation)\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "import gc\n", "gc.collect()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "30862" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "spa = list(spa)\n", "indi = list(indi)\n", "indi_indices = { w: i for i, w in enumerate(indi) }\n", "spa_indices = { w: i for i, w in enumerate(spa) }\n", "all_dicts_cooc = scipy.sparse.lil_matrix((len(indi), len(spa)))\n", "#all_dicts_cooc = numpy.zeros((len(indi), len(spa)))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "len(spa)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "45515" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "for i, head in enumerate(spa):\n", " for trans in spa_to_indi[head]:\n", " all_dicts_cooc[indi_indices[trans], i] = 1" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "all_dicts_spa_collo = scipy.sparse.lil_matrix((len(spa), len(spa)))\n", "for j, p in enumerate(spa_to_spa):\n", " for i in range(len(p)-1):\n", " for w in p[i+1:]:\n", " all_dicts_spa_collo[spa_indices[p[i]], spa_indices[w]] += 1" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "spa_to_spa[19764]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "[u'quedar', u'carbonizado']" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Matrix calculations do not work on lil matrices" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_dicts_cooc = scipy.sparse.csc_matrix(all_dicts_cooc)\n", "all_dicts_spa_collo = scipy.sparse.csc_matrix(all_dicts_spa_collo)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "spa_similarity = all_dicts_cooc.T * all_dicts_cooc" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "spa_similarity_without_collo = spa_similarity - all_dicts_spa_collo" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a graph" ] }, { "cell_type": "code", "collapsed": false, "input": [ "g = networkx.Graph(spa_similarity)\n", "#solitary = [ n for n, d in g.degree_iter() if d==2 ]\n", "#g.remove_nodes_from(solitary)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "labels = dict(zip(range(len(spa)), spa))\n", "#labels = { k: v for k,v in enumerate(spa) if k in g }\n", "g2 = networkx.relabel_nodes(g, labels)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "word = u\"casa\"\n", "cutoff = 50\n", "comer_nodes = g2[word]\n", "comer_graph = networkx.Graph()\n", "comer_graph.add_node(word)\n", "for n in comer_nodes:\n", " if comer_nodes[n]['weight'] >= cutoff:\n", " comer_graph.add_node(n)\n", " comer_graph.add_edge(word, n, weight=comer_nodes[n]['weight'])\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "len(comer_graph)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "18" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "from networkx.readwrite import json_graph\n", "import json\n", "comer_json = json_graph.node_link_data(comer_graph)\n", "#json.dump(bodyparts_json, codecs.open(\"bodyparts_graph.json\", \"w\", \"utf-8\"))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.display import HTML, Javascript\n", "from IPython.core.display import display\n", "html = \"\"\"\n", "\n", "\n", "\n", "\n", "
\n", "\"\"\"\n", "\n", "javascript = \"\"\"\n", "var color = d3.scale.category20();\n", "\n", "var width = 500,\n", " height = 400;\n", "\n", "var svg = d3.select(\"#nav\").append(\"svg\")\n", " .attr(\"width\", width)\n", " .attr(\"height\", height);\n", "\n", "var force = d3.layout.force()\n", " .gravity(.05)\n", " .distance(100)\n", " .charge(-250)\n", " .size([width, height]);\n", "\n", "var json = \"\"\" + json.dumps(comer_json) + \"\"\";\n", "\n", "//d3.json(\"bodyparts_graph.json\", function(error, json) {\n", " force\n", " .nodes(json.nodes)\n", " .links(json.links)\n", " .start();\n", "\n", " var link = svg.selectAll(\"line.link\")\n", " .data(json.links)\n", " .enter().append(\"line\")\n", " .attr(\"class\", \"link\")\n", " .style(\"stroke-width\", function(d) { return d.weight/\"\"\" + str(cutoff) + \"\"\"; });\n", "\n", " var node = svg.selectAll(\"circle.node\")\n", " .data(json.nodes)\n", " .enter().append(\"g\")\n", " .attr(\"class\", \"node\")\n", " .call(force.drag);\n", "\n", " node.append(\"circle\")\n", " .attr(\"r\", 5);\n", " //.style(\"fill\", function(d) { return color(d.group); })\n", "\n", " node.append(\"text\")\n", " .attr(\"dx\", 12)\n", " .attr(\"dy\", \".35em\")\n", " .text(function(d) { return d.id });\n", "\n", " force.on(\"tick\", function() {\n", " link.attr(\"x1\", function(d) { return d.source.x; })\n", " .attr(\"y1\", function(d) { return d.source.y; })\n", " .attr(\"x2\", function(d) { return d.target.x; })\n", " .attr(\"y2\", function(d) { return d.target.y; });\n", "\n", " node.attr(\"transform\", function(d) { return \"translate(\" + d.x + \",\" + d.y + \")\"; });\n", " });\n", "//});\n", "\"\"\"\n", "display(HTML(html))\n", "display(Javascript(javascript))" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "
\n" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] }, { "javascript": [ "\n", "var color = d3.scale.category20();\n", "\n", "var width = 500,\n", " height = 400;\n", "\n", "var svg = d3.select(\"#nav\").append(\"svg\")\n", " .attr(\"width\", width)\n", " .attr(\"height\", height);\n", "\n", "var force = d3.layout.force()\n", " .gravity(.05)\n", " .distance(100)\n", " .charge(-250)\n", " .size([width, height]);\n", "\n", "var json = {\"directed\": false, \"graph\": [], \"nodes\": [{\"id\": \"\"}, {\"id\": \"hacer\"}, {\"id\": \"a\\u0301rbol\"}, {\"id\": \"casa\"}, {\"id\": \"xo\"}, {\"id\": \"esta\\u0301\"}, {\"id\": \"especie\"}, {\"id\": \"construir\"}, {\"id\": \"tsi\"}, {\"id\": \"xobo\"}, {\"id\": \"techo\"}, {\"id\": \"hojas\"}, {\"id\": \"grande\"}, {\"id\": \"e\\u0301l\"}, {\"id\": \"qui\"}, {\"id\": \"ca\"}, {\"id\": \"ja\"}, {\"id\": \"jahue\\u0308\"}], \"links\": [{\"source\": 0, \"target\": 3, \"weight\": 264.0}, {\"source\": 1, \"target\": 3, \"weight\": 79.0}, {\"source\": 2, \"target\": 3, \"weight\": 75.0}, {\"source\": 3, \"target\": 3, \"weight\": 1359.0}, {\"source\": 3, \"target\": 4, \"weight\": 100.0}, {\"source\": 3, \"target\": 5, \"weight\": 58.0}, {\"source\": 3, \"target\": 6, \"weight\": 65.0}, {\"source\": 3, \"target\": 7, \"weight\": 60.0}, {\"source\": 3, \"target\": 8, \"weight\": 185.0}, {\"source\": 3, \"target\": 9, \"weight\": 197.0}, {\"source\": 3, \"target\": 10, \"weight\": 101.0}, {\"source\": 3, \"target\": 11, \"weight\": 77.0}, {\"source\": 3, \"target\": 12, \"weight\": 51.0}, {\"source\": 3, \"target\": 13, \"weight\": 57.0}, {\"source\": 3, \"target\": 14, \"weight\": 55.0}, {\"source\": 3, \"target\": 15, \"weight\": 101.0}, {\"source\": 3, \"target\": 16, \"weight\": 51.0}, {\"source\": 3, \"target\": 17, \"weight\": 64.0}], \"multigraph\": false};\n", "\n", "//d3.json(\"bodyparts_graph.json\", function(error, json) {\n", " force\n", " .nodes(json.nodes)\n", " .links(json.links)\n", " .start();\n", "\n", " var link = svg.selectAll(\"line.link\")\n", " .data(json.links)\n", " .enter().append(\"line\")\n", " .attr(\"class\", \"link\")\n", " .style(\"stroke-width\", function(d) { return d.weight/50; });\n", "\n", " var node = svg.selectAll(\"circle.node\")\n", " .data(json.nodes)\n", " .enter().append(\"g\")\n", " .attr(\"class\", \"node\")\n", " .call(force.drag);\n", "\n", " node.append(\"circle\")\n", " .attr(\"r\", 5);\n", " //.style(\"fill\", function(d) { return color(d.group); })\n", "\n", " node.append(\"text\")\n", " .attr(\"dx\", 12)\n", " .attr(\"dy\", \".35em\")\n", " .text(function(d) { return d.id });\n", "\n", " force.on(\"tick\", function() {\n", " link.attr(\"x1\", function(d) { return d.source.x; })\n", " .attr(\"y1\", function(d) { return d.source.y; })\n", " .attr(\"x2\", function(d) { return d.target.x; })\n", " .attr(\"y2\", function(d) { return d.target.y; });\n", "\n", " node.attr(\"transform\", function(d) { return \"translate(\" + d.x + \",\" + d.y + \")\"; });\n", " });\n", "//});\n" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }