{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing polysemy of body parts in dictionaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "For this tutorial we will use data from the project \"[Quantitative Historical Linguistics](http://www.quanthistling.info/)\". The website of the project provides a ZIP package of GrAF/XML files for the printed sources that were digitized within the project:\n", "\n", "http://www.quanthistling.info/data/downloads/xml/data.zip\n", "\n", "The ZIP package contains several files encoded as described in the [ISO standard 24612](http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326) \"Linguistic annotation framework (LAF)\". The QuantHistLing data contains dictionary and wordlist sources. Those were first tokenized into entries, for each entry you will find annotations for at least the head word(s) (\"head\" annotation) and translation(s) (\"translation\" annotation) in the case of dictionaries. We will only use the dictionaries of the \"Witotoan\" compoment in this tutorial. The ZIP package also contains a CSV file \"sources.csv\" with some information for each source, for example the languages as ISO codes, type of source, etc. Be aware that the ZIP package contains a filtered version of the sources: only entries that contain a Spanish word that is part of the Spanish swadesh list are included in the download package.\n", "\n", "For a simple example how to parse one of the source please see here:\n", "\n", "http://graf-python.readthedocs.org/en/latest/Querying%20GrAF%20graphs.html\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirements\n", "\n", "The following Python libraries are required to process the GrAF/XML files, calculate the co-ocurrence matrices and visualize the polysemy:\n", "\n", "* NetworkX: http://networkx.lanl.gov/\n", "* graf-python: https://github.com/cidles/graf-python\n", "* SciPy: http://scipy.org/\n", "* requests: http://docs.python-requests.org/en/latest/ (only if you want automated download of the data)\n", "\n", "To visualize the graphs we use the [D3.js](http://d3js.org/) library, but we will load this on-the-fly when we start with the visualization." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import sys\n", "import csv\n", "import codecs\n", "import collections\n", "import glob\n", "import re\n", "\n", "import graf\n", "import scipy.sparse\n", "import networkx" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the data\n", "\n", "In the first step we download and extract the data. You may change to a local \"tmp\" directory before the download or just download the data to the current working directory. For this you need to install the Python library `requests`. You may also download and extract the data manually, the data is only downloaded for you if the file `sources.csv` is not found.## Get the sources\n", "\n", "Change to the directory where you extracted the ZIP archive that you downloaded from the QuantHistLing website:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "os.chdir(\"/Users/pbouda/Projects/git-github/notebooks/polysemy\")\n", "if not os.path.exists(\"sources.csv\"):\n", " import requests\n", " import zipfile\n", " r = requests.get(\n", " \"http://www.quanthistling.info/data/downloads/xml/data.zip\")\n", " with open(\"data.zip\", \"wb\") as f:\n", " f.write(r.content)\n", "\n", " z = zipfile.ZipFile(\"data.zip\")\n", " z.extractall()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we open the file \"sources.csv\" and read out all the sources that are dictionaries. We will store a list of those source in `dict_sources`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sources = csv.reader(open(\"sources.csv\", \"rU\"), delimiter=\"\\t\")\n", "dict_sources = list()\n", "for source in sources:\n", " if source[0] != \"QLCID\" and source[1] == \"dictionary\":\n", " dict_sources.append(source[0])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Some notes about co-occurence matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a co-occurence matrix for bodyparts" ] }, { "cell_type": "code", "collapsed": false, "input": [ "with codecs.open(\"body-part-terms-spanish.txt\", \"r\", \"utf-8\") as f:\n", " bodyparts = f.read().splitlines()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "if not os.path.exists(os.path.join(\"stopwords\", \"spanish\")):\n", " import requests\n", " import zipfile\n", " r = requests.get(\n", " \"https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip?raw=true\")\n", " with open(\"stopwords.zip\", \"wb\") as f:\n", " f.write(r.content)\n", "\n", " z = zipfile.ZipFile(\"stopwords.zip\")\n", " z.extractall()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "stopwords = list()\n", "with codecs.open(os.path.join(\"stopwords\", \"spanish\"), \"r\", \"utf-8\") as f:\n", " for line in f:\n", " stopwords.append(line.strip())\n", "re_stopwords_str = u\"(?={0})\".format(u\"|\".join(stopwords))\n", "re_stopwords = re.compile(re_stopwords_str)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "import unicodedata\n", "tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)\n", " if unicodedata.category(unichr(i)).startswith('P') or unicodedata.category(unichr(i)).startswith('S'))\n", "def remove_punctuation(text):\n", " return text.translate(tbl)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "parser = graf.GraphParser()\n", "\n", "spa_to_indi = collections.defaultdict(list)\n", "#spa_to_spa = []\n", "indi = set()\n", "spa = set()\n", "\n", "for d in dict_sources:\n", " for f in glob.glob(os.path.join(d, \"dict-*-dictinterpretation.xml\")):\n", " #print(\"Parsing {0}...\".format(f))\n", " graf_graph = parser.parse(f)\n", " \n", " for (node_id, node) in graf_graph.nodes.items():\n", " if node_id.endswith(\"..entry\"):\n", " entry_spa = set()\n", " #spa_to_spa_tmp = list()\n", " entry_indi = set()\n", " for e in node.out_edges:\n", " if e.annotations.get_first().label == \"head\" or e.annotations.get_first().label == \"translation\":\n", " # get lang\n", " for n in e.to_node.links[0][0].nodes:\n", " if n.annotations.get_first().label == \"iso-639-3\":\n", " if n.annotations.get_first().features.get_value(\"substring\") == \"spa\":\n", " substr = e.to_node.annotations.get_first().features.get_value(\"substring\")\n", " substr = remove_punctuation(substr)\n", " substr = re_stopwords.sub(\"\", substr)\n", " if substr in bodyparts:\n", " entry_spa.add(substr)\n", " break\n", " else:\n", " substr = e.to_node.annotations.get_first().features.get_value(\"substring\")\n", " entry_indi.add(u\"{0}|{1}\".format(substr, d))\n", " break\n", " \n", " if len(entry_spa) > 0:\n", " for head in entry_spa:\n", " for translation in entry_indi:\n", " spa_to_indi[head].append(translation)\n", " spa.add(head)\n", " indi.add(translation)\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "spa = list(spa)\n", "indi = list(indi)\n", "all_dicts_cooc = scipy.sparse.lil_matrix((len(indi), len(spa)))\n", "#all_dicts_cooc = numpy.zeros((len(indi), len(spa)))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "for i, head in enumerate(spa):\n", " for trans in spa_to_indi[head]:\n", " all_dicts_cooc[indi.index(trans), i] = 1" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "all_dicts_cooc = scipy.sparse.csc_matrix(all_dicts_cooc)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "spa_similarity = all_dicts_cooc.T * all_dicts_cooc" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a graph" ] }, { "cell_type": "code", "collapsed": false, "input": [ "g = networkx.Graph(spa_similarity)\n", "solitary = [ n for n, d in g.degree_iter() if d==2 ]\n", "g.remove_nodes_from(solitary)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "labels = dict(zip(range(len(spa)), spa))\n", "#labels = { k: v for k,v in enumerate(spa) if k in g }\n", "g2 = networkx.relabel_nodes(g, labels)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "from networkx.readwrite import json_graph\n", "import json\n", "bodyparts_json = json_graph.node_link_data(g2)\n", "json.dump(bodyparts_json, codecs.open(\"bodyparts_graph.json\", \"w\", \"utf-8\"))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }