{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using LX-DepParser to parse sentences and displaying dependency tree graphs\n", "This is an example notebook that illustrates how you can use the LX-DepParser web service to parse \n", "sentences and how to visualize dependency tree graphs in a notebook.\n", "\n", "**Before you run this example**, replace `access_key_goes_here` by your webservice access key, below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "LXDEPPARSER_WS_API_KEY = 'access_key_goes_here'\n", "LXDEPPARSER_WS_API_URL = 'https://portulanclarin.net/workbench/lx-depparser/api/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing required Python modules\n", "The next cell will take care of installing the `requests` and `pydependencygrapher` packages,\n", "if not already installed, and make them available to use in this notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "try:\n", " import requests\n", "except:\n", " !pip3 install requests\n", " import requests\n", "try:\n", " import pydependencygrapher\n", "except:\n", " # see https://github.com/pygobject/pycairo/issues/39#issuecomment-391830334\n", " !apt-get install libcairo2-dev libjpeg-dev libgif-dev\n", " !pip3 install pydependencygrapher\n", " import pydependencygrapher\n", "\n", "import base64\n", "import IPython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping the complexities of the JSON-RPC API in a simple, easy to use function\n", "\n", "The `WSException` class defined below, will be used later to identify errors\n", "from the webservice." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "class WSException(Exception):\n", " 'Webservice Exception'\n", " def __init__(self, errordata):\n", " \"errordata is a dict returned by the webservice with details about the error\"\n", " super().__init__(self)\n", " assert isinstance(errordata, dict)\n", " self.message = errordata[\"message\"]\n", " # see https://json-rpc.readthedocs.io/en/latest/exceptions.html for more info\n", " # about JSON-RPC error codes\n", " if -32099 <= errordata[\"code\"] <= -32000: # Server Error\n", " if errordata[\"data\"][\"type\"] == \"WebServiceException\":\n", " self.message += f\": {errordata['data']['message']}\"\n", " else:\n", " self.message += f\": {errordata['data']!r}\"\n", " def __str__(self):\n", " return self.message" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next function invoques the LX-DepParser webservice through it's public JSON-RPC API." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def parse(text, tagset, format):\n", " '''\n", " Arguments\n", " text: a string with a maximum of 2000 characters, Portuguese text, with\n", " the input to be processed\n", " tagset: either 'CINTIL' or 'UD' (universal dependencies)\n", " format: either 'CONLL' or 'JSON'\n", "\n", " Returns a string with the output according to specification in\n", " https://portulanclarin.net/workbench/lx-depparser/\n", " \n", " Raises a WSException if an error occurs.\n", " '''\n", "\n", " request_data = {\n", " 'method': 'parse',\n", " 'jsonrpc': '2.0',\n", " 'id': 0,\n", " 'params': {\n", " 'text': text,\n", " 'tagset': tagset,\n", " 'format': format,\n", " 'key': LXDEPPARSER_WS_API_KEY,\n", " },\n", " }\n", " request = requests.post(LXDEPPARSER_WS_API_URL, json=request_data)\n", " response_data = request.json()\n", " if \"error\" in response_data:\n", " raise WSException(response_data[\"error\"])\n", " else:\n", " return response_data[\"result\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us test the function we just defined:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#id\tform\tlemma\tcpos\tpos\tfeat\thead\tdeprel\tphead\tpdeprel\n", "1\tEsta\t-\tDEM\tDEM\tfs\t2\tSP\t2\tSP\n", "2\tfrase\tFRASE\tCN\tCN\tfs\t3\tSJ\t3\tSJ\n", "3\tserve\tSERVIR\tV\tV\tpi-3s\t0\tROOT\t0\tROOT\n", "4\tpara\t-\tPREP\tPREP\t-\t3\tC\t3\tC\n", "5\ttestar\tTESTAR\tV\tV\tINF-nInf\t3\tCOORD\t3\tCOORD\n", "6\to\t-\tDA\tDA\tms\t7\tSP\t7\tSP\n", "7\tfuncionamento\tFUNCIONAMENTO\tCN\tCN\tms\t5\tDO\t5\tDO\n", "8\tde_\t-\tPREP\tPREP\t-\t7\tOBL\t7\tOBL\n", "9\to\t-\tDA\tDA\tms\t10\tSP\t10\tSP\n", "10\tparser\tPARSER\tCN\tCN\tms\t8\tC\t8\tC\n", "11\tde\t-\tPREP\tPREP\t-\t10\tM\t10\tM\n", "12\tdependências\tDEPENDÊNCIA\tCN\tCN\tfp\t11\tC\t11\tC\n", "13\t.\t-\tPNT\tPNT\t-\t3\tPUNCT\t3\tPUNCT\n", "\n", "\n", "#id\tform\tlemma\tcpos\tpos\tfeat\thead\tdeprel\tphead\tpdeprel\n", "1\tEsta\t-\tDEM\tDEM\tfs\t3\tSP\t3\tSP\n", "2\toutra\tOUTRO\tADJ\tADJ\tfs\t3\tSP\t3\tSP\n", "3\tfrase\tFRASE\tCN\tCN\tfs\t4\tSJ\t4\tSJ\n", "4\tfaz\tFAZER\tV\tV\tpi-3s\t0\tROOT\t0\tROOT\n", "5\to\t-\tLDEM1\tLDEM1\t-\t4\tDO\t4\tDO\n", "6\tmesmo\t-\tLDEM2\tLDEM2\t-\t4\tDO\t4\tDO\n", "7\t.\t-\tPNT\tPNT\t-\t4\tPUNCT\t4\tPUNCT\n", "\n", "\n" ] } ], "source": [ "text = '''Esta frase serve para testar o funcionamento do parser de dependências. Esta outra\n", "frase faz o mesmo.'''\n", "# the CONLL annotation format is a popular format for annotating part of speech\n", "# and dependency tree graphs\n", "result = parse(text, tagset=\"CINTIL\", format=\"CONLL\")\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Displaying dependency tree graphs from parsed text in CONLL format\n", "\n", "To view dependency tree graphs for the parsed sentences, first we will split the CONLL output on empty lines to get one set of lines per sentence (each line carrying information pertaining to each token)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def group_sentence_conll_lines(conll_lines):\n", " \"\"\"Groups CONLL-encoded lines (one line encodes one token), according to sentences.\n", "\n", " This generator function takes as argument a sequence of CONLL lines, and generates\n", " a sequence of lists, each one containing the CONLL lines of a sentence\n", " \"\"\"\n", " parsed_sentences = []\n", " current_sentence = []\n", " for line in conll_lines:\n", " # lines starting with # are comments; ignore\n", " if line.startswith(\"#\"):\n", " continue\n", " # one or more consecutive empty lines mark the end of a sentence\n", " if not line:\n", " if current_sentence:\n", " parsed_sentences.append(current_sentence)\n", " current_sentence = []\n", " else:\n", " current_sentence.append(line)\n", " if current_sentence:\n", " parsed_sentences.append(current_sentence)\n", " return parsed_sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us define a function `render_tree` that displays a sentence dependency graph, making use of the `pydependencygrapher` package for rendering the graph into an image and the `IPython` package for displaying the resulting image.\n", "\n", "We also define a function `render_tree_from_conll` that will take a CONLL sentence (a list of CONLL-formatted lines, one for each token) and create one `pydependencygrapher.Token` object for each token, before calling `render_tree` to display the dependency graph." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def render_tree(sentence):\n", " graph = pydependencygrapher.DependencyGraph(sentence)\n", " graph.draw()\n", " b64png = graph.save_buffer()\n", " IPython.display.display(IPython.display.Image(data=base64.b64decode(b64png)))\n", "\n", "def render_tree_from_conll(conll_sentence):\n", " sentence = [pydependencygrapher.Token(*conll_token.split(\"\\t\")) for conll_token in conll_sentence] \n", " return render_tree(sentence)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conll_lines = result.splitlines(keepends=False)\n", "for conll_sentence in group_sentence_conll_lines(conll_lines):\n", " data = render_tree_from_conll(conll_sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The JSON output format\n", "\n", "The JSON format (which we obtain by passing `format=\"JSON\"` into the `parse` function) is more\n", "convenient when we need to further process the annotations, because each abstraction is mapped\n", "directly into a Python native object (lists, dicts, strings, etc) as follows:\n", "- The returned object is a `list`, where each element corresponds to a paragraph of the given text;\n", "- In turn, each paragraph is a `list` where each element represents a sentence;\n", "- Each sentence is a `list` where each element represents a token;\n", "- Each token is a `dict` where each key-value pair is an attribute of the token." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "paragraph 1:\n", " sentence 1:\n", " token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs', 'deprel': 'SP', 'parent': 2}\n", " token 2: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs', 'deprel': 'SJ', 'parent': 3}\n", " token 3: {'form': 'serve', 'space': 'LR', 'pos': 'V', 'lemma': 'SERVIR', 'infl': 'pi-3s', 'deprel': 'ROOT', 'parent': 0}\n", " token 4: {'form': 'para', 'space': 'LR', 'pos': 'PREP', 'deprel': 'C', 'parent': 3}\n", " token 5: {'form': 'testar', 'space': 'LR', 'pos': 'V', 'lemma': 'TESTAR', 'infl': 'INF-nInf', 'deprel': 'COORD', 'parent': 3}\n", " token 6: {'form': 'o', 'space': 'LR', 'pos': 'DA', 'infl': 'ms', 'deprel': 'SP', 'parent': 7}\n", " token 7: {'form': 'funcionamento', 'space': 'LR', 'pos': 'CN', 'lemma': 'FUNCIONAMENTO', 'infl': 'ms', 'deprel': 'DO', 'parent': 5}\n", " token 8: {'form': 'de_', 'space': 'L', 'raw': 'do', 'pos': 'PREP', 'deprel': 'OBL', 'parent': 7}\n", " token 9: {'form': 'o', 'space': 'R', 'pos': 'DA', 'infl': 'ms', 'deprel': 'SP', 'parent': 10}\n", " token 10: {'form': 'parser', 'space': 'LR', 'pos': 'CN', 'lemma': 'PARSER', 'infl': 'ms', 'deprel': 'C', 'parent': 8}\n", " token 11: {'form': 'de', 'space': 'LR', 'pos': 'PREP', 'deprel': 'M', 'parent': 10}\n", " token 12: {'form': 'dependências', 'space': 'L', 'pos': 'CN', 'lemma': 'DEPENDÊNCIA', 'infl': 'fp', 'deprel': 'C', 'parent': 11}\n", " token 13: {'form': '.', 'space': 'R', 'pos': 'PNT', 'deprel': 'PUNCT', 'parent': 3}\n", " sentence 2:\n", " token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs', 'deprel': 'SP', 'parent': 3}\n", " token 2: {'form': 'outra', 'space': 'LR', 'pos': 'ADJ', 'lemma': 'OUTRO', 'infl': 'fs', 'deprel': 'SP', 'parent': 3}\n", " token 3: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs', 'deprel': 'SJ', 'parent': 4}\n", " token 4: {'form': 'faz', 'space': 'LR', 'pos': 'V', 'lemma': 'FAZER', 'infl': 'pi-3s', 'deprel': 'ROOT', 'parent': 0}\n", " token 5: {'form': 'o', 'space': 'LR', 'pos': 'LDEM1', 'deprel': 'DO', 'parent': 4}\n", " token 6: {'form': 'mesmo', 'space': 'L', 'pos': 'LDEM2', 'deprel': 'DO', 'parent': 4}\n", " token 7: {'form': '.', 'space': 'R', 'pos': 'PNT', 'deprel': 'PUNCT', 'parent': 4}\n" ] } ], "source": [ "parsed_text = parse(text, tagset=\"CINTIL\", format=\"JSON\")\n", "for pnum, paragraph in enumerate(parsed_text, start=1): # enumerate paragraphs in text, starting at 1\n", " print(f\"paragraph {pnum}:\")\n", " for snum, sentence in enumerate(paragraph, start=1): # enumerate sentences in paragraph, starting at 1\n", " print(f\" sentence {snum}:\")\n", " for tnum, token in enumerate(sentence, start=1): # enumerate tokens in sentence, starting at 1\n", " print(f\" token {tnum}: {token!r}\") # print a token representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Displaying dependency graphs from parsed text in JSON format\n", "Let us define a function, similar to `render_tree_from_conll` to display dependency graphs for JSON-encoded sentences." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def render_tree_from_json(json_sentence):\n", " token_attributes = [\"form\", \"lemma\", \"pos\", \"pos\", \"infl\", \"parent\", \"deprel\", \"parent\", \"deprel\"]\n", " sentence = []\n", " for num, token in enumerate(json_sentence, start=1):\n", " sentence.append(\n", " pydependencygrapher.Token(\n", " num,\n", " *[token.get(attribute, \"_\") for attribute in token_attributes]\n", " )\n", " )\n", " return render_tree(sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us test the function we just defined" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for paragraph in parsed_text:\n", " for sentence in paragraph:\n", " render_tree_from_json(sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the status of a webservice access key" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def get_key_status():\n", " '''Returns a string with the detailed status of the webservice access key'''\n", " \n", " request_data = {\n", " 'method': 'key_status',\n", " 'jsonrpc': '2.0',\n", " 'id': 0,\n", " 'params': {\n", " 'key': LXDEPPARSER_WS_API_KEY,\n", " },\n", " }\n", " request = requests.post(LXDEPPARSER_WS_API_URL, json=request_data)\n", " response_data = request.json()\n", " if \"error\" in response_data:\n", " raise WSException(response_data[\"error\"])\n", " else:\n", " return response_data[\"result\"]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'requests_remaining': 99999970,\n", " 'chars_remaining': 999998849,\n", " 'expiry': '2030-01-10T00:00+00:00'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_key_status()" ] } ], "metadata": { "interpreter": { "hash": "006d5deb8e6cdcd4312641bdf15f3bc20f0769a7305d81173599a7b40f33b4a2" }, "kernelspec": { "display_name": "Python 3.7.7 64-bit", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }