{ "cells": [ { "cell_type": "markdown", "source": [ "# Using LX-Suite to annotate a text from the BDCamões corpus\n", "This is an example notebook that illustrates how you can use the LX-Suite web service to annotate\n", "a sample text from the BDCamões corpus (the full corpus is [available from the PORTULAN CLARIN repository](https://portulanclarin.net/repository/browse/bdcamoes-corpus-collection-of-portuguese-literary-documents-from-the-digital-library-of-camoes-ip-part-i/52f2b16412c411ea8a1302420a000005407eb504ccc045a4a0582ab53dfd43fd/)).\n", "\n", "**Before you run this example**, replace `access_key_goes_here` by your webservice access key, below:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "LXSUITE_WS_API_KEY = 'access_key_goes_here'\n", "LXSUITE_WS_API_URL = 'https://portulanclarin.net/workbench/lx-suite/api/'" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Importing required Python modules\n", "The next cell will take care of installing the `requests` and `matplotlib` packages,\n", "if not already installed, and make them available to use in this notebook." ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "try:\n", " import requests\n", "except:\n", " !pip3 install requests\n", " import requests\n", "try:\n", " import matplotlib.pyplot as plt\n", "except:\n", " !pip3 install matplotlib\n", " import matplotlib.pyplot as plt\n", "import collections" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Wrapping the complexities of the JSON-RPC API in a simple, easy to use function\n", "\n", "The `WSException` class defined below, will be used later to identify errors\n", "from the webservice." ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "class WSException(Exception):\n", " 'Webservice Exception'\n", " def __init__(self, errordata):\n", " \"errordata is a dict returned by the webservice with details about the error\"\n", " super().__init__(self)\n", " assert isinstance(errordata, dict)\n", " self.message = errordata[\"message\"]\n", " # see https://json-rpc.readthedocs.io/en/latest/exceptions.html for more info\n", " # about JSON-RPC error codes\n", " if -32099 <= errordata[\"code\"] <= -32000: # Server Error\n", " if errordata[\"data\"][\"type\"] == \"WebServiceException\":\n", " self.message += f\": {errordata['data']['message']}\"\n", " else:\n", " self.message += f\": {errordata['data']!r}\"\n", " def __str__(self):\n", " return self.message" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The next function invokes the LX-Suite webservice through its public JSON-RPC API." ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "def annotate(text, format):\n", " '''\n", " Arguments\n", " text: a string with a maximum of 4000 characters, Portuguese text, with\n", " the input to be processed\n", " format: either 'CINTIL', 'CONLL' or 'JSON'\n", "\n", " Returns a string with the output according to specification in\n", " https://portulanclarin.net/workbench/lx-suite/\n", " \n", " Raises a WSException if an error occurs.\n", " '''\n", "\n", " request_data = {\n", " 'method': 'annotate',\n", " 'jsonrpc': '2.0',\n", " 'id': 0,\n", " 'params': {\n", " 'text': text,\n", " 'format': format,\n", " 'key': LXSUITE_WS_API_KEY,\n", " },\n", " }\n", " request = requests.post(LXSUITE_WS_API_URL, json=request_data)\n", " response_data = request.json()\n", " if \"error\" in response_data:\n", " raise WSException(response_data[\"error\"])\n", " else:\n", " return response_data[\"result\"]" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Let us test the function we just defined:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "text = '''Esta frase serve para testar o funcionamento da suite. Esta outra\n", "frase faz o mesmo.'''\n", "# the CONLL annotation format is a popular format for annotating part of speech\n", "result = annotate(text, format=\"CONLL\")\n", "print(result)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "#id\tform\tlemma\tcpos\tpos\tfeat\thead\tdeprel\tphead\tpdeprel\n", "1\tEsta\t-\tDEM\tDEM\tfs\t-\t-\t-\t-\n", "2\tfrase\tFRASE\tCN\tCN\tfs\t-\t-\t-\t-\n", "3\tserve\tSERVIR\tV\tV\tpi-3s\t-\t-\t-\t-\n", "4\tpara\t-\tPREP\tPREP\t-\t-\t-\t-\t-\n", "5\ttestar\tTESTAR\tV\tV\tINF-nInf\t-\t-\t-\t-\n", "6\to\t-\tDA\tDA\tms\t-\t-\t-\t-\n", "7\tfuncionamento\tFUNCIONAMENTO\tCN\tCN\tms\t-\t-\t-\t-\n", "8\tde_\t-\tPREP\tPREP\t-\t-\t-\t-\t-\n", "9\ta\t-\tDA\tDA\tfs\t-\t-\t-\t-\n", "10\tsuite\tSUITE\tCN\tCN\tfs\t-\t-\t-\t-\n", "11\t.\t-\tPNT\tPNT\t-\t-\t-\t-\t-\n", "\n", "\n", "#id\tform\tlemma\tcpos\tpos\tfeat\thead\tdeprel\tphead\tpdeprel\n", "1\tEsta\t-\tDEM\tDEM\tfs\t-\t-\t-\t-\n", "2\toutra\tOUTRO\tADJ\tADJ\tfs\t-\t-\t-\t-\n", "3\tfrase\tFRASE\tCN\tCN\tfs\t-\t-\t-\t-\n", "4\tfaz\tFAZER\tV\tV\tpi-3s\t-\t-\t-\t-\n", "5\to\t-\tLDEM1\tLDEM1\t-\t-\t-\t-\t-\n", "6\tmesmo\t-\tLDEM2\tLDEM2\t-\t-\t-\t-\t-\n", "7\t.\t-\tPNT\tPNT\t-\t-\t-\t-\t-\n", "\n", "\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## The JSON output format\n", "\n", "The JSON format (which we obtain by passing `format=\"JSON\"` into the `annotate` function) is more\n", "convenient when we need to further process the annotations, because each abstraction is mapped\n", "directly into a Python native object (lists, dicts, strings, etc) as follows:\n", "- The returned object is a `list`, where each element corresponds to a paragraph of the given text;\n", "- In turn, each paragraph is a `list` where each element represents a sentence;\n", "- Each sentence is a `list` where each element represents a token;\n", "- Each token is a `dict` where each key-value pair is an attribute of the token." ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "annotated_text = annotate(text, format=\"JSON\")\n", "for pnum, paragraph in enumerate(annotated_text, start=1): # enumerate paragraphs in text, starting at 1\n", " print(f\"paragraph {pnum}:\")\n", " for snum, sentence in enumerate(paragraph, start=1): # enumerate sentences in paragraph, starting at 1\n", " print(f\" sentence {snum}:\")\n", " for tnum, token in enumerate(sentence, start=1): # enumerate tokens in sentence, starting at 1\n", " print(f\" token {tnum}: {token!r}\") # print a token representation" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "paragraph 1:\n", " sentence 1:\n", " token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs'}\n", " token 2: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs'}\n", " token 3: {'form': 'serve', 'space': 'LR', 'pos': 'V', 'lemma': 'SERVIR', 'infl': 'pi-3s'}\n", " token 4: {'form': 'para', 'space': 'LR', 'pos': 'PREP'}\n", " token 5: {'form': 'testar', 'space': 'LR', 'pos': 'V', 'lemma': 'TESTAR', 'infl': 'INF-nInf'}\n", " token 6: {'form': 'o', 'space': 'LR', 'pos': 'DA', 'infl': 'ms'}\n", " token 7: {'form': 'funcionamento', 'space': 'LR', 'pos': 'CN', 'lemma': 'FUNCIONAMENTO', 'infl': 'ms'}\n", " token 8: {'form': 'de_', 'space': 'L', 'raw': 'da', 'pos': 'PREP'}\n", " token 9: {'form': 'a', 'space': 'R', 'pos': 'DA', 'infl': 'fs'}\n", " token 10: {'form': 'suite', 'space': 'L', 'pos': 'CN', 'lemma': 'SUITE', 'infl': 'fs'}\n", " token 11: {'form': '.', 'space': 'R', 'pos': 'PNT'}\n", " sentence 2:\n", " token 1: {'form': 'Esta', 'space': 'LR', 'pos': 'DEM', 'infl': 'fs'}\n", " token 2: {'form': 'outra', 'space': 'LR', 'pos': 'ADJ', 'lemma': 'OUTRO', 'infl': 'fs'}\n", " token 3: {'form': 'frase', 'space': 'LR', 'pos': 'CN', 'lemma': 'FRASE', 'infl': 'fs'}\n", " token 4: {'form': 'faz', 'space': 'LR', 'pos': 'V', 'lemma': 'FAZER', 'infl': 'pi-3s'}\n", " token 5: {'form': 'o', 'space': 'LR', 'pos': 'LDEM1'}\n", " token 6: {'form': 'mesmo', 'space': 'L', 'pos': 'LDEM2'}\n", " token 7: {'form': '.', 'space': 'R', 'pos': 'PNT'}\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Downloading and preparing our working text\n", "\n", "In the next code cell, we will download a copy of the book \"Viagens na minha terra\" and prepare it to be used as our working text." ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "# A plain text version of this book is available from our Gitbub repository:\n", "sample_text_url = \"https://github.com/portulanclarin/jupyter-notebooks/raw/main/sample-data/viagensnaminhaterra.txt\"\n", "\n", "req = requests.get(sample_text_url)\n", "sample_text_lines = req.text.splitlines()\n", "\n", "num_lines = len(sample_text_lines)\n", "print(f\"The downloaded text contains {num_lines} lines\")\n", "\n", "# discard whitespace at beginning and end of each line:\n", "sample_text_lines = [line.strip() for line in sample_text_lines]\n", "\n", "# discard empty lines\n", "sample_text_lines = [line for line in sample_text_lines if line]\n", "\n", "# how many lines do we have left?\n", "num_lines = len(sample_text_lines)\n", "print(f\"After discarding empty lines we are left with {num_lines} non-empty lines\")\n" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The downloaded text contains 2509 lines\n", "After discarding empty lines we are left with 2205 non-empty lines\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Annotating with the LX-Suite web service\n", "\n", "There is a limit on the number of web service requests per hour that can be made in association with any given key.\n", "Thus, we should send as much text as possible in each request while also conforming with the 4000 characters\n", "per request limit.\n", "\n", "To this end, the following function slices our text into chunks smaller than 4K:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "def slice_into_chunks(lines, max_chunk_size=4000):\n", " chunk, chunk_size = [], 0\n", " for lnum, line in enumerate(lines, start=1):\n", " if (chunk_size + len(line)) <= max_chunk_size:\n", " chunk.append(line)\n", " chunk_size += len(line) + 1\n", " # the + 1 above is for the newline character terminating each line\n", " else:\n", " yield \"\\n\".join(chunk)\n", " if len(line) > max_chunk_size:\n", " print(f\"line {lnum} is longer than 4000 characters; truncating\")\n", " line = line[:4000]\n", " chunk, chunk_size = [line], len(line) + 1\n", " if chunk:\n", " yield \"\\n\".join(chunk)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Next, we will apply `slice_into_chunks` to the sample text to get the chunks to be annotated." ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "chunks = list(slice_into_chunks(sample_text_lines))\n", "annotated_text = [] # annotated paragraphs will be stored here\n", "chunks_processed = 0 # this variable keeps track of which chunks have been processed already\n", "print(f\"There are {len(chunks)} chunks to be annotated\")" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "There are 105 chunks to be annotated\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Next, we will invoke `annotate` on each chunk.\n", "If we get an exception while annotating a chunk:\n", "- check the exception message to determine what was the cause;\n", "- if the maximum number of requests per hour has been exceeded, then wait some time before retrying;\n", "- if a temporary error occurred in the webservice, try again later.\n", "\n", "In any case, as long as the notebook is not shutdown or restarted, the text that has been annotated thus far is not lost,\n", "and re-running the following cell will pick up from the point where the exception occurred." ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "for cnum, chunk in enumerate(chunks[chunks_processed:], start=chunks_processed+1):\n", " try:\n", " annotated_text.extend(annotate(chunk, format=\"JSON\"))\n", " chunks_processed = cnum\n", " # print one dot for each annotated chunk to get some progress feedback\n", " print(\".\", end=\"\", flush=True)\n", " except Exception as exc:\n", " chunk_preview = chunk[:100] + \"[...]\" if len(chunk) > 100 else chunk\n", " print(\n", " f\"\\nError: annotation of chunk {cnum} failed ({exc}); chunk contents:\\n\\n{chunk_preview}\\n\\n\"\n", " )\n", " break" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "........................................................................................................." ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Let's create a pie chart with the most common part-of-speech tags" ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "%matplotlib inline\n", "\n", "tag_frequencies = collections.Counter(\n", " token[\"pos\"]\n", " for paragraph in annotated_text\n", " for sentence in paragraph\n", " for token in sentence\n", ").most_common()\n", "\n", "tags = [tag for tag, _ in tag_frequencies[:9]]\n", "freqs = [freq for _, freq in tag_frequencies[:9]]\n", "\n", "tags.append(\"other\")\n", "freqs.append(sum(freq for _, freq in tag_frequencies[10:]))\n", "\n", "plt.rcParams['figure.figsize'] = [10, 10]\n", "fig1, ax1 = plt.subplots()\n", "ax1.pie(freqs, labels=tags, autopct='%1.1f%%', startangle=90)\n", "ax1.axis('equal') # equal aspect ratio ensures that pie is drawn as a circle.\n", "\n", "plt.show()\n", "# To learn more about matplotlib visit https://matplotlib.org/" ], "outputs": [ { "output_type": "display_data", "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {} } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Getting the status of a webservice access key" ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "def get_key_status():\n", " '''Returns a string with the detailed status of the webservice access key'''\n", " \n", " request_data = {\n", " 'method': 'key_status',\n", " 'jsonrpc': '2.0',\n", " 'id': 0,\n", " 'params': {\n", " 'key': LXSUITE_WS_API_KEY,\n", " },\n", " }\n", " request = requests.post(LXSUITE_WS_API_URL, json=request_data)\n", " response_data = request.json()\n", " if \"error\" in response_data:\n", " raise WSException(response_data[\"error\"])\n", " else:\n", " return response_data[\"result\"]" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "get_key_status()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'requests_remaining': 99999140,\n", " 'chars_remaining': 998236690,\n", " 'expiry': '2030-01-10T00:00+00:00'}" ] }, "metadata": {}, "execution_count": 13 } ], "metadata": {} } ], "metadata": { "interpreter": { "hash": "006d5deb8e6cdcd4312641bdf15f3bc20f0769a7305d81173599a7b40f33b4a2" }, "kernelspec": { "name": "python3", "display_name": "Python 3.7.7 64-bit" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }