{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Parsing with NLTK and Foma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "\u001b[33mWARNING: Skipping /usr/lib/python3.12/dist-packages/argcomplete-3.1.4.dist-info due to invalid metadata entry 'name'\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Skipping /usr/lib/python3.12/dist-packages/pybind11-2.11.1.dist-info due to invalid metadata entry 'name'\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: nltk in /usr/lib/python3/dist-packages (3.8.1)\n", "\u001b[33mWARNING: Skipping /usr/lib/python3.12/dist-packages/argcomplete-3.1.4.dist-info due to invalid metadata entry 'name'\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Skipping /usr/lib/python3.12/dist-packages/pybind11-2.11.1.dist-info due to invalid metadata entry 'name'\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip install -U nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of grammar engineering and parsing in the classes *Alternative Syntactic Theories* and *Advanced Natural Language Processing* taught at Indiana University in Spring 2017, Fall 2018, and Spring 2019, Fall 2020." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code examples require the [*foma*](https://github.com/mhulden/foma) application and library to be installed, as well as the *foma.py* module. Since I am using [Python 3.x](https://www.python.org/downloads/) here, I would recommend to use my version of *foma.py* and install it in the local modules folder of your Python distribution. In my case, since I use [Anaconda](https://www.continuum.io/downloads), the file goes in *anaconda/lib/python3.6/* in my home directory. On a Linux distribution you might have a folder */usr/local/lib/python3.6* or similar, where the module *foma.py* has to be copied to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might need to install [Foma](https://fomafst.github.io/) on your local machine. It seems to be a package in the general catalog of [Ubuntu](https://ubuntu.com/) and [Fedora Linux](https://fedoraproject.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grammars and Parsers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use NLTK in the following:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can declare a feature structure and display it using NLTK:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ [ GND = 'fem' ] ]\n", "[ AGR = [ NUM = 'pl' ] ]\n", "[ [ PER = 3 ] ]\n", "[ ]\n", "[ POS = 'N' ]\n" ] } ], "source": [ "fstr = nltk.FeatStruct(\"[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]\")\n", "print(fstr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also use Feature grammars:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from nltk.grammar import FeatureGrammar, FeatDict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define a feature grammar in the following way:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "grammarText = \"\"\"\n", "% start S\n", "# ############################\n", "# Grammar Rules\n", "# ############################\n", "S[TYPE=decl] -> NP[NUM=?n, PERS=?p, CASE='nom'] VP[NUM=?n, PERS=?p] '.'\n", "S[TYPE=inter] -> NP[NUM=?n, PERS=?p, CASE='nom', PRONTYPE=inter] VP[NUM=?n, PERS=?p] '?'\n", "S[TYPE=inter] -> NP[NUM=?n, PERS=?p, CASE='acc', PRONTYPE=inter] AUX NP[NUM=?n, PERS=?p, CASE='nom'] VP[NUM=?n, PERS=?p, VAL=2] '?'\n", "NP[NUM=?n, PERS=?p, CASE=?c] -> N[NUM=?n, PERS=?p, CASE=?c]\n", "NP[NUM=?n, PERS=?p, CASE=?c] -> D[NUM=?n, CASE=?c] N[NUM=?n,PERS=?p, CASE=?c]\n", "NP[NUM=?n, PERS=?p, CASE=?c] -> Pron[NUM=?n, PERS=?p, CASE=?c]\n", "VP[NUM=?n, PERS=?p] -> V[NUM=?n, PERS=?p]\n", "VP[NUM=?n, PERS=?p, VAL=2] -> V[NUM=?n, PERS=?p, VAL=2] NP[CASE='acc']\n", "VP[NUM=?n, PERS=?p, VAL=2] -> V[NUM=?n, PERS=?p, VAL=2]\n", "\"\"\"\n", "\n", "grammar = FeatureGrammar.fromstring(grammarText)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Testing the grammar with the input sentence *John loves Mary* fails, because there is not lexical defintion of these entries in the grammar." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The grammar is missing token definitions.\n", "* John loves Mary\n", "The grammar is missing token definitions.\n", "* Mary loves John\n" ] } ], "source": [ "parser = nltk.parse.FeatureChartParser(grammar)\n", "sentences = [\"John loves Mary\",\n", " \"Mary loves John\"]\n", "\n", "for sentence in sentences:\n", " result = []\n", " try:\n", " result = list(parser.parse(sentence.split())) \n", " except ValueError:\n", " print(\"The grammar is missing token definitions.\")\n", "\n", " if result:\n", " for x in result:\n", " print(x)\n", " else:\n", " print(\"*\", sentence)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can include foma and a morphology using the following code:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import foma" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following line will load the *eng.fst* Foma morphology:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "fst = foma.FST.load(b'eng.fst')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print out the analysis for each single token by submitting it to the FST:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "John+N+Sg+Masc+NEPerson\n", "call+V+3P+Sg\n", "Mary+N+Sg+Fem+NEPerson\n" ] } ], "source": [ "tokens = \"John calls Mary\".split()\n", "for token in tokens:\n", " result = list(fst.apply_up(str.encode(token)))\n", " for r in result:\n", " print(r.decode('utf8'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to convert the flat string annotation from the morphology to a NLTK feature structure, we need to translate some entires to a corresponding Attribute Value Matrix (AVM). In the following table we define a feature in the morphology output and the corresponging feature structure that it corresponds with:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "featureMapping = {\n", " 'Q' : \"PRONTYPE = inter\",\n", " 'Animate' : \"ANIMATE = 1\",\n", " 'Def' : \"DETTYPE = def\",\n", " 'Indef' : \"DETTYPE = indef\",\n", " 'Sg' : \"NUM = sg\",\n", " 'Pl' : \"NUM = pl\",\n", " '3P' : \"PERS = 3\",\n", " 'Masc' : \"GENDSEM = male\",\n", " 'Fem' : \"GENDSEM = female\",\n", " 'Dat' : \"CASE = dat\",\n", " 'Acc' : \"CASE = acc\",\n", " 'Nom' : \"CASE = nom\",\n", " 'NEPersonName' : \"\"\"NTYPE = [NSYN = proper,\n", " NSEM = [PROPER = [PROPERTYPE = name,\n", " NAMETYPE = first_name]]],\n", " HUMAN = 1\"\"\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a function *feat2LFG* to convert the feature tag in the morphological analysis to a LFG-compatible AVM:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def feat2LFG(f):\n", " result = featureMapping.get(f, \"\")\n", " return(nltk.FeatStruct(\"\".join( (\"[\", result, \"]\") )))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following helper function is recursive. It mapps the AVM to a bracketed string annotation of feature structures, as used in the NLTK feature grammar format:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def flatFStructure(f):\n", " res = \"\"\n", " for key in f.keys():\n", " val = f[key]\n", " if res:\n", " res += ', '\n", " if (isinstance(val, FeatDict)):\n", " res += key + '=' + flatFStructure(val)\n", " else:\n", " res += key + \"=\" + str(val)\n", " return('[' + res + ']')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[GENDSEM='male']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feat2LFG(\"Masc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function is a parse function that prints out parse trees for an input, maintaining the extended feature structures at the lexical level. It can now parse sentences that contain words that are not specified as lexical words in the grammar, but rather as paths in the morphological finite state transducer." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def parseFoma(sentence):\n", " tokens = sentence.split()\n", "\n", " tokenAnalyses = {}\n", " rules = []\n", " count = 0\n", " for token in tokens:\n", " aVal = []\n", " result = list(fst.apply_up(str.encode(token)))\n", " for r in result:\n", " elements = r.decode('utf8').split('+')\n", "\n", " lemma = elements[0]\n", " tokFeat = nltk.FeatStruct(\"[PRED=\" + lemma + \"]\")\n", "\n", " cat = elements[1]\n", " if len(elements) > 2:\n", " feats = tuple(elements[2:])\n", " else:\n", " feats = ()\n", " for x in feats:\n", " fRes2 = feat2LFG(x)\n", " fRes = tokFeat.unify(fRes2)\n", " if fRes:\n", " tokFeat = fRes\n", " else:\n", " print(\"Error unifying:\", tokFeat, fRes2)\n", " flatFStr = flatFStructure(tokFeat)\n", " aVal.append(cat + flatFStr)\n", " rules.append(cat + flatFStr + \" -> \" + \"'\" + token + \"'\")\n", " tokenAnalyses[count] = aVal\n", " count += 1\n", "\n", " grammarText2 = grammarText + \"\\n\" + \"\\n\".join(rules)\n", "\n", " grammar = FeatureGrammar.fromstring(grammarText2)\n", " parser = nltk.parse.FeatureChartParser(grammar)\n", " result = list(parser.parse(tokens))\n", " if result:\n", " for x in result:\n", " print(x)\n", " else:\n", " print(\"*\", sentence)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can call this function using the following code:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S[TYPE='decl']\n", " (NP[CASE=?c, NUM='sg', PERS=?p]\n", " (N[GENDSEM='male', NUM='sg', PRED='John'] John))\n", " (VP[NUM=?n, PERS=?p, VAL=2]\n", " (V[PRED='call'] called)\n", " (NP[CASE=?c, NUM='sg', PERS=?p]\n", " (N[GENDSEM='female', NUM='sg', PRED='Mary'] Mary)))\n", " .)\n" ] } ], "source": [ "parseFoma(\"John called Mary .\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "1e28a5307a9b5c2fbeb0b263581f1cf3bfba9739188743f6a231f74c7de58892" } } }, "nbformat": 4, "nbformat_minor": 4 }