{ "cells": [ { "cell_type": "code", "execution_count": 64, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "import sys\n", "sys.path.append(\"..\")\n", "from statnlpbook.util import execute_notebook\n", "import statnlpbook.parsing as parsing\n", "from statnlpbook.transition import *\n", "from statnlpbook.dep import *\n", "import pandas as pd\n", "from io import StringIO\n", "from IPython.display import display, HTML\n", "\n", "execute_notebook('transition-based_dependency_parsing.ipynb')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\a}{\\alpha}\n", "\\newcommand{\\b}{\\beta}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The tikzmagic extension is already loaded. To reload it, use:\n", " %reload_ext tikzmagic\n" ] } ], "source": [ "%load_ext tikzmagic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Parsing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Schedule\n", "\n", "+ Parsing motivation\n", "\n", "+ Background: parsing (10 min.)\n", "\n", "+ Exercise: multi-word expressions (10 min.)\n", "\n", "+ Background: Universal Dependencies (5 min.)\n", "\n", "+ Background: transition-based parsing (10 min.)\n", "\n", "+ Break (10 min.)\n", "\n", "+ Example: transition-based parsing (5 min.)\n", "\n", "+ Motivation: natural language understanding (5 min.)\n", "\n", "+ Background: learning to parse (10 min.)\n", "\n", "+ Math: dependency parsing evaluation (5 min.)\n", "\n", "+ Examples: dependency parsers (5 min.)\n", "\n", "+ Background: semantic parsing (15 min.)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Motivation: information extraction\n", "\n", "> Dechra Pharmaceuticals, which has just made its second acquisition, had previously purchased Genitrix.\n", "\n", "> Trinity Mirror plc, the largest British newspaper, purchased Local World, its rival.\n", "\n", "> Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Check out [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/run.php?model=english-ewt-ud-2.6-200830&data=Kraft,%20owner%20of%20Milka,%20purchased%20Cadbury%20Dairy%20Milk%20and%20is%20now%20gearing%20up%20for%20a%20roll-out%20of%20its%20new%20brand.) and [Stanza](http://stanza.run/)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Motivation: question answering by reading comprehension\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from [Rajpurkar et al., 2016](https://aclanthology.org/D16-1264))\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Motivation: question answering from knowledge bases\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from [Reddy et al., 2017](https://aclanthology.org/D17-1009/))\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Parsing is is the process of **constructing these graphs**:\n", "\n", "* very important for downstream applications\n", "* researched in academia and [industry](https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How is this done?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Syntactic Dependencies\n", "\n", "* **Lexical Elements**: words\n", "* **Syntactic Relations**: subject, direct object, nominal modifier, etc. \n", "\n", "Task: determine the syntactic relations between words" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Grammatical Relations\n", "> Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.\n", "\n", "* *Subject* of **purchased**: Kraft\n", "* *Object* of **purchased**: Cadbury" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tKraft\tKraft\tNOUN\tNN\t_\t7\tnsubj\t_\t_\n", "2\t,\t,\tPUNCT\t,\t_\t1\tpunct\t_\t_\n", "3\towner\towner\tNOUN\tNN\t_\t1\tappos\t_\t_\n", "4\tof\tof\tADP\tIN\t_\t5\tcase\t_\t_\n", "5\tMilka\tMilka\tPROPN\tNNP\t_\t3\tnmod\t_\t_\n", "6\t,\t,\tPUNCT\t,\t_\t7\tpunct\t_\t_\n", "7\tpurchased\tpurchase\tVERB\tVBD\t_\t0\troot\t_\t_\n", "8\tCadbury\tCadbury\tPROPN\tNNP\t_\t7\tdobj\t_\t_\n", "9\tDairy\tDairy\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "10\tMilk\tmilk\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"1200px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Anatomy of a Dependency Tree\n", "\n", "* Nodes (vertices):\n", " * Words of the sentence (+ punctuation tokens)\n", " * a ROOT node\n", "* Arcs (edges):\n", " * Directed from syntactic **head** to **dependent**\n", " * Each **non-ROOT** token has **exactly one head**\n", " * the word that controls its syntactic function, or\n", " * the word \"it depends on\"\n", "* ROOT **has no head**\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tKraft\tKraft\tNOUN\tNN\t_\t7\tnsubj\t_\t_\n", "2\t,\t,\tPUNCT\t,\t_\t1\tpunct\t_\t_\n", "3\towner\towner\tNOUN\tNN\t_\t1\tappos\t_\t_\n", "4\tof\tof\tADP\tIN\t_\t5\tcase\t_\t_\n", "5\tMilka\tMilka\tPROPN\tNNP\t_\t3\tnmod\t_\t_\n", "6\t,\t,\tPUNCT\t,\t_\t7\tpunct\t_\t_\n", "7\tpurchased\tpurchase\tVERB\tVBD\t_\t0\troot\t_\t_\n", "8\tCadbury\tCadbury\tPROPN\tNNP\t_\t7\tdobj\t_\t_\n", "9\tDairy\tDairy\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "10\tMilk\tmilk\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"1200px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Example\n", "\n", "(in [CoNLL-U Format](https://universaldependencies.org/format.html))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1Alice____2nsubj__
2saw____0root__
3Bob____2dobj__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tAlice\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tBob\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"900px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### https://ucph.padlet.org/dh/mw" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Need for Universal Syntax\n", "\n", "### https://cl.lingfil.uu.se/~nivre/docs/NivreCLIN2020.pdf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Universal Syntax\n", "\n", "English and Danish are similar, while others are more distant:\n", "![similarities](https://www.mitpressjournals.org/na101/home/literatum/publisher/mit/journals/content/coli/2019/coli.2019.45.issue-2/coli_a_00351/20190614/images/large/00351f03c.jpeg)\n", "\n", "
\n", " Left: clustering based on syntactic dependencies; right: genetic tree\n", " (from Bjerva et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Danish Example" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1AliceAliceNOUN__2nsubj__
2seVERB__0root__
3BobBobPROPN__2obj__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tAlice\tAlice\tNOUN\t_\t_\t2\tnsubj\t_\t_\n", "2\tså\tse\tVERB\t_\t_\t0\troot\t_\t_\n", "3\tBob\tBob\tPROPN\t_\t_\t2\tobj\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"900px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Korean Example" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1앨리스는앨리스+는NOUN__3nsubj__
2밥을밥+을NOUN__3obj__
3보았다보+았+다VERB__0root__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\t앨리스는\t앨리스+는\tNOUN\t_\t_\t3\tnsubj\t_\t_\n", "2\t밥을\t밥+을\tNOUN\t_\t_\t3\tobj\t_\t_\n", "3\t보았다\t보+았+다\tVERB\t_\t_\t0\troot\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"900px\")\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Longer English Example" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1KraftKraftNOUNNN_7nsubj__
2,,PUNCT,_1punct__
3ownerownerNOUNNN_1appos__
4ofofADPIN_5case__
5MilkaMilkaPROPNNNP_3nmod__
6,,PUNCT,_7punct__
7purchasedpurchaseVERBVBD_0root__
8CadburyCadburyPROPNNNP_7dobj__
9DairyDairyPROPNNNP_8flat__
10MilkmilkPROPNNNP_8flat__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tKraft\tKraft\tNOUN\tNN\t_\t7\tnsubj\t_\t_\n", "2\t,\t,\tPUNCT\t,\t_\t1\tpunct\t_\t_\n", "3\towner\towner\tNOUN\tNN\t_\t1\tappos\t_\t_\n", "4\tof\tof\tADP\tIN\t_\t5\tcase\t_\t_\n", "5\tMilka\tMilka\tPROPN\tNNP\t_\t3\tnmod\t_\t_\n", "6\t,\t,\tPUNCT\t,\t_\t7\tpunct\t_\t_\n", "7\tpurchased\tpurchase\tVERB\tVBD\t_\t0\troot\t_\t_\n", "8\tCadbury\tCadbury\tPROPN\tNNP\t_\t7\tdobj\t_\t_\n", "9\tDairy\tDairy\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "10\tMilk\tmilk\tPROPN\tNNP\t_\t8\tflat\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"1200px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Universal Dependencies \n", "\n", "* Annotation framework featuring [37 syntactic relations](https://universaldependencies.org/u/dep/all.html)\n", "* [Treebanks](http://universaldependencies.org/) in over 90 languages\n", "* Large project with over 200 contributors\n", "* Linguistically universal [annotation guidelines](https://universaldependencies.org/guidelines.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### UD Dependency Relations\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\t \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \t\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nominals Clauses Modifier words Function Words
\n", "\tCore arguments\n", " \n", "\t nsubj
\n", "\t obj
\n", "\t iobj\n", "
\n", "\t csubj
\n", "\t ccomp
\n", "\t xcomp\n", "
\n", "\tNon-core dependents\n", " \n", "\t obl
\n", "\t vocative
\n", "\t expl
\n", "\t dislocated\n", "
\n", "\t advcl\n", " \n", "\t advmod
\n", "\t discourse\n", "
\n", "\t aux
\n", "\t cop
\n", "\t mark\n", "
\n", "\tNominal dependents\n", " \n", "\t nmod
\n", "\t appos
\n", "\t nummod\n", "
\n", "\t acl\n", " \n", "\t amod\n", " \n", "\t det
\n", "\t clf
\n", "\t case\n", "
Coordination MWE Loose Special Other
\n", "\t conj
\n", "\t cc\n", "
\n", "\t fixed
\n", "\t flat
\n", "\t compound\n", "
\n", "\t list
\n", "\t parataxis\n", "
\n", "\t orphan
\n", "\t goeswith
\n", "\t reparandum\n", "
\n", "\t punct
\n", "\t root
\n", "\t dep\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Universal POS Tags (UPOS)\n", "\n", "As opposed to language-specific POS tags (XPOS).\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Open class wordsClosed class wordsOther
ADJADPPUNCT
ADVAUXSYM
INTJCCONJX
NOUNDET 
PROPNNUM 
VERBPART 
 PRON 
 SCONJ 
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Dependency Parsing\n", "\n", "* Predict **head** and **relation** for each word.\n", "* Structured prediction, just like POS tagging.\n", "* Or is it?" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1Alice____2nsubj__
2saw____0root__
3Bob____2dobj__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tAlice\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tBob\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Dependency Parsing Approaches\n", "\n", "* Graph-based: score all possible parts (e.g. word pairs), find best combination (e.g. maximum spanning tree)\n", "* Transition-based: incrementally build the tree, one arc at a time, by applying a sequence of actions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Transition-Based Parsing\n", "\n", "* Learn to perform the right action / transition in a bottom-up left-right parser\n", "* Train classifiers $p(y|\\x)$ where $y$ is an action, and $\\x$ is the **parser state**\n", "* Many possible transition systems; shown here: **arc-standard** ([Nivre, 2004](https://www.aclweb.org/anthology/W04-0308))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Configuration (Parser State)\n", "\n", "Consists of a buffer, stack and set of arcs created so far." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Buffer\n", "of tokens waiting for processing" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
stackbufferparseaction
ROOTAlice saw Bob\n", "
\n", "
" ], "text/plain": [ ".Output at 0x7fbd7aaa2898>" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[0:1], tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Stack\n", "of tokens currently being processed" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
stackbufferparseaction
ROOT Alice sawBob\n", "
\n", "
shift
" ], "text/plain": [ ".Output at 0x7fbd7a820898>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[2:3],tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Parse (set of arcs)\n", "tree built so far" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
stackbufferparseaction
ROOT\n", "
\n", "
rightArc-root
" ], "text/plain": [ ".Output at 0x7fbd7a7f3f60>" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[6:7], tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "We use the following \n", "### Actions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Shift\n", "\n", "Push the word at the top of the buffer to the stack. \n", "\n", "$$\n", "(S, i|B, A)\\rightarrow(S|i, B, A)\n", "$$" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
stackbufferparseaction
ROOTAlice saw Bob\n", "
\n", "
ROOT Alicesaw Bob\n", "
\n", "
shift
" ], "text/plain": [ ".Output at 0x7fbd7a820cf8>" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[0:2], tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### rightArc-[label]\n", "\n", "Add labeled arc from secondmost top node of stack \\\\(i\\\\) to top of the stack \\\\(j\\\\). Pop the top of the stack.\n", "\n", "$$\n", "(S|i|j, B, A) \\rightarrow (S|i, B, A\\cup\\{(i,j,l)\\})\n", "$$\n" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
stackbufferparseaction
ROOT saw Bob\n", "
\n", "
shift
ROOT saw\n", "
\n", "
rightArc-dobj
ROOT\n", "
\n", "
rightArc-root
" ], "text/plain": [ ".Output at 0x7fbd7a7f30f0>" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[4:7], tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### leftArc-[label] \n", "\n", "Add labeled arc from top of stack, \\\\(j\\\\), to secondmost top node of stack, \\\\(i\\\\). Reduce the secondmost top node of the stack.\n", "\n", "$$\n", "(S|i|j, B, A) \\rightarrow (S|j, B, A\\cup\\{(j,i,l)\\})\n", "$$\n" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
stackbufferparseaction
ROOT Alice sawBob\n", "
\n", "
shift
ROOT sawBob\n", "
\n", "
leftArc-nsubj
" ], "text/plain": [ ".Output at 0x7fbd7a7f35c0>" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[2:4], tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Full Example" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stackbufferparseaction
ROOTAlice saw Bob\n", "
\n", "
ROOT Alicesaw Bob\n", "
\n", "
shift
ROOT Alice sawBob\n", "
\n", "
shift
ROOT sawBob\n", "
\n", "
leftArc-nsubj
ROOT saw Bob\n", "
\n", "
shift
ROOT saw\n", "
\n", "
rightArc-dobj
ROOT\n", "
\n", "
rightArc-root
ROOT\n", "
\n", "
" ], "text/plain": [ ".Output at 0x7fbd7a7f33c8>" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions[:], tokenized_sentence)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Summary: Configuration\n", "\n", "**Configuration**:\n", "- Stack \\\\(S\\\\): a last-in, first-out memory to keep track of words to process later\n", "- Buffer \\\\(B\\\\): words not processed so far\n", "- Arcs \\\\(A\\\\): the dependency edges predicted so far\n", "\n", "We further define two special configurations:\n", "- initial: buffer is initialised to the words in the sentence, stack contains root, and arcs are empty\n", "- terminal: buffer is empty, stack contains only root" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Summary: Actions\n", "\n", "- shift: Push the word at the top of the buffer to the stack \\\\((S, i|B, A)\\rightarrow(S|i, B, A)\\\\)\n", "- rightArc-label: Add labeled arc from secondmost top node of stack \\\\(i\\\\) to top of the stack \\\\(j\\\\). Pop the top of the stack.\n", "- leftArc-label: Add labeled arc from top of stack, \\\\(j\\\\), to secondmost top node of stack, \\\\(i\\\\). Reduce the secondmost top node of the stack." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Syntactic Ambiguity" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1I____2nsubj__
2saw____0root__
3the____4det__
4star____2dobj__
5with____7case__
6the____7det__
7telescope____2obl__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n", "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n", "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n", "7\ttelescope\t_\t_\t_\t_\t2\tobl\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"900px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Syntactic Ambiguity" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1I____2nsubj__
2saw____0root__
3the____4det__
4star____2dobj__
5with____7case__
6the____7det__
7telescope____4nmod__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n", "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n", "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n", "7\ttelescope\t_\t_\t_\t_\t4\tnmod\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"900px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Learning a Transition-Based Parser\n", "\n", "* Decompose parse tree into a sequence of **actions**\n", "* Learn to score individual actions\n", " * Structured prediction problem!\n", " * Sequence labeling? Sequence-to-sequence?\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "How to decide what action to take? \n", "\n", "* Learn a discriminative classifier $p(y | \\x)$ where \n", " * $\\x$ is a representation of buffer, stack and parse\n", " * $y$ is the action to choose\n", "* Current state-of-the-art systems use neural networks as classifiers (Bi-LSTMs, Transformers, BERT)\n", "* Use **greedy search** or **beam search** to find the highest scoring sequence of steps" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "
\n", "\n", "
\n", " (from Hershcovich et al., 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Oracle\n", "\n", "How do we get training data for the classifier?\n", "\n", "* Training data: whole trees labelled as correct\n", "* We need to design an **oracle**\n", " * function that, given a sentence and its dependency tree, recovers the sequence of actions used to construct it\n", " * can also be thought of reverse engineering a tree into a sequence of actions\n", "* An oracle does this for every possible parse tree\n", "* Oracle can also be thought of as human demonstrator teaching the parser" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dependency Parsing Evaluation\n", "\n", "* Unlabeled Attachment Score (**UAS**): % of words with correct head\n", "* Labeled Attachment Score (**LAS**): % of words with correct head and label\n", "\n", "Always 0 $\\leq$ LAS $\\leq$ UAS $\\leq$ 100%." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Example: LAS and UAS\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " $\\mathrm{UAS}=\\frac{8}{12}=67\\%$\n", "
\n", "\n", "
\n", " $\\mathrm{LAS}=\\frac{7}{12}=58\\%$\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### State-of-the-Art in Dependency Parsing\n", "\n", "* [CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/results-las.html)\n", "* [IWPT 2020 Shared Task](http://pauillac.inria.fr/~seddah/coarse_IWPT_SharedTask_unofficial_results.html)\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Che et al., 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### NN Parsers\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Chen and Manning, 2014)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Stack LSTMs\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Dyer et al., 2015)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Transition-Based Neural Networks\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Kong et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### mBERT for zero-shot cross-lingual parsing\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Kondratyuk and Straka, 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Beyond Dependency Parsing: Meaning Representations\n", "\n", "### https://danielhers.github.io/dikubits_20200218.pdf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary\n", "\n", "* **Dependency parsing** predicts word-to-word dependencies\n", "* Simple annotations in many languages, thanks to **UD**\n", "* Fast parsing, e.g. **transition-based**\n", "* Sufficient for most **down-stream applications**\n", "* More sophisticated **meaning representations** are more informative but harder to parse" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Background Material\n", "\n", "* Arc-standard transition-based parsing system ([Nivre, 2004](https://www.aclweb.org/anthology/W04-0308))\n", "* [EACL 2014 tutorial](http://stp.lingfil.uu.se/~nivre/eacl14.html)\n", "* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/15.pdf): Chapter 15, Dependency Parsing." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 1 }