{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>\n",
       "  function code_toggle() {\n",
       "    if (code_shown){\n",
       "      $('div.input').hide('500');\n",
       "      $('#toggleButton').val('Show Code')\n",
       "    } else {\n",
       "      $('div.input').show('500');\n",
       "      $('#toggleButton').val('Hide Code')\n",
       "    }\n",
       "    code_shown = !code_shown\n",
       "  }\n",
       "\n",
       "  $( document ).ready(function(){\n",
       "    code_shown=false;\n",
       "    $('div.input').hide()\n",
       "  });\n",
       "</script>\n",
       "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n",
       "<style>\n",
       ".rendered_html td {\n",
       "    font-size: xx-large;\n",
       "    text-align: left; !important\n",
       "}\n",
       ".rendered_html th {\n",
       "    font-size: xx-large;\n",
       "    text-align: left; !important\n",
       "}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<script>\n",
    "  function code_toggle() {\n",
    "    if (code_shown){\n",
    "      $('div.input').hide('500');\n",
    "      $('#toggleButton').val('Show Code')\n",
    "    } else {\n",
    "      $('div.input').show('500');\n",
    "      $('#toggleButton').val('Hide Code')\n",
    "    }\n",
    "    code_shown = !code_shown\n",
    "  }\n",
    "\n",
    "  $( document ).ready(function(){\n",
    "    code_shown=false;\n",
    "    $('div.input').hide()\n",
    "  });\n",
    "</script>\n",
    "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n",
    "<style>\n",
    ".rendered_html td {\n",
    "    font-size: xx-large;\n",
    "    text-align: left; !important\n",
    "}\n",
    ".rendered_html th {\n",
    "    font-size: xx-large;\n",
    "    text-align: left; !important\n",
    "}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "from statnlpbook.util import execute_notebook\n",
    "import statnlpbook.parsing as parsing\n",
    "from statnlpbook.transition import *\n",
    "from statnlpbook.dep import *\n",
    "import pandas as pd\n",
    "from io import StringIO\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "execute_notebook('transition-based_dependency_parsing.ipynb')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "<!---\n",
    "Latex Macros\n",
    "-->\n",
    "$$\n",
    "\\newcommand{\\Xs}{\\mathcal{X}}\n",
    "\\newcommand{\\Ys}{\\mathcal{Y}}\n",
    "\\newcommand{\\y}{\\mathbf{y}}\n",
    "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n",
    "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n",
    "\\newcommand{\\aligns}{\\mathbf{a}}\n",
    "\\newcommand{\\align}{a}\n",
    "\\newcommand{\\source}{\\mathbf{s}}\n",
    "\\newcommand{\\target}{\\mathbf{t}}\n",
    "\\newcommand{\\ssource}{s}\n",
    "\\newcommand{\\starget}{t}\n",
    "\\newcommand{\\repr}{\\mathbf{f}}\n",
    "\\newcommand{\\repry}{\\mathbf{g}}\n",
    "\\newcommand{\\x}{\\mathbf{x}}\n",
    "\\newcommand{\\prob}{p}\n",
    "\\newcommand{\\a}{\\alpha}\n",
    "\\newcommand{\\b}{\\beta}\n",
    "\\newcommand{\\vocab}{V}\n",
    "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n",
    "\\newcommand{\\param}{\\theta}\n",
    "\\DeclareMathOperator{\\perplexity}{PP}\n",
    "\\DeclareMathOperator{\\argmax}{argmax}\n",
    "\\DeclareMathOperator{\\argmin}{argmin}\n",
    "\\newcommand{\\train}{\\mathcal{D}}\n",
    "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n",
    "\\newcommand{\\length}[1]{\\text{length}(#1) }\n",
    "\\newcommand{\\indi}{\\mathbb{I}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%load_ext tikzmagic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Parsing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "+ Syntactic dependencies\n",
    "+ Parsing algorithms\n",
    "+ Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Syntactic constituency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Reminder: parts of speech (POS)\n",
    "\n",
    "[Parts of speech](sequence_labeling_slides.ipynb) categorise the syntactic function of words.\n",
    "\n",
    "[Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):\n",
    "\n",
    "Tag || Example\n",
    ":--- | :--- | :---\n",
    "CC | Coordinating conjunction | *and*\n",
    "CD | Cardinal number | *1*\n",
    "DT | Determiner | *the*\n",
    "EX | Existential there | *there*\n",
    "FW | Foreign word | *שלום*\n",
    "IN | Preposition or subordinating conjunction | *in*\n",
    "JJ | Adjective | *high*\n",
    "JJR | Adjective, comparative | *higher*\n",
    "JJS | Adjective, superlative | *highest*\n",
    "LS | List item marker | *,*\n",
    "MD | Modal | *can*\n",
    "NN | Noun, singular or mass | *desk*\n",
    "NNS | Noun, plural | *desks*\n",
    "NNP | Proper noun, singular | *Denmark*\n",
    "NNPS | Proper noun, plural | *Danes*\n",
    "PDT | Predeterminer | *both*\n",
    "POS | Possessive ending | *'s*\n",
    "PRP | Personal pronoun | *you*\n",
    "PRP$ | Possessive pronoun | *your*\n",
    "RB | Adverb | *well*\n",
    "RBR | Adverb, comparative | *better*\n",
    "RBS | Adverb, superlative | *best*\n",
    "RP | Particle |\n",
    "SYM | Symbol |\n",
    "TO | to |\n",
    "UH | Interjection |\n",
    "VB | Verb, base form | *see*\n",
    "VBD | Verb, past tense | *saw*\n",
    "VBG | Verb, gerund or present participle | *seeing*\n",
    "VBN | Verb, past participle | *seen*\n",
    "VBP | Verb, non-3rd person singular present | *see*\n",
    "VBZ | Verb, 3rd person singular present | *sees*\n",
    "WDT | Wh-determiner |\n",
    "WP | Wh-pronoun |\n",
    "WP\\$ | Possessive wh-pronoun |\n",
    "WRB | Wh-adverb |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Syntactic constituents\n",
    "\n",
    "**Phrases** also have a grammatical function when they are syntactic constituents.\n",
    "\n",
    "[Penn Treebank constituent tagset](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/penn-etb-2-style-guidelines.pdf):\n",
    "\n",
    "Phrase Level || Example\n",
    ":--- | :--- | :---\n",
    "ADJP | Adjective Phrase | *really high*\n",
    "ADVP | Adverb Phrase | *very well*\n",
    "CONJP | Conjunction Phrase | *as well as*\n",
    "FRAG | Fragment |\n",
    "INTJ | Interjection |\n",
    "LST | List marker |\n",
    "NP | Noun Phrase | *high desk*\n",
    "PP | Prepositional Phrase | *at home*\n",
    "PRN | Parenthetical |\n",
    "PRT | Particle. Category for words that should be tagged RP |\n",
    "QP | Quantifier Phrase (i.e. complex measure/amount phrase); used within NP |\n",
    "RRC | Reduced Relative Clause |\n",
    "VP | Verb Phrase | *see the desk*\n",
    "WHADJP | Wh-adjective Phrase. Adjectival phrase containing a wh-adverb | *how hot*\n",
    "WHAVP | Wh-adverb Phrase, containing a wh-adverb | *how well*\n",
    "WHNP | Wh-noun Phrase, containing some wh-word | *which book*\n",
    "WHPP | Wh-prepositional Phrase, containing a wh-noun phrase | *of which*\n",
    "X | Unknown, uncertain, or unbracketable. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Clause Level ||\n",
    ":--- | :---\n",
    "S | simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion.\n",
    "SBAR | Clause introduced by a (possibly empty) subordinating conjunction.\n",
    "SBARQ | Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ.\n",
    "SINV | Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.\n",
    "SQ | Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Trees\n",
    "\n",
    "A **tree** is a connected acyclic undirected graph.\n",
    "\n",
    "Graphs consist of **nodes** and **edges** between them.\n",
    "\n",
    "<center><img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Tree_graph.svg/1920px-Tree_graph.svg.png\" width=40%/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Syntactic constituency trees (phrase structure trees)\n",
    "\n",
    "* **Nodes**: syntactic constituents, labeled by type (including individual words, labeled by POS).\n",
    "* **Edges**: connecting phrases to their constituents, unlabeled.\n",
    "\n",
    "<center>\n",
    "    <img src=\"../img/spaghetti1.png\" width=80%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/P13-1045/\">Socher et al., 2013</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "<center>\n",
    "    <table><tr>\n",
    "    <td><img src=\"https://d3i71xaburhd42.cloudfront.net/0fea9208b2e18dd679a0e913a918bbb949eb8589/2-Figure2-1.png\" width=80%></td>\n",
    "    <td><img src=\"https://d3i71xaburhd42.cloudfront.net/0fea9208b2e18dd679a0e913a918bbb949eb8589/3-Figure5-1.png\" width=80%></td>\n",
    "        </tr></table>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/N18-5011\">Gokcen et al., 2018</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Another example of a *PP attachment* problem: does the **PP** (prepositional phrase) attach to the **VP** (verbal phrase) or the **NP** (noun phrase)?\n",
    "<center>\n",
    "    <img src=\"../img/proposal.png\" width=60%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/2022.acl-long.220\">Kitaev et al., 2022</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Constituency parsers\n",
    "\n",
    "Structured prediction: trained on treebanks to build constituency trees from text.\n",
    "\n",
    "See more in the [chapter from this book about constituency parsing](parsing.ipynb) ([slides](parsing_slides.ipynb)).\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/30a4754062a5f8c99e665db0b702a4da060af340/5-Figure2-1.png\" width=30%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/2022.acl-long.220\">Kitaev et al., 2022</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Syntactic dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Motivation: information extraction\n",
    "\n",
    "In [relation extraction](information_extraction_slides.ipynb), it helps to define **linguistic** patterns such as `<subject> <verb> <object>` instead of purely text-based patterns.\n",
    "\n",
    "> <font color=\"blue\">Dechra Pharmaceuticals</font>, which has just made its second acquisition, had previously purchased <font color=\"green\">Genitrix</font>.\n",
    "\n",
    "> <font color=\"blue\">Trinity Mirror plc</font>, the largest British newspaper, purchased <font color=\"green\">Local World</font>, its rival.\n",
    "\n",
    "> <font color=\"blue\">Kraft</font>, owner of <font color=\"blue\">Milka</font>, purchased <font color=\"green\">Cadbury Dairy Milk</font> and is now gearing up for a roll-out of its new brand.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "**Syntactic dependencies** are a useful representation for this purpose.\n",
    "\n",
    "<center>\n",
    "    <img src=\"parsing_figures/dep4re.png\" width=\"60%\">\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Motivation: question answering by reading comprehension\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/05dd7254b632376973f3a1b4d39485da17814df5/6-Figure4-1.png\" width=100%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/D16-1264\">Rajpurkar et al., 2016</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Motivation: question answering from knowledge bases\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/faee0c81a1170402b149500f1b91c51ccaf24027/2-Figure1-1.png\" width=50%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/D17-1009\">Reddy et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Motivation: machine translation\n",
    "\n",
    "Reordering rules can be stated in terms of syntactic dependencies:\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/f4c750cdf8f557eea3a4b76be16e99ec15f0c92b/3-Figure2-1.png\" width=100%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://arxiv.org/abs/2104.08384\">Rasooli et al., 2021</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Syntactic dependency trees\n",
    "\n",
    "* **Nodes**: individual words, and a special `ROOT` node.\n",
    "* Edges (**arcs**): labeled syntactic relations between words: from **head** to **dependent**.\n",
    "\n",
    "Must be a tree: **every word has exactly one head, and `ROOT` has no head**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy3' style=\"width: 2400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy3',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 6, \"end\": 7, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 5, \"end\": 7, \"label\": \"case\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 4, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 2, \"end\": 7, \"label\": \"obl\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"I\"}, {\"text\": \"saw\"}, {\"text\": \"the\"}, {\"text\": \"star\"}, {\"text\": \"with\"}, {\"text\": \"the\"}, {\"text\": \"telescope\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy3'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
    "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n",
    "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n",
    "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n",
    "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n",
    "7\ttelescope\t_\t_\t_\t_\t2\tobl\t_\t_\n",
    "\"\"\"\n",
    "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n",
    "render_displacy(arcs, tokens,\"2400px\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Syntactic ambiguity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy4' style=\"width: 2400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy4',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 6, \"end\": 7, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 5, \"end\": 7, \"label\": \"case\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 4, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 2, \"end\": 7, \"label\": \"obl\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"I\"}, {\"text\": \"saw\"}, {\"text\": \"the\"}, {\"text\": \"star\"}, {\"text\": \"with\"}, {\"text\": \"the\"}, {\"text\": \"telescope\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy4'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
    "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n",
    "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n",
    "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n",
    "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n",
    "7\ttelescope\t_\t_\t_\t_\t2\tobl\t_\t_\n",
    "\"\"\"\n",
    "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n",
    "render_displacy(arcs, tokens,\"2400px\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy5' style=\"width: 2400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy5',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 6, \"end\": 7, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 5, \"end\": 7, \"label\": \"case\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 4, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 4, \"end\": 7, \"label\": \"nmod\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"I\"}, {\"text\": \"saw\"}, {\"text\": \"the\"}, {\"text\": \"star\"}, {\"text\": \"with\"}, {\"text\": \"the\"}, {\"text\": \"telescope\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy5'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
    "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n",
    "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n",
    "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n",
    "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n",
    "7\ttelescope\t_\t_\t_\t_\t4\tnmod\t_\t_\n",
    "\"\"\"\n",
    "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n",
    "render_displacy(arcs, tokens,\"2400px\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <table>\n",
    "        <tr>\n",
    "            <td><img src=\"parsing_figures/telescope1.jpeg\" width=50%/></td>\n",
    "            <td><img src=\"parsing_figures/telescope2.jpg\" width=50%/></td>\n",
    "        </tr>\n",
    "    </table>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Treebanks\n",
    "\n",
    "A dataset that consists of a text corpus with annotated (syntactic) trees.\n",
    "\n",
    "Some commonly used treebanks:\n",
    "\n",
    "* English: *Penn Treebank* (4.8 million words)\n",
    "* Mandarin Chinese: *Chinese Treebank* (1.5 million words)\n",
    "* German: *TIGER* (0.9 million words), *TüBa-D/Z* (1.6 million words)\n",
    "* Czech: *Prague Dependency Treebank* (2 million words)\n",
    "* Danish: *Arboretum* (0.2 million words)\n",
    "* ...\n",
    "* Multilingual: *Universal Dependencies* (more in a few slides)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### CoNLL-U format\n",
    "\n",
    "Tabular format with 10 columns indicating various morphosyntactic attributes.\n",
    "\n",
    "Shown here: ID, surface form, dependency head and dependency relation.\n",
    "\n",
    "(The others are shown as `_` but normally they would be filled in too.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th># ID</th>\n",
       "      <th>FORM</th>\n",
       "      <th>LEMMA</th>\n",
       "      <th>UPOS</th>\n",
       "      <th>XPOS</th>\n",
       "      <th>FEATS</th>\n",
       "      <th>HEAD</th>\n",
       "      <th>DEPREL</th>\n",
       "      <th>DEPS</th>\n",
       "      <th>MISC</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>I</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>2</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>saw</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>0</td>\n",
       "      <td>root</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>the</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>4</td>\n",
       "      <td>det</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>star</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>2</td>\n",
       "      <td>dobj</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>with</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>7</td>\n",
       "      <td>case</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>the</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>7</td>\n",
       "      <td>det</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>telescope</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>4</td>\n",
       "      <td>nmod</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy6' style=\"width: 2400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy6',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 6, \"end\": 7, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 5, \"end\": 7, \"label\": \"case\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 4, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 4, \"end\": 7, \"label\": \"nmod\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"I\"}, {\"text\": \"saw\"}, {\"text\": \"the\"}, {\"text\": \"star\"}, {\"text\": \"with\"}, {\"text\": \"the\"}, {\"text\": \"telescope\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy6'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n",
    "render_displacy(arcs, tokens,\"2400px\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Need for universal syntactic annotation\n",
    "\n",
    "How to define the relation labels? There are different linguistic traditions in different languages...\n",
    "<center>\n",
    "    <img src=\"../img/ud.png\" width=60%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://cl.lingfil.uu.se/~miryam/assets/pdf/thesis.pdf\">de Lhoneux, 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Universal Dependencies\n",
    "\n",
    "* Annotation framework featuring [37 syntactic relations](https://universaldependencies.org/u/dep/all.html)\n",
    "* [Treebanks](http://universaldependencies.org/) in 161 languages\n",
    "* Cross-linguistically consistent annotation of typologically diverse languages ([de Marneffe et al., 2021](https://aclanthology.org/2021.cl-2.11/))\n",
    "\n",
    "<center><img src=\"../img/ud_treebanks.png\" width=70%/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### UD dependency relations\n",
    "\n",
    "<table border=\"1\">\n",
    "  <tr style=\"background-color:cornflowerblue; font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"> </td>\n",
    "      <td style=\"text-align: left;\"> Nominals </td>\n",
    "      <td style=\"text-align: left;\"> Clauses </td>\n",
    "      <td style=\"text-align: left;\"> Modifier words </td>\n",
    "      <td style=\"text-align: left;\"> Function Words </td>\n",
    "  </tr>\n",
    "  <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"background-color:darkseagreen\">\n",
    "\tCore arguments\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/nsubj.html\" title=\"u-dep nsubj\">nsubj</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/obj.html\" title=\"u-dep obj\">obj</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/iobj.html\" title=\"u-dep iobj\">iobj</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/csubj.html\" title=\"u-dep csubj\">csubj</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/ccomp.html\" title=\"u-dep ccomp\">ccomp</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/xcomp.html\" title=\"u-dep xcomp\">xcomp</a>\n",
    "      </td>\n",
    "\t  <td style=\"text-align: left;\"></td><td style=\"text-align: left;\"></td>\n",
    "  </tr>\n",
    "  <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"background-color:darkseagreen;\">\n",
    "\tNon-core dependents\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/obl.html\" title=\"u-dep obl\">obl</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/vocative.html\" title=\"u-dep vocative\">vocative</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/expl.html\" title=\"u-dep expl\">expl</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/dislocated.html\" title=\"u-dep dislocated\">dislocated</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/advcl.html\" title=\"u-dep advcl\">advcl</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/advmod.html\" title=\"u-dep advmod\">advmod</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/discourse.html\" title=\"u-dep discourse\">discourse</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/aux_.html\" title=\"u-dep aux\">aux</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/cop.html\" title=\"u-dep cop\">cop</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/mark.html\" title=\"u-dep mark\">mark</a>\n",
    "      </td>\n",
    "  </tr>\n",
    "  <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"background-color:darkseagreen\">\n",
    "\tNominal dependents\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/nmod.html\" title=\"u-dep nmod\">nmod</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/appos.html\" title=\"u-dep appos\">appos</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/nummod.html\" title=\"u-dep nummod\">nummod</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/acl.html\" title=\"u-dep acl\">acl</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/amod.html\" title=\"u-dep amod\">amod</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/det.html\" title=\"u-dep det\">det</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/clf.html\" title=\"u-dep clf\">clf</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/case.html\" title=\"u-dep case\">case</a>\n",
    "      </td>\n",
    "  </tr style=\"font-size: x-large; text-align: left;\">\n",
    "  <tr style=\"background-color:cornflowerblue; font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"> Coordination </td>\n",
    "      <td style=\"text-align: left;\"> MWE </td>\n",
    "      <td style=\"text-align: left;\"> Loose </td>\n",
    "      <td style=\"text-align: left;\"> Special </td>\n",
    "      <td style=\"text-align: left;\"> Other </td>\n",
    "  </tr>\n",
    "  <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\">\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/conj.html\" title=\"u-dep conj\">conj</a><br>\n",
    "\t    <a href=\"https://universaldependencies.org/u/dep/cc.html\" title=\"u-dep cc\">cc</a>\n",
    "      </td>\n",
    "      <td style=\"text-align: left;\">\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/fixed.html\" title=\"u-dep fixed\">fixed</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/flat.html\" title=\"u-dep flat\">flat</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/compound.html\" title=\"u-dep compound\">compound</a>\n",
    "    </td>\n",
    "    <td style=\"text-align: left;\">\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/list.html\" title=\"u-dep list\">list</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/parataxis.html\" title=\"u-dep parataxis\">parataxis</a>\n",
    "    </td>\n",
    "    <td style=\"text-align: left;\">\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/orphan.html\" title=\"u-dep orphan\">orphan</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/goeswith.html\" title=\"u-dep goeswith\">goeswith</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/reparandum.html\" title=\"u-dep reparandum\">reparandum</a>\n",
    "    </td>\n",
    "    <td style=\"text-align: left;\">\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/punct.html\" title=\"u-dep punct\">punct</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/root.html\" title=\"u-dep root\">root</a><br>\n",
    "\t  <a href=\"https://universaldependencies.org/u/dep/dep.html\" title=\"u-dep dep\">dep</a>\n",
    "    </td>\n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Beyond dependency trees\n",
    "\n",
    "UD also includes other morphosyntactic annotation:\n",
    "\n",
    "* Tokenisation and word segmentation\n",
    "* Morphological features (e.g., lemmas, case, gender)\n",
    "* **Universal part of speech tags (UPOS)**: coarse abstraction over language-specific POS tags (XPOS).\n",
    "\n",
    "<table class=\"typeindex\">\n",
    "  <thead>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <th>Open class words</th>\n",
    "      <th>Closed class words</th>\n",
    "      <th>Other</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/ADJ.html\" class=\"doclink doclabel\" title=\"u-pos ADJ\">ADJ</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/ADP.html\" class=\"doclink doclabel\" title=\"u-pos ADP\">ADP</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/PUNCT.html\" class=\"doclink doclabel\" title=\"u-pos PUNCT\">PUNCT</a></td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/ADV.html\" class=\"doclink doclabel\" title=\"u-pos ADV\">ADV</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/AUX_.html\" class=\"doclink doclabel\" title=\"u-pos AUX\">AUX</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/SYM.html\" class=\"doclink doclabel\" title=\"u-pos SYM\">SYM</a></td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/INTJ.html\" class=\"doclink doclabel\" title=\"u-pos INTJ\">INTJ</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/CCONJ.html\" class=\"doclink doclabel\" title=\"u-pos CCONJ\">CCONJ</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/X.html\" class=\"doclink doclabel\" title=\"u-pos X\">X</a></td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/NOUN.html\" class=\"doclink doclabel\" title=\"u-pos NOUN\">NOUN</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/DET.html\" class=\"doclink doclabel\" title=\"u-pos DET\">DET</a></td>\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/PROPN.html\" class=\"doclink doclabel\" title=\"u-pos PROPN\">PROPN</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/NUM.html\" class=\"doclink doclabel\" title=\"u-pos NUM\">NUM</a></td>\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/VERB.html\" class=\"doclink doclabel\" title=\"u-pos VERB\">VERB</a></td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/PART.html\" class=\"doclink doclabel\" title=\"u-pos PART\">PART</a></td>\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/PRON.html\" class=\"doclink doclabel\" title=\"u-pos PRON\">PRON</a></td>\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "      <td style=\"text-align: left;\"><a href=\"https://universaldependencies.org/u/pos/SCONJ.html\" class=\"doclink doclabel\" title=\"u-pos SCONJ\">SCONJ</a></td>\n",
    "      <td style=\"text-align: left;\">&nbsp;</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Danish UD example\n",
    "*the big fish ate the small fish*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy7' style=\"width: 1400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy7',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 2, \"end\": 3, \"label\": \"amod\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 6, \"end\": 7, \"label\": \"amod\", \"dir\": \"left\"}, {\"start\": 4, \"end\": 7, \"label\": \"obj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 4, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 1, \"end\": 3, \"label\": \"det\", \"dir\": \"left\"}, {\"start\": 5, \"end\": 7, \"label\": \"det\", \"dir\": \"left\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Den\"}, {\"text\": \"store\"}, {\"text\": \"fisk\"}, {\"text\": \"spiste\"}, {\"text\": \"den\"}, {\"text\": \"lille\"}, {\"text\": \"fisk\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy7'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\tDen\t_\t_\t_\t_\t3\tdet\t_\t_\n",
    "2\tstore\t_\t_\t_\t_\t3\tamod\t_\t_\n",
    "3\tfisk\t_\t_\t_\t_\t4\tnsubj\t_\t_\n",
    "4\tspiste\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "5\tden\t_\t_\t_\t_\t7\tdet\t_\t_\n",
    "6\tlille\t_\t_\t_\t_\t7\tamod\t_\t_\n",
    "7\tfisk\t_\t_\t_\t_\t4\tobj\t_\t_\n",
    "\"\"\"\n",
    "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n",
    "render_displacy(arcs, tokens,\"1400px\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Korean UD example\n",
    "*big fish small fish ate*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div id='displacy8' style=\"width: 1400px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy8',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"amod\", \"dir\": \"left\"}, {\"start\": 0, \"end\": 5, \"label\": \"root\", \"dir\": \"right\"}, {\"start\": 2, \"end\": 5, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 4, \"end\": 5, \"label\": \"obj\", \"dir\": \"left\"}, {\"start\": 3, \"end\": 4, \"label\": \"amod\", \"dir\": \"left\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"\\ud070\"}, {\"text\": \"\\ubb3c\\uace0\\uae30\\uac00\"}, {\"text\": \"\\uc791\\uc740\"}, {\"text\": \"\\ubb3c\\uace0\\uae30\\ub97c\"}, {\"text\": \"\\uba39\\uc5c8\\ub2e4\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy8'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\t큰\t_\t_\t_\t_\t2\tamod\t_\t_\n",
    "2\t물고기가\t_\t_\t_\t_\t5\tnsubj\t_\t_\n",
    "3\t작은\t_\t_\t_\t_\t4\tamod\t_\t_\n",
    "4\t물고기를\t_\t_\t_\t_\t5\tobj\t_\t_\n",
    "5\t먹었다\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "\"\"\"\n",
    "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n",
    "render_displacy(arcs, tokens,\"1400px\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dependency parsing\n",
    "\n",
    "Task:\n",
    "* Predict **head** and **relation** for each word.\n",
    "* Classification? Sequence tagging? Sequence-to-sequence? Span selection? Or something else?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th># ID</th>\n",
       "      <th>FORM</th>\n",
       "      <th>LEMMA</th>\n",
       "      <th>UPOS</th>\n",
       "      <th>XPOS</th>\n",
       "      <th>FEATS</th>\n",
       "      <th>HEAD</th>\n",
       "      <th>DEPREL</th>\n",
       "      <th>DEPS</th>\n",
       "      <th>MISC</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>Alice</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>2</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>saw</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>0</td>\n",
       "      <td>root</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>Bob</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "      <td>2</td>\n",
       "      <td>dobj</td>\n",
       "      <td>_</td>\n",
       "      <td>_</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "conllu = \"\"\"\n",
    "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n",
    "1\tAlice\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
    "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n",
    "3\tBob\t_\t_\t_\t_\t2\tdobj\t_\t_\n",
    "\"\"\"\n",
    "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Dependency parsing approaches\n",
    "\n",
    "* **Graph-based**: score all possible word pairs, find best combination (often a maximum spanning tree). Examples:\n",
    "  * [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/run.php?model=english-ewt-ud-2.10-220711&data=Kraft,%20owner%20of%20Milka,%20purchased%20Cadbury%20Dairy%20Milk%20and%20is%20now%20gearing%20up%20for%20a%20roll-out%20of%20its%20new%20brand.)\n",
    "  * [Stanza](http://stanza.run/)\n",
    "* **Transition-based**: incrementally build the tree, one arc at a time, by applying a sequence of actions. Examples:\n",
    "  * [spaCy](https://demos.explosion.ai/displacy?text=Kraft%2C%20owner%20of%20Milka%2C%20purchased%20Cadbury%20Dairy%20Milk%20and%20is%20now%20gearing%20up%20for%20a%20roll-out%20of%20its%20new%20brand.&model=en_core_web_sm&cpu=0&cph=0)\n",
    "  * [UUParser](https://github.com/UppsalaNLP/uuparser)\n",
    "  * [TUPA](https://github.com/huji-nlp/tupa/)\n",
    "  \n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/c267b4a64066b56c8eef053de391c3cbe58c9eb3/3-Figure2-1.png\" width=60%>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/P18-2077/\">Dozat & Manning, 2018</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dependency parsing evaluation\n",
    "\n",
    "* Unlabeled Attachment Score (**UAS**): % of words with correct head\n",
    "* Labeled Attachment Score (**LAS**): % of words with correct head and label\n",
    "\n",
    "Always 0 $\\leq$ LAS $\\leq$ UAS $\\leq$ 100%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example: LAS and UAS\n",
    "\n",
    "<center>\n",
    "    <img src=\"parsing_figures/as.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<center>\n",
    "    $\\mathrm{UAS}=\\frac{8}{12}=67\\%$\n",
    "</center>\n",
    "\n",
    "<center>\n",
    "    $\\mathrm{LAS}=\\frac{7}{12}=58\\%$\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Transition-based parsers\n",
    "\n",
    "Consist of a **<span style=\"color:blue\">buffer</span>** and **<span style=\"color:red\">stack</span>**, incrementally build the **parse** by applying **actions** (transitions)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Configuration\n",
    "\n",
    "- Stack \\\\(S\\\\): a last-in, first-out memory to keep track of words to process\n",
    "- Buffer \\\\(B\\\\): words remaining to be processed\n",
    "- Arcs \\\\(A\\\\): the dependency arcs created so far in the parse tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "What are the possible actions? Depends which **transition system** we are using!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Common transition systems:\n",
    "+ arc-standard ([Nivre, 2003](https://aclanthology.org/W03-3017/))\n",
    "+ arc-eager ([Nivre, 2004](https://www.aclweb.org/anthology/W04-0308))\n",
    "+ arc-hybrid ([Kuhlmann et al., 2011](https://aclanthology.org/P11-1068/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## arc-standard\n",
    "\n",
    "Possible actions at each step:\n",
    "- **SHIFT**: move the buffer top item to the stack.\n",
    "- For each relation $r$,\n",
    "  - **RIGHT-ARC-$r$**: create $r$ arc from second stack item to stack top. Then pop stack top.\n",
    "  - **LEFT-ARC-$r$**: create $r$ arc from stack top to second stack item. Then pop second stack item."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Two special configurations:\n",
    "- **initial**: buffer contains the words, stack contains root, and arcs are empty.\n",
    "- **terminal**: buffer is empty, stack contains only root."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### arc-standard example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table><tr><td style='font-size: x-large;'>stack</td><td style='font-size: x-large;'>buffer</td><td style='font-size: x-large;'>parse</td><td style='font-size: x-large;'>action</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT</td><td style='font-size: x-large;'>Alice saw Bob</td><td style='font-size: x-large;'>\n",
       "    <div id='displacy9' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy9',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy9'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'></td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT Alice</td><td style='font-size: x-large;'>saw Bob</td><td style='font-size: x-large;'>\n",
       "    <div id='displacy10' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy10',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy10'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>shift</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT Alice saw</td><td style='font-size: x-large;'>Bob</td><td style='font-size: x-large;'>\n",
       "    <div id='displacy11' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy11',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy11'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>shift</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT saw</td><td style='font-size: x-large;'>Bob</td><td style='font-size: x-large;'>\n",
       "    <div id='displacy12' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy12',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy12'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>leftArc-nsubj</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT saw Bob</td><td style='font-size: x-large;'></td><td style='font-size: x-large;'>\n",
       "    <div id='displacy13' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy13',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy13'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>shift</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT saw</td><td style='font-size: x-large;'></td><td style='font-size: x-large;'>\n",
       "    <div id='displacy14' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy14',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 3, \"label\": \"dobj\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy14'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>rightArc-dobj</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT</td><td style='font-size: x-large;'></td><td style='font-size: x-large;'>\n",
       "    <div id='displacy15' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy15',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 3, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy15'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'>rightArc-root</td></tr>\n",
       "<tr><td style='font-size: x-large;'>ROOT</td><td style='font-size: x-large;'></td><td style='font-size: x-large;'>\n",
       "    <div id='displacy16' style=\"width: 500px;\"></div>\n",
       "    <script>\n",
       "    $(function() {\n",
       "    requirejs.config({\n",
       "        paths: {\n",
       "            'displaCy': ['/files/node_modules/displacy/displacy'],\n",
       "                                                  // strip .js ^, require adds it back\n",
       "        },\n",
       "    });\n",
       "    require(['displaCy'], function() {\n",
       "        console.log(\"Loaded :)\");\n",
       "        const displacy = new displaCy('http://localhost:8000', {\n",
       "            container: '#displacy16',\n",
       "            format: 'spacy',\n",
       "            distance: 150,\n",
       "            offsetX: 0,\n",
       "            wordSpacing: 20,\n",
       "            arrowSpacing: 3,\n",
       "\n",
       "        });\n",
       "        const parse = {\n",
       "            arcs: [{\"start\": 1, \"end\": 2, \"label\": \"nsubj\", \"dir\": \"left\"}, {\"start\": 2, \"end\": 3, \"label\": \"dobj\", \"dir\": \"right\"}, {\"start\": 0, \"end\": 2, \"label\": \"root\", \"dir\": \"right\"}],\n",
       "            words: [{\"text\": \"ROOT\"}, {\"text\": \"Alice\"}, {\"text\": \"saw\"}, {\"text\": \"Bob\"}]\n",
       "        };\n",
       "\n",
       "        displacy.render(parse, {\n",
       "            uniqueId: 'render_displacy16'\n",
       "            //color: '#ff0000'\n",
       "        });\n",
       "        return {};\n",
       "    });\n",
       "    });\n",
       "    </script></td><td style='font-size: x-large;'></td></tr></table>"
      ],
      "text/plain": [
       "<statnlpbook.transition.render_transitions_displacy.<locals>.Output at 0x105d76790>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "render_transitions_displacy(transitions, tokenized_sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Transition-based parsing as structured prediction\n",
    "\n",
    "**Model** $p(a|c)$: how likely is action $a$ to be next, given that the current configuration is $c$?\n",
    "$$p(a|c) \\approx s_\\params(a,c)$$\n",
    "\n",
    "**Training**: learn $\\params$ with an annotated training set\n",
    "$$\n",
    "\\argmax_\\params \\prod_{x \\in \\train} \\prod_{i=1}^{|x|} s_\\params(a_i,c_i)\n",
    "$$\n",
    "\n",
    "**Decoding**: try to find the most likely action sequence\n",
    "$$\\argmax_{a_1,\\ldots,a_{|x|}}  \\prod_{i=1}^{|x|} s_\\params(a_i,c_i)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Neural transition classifiers\n",
    "\n",
    "Sequence-to-sequence?\n",
    "<center>\n",
    "    <img src=\"parsing_figures/tb1.png\" width=60%/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Sequence-to-sequence, but with control structure:\n",
    "<center>\n",
    "    <img src=\"parsing_figures/tb2.png\" width=30%/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Neural transition classifiers\n",
    "* Each step is a new classification instance\n",
    "* Architectures using Word embedding ([Chen and Manning, 2014](https://aclanthology.org/D14-1082/)), Stack-LSTM ([Dyer et al., 2015](https://aclanthology.org/P15-1033/)), BiLSTM ([Kiperwasser and Goldberg, 2016](https://aclanthology.org/Q16-1023/)), attention ([Ma et al., 2018](https://aclanthology.org/P18-1130/)), Stack-Transformer ([Fernandez Astudillo et al., 2020](https://aclanthology.org/2020.findings-emnlp.89/))\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/8292a74aba4eab2ca864b457c17b02634fef4ddd/5-Figure7-1.png\" width=30%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/K18-2010/\">Hershcovich et al., 2018</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Training\n",
    "\n",
    "+ Loss function: often negative log-likelihood or max-margin\n",
    "+ **Teacher forcing:**  always choose the ground truth action.\n",
    "\n",
    "*Alternative:* (see also [MT slides](nmt_slides_active.ipynb))\n",
    "\n",
    "+ **Scheduled sampling:** with a certain probability, use model predictions instead.\n",
    "\n",
    "But what is the **ground truth**? Treebanks contain **trees**, not action sequences!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**Oracle**: rules to select the right action given the configuration **and the correct tree**.\n",
    "\n",
    "<center><img src=\"../img/oracle.png\" width=50%/></center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://aclanthology.org/P17-1104/\">Hershcovich et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Decoding\n",
    "\n",
    "+ Greedy decoding:\n",
    "    + Always pick the **most likely action** (according to the classifier)\n",
    "    + Continue applying more actions **until a terminal configuration is reached**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "+ Beam search:\n",
    "    * Maintains a list of top-$k$ action+configuration sequences in a **beam**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## arc-hybrid\n",
    "\n",
    "- **SHIFT**: move the buffer top item to the stack.\n",
    "- **RIGHT-ARC-$r$**: create $r$ arc from second stack item to stack top. Then pop stack top.\n",
    "- **LEFT-ARC-$r$**: create $r$ arc from **buffer top** to stack top. Then pop **stack top**.\n",
    "\n",
    "<center><img src=\"../img/archybrid.png\" width=20%/></center>\n",
    "\n",
    "- **initial**: buffer contains the words **followed by root**, stack is **empty**, and arcs are empty.\n",
    "- **terminal**: buffer **contains only root**, stack **is empty**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### arc-hybrid example\n",
    "\n",
    "**Unlabeled parsing** (without relation labels), just for simplicity.\n",
    "\n",
    "https://danielhers.github.io/archybrid.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "<center><img src=\"../img/quiz_time.png\"></center>\n",
    "\n",
    "# [tinyurl.com/diku-nlp-tb](https://tinyurl.com/diku-nlp-tb)\n",
    "\n",
    "([Responses](https://app.quizalize.com/dash/R3JvdXA6Y2U1ZWFmYTMtMzdhMS00NTNiLWI3OTMtNThmYzg2MWRmNDQ3/activity/QWN0aXZpdHk6NGJmMTQxYTUtODg1Yy00ZDJmLTk5MmEtYWE4MTIxMmRmNTZk/board/leaderboard))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Summary: arc-hybrid vs arc-standard\n",
    "<table class=\"typeindex\">\n",
    "  <thead>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <th></th>\n",
    "      <th>LEFT-ARC</th>\n",
    "      <th>initial configuration</th>\n",
    "      <th>terminal configuration</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\">arc-standard</td>\n",
    "        <td style=\"text-align: left;\">create arc from <b>stack</b> top to <b>second</b> stack item,<br>pop <b>second</b> stack item</td>\n",
    "        <td style=\"text-align: left;\">stack contains <b>root</b>,<br>buffer contains words</td>\n",
    "        <td style=\"text-align: left;\">stack contains <b>root</b>,<br>buffer is <b>empty</b></td>\n",
    "    </tr>\n",
    "    <tr style=\"font-size: x-large; text-align: left;\">\n",
    "      <td style=\"text-align: left;\">arc-hybrid</td>\n",
    "        <td style=\"text-align: left;\">create arc from <b>buffer</b> top to stack <b>top</b>,<br>pop stack <b>top</b></td>\n",
    "        <td style=\"text-align: left;\">stack is <b>empty</b>,<br>buffer contains words <b>and root</b></td>\n",
    "        <td style=\"text-align: left;\">stack is <b>empty</b>,<br>buffer contains <b>root</b></td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>\n",
    "\n",
    "<center><img src=\"../img/archybrid.png\" width=20%/>(arc-hybrid)</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Summary\n",
    "\n",
    "* **Dependency parsing** predicts word-to-word dependencies\n",
    "* Treebanks in many languages, thanks to **UD**\n",
    "* Fast and accurate parsing, e.g. **transition-based**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Further reading\n",
    "\n",
    "* [EACL 2014 tutorial on dependency parsing](http://stp.lingfil.uu.se/~nivre/eacl14.html)\n",
    "* [Slides about semantic parsing](https://danielhers.github.io/mr.pdf)\n",
    "* [Chapter from this book about transition-based dependency parsing](http://localhost:8888/notebooks/chapters/transition-based_dependency_parsing.ipynb)\n",
    "* [Chapter from this book about constituency parsing](parsing.ipynb) ([slides](parsing_slides.ipynb))\n",
    "* [Jurafsky & Martin, Chapter 19](https://web.stanford.edu/~jurafsky/slp3/19.pdf)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}