{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "import sys\n", "sys.path.append(\"..\")\n", "from statnlpbook.util import execute_notebook\n", "import statnlpbook.parsing as parsing\n", "from statnlpbook.transition import *\n", "from statnlpbook.dep import *\n", "import pandas as pd\n", "from io import StringIO\n", "from IPython.display import display, HTML\n", "\n", "execute_notebook('transition-based_dependency_parsing.ipynb')" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\a}{\\alpha}\n", "\\newcommand{\\b}{\\beta}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%load_ext tikzmagic" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Parsing" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "+ Syntactic dependencies\n", "+ Parsing algorithms\n", "+ Evaluation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Syntactic constituency" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Reminder: parts of speech (POS)\n", "\n", "[Parts of speech](sequence_labeling_slides.ipynb) categorise the syntactic function of words.\n", "\n", "[Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):\n", "\n", "Tag || Example\n", ":--- | :--- | :---\n", "CC | Coordinating conjunction | *and*\n", "CD | Cardinal number | *1*\n", "DT | Determiner | *the*\n", "EX | Existential there | *there*\n", "FW | Foreign word | *שלום*\n", "IN | Preposition or subordinating conjunction | *in*\n", "JJ | Adjective | *high*\n", "JJR | Adjective, comparative | *higher*\n", "JJS | Adjective, superlative | *highest*\n", "LS | List item marker | *,*\n", "MD | Modal | *can*\n", "NN | Noun, singular or mass | *desk*\n", "NNS | Noun, plural | *desks*\n", "NNP | Proper noun, singular | *Denmark*\n", "NNPS | Proper noun, plural | *Danes*\n", "PDT | Predeterminer | *both*\n", "POS | Possessive ending | *'s*\n", "PRP | Personal pronoun | *you*\n", "PRP$ | Possessive pronoun | *your*\n", "RB | Adverb | *well*\n", "RBR | Adverb, comparative | *better*\n", "RBS | Adverb, superlative | *best*\n", "RP | Particle |\n", "SYM | Symbol |\n", "TO | to |\n", "UH | Interjection |\n", "VB | Verb, base form | *see*\n", "VBD | Verb, past tense | *saw*\n", "VBG | Verb, gerund or present participle | *seeing*\n", "VBN | Verb, past participle | *seen*\n", "VBP | Verb, non-3rd person singular present | *see*\n", "VBZ | Verb, 3rd person singular present | *sees*\n", "WDT | Wh-determiner |\n", "WP | Wh-pronoun |\n", "WP\\$ | Possessive wh-pronoun |\n", "WRB | Wh-adverb |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Syntactic constituents\n", "\n", "**Phrases** also have a grammatical function when they are syntactic constituents.\n", "\n", "[Penn Treebank constituent tagset](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/penn-etb-2-style-guidelines.pdf):\n", "\n", "Phrase Level || Example\n", ":--- | :--- | :---\n", "ADJP | Adjective Phrase | *really high*\n", "ADVP | Adverb Phrase | *very well*\n", "CONJP | Conjunction Phrase | *as well as*\n", "FRAG | Fragment |\n", "INTJ | Interjection |\n", "LST | List marker |\n", "NP | Noun Phrase | *high desk*\n", "PP | Prepositional Phrase | *at home*\n", "PRN | Parenthetical |\n", "PRT | Particle. Category for words that should be tagged RP |\n", "QP | Quantifier Phrase (i.e. complex measure/amount phrase); used within NP |\n", "RRC | Reduced Relative Clause |\n", "VP | Verb Phrase | *see the desk*\n", "WHADJP | Wh-adjective Phrase. Adjectival phrase containing a wh-adverb | *how hot*\n", "WHAVP | Wh-adverb Phrase, containing a wh-adverb | *how well*\n", "WHNP | Wh-noun Phrase, containing some wh-word | *which book*\n", "WHPP | Wh-prepositional Phrase, containing a wh-noun phrase | *of which*\n", "X | Unknown, uncertain, or unbracketable. |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Clause Level ||\n", ":--- | :---\n", "S | simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion.\n", "SBAR | Clause introduced by a (possibly empty) subordinating conjunction.\n", "SBARQ | Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ.\n", "SINV | Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.\n", "SQ | Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Trees\n", "\n", "A **tree** is a connected acyclic undirected graph.\n", "\n", "Graphs consist of **nodes** and **edges** between them.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Syntactic constituency trees (phrase structure trees)\n", "\n", "* **Nodes**: syntactic constituents, labeled by type (including individual words, labeled by POS).\n", "* **Edges**: connecting phrases to their constituents, unlabeled.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Socher et al., 2013)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "
\n", " \n", " \n", " \n", "
\n", "
\n", "\n", "
\n", " (from Gokcen et al., 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Another example of a *PP attachment* problem: does the **PP** (prepositional phrase) attach to the **VP** (verbal phrase) or the **NP** (noun phrase)?\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Kitaev et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Constituency parsers\n", "\n", "Structured prediction: trained on treebanks to build constituency trees from text.\n", "\n", "See more in the [chapter from this book about constituency parsing](parsing.ipynb) ([slides](parsing_slides.ipynb)).\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Kitaev et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Syntactic dependencies" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Motivation: information extraction\n", "\n", "In [relation extraction](information_extraction_slides.ipynb), it helps to define **linguistic** patterns such as ` ` instead of purely text-based patterns.\n", "\n", "> Dechra Pharmaceuticals, which has just made its second acquisition, had previously purchased Genitrix.\n", "\n", "> Trinity Mirror plc, the largest British newspaper, purchased Local World, its rival.\n", "\n", "> Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "**Syntactic dependencies** are a useful representation for this purpose.\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Motivation: question answering by reading comprehension\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Rajpurkar et al., 2016)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Motivation: question answering from knowledge bases\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Reddy et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Motivation: machine translation\n", "\n", "Reordering rules can be stated in terms of syntactic dependencies:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Rasooli et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Syntactic dependency trees\n", "\n", "* **Nodes**: individual words, and a special `ROOT` node.\n", "* Edges (**arcs**): labeled syntactic relations between words: from **head** to **dependent**.\n", "\n", "Must be a tree: **every word has exactly one head, and `ROOT` has no head**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n", "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n", "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n", "7\ttelescope\t_\t_\t_\t_\t2\tobl\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"2400px\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Syntactic ambiguity" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n", "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n", "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n", "7\ttelescope\t_\t_\t_\t_\t2\tobl\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"2400px\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tI\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tthe\t_\t_\t_\t_\t4\tdet\t_\t_\n", "4\tstar\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "5\twith\t_\t_\t_\t_\t7\tcase\t_\t_\n", "6\tthe\t_\t_\t_\t_\t7\tdet\t_\t_\n", "7\ttelescope\t_\t_\t_\t_\t4\tnmod\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"2400px\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", " \n", " \n", " \n", " \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Treebanks\n", "\n", "A dataset that consists of a text corpus with annotated (syntactic) trees.\n", "\n", "Some commonly used treebanks:\n", "\n", "* English: *Penn Treebank* (4.8 million words)\n", "* Mandarin Chinese: *Chinese Treebank* (1.5 million words)\n", "* German: *TIGER* (0.9 million words), *TüBa-D/Z* (1.6 million words)\n", "* Czech: *Prague Dependency Treebank* (2 million words)\n", "* Danish: *Arboretum* (0.2 million words)\n", "* ...\n", "* Multilingual: *Universal Dependencies* (more in a few slides)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### CoNLL-U format\n", "\n", "Tabular format with 10 columns indicating various morphosyntactic attributes.\n", "\n", "Shown here: ID, surface form, dependency head and dependency relation.\n", "\n", "(The others are shown as `_` but normally they would be filled in too.)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1I____2nsubj__
2saw____0root__
3the____4det__
4star____2dobj__
5with____7case__
6the____7det__
7telescope____4nmod__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))\n", "render_displacy(arcs, tokens,\"2400px\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "### Need for universal syntactic annotation\n", "\n", "How to define the relation labels? There are different linguistic traditions in different languages...\n", "
\n", " \n", "
\n", "\n", "
\n", " (from de Lhoneux, 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Universal Dependencies\n", "\n", "* Annotation framework featuring [37 syntactic relations](https://universaldependencies.org/u/dep/all.html)\n", "* [Treebanks](http://universaldependencies.org/) in 161 languages\n", "* Cross-linguistically consistent annotation of typologically diverse languages ([de Marneffe et al., 2021](https://aclanthology.org/2021.cl-2.11/))\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### UD dependency relations\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\t \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Nominals Clauses Modifier words Function Words
\n", "\tCore arguments\n", " \n", "\t nsubj
\n", "\t obj
\n", "\t iobj\n", "
\n", "\t csubj
\n", "\t ccomp
\n", "\t xcomp\n", "
\n", "\tNon-core dependents\n", " \n", "\t obl
\n", "\t vocative
\n", "\t expl
\n", "\t dislocated\n", "
\n", "\t advcl\n", " \n", "\t advmod
\n", "\t discourse\n", "
\n", "\t aux
\n", "\t cop
\n", "\t mark\n", "
\n", "\tNominal dependents\n", " \n", "\t nmod
\n", "\t appos
\n", "\t nummod\n", "
\n", "\t acl\n", " \n", "\t amod\n", " \n", "\t det
\n", "\t clf
\n", "\t case\n", "
Coordination MWE Loose Special Other
\n", "\t conj
\n", "\t cc\n", "
\n", "\t fixed
\n", "\t flat
\n", "\t compound\n", "
\n", "\t list
\n", "\t parataxis\n", "
\n", "\t orphan
\n", "\t goeswith
\n", "\t reparandum\n", "
\n", "\t punct
\n", "\t root
\n", "\t dep\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Beyond dependency trees\n", "\n", "UD also includes other morphosyntactic annotation:\n", "\n", "* Tokenisation and word segmentation\n", "* Morphological features (e.g., lemmas, case, gender)\n", "* **Universal part of speech tags (UPOS)**: coarse abstraction over language-specific POS tags (XPOS).\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Open class wordsClosed class wordsOther
ADJADPPUNCT
ADVAUXSYM
INTJCCONJX
NOUNDET 
PROPNNUM 
VERBPART 
 PRON 
 SCONJ 
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Danish UD example\n", "*the big fish ate the small fish*" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tDen\t_\t_\t_\t_\t3\tdet\t_\t_\n", "2\tstore\t_\t_\t_\t_\t3\tamod\t_\t_\n", "3\tfisk\t_\t_\t_\t_\t4\tnsubj\t_\t_\n", "4\tspiste\t_\t_\t_\t_\t0\troot\t_\t_\n", "5\tden\t_\t_\t_\t_\t7\tdet\t_\t_\n", "6\tlille\t_\t_\t_\t_\t7\tamod\t_\t_\n", "7\tfisk\t_\t_\t_\t_\t4\tobj\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"1400px\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "### Korean UD example\n", "*big fish small fish ate*" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\t큰\t_\t_\t_\t_\t2\tamod\t_\t_\n", "2\t물고기가\t_\t_\t_\t_\t5\tnsubj\t_\t_\n", "3\t작은\t_\t_\t_\t_\t4\tamod\t_\t_\n", "4\t물고기를\t_\t_\t_\t_\t5\tobj\t_\t_\n", "5\t먹었다\t_\t_\t_\t_\t0\troot\t_\t_\n", "\"\"\"\n", "arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))\n", "render_displacy(arcs, tokens,\"1400px\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Dependency parsing\n", "\n", "Task:\n", "* Predict **head** and **relation** for each word.\n", "* Classification? Sequence tagging? Sequence-to-sequence? Span selection? Or something else?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
# IDFORMLEMMAUPOSXPOSFEATSHEADDEPRELDEPSMISC
1Alice____2nsubj__
2saw____0root__
3Bob____2dobj__
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conllu = \"\"\"\n", "# ID\tFORM\tLEMMA\tUPOS\tXPOS\tFEATS\tHEAD\tDEPREL\tDEPS\tMISC\n", "1\tAlice\t_\t_\t_\t_\t2\tnsubj\t_\t_\n", "2\tsaw\t_\t_\t_\t_\t0\troot\t_\t_\n", "3\tBob\t_\t_\t_\t_\t2\tdobj\t_\t_\n", "\"\"\"\n", "display(HTML(pd.read_csv(StringIO(conllu), sep=\"\\t\").to_html(index=False)))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Dependency parsing approaches\n", "\n", "* **Graph-based**: score all possible word pairs, find best combination (often a maximum spanning tree). Examples:\n", " * [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/run.php?model=english-ewt-ud-2.10-220711&data=Kraft,%20owner%20of%20Milka,%20purchased%20Cadbury%20Dairy%20Milk%20and%20is%20now%20gearing%20up%20for%20a%20roll-out%20of%20its%20new%20brand.)\n", " * [Stanza](http://stanza.run/)\n", "* **Transition-based**: incrementally build the tree, one arc at a time, by applying a sequence of actions. Examples:\n", " * [spaCy](https://demos.explosion.ai/displacy?text=Kraft%2C%20owner%20of%20Milka%2C%20purchased%20Cadbury%20Dairy%20Milk%20and%20is%20now%20gearing%20up%20for%20a%20roll-out%20of%20its%20new%20brand.&model=en_core_web_sm&cpu=0&cph=0)\n", " * [UUParser](https://github.com/UppsalaNLP/uuparser)\n", " * [TUPA](https://github.com/huji-nlp/tupa/)\n", " \n", "
\n", " \n", "
\n", "\n", "
\n", " (from Dozat & Manning, 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Dependency parsing evaluation\n", "\n", "* Unlabeled Attachment Score (**UAS**): % of words with correct head\n", "* Labeled Attachment Score (**LAS**): % of words with correct head and label\n", "\n", "Always 0 $\\leq$ LAS $\\leq$ UAS $\\leq$ 100%." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Example: LAS and UAS\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " $\\mathrm{UAS}=\\frac{8}{12}=67\\%$\n", "
\n", "\n", "
\n", " $\\mathrm{LAS}=\\frac{7}{12}=58\\%$\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Transition-based parsers\n", "\n", "Consist of a **buffer** and **stack**, incrementally build the **parse** by applying **actions** (transitions)." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "### Configuration\n", "\n", "- Stack \\\\(S\\\\): a last-in, first-out memory to keep track of words to process\n", "- Buffer \\\\(B\\\\): words remaining to be processed\n", "- Arcs \\\\(A\\\\): the dependency arcs created so far in the parse tree" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "What are the possible actions? Depends which **transition system** we are using!" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "Common transition systems:\n", "+ arc-standard ([Nivre, 2003](https://aclanthology.org/W03-3017/))\n", "+ arc-eager ([Nivre, 2004](https://www.aclweb.org/anthology/W04-0308))\n", "+ arc-hybrid ([Kuhlmann et al., 2011](https://aclanthology.org/P11-1068/))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## arc-standard\n", "\n", "Possible actions at each step:\n", "- **SHIFT**: move the buffer top item to the stack.\n", "- For each relation $r$,\n", " - **RIGHT-ARC-$r$**: create $r$ arc from second stack item to stack top. Then pop stack top.\n", " - **LEFT-ARC-$r$**: create $r$ arc from stack top to second stack item. Then pop second stack item." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "Two special configurations:\n", "- **initial**: buffer contains the words, stack contains root, and arcs are empty.\n", "- **terminal**: buffer is empty, stack contains only root." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### arc-standard example" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stackbufferparseaction
ROOTAlice saw Bob\n", "
\n", "
ROOT Alicesaw Bob\n", "
\n", "
shift
ROOT Alice sawBob\n", "
\n", "
shift
ROOT sawBob\n", "
\n", "
leftArc-nsubj
ROOT saw Bob\n", "
\n", "
shift
ROOT saw\n", "
\n", "
rightArc-dobj
ROOT\n", "
\n", "
rightArc-root
ROOT\n", "
\n", "
" ], "text/plain": [ ".Output at 0x105d76790>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "render_transitions_displacy(transitions, tokenized_sentence)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Transition-based parsing as structured prediction\n", "\n", "**Model** $p(a|c)$: how likely is action $a$ to be next, given that the current configuration is $c$?\n", "$$p(a|c) \\approx s_\\params(a,c)$$\n", "\n", "**Training**: learn $\\params$ with an annotated training set\n", "$$\n", "\\argmax_\\params \\prod_{x \\in \\train} \\prod_{i=1}^{|x|} s_\\params(a_i,c_i)\n", "$$\n", "\n", "**Decoding**: try to find the most likely action sequence\n", "$$\\argmax_{a_1,\\ldots,a_{|x|}} \\prod_{i=1}^{|x|} s_\\params(a_i,c_i)$$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "### Neural transition classifiers\n", "\n", "Sequence-to-sequence?\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "Sequence-to-sequence, but with control structure:\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "### Neural transition classifiers\n", "* Each step is a new classification instance\n", "* Architectures using Word embedding ([Chen and Manning, 2014](https://aclanthology.org/D14-1082/)), Stack-LSTM ([Dyer et al., 2015](https://aclanthology.org/P15-1033/)), BiLSTM ([Kiperwasser and Goldberg, 2016](https://aclanthology.org/Q16-1023/)), attention ([Ma et al., 2018](https://aclanthology.org/P18-1130/)), Stack-Transformer ([Fernandez Astudillo et al., 2020](https://aclanthology.org/2020.findings-emnlp.89/))\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Hershcovich et al., 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Training\n", "\n", "+ Loss function: often negative log-likelihood or max-margin\n", "+ **Teacher forcing:** always choose the ground truth action.\n", "\n", "*Alternative:* (see also [MT slides](nmt_slides_active.ipynb))\n", "\n", "+ **Scheduled sampling:** with a certain probability, use model predictions instead.\n", "\n", "But what is the **ground truth**? Treebanks contain **trees**, not action sequences!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Oracle**: rules to select the right action given the configuration **and the correct tree**.\n", "\n", "
\n", "\n", "
\n", " (from Hershcovich et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Decoding\n", "\n", "+ Greedy decoding:\n", " + Always pick the **most likely action** (according to the classifier)\n", " + Continue applying more actions **until a terminal configuration is reached**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "+ Beam search:\n", " * Maintains a list of top-$k$ action+configuration sequences in a **beam**" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## arc-hybrid\n", "\n", "- **SHIFT**: move the buffer top item to the stack.\n", "- **RIGHT-ARC-$r$**: create $r$ arc from second stack item to stack top. Then pop stack top.\n", "- **LEFT-ARC-$r$**: create $r$ arc from **buffer top** to stack top. Then pop **stack top**.\n", "\n", "
\n", "\n", "- **initial**: buffer contains the words **followed by root**, stack is **empty**, and arcs are empty.\n", "- **terminal**: buffer **contains only root**, stack **is empty**." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "### arc-hybrid example\n", "\n", "**Unlabeled parsing** (without relation labels), just for simplicity.\n", "\n", "https://danielhers.github.io/archybrid.pdf" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "
\n", "\n", "# [tinyurl.com/diku-nlp-tb](https://tinyurl.com/diku-nlp-tb)\n", "\n", "([Responses](https://app.quizalize.com/dash/R3JvdXA6Y2U1ZWFmYTMtMzdhMS00NTNiLWI3OTMtNThmYzg2MWRmNDQ3/activity/QWN0aXZpdHk6NGJmMTQxYTUtODg1Yy00ZDJmLTk5MmEtYWE4MTIxMmRmNTZk/board/leaderboard))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Summary: arc-hybrid vs arc-standard\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LEFT-ARCinitial configurationterminal configuration
arc-standardcreate arc from stack top to second stack item,
pop second stack item
stack contains root,
buffer contains words
stack contains root,
buffer is empty
arc-hybridcreate arc from buffer top to stack top,
pop stack top
stack is empty,
buffer contains words and root
stack is empty,
buffer contains root
\n", "\n", "
(arc-hybrid)
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary\n", "\n", "* **Dependency parsing** predicts word-to-word dependencies\n", "* Treebanks in many languages, thanks to **UD**\n", "* Fast and accurate parsing, e.g. **transition-based**" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Further reading\n", "\n", "* [EACL 2014 tutorial on dependency parsing](http://stp.lingfil.uu.se/~nivre/eacl14.html)\n", "* [Slides about semantic parsing](https://danielhers.github.io/mr.pdf)\n", "* [Chapter from this book about transition-based dependency parsing](http://localhost:8888/notebooks/chapters/transition-based_dependency_parsing.ipynb)\n", "* [Chapter from this book about constituency parsing](parsing.ipynb) ([slides](parsing_slides.ipynb))\n", "* [Jurafsky & Martin, Chapter 19](https://web.stanford.edu/~jurafsky/slp3/19.pdf)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 1 }