{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stanza Tutorial\n", "\n", "(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.1, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U stanza" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install [spaCy](https://spacy.io/) follow the instructions on the [Install spaCy page](https://spacy.io/usage)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U pip setuptools wheel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the [spaCy homepage](https://spacy.io/usage) for detailed installation instructions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the [L645 Advanced Natural Language Processing](http://damir.cavar.me/l645/) course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/) and [Stanza](https://stanfordnlp.github.io/stanza/) 1.5.1 or newer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook assumes that you have set up [Stanza](https://stanfordnlp.github.io/stanza/) on your computer with your [Python](https://python.org/) distribution. Follow the instructions on the [Stanza](https://stanfordnlp.github.io/stanza/) installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading the [Stanza](https://stanfordnlp.github.io/stanza/) module and [spaCy's Displacy](https://spacy.io/usage/visualizers) for visualization:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import stanza\n", "from stanza.models.common.doc import Document\n", "from stanza.pipeline.core import Pipeline\n", "from spacy import displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will load the English language model for [Stanza](https://stanfordnlp.github.io/stanza/):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "96d6347621814771ba37fb82cc243440", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2024-01-23 12:31:57 INFO: Downloading default packages for language: en (English) ...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "21090f54744249249eaa4071e8ab3e40", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip: 0%| | 0…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2024-01-23 12:32:13 INFO: Finished downloading models and saved to /home/damir/stanza_resources.\n" ] } ], "source": [ "stanza.download('en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can configure the [Stanza](https://stanfordnlp.github.io/stanza/) pipeline to contain all desired linguistic annotation modules. In this case we use:\n", "- tokenizer\n", "- multi-word-tokenizer\n", "- Part-of-Speech tagger\n", "- lemmatizer\n", "- dependency parser\n", "- constituent parser" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-01-23 12:32:27 WARNING: Can not find ner: ontonotes from official model list. Ignoring it.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a395565183484c739053793543c31bda", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/ner/ncbi_disease.pt: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6cfa2cf60ed940a8a36c887af7deb50d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/forward_charlm/pubmed.pt: 0%|…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cbcd807c6e5f4942b5722c0cc8293a68", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/backward_charlm/pubmed.pt: 0%…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "732dce84bb8b4cc89f0858d0d0f02b91", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/pretrain/biomed.pt: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2024-01-23 12:32:33 INFO: Loading these models for language: en (English):\n", "======================================\n", "| Processor | Package |\n", "--------------------------------------\n", "| tokenize | combined |\n", "| mwt | combined |\n", "| pos | combined_charlm |\n", "| lemma | combined_nocharlm |\n", "| constituency | ptb3-revised_charlm |\n", "| depparse | combined_charlm |\n", "| sentiment | sstplus |\n", "| ner | ncbi_disease |\n", "======================================\n", "\n", "2024-01-23 12:32:33 INFO: Using device: cpu\n", "2024-01-23 12:32:33 INFO: Loading: tokenize\n", "/home/damir/.local/lib/python3.12/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.\n", " _torch_pytree._register_pytree_node(\n", "2024-01-23 12:32:33 INFO: Loading: mwt\n", "2024-01-23 12:32:33 INFO: Loading: pos\n", "2024-01-23 12:32:34 INFO: Loading: lemma\n", "2024-01-23 12:32:34 INFO: Loading: constituency\n", "2024-01-23 12:32:34 INFO: Loading: depparse\n", "2024-01-23 12:32:34 INFO: Loading: sentiment\n", "2024-01-23 12:32:34 INFO: Loading: ner\n", "2024-01-23 12:32:35 INFO: Done loading processors!\n" ] } ], "source": [ "nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={\"ner\": [\"ncbi_disease\", \"ontonotes\"]}, use_gpu=False, download_method=\"reuse_resources\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "====== Sentence 1 tokens =======\n", "id: (1,)\ttext: The\n", "id: (2,)\ttext: pilot\n", "id: (3,)\ttext: had\n", "id: (4,)\ttext: arthritis\n", "id: (5,)\ttext: .\n", "====== Sentence 2 tokens =======\n", "id: (1, 2)\ttext: What's\n", "id: (3,)\ttext: so\n", "id: (4,)\ttext: important\n", "id: (5,)\ttext: to\n", "id: (6,)\ttext: underline\n", "id: (7,)\ttext: is\n", "id: (8,)\ttext: that\n", "id: (9,)\ttext: Metz\n", "id: (10,)\ttext: worked\n", "id: (11,)\ttext: for\n", "id: (12,)\ttext: both\n", "id: (13,)\ttext: Northrop\n", "id: (14,)\ttext: and\n", "id: (15,)\ttext: Lockheed\n", "id: (16,)\ttext: Martin\n", "id: (17,)\ttext: in\n", "id: (18,)\ttext: New\n", "id: (19,)\ttext: York\n", "id: (20,)\ttext: City\n", "id: (21,)\ttext: and\n", "id: (22,)\ttext: is\n", "id: (23,)\ttext: not\n", "id: (24,)\ttext: known\n", "id: (25,)\ttext: for\n", "id: (26,)\ttext: hyperbole\n", "id: (27,)\ttext: .\n", "====== Sentence 3 tokens =======\n", "id: (1,)\ttext: Yet\n", "id: (2,)\ttext: even\n", "id: (3,)\ttext: after\n", "id: (4,)\ttext: flying\n", "id: (5,)\ttext: the\n", "id: (6,)\ttext: pre-production\n", "id: (7,)\ttext: F\n", "id: (8,)\ttext: -\n", "id: (9,)\ttext: 22\n", "id: (10,)\ttext: ,\n", "id: (11,)\ttext: a\n", "id: (12,)\ttext: far\n", "id: (13,)\ttext: more\n", "id: (14,)\ttext: mature\n", "id: (15,)\ttext: machine\n", "id: (16,)\ttext: than\n", "id: (17,)\ttext: the\n", "id: (18,)\ttext: YF\n", "id: (19,)\ttext: -\n", "id: (20,)\ttext: 23\n", "id: (21,)\ttext: ever\n", "id: (22,)\ttext: was\n", "id: (23,)\ttext: ,\n", "id: (24,)\ttext: he\n", "id: (25,)\ttext: makes\n", "id: (26,)\ttext: it\n", "id: (27,)\ttext: quite\n", "id: (28,)\ttext: clear\n", "id: (29,)\ttext: that\n", "id: (30, 31)\ttext: Northrop's\n", "id: (32,)\ttext: offering\n", "id: (33,)\ttext: was\n", "id: (34,)\ttext: on\n", "id: (35,)\ttext: par\n", "id: (36,)\ttext: with\n", "id: (37, 38)\ttext: Lockheed's\n", "id: (39,)\ttext: ,\n", "id: (40,)\ttext: if\n", "id: (41,)\ttext: not\n", "id: (42,)\ttext: superior\n", "id: (43,)\ttext: .\n" ] } ], "source": [ "doc = nlp(\"The pilot had arthritis. What's so important to underline is that Metz worked for both Northrop and Lockheed Martin in New York City and is not known for hyperbole. Yet even after flying the pre-production F-22, a far more mature machine than the YF-23 ever was, he makes it quite clear that Northrop's offering was on par with Lockheed's, if not superior.\")\n", "for i, sentence in enumerate(doc.sentences):\n", " print(f'====== Sentence {i+1} tokens =======')\n", " print(*[f'id: {token.id}\\ttext: {token.text}' for token in sentence.tokens], sep='\\n')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word: The\tupos: DET\txpos: DT\tfeats: Definite=Def|PronType=Art\n", "word: pilot\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: had\tupos: VERB\txpos: VBD\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\n", "word: arthritis\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: .\tupos: PUNCT\txpos: .\tfeats: _\n", "word: What\tupos: PRON\txpos: WP\tfeats: PronType=Int\n", "word: 's\tupos: AUX\txpos: VBZ\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\n", "word: so\tupos: ADV\txpos: RB\tfeats: _\n", "word: important\tupos: ADJ\txpos: JJ\tfeats: Degree=Pos\n", "word: to\tupos: PART\txpos: TO\tfeats: _\n", "word: underline\tupos: VERB\txpos: VB\tfeats: VerbForm=Inf\n", "word: is\tupos: AUX\txpos: VBZ\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\n", "word: that\tupos: SCONJ\txpos: IN\tfeats: _\n", "word: Metz\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: worked\tupos: VERB\txpos: VBD\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\n", "word: for\tupos: ADP\txpos: IN\tfeats: _\n", "word: both\tupos: CCONJ\txpos: CC\tfeats: _\n", "word: Northrop\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: and\tupos: CCONJ\txpos: CC\tfeats: _\n", "word: Lockheed\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: Martin\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: in\tupos: ADP\txpos: IN\tfeats: _\n", "word: New\tupos: ADJ\txpos: NNP\tfeats: Degree=Pos\n", "word: York\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: City\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: and\tupos: CCONJ\txpos: CC\tfeats: _\n", "word: is\tupos: AUX\txpos: VBZ\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\n", "word: not\tupos: PART\txpos: RB\tfeats: _\n", "word: known\tupos: VERB\txpos: VBN\tfeats: Tense=Past|VerbForm=Part|Voice=Pass\n", "word: for\tupos: ADP\txpos: IN\tfeats: _\n", "word: hyperbole\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: .\tupos: PUNCT\txpos: .\tfeats: _\n", "word: Yet\tupos: CCONJ\txpos: CC\tfeats: _\n", "word: even\tupos: ADV\txpos: RB\tfeats: _\n", "word: after\tupos: SCONJ\txpos: IN\tfeats: _\n", "word: flying\tupos: VERB\txpos: VBG\tfeats: VerbForm=Ger\n", "word: the\tupos: DET\txpos: DT\tfeats: Definite=Def|PronType=Art\n", "word: pre-production\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: F\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: -\tupos: PUNCT\txpos: HYPH\tfeats: _\n", "word: 22\tupos: NUM\txpos: CD\tfeats: NumForm=Digit|NumType=Card\n", "word: ,\tupos: PUNCT\txpos: ,\tfeats: _\n", "word: a\tupos: DET\txpos: DT\tfeats: Definite=Ind|PronType=Art\n", "word: far\tupos: ADV\txpos: RB\tfeats: Degree=Pos\n", "word: more\tupos: ADV\txpos: RBR\tfeats: Degree=Cmp\n", "word: mature\tupos: ADJ\txpos: JJ\tfeats: Degree=Pos\n", "word: machine\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: than\tupos: ADP\txpos: IN\tfeats: _\n", "word: the\tupos: DET\txpos: DT\tfeats: Definite=Def|PronType=Art\n", "word: YF\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: -\tupos: PUNCT\txpos: HYPH\tfeats: _\n", "word: 23\tupos: NUM\txpos: CD\tfeats: NumForm=Digit|NumType=Card\n", "word: ever\tupos: ADV\txpos: RB\tfeats: _\n", "word: was\tupos: AUX\txpos: VBD\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\n", "word: ,\tupos: PUNCT\txpos: ,\tfeats: _\n", "word: he\tupos: PRON\txpos: PRP\tfeats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs\n", "word: makes\tupos: VERB\txpos: VBZ\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\n", "word: it\tupos: PRON\txpos: PRP\tfeats: Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs\n", "word: quite\tupos: ADV\txpos: RB\tfeats: _\n", "word: clear\tupos: ADJ\txpos: JJ\tfeats: Degree=Pos\n", "word: that\tupos: SCONJ\txpos: IN\tfeats: _\n", "word: Northrop\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: 's\tupos: PART\txpos: POS\tfeats: _\n", "word: offering\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: was\tupos: AUX\txpos: VBD\tfeats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\n", "word: on\tupos: ADP\txpos: IN\tfeats: _\n", "word: par\tupos: NOUN\txpos: NN\tfeats: Number=Sing\n", "word: with\tupos: ADP\txpos: IN\tfeats: _\n", "word: Lockheed\tupos: PROPN\txpos: NNP\tfeats: Number=Sing\n", "word: 's\tupos: PART\txpos: POS\tfeats: _\n", "word: ,\tupos: PUNCT\txpos: ,\tfeats: _\n", "word: if\tupos: SCONJ\txpos: IN\tfeats: _\n", "word: not\tupos: PART\txpos: RB\tfeats: _\n", "word: superior\tupos: ADJ\txpos: JJ\tfeats: Degree=Pos\n", "word: .\tupos: PUNCT\txpos: .\tfeats: _\n" ] } ], "source": [ "print(*[f'word: {word.text}\\tupos: {word.upos}\\txpos: {word.xpos}\\tfeats: {word.feats if word.feats else \"_\"}' for sent in doc.sentences for word in sent.words], sep='\\n')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word: The \tlemma: the\n", "word: pilot \tlemma: pilot\n", "word: had \tlemma: have\n", "word: arthritis \tlemma: arthritis\n", "word: . \tlemma: .\n", "word: What \tlemma: what\n", "word: 's \tlemma: be\n", "word: so \tlemma: so\n", "word: important \tlemma: important\n", "word: to \tlemma: to\n", "word: underline \tlemma: underline\n", "word: is \tlemma: be\n", "word: that \tlemma: that\n", "word: Metz \tlemma: Metz\n", "word: worked \tlemma: work\n", "word: for \tlemma: for\n", "word: both \tlemma: both\n", "word: Northrop \tlemma: Northrop\n", "word: and \tlemma: and\n", "word: Lockheed \tlemma: Lockheed\n", "word: Martin \tlemma: Martin\n", "word: in \tlemma: in\n", "word: New \tlemma: New\n", "word: York \tlemma: York\n", "word: City \tlemma: City\n", "word: and \tlemma: and\n", "word: is \tlemma: be\n", "word: not \tlemma: not\n", "word: known \tlemma: know\n", "word: for \tlemma: for\n", "word: hyperbole \tlemma: hyperbole\n", "word: . \tlemma: .\n", "word: Yet \tlemma: yet\n", "word: even \tlemma: even\n", "word: after \tlemma: after\n", "word: flying \tlemma: fly\n", "word: the \tlemma: the\n", "word: pre-production \tlemma: pre-production\n", "word: F \tlemma: F\n", "word: - \tlemma: -\n", "word: 22 \tlemma: 22\n", "word: , \tlemma: ,\n", "word: a \tlemma: a\n", "word: far \tlemma: far\n", "word: more \tlemma: more\n", "word: mature \tlemma: mature\n", "word: machine \tlemma: machine\n", "word: than \tlemma: than\n", "word: the \tlemma: the\n", "word: YF \tlemma: YF\n", "word: - \tlemma: -\n", "word: 23 \tlemma: 23\n", "word: ever \tlemma: ever\n", "word: was \tlemma: be\n", "word: , \tlemma: ,\n", "word: he \tlemma: he\n", "word: makes \tlemma: make\n", "word: it \tlemma: it\n", "word: quite \tlemma: quite\n", "word: clear \tlemma: clear\n", "word: that \tlemma: that\n", "word: Northrop \tlemma: Northrop\n", "word: 's \tlemma: 's\n", "word: offering \tlemma: offering\n", "word: was \tlemma: be\n", "word: on \tlemma: on\n", "word: par \tlemma: par\n", "word: with \tlemma: with\n", "word: Lockheed \tlemma: Lockheed\n", "word: 's \tlemma: 's\n", "word: , \tlemma: ,\n", "word: if \tlemma: if\n", "word: not \tlemma: not\n", "word: superior \tlemma: superior\n", "word: . \tlemma: .\n" ] } ], "source": [ "print(*[f'word: {word.text+\" \"}\\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\\n')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(ROOT (S (NP (DT The) (NN pilot)) (VP (VBD had) (NP (NN arthritis))) (. .)))\n", "(ROOT (S (SBAR (WHNP (WP What)) (S (VP (VBZ 's) (ADJP (RB so) (JJ important) (SBAR (S (VP (TO to) (VP (VB underline))))))))) (VP (VBZ is) (SBAR (IN that) (S (NP (NNP Metz)) (VP (VP (VBD worked) (PP (IN for) (NP (CC both) (NP (NNP Northrop)) (CC and) (NP (NNP Lockheed) (NNP Martin)))) (PP (IN in) (NP (NML (NNP New) (NNP York)) (NNP City)))) (CC and) (VP (VBZ is) (RB not) (VP (VBN known) (PP (IN for) (NP (NN hyperbole))))))))) (. .)))\n", "(ROOT (S (CC Yet) (PP (ADVP (RB even)) (IN after) (S (VP (VBG flying) (NP (DT the) (NN pre-production) (NNP F) (HYPH -) (CD 22))))) (, ,) (NP (NP (DT a) (ADJP (ADVP (RB far) (RBR more)) (JJ mature)) (NN machine)) (PP (IN than) (NP (DT the) (NNP YF) (HYPH -) (CD 23)))) (ADVP (RB ever)) (VP (VBD was)) (, ,) (NP (NP (PRP he))) (VP (VBZ makes) (S (NP (NP (PRP it))) (ADJP (RB quite) (JJ clear)) (SBAR (IN that) (S (NP (NP (NNP Northrop) (POS 's)) (NN offering)) (VP (VBD was) (PP (IN on) (NP (NN par))) (PP (IN with) (ADJP (NP (NNP Lockheed) (POS 's)) (, ,) (SBAR (IN if) (FRAG (RB not) (ADJP (JJ superior))))))))))) (. .)))\n" ] } ], "source": [ "for sentence in doc.sentences:\n", " print(sentence.constituency)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "entity: arthritis\ttype: DISEASE\n" ] } ], "source": [ "print(*[f'entity: {ent.text}\\ttype: {ent.type}' for ent in doc.ents], sep='\\n')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "token: The\tner: O\n", "token: pilot\tner: O\n", "token: had\tner: O\n", "token: arthritis\tner: S-DISEASE\n", "token: .\tner: O\n", "token: What\tner: O\n", "token: 's\tner: O\n", "token: so\tner: O\n", "token: important\tner: O\n", "token: to\tner: O\n", "token: underline\tner: O\n", "token: is\tner: O\n", "token: that\tner: O\n", "token: Metz\tner: S-ORG\n", "token: worked\tner: O\n", "token: for\tner: O\n", "token: both\tner: O\n", "token: Northrop\tner: S-ORG\n", "token: and\tner: O\n", "token: Lockheed\tner: B-ORG\n", "token: Martin\tner: E-ORG\n", "token: in\tner: O\n", "token: New\tner: B-GPE\n", "token: York\tner: I-GPE\n", "token: City\tner: E-GPE\n", "token: and\tner: O\n", "token: is\tner: O\n", "token: not\tner: O\n", "token: known\tner: O\n", "token: for\tner: O\n", "token: hyperbole\tner: O\n", "token: .\tner: O\n", "token: Yet\tner: O\n", "token: even\tner: O\n", "token: after\tner: O\n", "token: flying\tner: O\n", "token: the\tner: O\n", "token: pre-production\tner: O\n", "token: F\tner: B-PRODUCT\n", "token: -\tner: I-PRODUCT\n", "token: 22\tner: E-PRODUCT\n", "token: ,\tner: O\n", "token: a\tner: O\n", "token: far\tner: O\n", "token: more\tner: O\n", "token: mature\tner: O\n", "token: machine\tner: O\n", "token: than\tner: O\n", "token: the\tner: B-PRODUCT\n", "token: YF\tner: I-PRODUCT\n", "token: -\tner: I-PRODUCT\n", "token: 23\tner: E-PRODUCT\n", "token: ever\tner: O\n", "token: was\tner: O\n", "token: ,\tner: O\n", "token: he\tner: O\n", "token: makes\tner: O\n", "token: it\tner: O\n", "token: quite\tner: O\n", "token: clear\tner: O\n", "token: that\tner: O\n", "token: Northrop\tner: S-ORG\n", "token: 's\tner: O\n", "token: offering\tner: O\n", "token: was\tner: O\n", "token: on\tner: O\n", "token: par\tner: O\n", "token: with\tner: O\n", "token: Lockheed\tner: S-ORG\n", "token: 's\tner: O\n", "token: ,\tner: O\n", "token: if\tner: O\n", "token: not\tner: O\n", "token: superior\tner: O\n", "token: .\tner: O\n" ] } ], "source": [ "print(*[f'token: {token.text}\\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\\n')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 -> 0\n", "1 -> 2\n", "2 -> 0\n" ] } ], "source": [ "for i, sentence in enumerate(doc.sentences):\n", " print(\"%d -> %d\" % (i, sentence.sentiment))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Language ID" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "61e8608ae99e484aa2e2c59cc9d065f6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2023-09-20 17:34:37 INFO: Downloading default packages for language: multilingual (multilingual) ...\n", "2023-09-20 17:34:37 INFO: File exists: C:\\Users\\damir\\stanza_resources\\multilingual\\default.zip\n", "2023-09-20 17:34:37 INFO: Finished downloading models and saved to C:\\Users\\damir\\stanza_resources.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "879099071f2545ec9e93245b3e8c6fd6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2023-09-20 17:34:38 INFO: Downloading default packages for language: en (English) ...\n", "2023-09-20 17:34:38 INFO: File exists: C:\\Users\\damir\\stanza_resources\\en\\default.zip\n", "2023-09-20 17:34:42 INFO: Finished downloading models and saved to C:\\Users\\damir\\stanza_resources.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "83c23cc5a77946dc98ee6c66c54603f1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2023-09-20 17:34:42 INFO: Downloading default packages for language: de (German) ...\n", "2023-09-20 17:34:43 INFO: File exists: C:\\Users\\damir\\stanza_resources\\de\\default.zip\n", "2023-09-20 17:34:47 INFO: Finished downloading models and saved to C:\\Users\\damir\\stanza_resources.\n" ] } ], "source": [ "stanza.download(lang=\"multilingual\")\n", "stanza.download(lang=\"en\")\n", "# stanza.download(lang=\"fr\")\n", "stanza.download(lang=\"de\")" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-09-20 17:36:07 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "aabdb7f579ca4d089cd326086aafd80b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2023-09-20 17:36:07 INFO: Loading these models for language: multilingual ():\n", "=======================\n", "| Processor | Package |\n", "-----------------------\n", "| langid | ud |\n", "=======================\n", "\n", "2023-09-20 17:36:07 INFO: Using device: cuda\n", "2023-09-20 17:36:07 INFO: Loading: langid\n", "2023-09-20 17:36:07 INFO: Done loading processors!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Hello world.\ten\n", "Hallo, Welt!\tit\n" ] } ], "source": [ "nlp = Pipeline(lang=\"multilingual\", processors=\"langid\")\n", "docs = [\"Hello world.\", \"Hallo, Welt!\"]\n", "docs = [Document([], text=text) for text in docs]\n", "nlp(docs)\n", "print(\"\\n\".join(f\"{doc.text}\\t{doc.lang}\" for doc in docs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Dependency Parse Trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I wrote the following function to convert the [Stanza](https://stanfordnlp.github.io/stanza/) dependency tree data structure to a [spaCy's Displacy](https://spacy.io/usage/visualizers) compatible data structure for the visualization of dependency trees using [spaCy's](https://spacy.io/) excellent visualizer:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def get_stanza_dep_displacy_manual(doc):\n", " res = []\n", " for x in doc.sentences:\n", " words = []\n", " arcs = []\n", " for w in x.words:\n", " if w.head > 0:\n", " head_text = x.words[w.head-1].text\n", " else:\n", " head_text = \"root\"\n", " words.append({\"text\": w.text, \"tag\": w.upos})\n", " if w.deprel == \"root\": continue\n", " start = w.head-1\n", " end = w.id-1\n", " if start < end:\n", " arcs.append({ \"start\":start, \"end\":end, \"label\": w.deprel, \"dir\":\"right\"})\n", " else:\n", " arcs.append({ \"start\":end, \"end\":start, \"label\": w.deprel, \"dir\":\"left\"})\n", " res.append( { \"words\": words, \"arcs\": arcs } )\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generate an annotation object with [Stanza](https://stanfordnlp.github.io/stanza/) similarly to [spaCy's](https://spacy.io/) approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the `nlp` variable:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "doc = nlp(\"John loves to read books and Mary newspapers.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now generate the [spaCy](https://spacy.io/)-compatible data format from the dependency tree to be able to visualize it:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "res = get_stanza_dep_displacy_manual(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rendering can be achieved using the [Displacy](https://spacy.io/usage/visualizers) call:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " John\n", " PROPN\n", "\n", "\n", "\n", " loves\n", " VERB\n", "\n", "\n", "\n", " to\n", " PART\n", "\n", "\n", "\n", " read\n", " VERB\n", "\n", "\n", "\n", " books\n", " NOUN\n", "\n", "\n", "\n", " and\n", " CCONJ\n", "\n", "\n", "\n", " Mary\n", " PROPN\n", "\n", "\n", "\n", " newspapers\n", " NOUN\n", "\n", "\n", "\n", " .\n", " PUNCT\n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " mark\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " xcomp\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " obj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " cc\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(res, style=\"dep\", manual=True, options={\"compact\":False, \"distance\":110})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Format - CoNLL" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "from stanza.utils.conll import CoNLL" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "CoNLL.write_doc2conll(doc, \"output.conllu\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/) <>**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }