{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stanza Tutorial\n", "\n", "(C) 2023-2025 by [Damir Cavar](http://damir.cavar.me/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.2, January 2025" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U stanza" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install [spaCy](https://spacy.io/) follow the instructions on the [Install spaCy page](https://spacy.io/usage)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U pip setuptools wheel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the [spaCy homepage](https://spacy.io/usage) for detailed installation instructions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the [L645 Advanced Natural Language Processing](http://damir.cavar.me/l645/) course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/) and [Stanza](https://stanfordnlp.github.io/stanza/) 1.5.1 or newer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook assumes that you have set up [Stanza](https://stanfordnlp.github.io/stanza/) on your computer with your [Python](https://python.org/) distribution. Follow the instructions on the [Stanza](https://stanfordnlp.github.io/stanza/) installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading the [Stanza](https://stanfordnlp.github.io/stanza/) module and [spaCy's Displacy](https://spacy.io/usage/visualizers) for visualization:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import stanza\n", "from stanza.models.common.doc import Document\n", "from stanza.pipeline.core import Pipeline\n", "from spacy import displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will load the English language model for [Stanza](https://stanfordnlp.github.io/stanza/):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stanza.download('de')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can configure the [Stanza](https://stanfordnlp.github.io/stanza/) pipeline to contain all desired linguistic annotation modules. In this case we use:\n", "- tokenizer\n", "- multi-word-tokenizer\n", "- Part-of-Speech tagger\n", "- lemmatizer\n", "- dependency parser\n", "- constituent parser" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nlp = stanza.Pipeline('de', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={\"ner\": [\"ncbi_disease\", \"ontonotes\"]}, use_gpu=False, download_method=\"reuse_resources\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc = nlp(\"Gummibärchen habe ich grüne noch keine gegessen.\")\n", "for i, sentence in enumerate(doc.sentences):\n", " print(f'====== Sentence {i+1} tokens =======')\n", " print(*[f'id: {token.id}\\ttext: {token.text}' for token in sentence.tokens], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*[f'word: {word.text}\\tupos: {word.upos}\\txpos: {word.xpos}\\tfeats: {word.feats if word.feats else \"_\"}' for sent in doc.sentences for word in sent.words], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*[f'word: {word.text+\" \"}\\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for sentence in doc.sentences:\n", " print(sentence.constituency)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*[f'entity: {ent.text}\\ttype: {ent.type}' for ent in doc.ents], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*[f'token: {token.text}\\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, sentence in enumerate(doc.sentences):\n", " print(\"%d -> %d\" % (i, sentence.sentiment))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Language ID" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stanza.download(lang=\"multilingual\")\n", "stanza.download(lang=\"en\")\n", "# stanza.download(lang=\"fr\")\n", "stanza.download(lang=\"de\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nlp = Pipeline(lang=\"multilingual\", processors=\"langid\")\n", "docs = [\"Hello world.\", \"Hallo, Welt!\", \"Ciao mondo!\", \"Hola mundo!\"]\n", "docs = [Document([], text=text) for text in docs]\n", "nlp(docs)\n", "print(\"\\n\".join(f\"{doc.text}\\t{doc.lang}\" for doc in docs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing Dependency Parse Trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I wrote the following function to convert the [Stanza](https://stanfordnlp.github.io/stanza/) dependency tree data structure to a [spaCy's Displacy](https://spacy.io/usage/visualizers) compatible data structure for the visualization of dependency trees using [spaCy's](https://spacy.io/) excellent visualizer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_stanza_dep_displacy_manual(doc):\n", " res = []\n", " for x in doc.sentences:\n", " words = []\n", " arcs = []\n", " for w in x.words:\n", " if w.head > 0:\n", " head_text = x.words[w.head-1].text\n", " else:\n", " head_text = \"root\"\n", " words.append({\"text\": w.text, \"tag\": w.upos})\n", " if w.deprel == \"root\": continue\n", " start = w.head-1\n", " end = w.id-1\n", " if start < end:\n", " arcs.append({ \"start\":start, \"end\":end, \"label\": w.deprel, \"dir\":\"right\"})\n", " else:\n", " arcs.append({ \"start\":end, \"end\":start, \"label\": w.deprel, \"dir\":\"left\"})\n", " res.append( { \"words\": words, \"arcs\": arcs } )\n", " return res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generate an annotation object with [Stanza](https://stanfordnlp.github.io/stanza/) similarly to [spaCy's](https://spacy.io/) approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the `nlp` variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc = nlp(\"Gummibärchen habe ich grüne noch keine gegessen.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now generate the [spaCy](https://spacy.io/)-compatible data format from the dependency tree to be able to visualize it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = get_stanza_dep_displacy_manual(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rendering can be achieved using the [Displacy](https://spacy.io/usage/visualizers) call:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "displacy.render(res, style=\"dep\") # , manual=True, options={\"compact\":False, \"distance\":110})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Format - CoNLL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from stanza.utils.conll import CoNLL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CoNLL.write_doc2conll(doc, \"output.conllu\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization using PyPlot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stanza.download('en')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse,constituency', use_gpu=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc = nlp(\"I saw the man with the binoculars.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G = nx.DiGraph()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for sentence in doc.sentences:\n", " # Add nodes for each word\n", " for word in sentence.words:\n", " G.add_node(word.id, label=word.text)\n", "\n", " # Add edges based on dependency relations\n", " for word in sentence.words:\n", " if word.head > 0: # Not the root\n", " G.add_edge(word.head, word.id, label=word.deprel)\n", " else: # Handle the root node (e.g., connect to a virtual root or identify it as such)\n", " G.add_node(0, label=\"ROOT\") # Add a virtual root node\n", " G.add_edge(0, word.id, label=\"root\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pos = nx.spring_layout(G)\n", "nx.draw(G, pos, with_labels=True, labels=nx.get_node_attributes(G, 'label'))\n", "nx.draw_networkx_edge_labels(G, pos, edge_labels=nx.get_edge_attributes(G, 'label'))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the Constituent Parse Tree with NLTK" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk import Tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "constituent_tree_string = str(doc.sentences[0].constituency)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nltk_tree = Tree.fromstring(constituent_tree_string)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nltk_tree.draw()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2023-2025 by [Damir Cavar](http://damir.cavar.me/) <>**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }