# Stanza Tutorial

(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 1.1, January 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**Prerequisites:**

In [None]:
!pip install -U stanza

To install [spaCy](https://spacy.io/) follow the instructions on the [Install spaCy page](https://spacy.io/usage).

In [None]:
!pip install -U pip setuptools wheel

The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the [spaCy homepage](https://spacy.io/usage) for detailed installation instructions.

In [None]:
!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'

## Introduction

This is a tutorial related to the [L645 Advanced Natural Language Processing](http://damir.cavar.me/l645/) course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/) and [Stanza](https://stanfordnlp.github.io/stanza/) 1.5.1 or newer.

This notebook assumes that you have set up [Stanza](https://stanfordnlp.github.io/stanza/) on your computer with your [Python](https://python.org/) distribution. Follow the instructions on the [Stanza](https://stanfordnlp.github.io/stanza/) installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed.

Loading the [Stanza](https://stanfordnlp.github.io/stanza/) module and [spaCy's Displacy](https://spacy.io/usage/visualizers) for visualization:

In [1]:
import stanza
from stanza.models.common.doc import Document
from stanza.pipeline.core import Pipeline
from spacy import displacy

The following code will load the English language model for [Stanza](https://stanfordnlp.github.io/stanza/):

In [2]:
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 0%| …

2024-01-23 12:31:57 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip: 0%| | 0…

2024-01-23 12:32:13 INFO: Finished downloading models and saved to /home/damir/stanza_resources.


We can configure the [Stanza](https://stanfordnlp.github.io/stanza/) pipeline to contain all desired linguistic annotation modules. In this case we use:
- tokenizer
- multi-word-tokenizer
- Part-of-Speech tagger
- lemmatizer
- dependency parser
- constituent parser

In [3]:
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={"ner": ["ncbi_disease", "ontonotes"]}, use_gpu=False, download_method="reuse_resources")



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/ner/ncbi_disease.pt: 0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/forward_charlm/pubmed.pt: 0%|…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/backward_charlm/pubmed.pt: 0%…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/pretrain/biomed.pt: 0%| …

2024-01-23 12:32:33 INFO: Loading these models for language: en (English):
| Processor | Package |
--------------------------------------
| tokenize | combined |
| mwt | combined |
| pos | combined_charlm |
| lemma | combined_nocharlm |
| constituency | ptb3-revised_charlm |
| depparse | combined_charlm |
| sentiment | sstplus |
| ner | ncbi_disease |

2024-01-23 12:32:33 INFO: Using device: cpu
2024-01-23 12:32:33 INFO: Loading: tokenize
 _torch_pytree._register_pytree_node(
2024-01-23 12:32:33 INFO: Loading: mwt
2024-01-23 12:32:33 INFO: Loading: pos
2024-01-23 12:32:34 INFO: Loading: lemma
2024-01-23 12:32:34 INFO: Loading: constituency
2024-01-23 12:32:34 INFO: Loading: depparse
2024-01-23 12:32:34 INFO: Loading: sentiment
2024-01-23 12:32:34 INFO: Loading: ner
2024-01-23 12:32:35 INFO: Done loading processors!


In [4]:
doc = nlp("The pilot had arthritis. What's so important to underline is that Metz worked for both Northrop and Lockheed Martin in New York City and is not known for hyperbole. Yet even after flying the pre-production F-22, a far more mature machine than the YF-23 ever was, he makes it quite clear that Northrop's offering was on par with Lockheed's, if not superior.")
for i, sentence in enumerate(doc.sentences):
 print(f'====== Sentence {i+1} tokens =======')
 print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

id: (1,)	text: The
id: (2,)	text: pilot
id: (3,)	text: had
id: (4,)	text: arthritis
id: (5,)	text: .
id: (1, 2)	text: What's
id: (3,)	text: so
id: (4,)	text: important
id: (5,)	text: to
id: (6,)	text: underline
id: (7,)	text: is
id: (8,)	text: that
id: (9,)	text: Metz
id: (10,)	text: worked
id: (11,)	text: for
id: (12,)	text: both
id: (13,)	text: Northrop
id: (14,)	text: and
id: (15,)	text: Lockheed
id: (16,)	text: Martin
id: (17,)	text: in
id: (18,)	text: New
id: (19,)	text: York
id: (20,)	text: City
id: (21,)	text: and
id: (22,)	text: is
id: (23,)	text: not
id: (24,)	text: known
id: (25,)	text: for
id: (26,)	text: hyperbole
id: (27,)	text: .
id: (1,)	text: Yet
id: (2,)	text: even
id: (3,)	text: after
id: (4,)	text: flying
id: (5,)	text: the
id: (6,)	text: pre-production
id: (7,)	text: F
id: (8,)	text: -
id: (9,)	text: 22
id: (10,)	text: ,
id: (11,)	text: a
id: (12,)	text: far
id: (13,)	text: more
id: (14,)	text: mature
id: (15,)	text: machine
id: (16,)	text: than
id: (17,)	text: the


In [5]:
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

word: The	upos: DET	xpos: DT	feats: Definite=Def|PronType=Art
word: pilot	upos: NOUN	xpos: NN	feats: Number=Sing
word: had	upos: VERB	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: arthritis	upos: NOUN	xpos: NN	feats: Number=Sing
word: .	upos: PUNCT	xpos: .	feats: _
word: What	upos: PRON	xpos: WP	feats: PronType=Int
word: 's	upos: AUX	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: so	upos: ADV	xpos: RB	feats: _
word: important	upos: ADJ	xpos: JJ	feats: Degree=Pos
word: to	upos: PART	xpos: TO	feats: _
word: underline	upos: VERB	xpos: VB	feats: VerbForm=Inf
word: is	upos: AUX	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: that	upos: SCONJ	xpos: IN	feats: _
word: Metz	upos: PROPN	xpos: NNP	feats: Number=Sing
word: worked	upos: VERB	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: for	upos: ADP	xpos: IN	feats: _
word: both	upos: CCONJ	xpos: CC	feats: _
word: Northrop	upos: 

In [6]:
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

word: The 	lemma: the
word: pilot 	lemma: pilot
word: had 	lemma: have
word: arthritis 	lemma: arthritis
word: . 	lemma: .
word: What 	lemma: what
word: 's 	lemma: be
word: so 	lemma: so
word: important 	lemma: important
word: to 	lemma: to
word: underline 	lemma: underline
word: is 	lemma: be
word: that 	lemma: that
word: Metz 	lemma: Metz
word: worked 	lemma: work
word: for 	lemma: for
word: both 	lemma: both
word: Northrop 	lemma: Northrop
word: and 	lemma: and
word: Lockheed 	lemma: Lockheed
word: Martin 	lemma: Martin
word: in 	lemma: in
word: New 	lemma: New
word: York 	lemma: York
word: City 	lemma: City
word: and 	lemma: and
word: is 	lemma: be
word: not 	lemma: not
word: known 	lemma: know
word: for 	lemma: for
word: hyperbole 	lemma: hyperbole
word: . 	lemma: .
word: Yet 	lemma: yet
word: even 	lemma: even
word: after 	lemma: after
word: flying 	lemma: fly
word: the 	lemma: the
word: pre-production 	lemma: pre-production
word: F 	lemma: F
word: - 	lemma: -
word: 22 	lemma: 22

In [8]:
for sentence in doc.sentences:
 print(sentence.constituency)

(ROOT (S (NP (DT The) (NN pilot)) (VP (VBD had) (NP (NN arthritis))) (. .)))
(ROOT (S (SBAR (WHNP (WP What)) (S (VP (VBZ 's) (ADJP (RB so) (JJ important) (SBAR (S (VP (TO to) (VP (VB underline))))))))) (VP (VBZ is) (SBAR (IN that) (S (NP (NNP Metz)) (VP (VP (VBD worked) (PP (IN for) (NP (CC both) (NP (NNP Northrop)) (CC and) (NP (NNP Lockheed) (NNP Martin)))) (PP (IN in) (NP (NML (NNP New) (NNP York)) (NNP City)))) (CC and) (VP (VBZ is) (RB not) (VP (VBN known) (PP (IN for) (NP (NN hyperbole))))))))) (. .)))
(ROOT (S (CC Yet) (PP (ADVP (RB even)) (IN after) (S (VP (VBG flying) (NP (DT the) (NN pre-production) (NNP F) (HYPH -) (CD 22))))) (, ,) (NP (NP (DT a) (ADJP (ADVP (RB far) (RBR more)) (JJ mature)) (NN machine)) (PP (IN than) (NP (DT the) (NNP YF) (HYPH -) (CD 23)))) (ADVP (RB ever)) (VP (VBD was)) (, ,) (NP (NP (PRP he))) (VP (VBZ makes) (S (NP (NP (PRP it))) (ADJP (RB quite) (JJ clear)) (SBAR (IN that) (S (NP (NP (NNP Northrop) (POS 's)) (NN offering)) (VP (VBD was) (PP (IN on) 

In [7]:
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

entity: arthritis	type: DISEASE


In [57]:
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

token: The	ner: O
token: pilot	ner: O
token: had	ner: O
token: arthritis	ner: S-DISEASE
token: .	ner: O
token: What	ner: O
token: 's	ner: O
token: so	ner: O
token: important	ner: O
token: to	ner: O
token: underline	ner: O
token: is	ner: O
token: that	ner: O
token: Metz	ner: S-ORG
token: worked	ner: O
token: for	ner: O
token: both	ner: O
token: Northrop	ner: S-ORG
token: and	ner: O
token: Lockheed	ner: B-ORG
token: Martin	ner: E-ORG
token: in	ner: O
token: New	ner: B-GPE
token: York	ner: I-GPE
token: City	ner: E-GPE
token: and	ner: O
token: is	ner: O
token: not	ner: O
token: known	ner: O
token: for	ner: O
token: hyperbole	ner: O
token: .	ner: O
token: Yet	ner: O
token: even	ner: O
token: after	ner: O
token: flying	ner: O
token: the	ner: O
token: pre-production	ner: O
token: F	ner: B-PRODUCT
token: -	ner: I-PRODUCT
token: 22	ner: E-PRODUCT
token: ,	ner: O
token: a	ner: O
token: far	ner: O
token: more	ner: O
token: mature	ner: O
token: machine	ner: O
token: than	ner: O
token: the	ner: B-P

In [58]:
for i, sentence in enumerate(doc.sentences):
 print("%d -> %d" % (i, sentence.sentiment))

0 -> 0
1 -> 2
2 -> 0


## Language ID

In [59]:
stanza.download(lang="multilingual")
stanza.download(lang="en")
# stanza.download(lang="fr")
stanza.download(lang="de")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …

2023-09-20 17:34:37 INFO: Downloading default packages for language: multilingual (multilingual) ...
2023-09-20 17:34:37 INFO: File exists: C:\Users\damir\stanza_resources\multilingual\default.zip
2023-09-20 17:34:37 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …

2023-09-20 17:34:38 INFO: Downloading default packages for language: en (English) ...
2023-09-20 17:34:38 INFO: File exists: C:\Users\damir\stanza_resources\en\default.zip
2023-09-20 17:34:42 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …

2023-09-20 17:34:42 INFO: Downloading default packages for language: de (German) ...
2023-09-20 17:34:43 INFO: File exists: C:\Users\damir\stanza_resources\de\default.zip
2023-09-20 17:34:47 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.


In [61]:
nlp = Pipeline(lang="multilingual", processors="langid")
docs = ["Hello world.", "Hallo, Welt!"]
docs = [Document([], text=text) for text in docs]
nlp(docs)
print("\n".join(f"{doc.text}\t{doc.lang}" for doc in docs))

2023-09-20 17:36:07 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …

2023-09-20 17:36:07 INFO: Loading these models for language: multilingual ():
| Processor | Package |
-----------------------
| langid | ud |

2023-09-20 17:36:07 INFO: Using device: cuda
2023-09-20 17:36:07 INFO: Loading: langid
2023-09-20 17:36:07 INFO: Done loading processors!


Hello world.	en
Hallo, Welt!	it


## Processing Dependency Parse Trees

I wrote the following function to convert the [Stanza](https://stanfordnlp.github.io/stanza/) dependency tree data structure to a [spaCy's Displacy](https://spacy.io/usage/visualizers) compatible data structure for the visualization of dependency trees using [spaCy's](https://spacy.io/) excellent visualizer:

In [9]:
def get_stanza_dep_displacy_manual(doc):
 res = []
 for x in doc.sentences:
 words = []
 arcs = []
 for w in x.words:
 if w.head > 0:
 head_text = x.words[w.head-1].text
 else:
 head_text = "root"
 words.append({"text": w.text, "tag": w.upos})
 if w.deprel == "root": continue
 start = w.head-1
 end = w.id-1
 if start < end:
 arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
 else:
 arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
 res.append( { "words": words, "arcs": arcs } )
 return res

We can generate an annotation object with [Stanza](https://stanfordnlp.github.io/stanza/) similarly to [spaCy's](https://spacy.io/) approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the `nlp` variable:

In [16]:
doc = nlp("John loves to read books and Mary newspapers.")

We can now generate the [spaCy](https://spacy.io/)-compatible data format from the dependency tree to be able to visualize it:

In [17]:
res = get_stanza_dep_displacy_manual(doc)

The rendering can be achieved using the [Displacy](https://spacy.io/usage/visualizers) call:

In [18]:
displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})

## Data Format - CoNLL

In [43]:
from stanza.utils.conll import CoNLL

In [44]:
CoNLL.write_doc2conll(doc, "output.conllu")

**(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/) <>**