# Named Entity Recognition for Danish

:::{note}
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
:::

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Danish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Danish is *Evangelines Genvordigheder: Til Kvinder med rødt Haar* by Elinor Glyn [from Project Gutenberg](http://www.gutenberg.org/ebooks/33632).

**Here's a preview of spaC's NER tagging *Evangelines Genvordigheder: Til Kvinder med rødt Haar*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Danish NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [19]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [16]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Danish-language model (`da_core_news_lg`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [1]:
!python -m spacy download da_core_news_md

Collecting da-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/da_core_news_md-3.7.0/da_core_news_md-3.7.0-py3-none-any.whl (42.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: da-core-news-md
Successfully installed da-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('da_core_news_md')


### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [17]:
import da_core_news_md
nlp = da_core_news_md.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('da_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run `nlp()` on the text and create our document.

In [18]:
filepath = '../texts/da.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [4]:
document.ents

(ELINOR,
 KØBENHAVN,
 BEGYNDELSEN PAA EVANGELINES,
 Eventyrerske,
 Slags Ting,
 Eventyrerske,
 Fru Carruthers,
 Godset,
 Fru Carruthers,
 Fru Carruthers,
 Papa,
 Mama,
 Mama,
 Mama,
 Lord,
 Mama,
 Papa,
 Officer,
 Indien,
 Eventyrerske,
 Carruthers,
 Visitter,
 Slags Katte,
 Fru Carruthers,
 Pjank,
 Godset,
 Diplomat,
 Paris,
 Rusland,
 England,
 Herre,
 Eventyrerske,
 Mænd,
 Liv,
 Hr. Carruthers,
 Sort,
 Brystet paa,
 Fru Carruthers Død,
 Doktor Garrison,
 Masser af Ting,
 Selskabslivet,
 Fru Carruthers,
 Sæsonen,
 Schweiz,
 London,
 Thomas,
 Diamantring,
 Eventyrerske,
 Hr. Carruthers,
 Gudernes Skød,
 Officielt,
 Mænd,
 Bridge,
 Fru
 Carruthers,
 Ex-Ambassadører,
 Korridoren,
 Selskabeligheden,
 London,
 Tennisbold,
 Piger,
 Paris,
 Hr. Carruthers,
 London,
 Fru Carruthers,
 Hr. Carruthers,
 Christopher,
 Christopher,
 Christopher,
 Aarevis,
 Mænd,
 Metal,
 Gud,
 Fru Carruthers,
 Piger,
 Begyndelsen,
 Smaragder,
 Cicely Parkers,
 Præstens,
 Eventyrerske,
 Hr. Carruthers,
 Hr. Carrut

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [5]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

ELINOR ORG
KØBENHAVN LOC
BEGYNDELSEN PAA EVANGELINES ORG
Eventyrerske MISC
Slags Ting ORG
Eventyrerske ORG
Fru Carruthers PER
Godset PER
Fru Carruthers PER
Fru Carruthers PER
Papa PER
Mama PER
Mama PER
Mama PER
Lord PER
Mama PER
Papa PER
Officer MISC
Indien LOC
Eventyrerske MISC
Carruthers PER
Visitter MISC
Slags Katte ORG
Fru Carruthers PER
Pjank ORG
Godset PER
Diplomat ORG
Paris LOC
Rusland LOC
England LOC
Herre PER
Eventyrerske ORG
Mænd ORG
Liv PER
Hr. Carruthers PER
Sort PER
Brystet paa MISC
Fru Carruthers Død PER
Doktor Garrison PER
Masser af Ting ORG
Selskabslivet LOC
Fru Carruthers PER
Sæsonen LOC
Schweiz LOC
London LOC
Thomas PER
Diamantring MISC
Eventyrerske MISC
Hr. Carruthers PER
Gudernes Skød MISC
Officielt MISC
Mænd ORG
Bridge PER
Fru
Carruthers PER
Ex-Ambassadører MISC
Korridoren LOC
Selskabeligheden LOC
London LOC
Tennisbold MISC
Piger PER
Paris LOC
Hr. Carruthers PER
London LOC
Fru Carruthers PER
Hr. Carruthers PER
Christopher PER
Christopher PER
Christopher PER
Aarevis

To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix:

In [6]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

Fru Carruthers
Godset
Fru Carruthers
Fru Carruthers
Papa
Mama
Mama
Mama
Lord
Mama
Papa
Carruthers
Fru Carruthers
Godset
Herre
Liv
Hr. Carruthers
Sort
Fru Carruthers Død
Doktor Garrison
Fru Carruthers
Thomas
Hr. Carruthers
Bridge
Fru
Carruthers
Piger
Hr. Carruthers
Fru Carruthers
Hr. Carruthers
Christopher
Christopher
Christopher
Mænd
Gud
Fru Carruthers
Piger
Smaragder
Cicely Parkers
Præstens
Hr. Carruthers
Hr. Carruthers
Hr. Barton
Fru Carruthers
Véronique
Hr. Carruthers
Carruthers
Véronique
Hr. Carruthers
Hage
Hans Væsen
Hr. Barton
Hr. Barton
Hr. Carruthers
Hr. Barton
Hr. Carruthers
Tante
Fru
Carruthers
Fru Carruthers
Fru Carruthers
Christopher
Liv
Hr. Barton
Hr. Barton
Sindsbevægelser
Hr.
Carruthers
Milady
Hr. Carruthers
Nat
Hr.
Carruthers
Faders
Broder
Fru Carruthers
Fred
Ulvs
Hr. Barton
Skænd
Claridges
Frøken Tomkins
Papa
Ironi
Hr. Barton
Herre
Carruthers
Carruthers
Liv
Hr. Carruthers
Fru Carruthers
Brændeild
Mile
Liv
Generaler
Carruthers
Fred
Hr. Barton
Hr. Carruthers Ansigt
Stenb

## NER with Long Texts or Many Texts

In [20]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [21]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER."

In [22]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Robert,135
1,Lady,88
2,Lord Robert,79
3,Hr. Carruthers,68
4,Fru Carruthers,50
5,Christopher,48
6,Lady Katherine,47
7,Lady Merrenden,39
8,Véronique,36
9,Malcolm,33


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [23]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,London,22
1,Paris,21
2,Stuen,15
3,Parken,10
4,Vestibulen,8
5,Vejen,8
6,Teatret,7
7,England,6
8,Trappen,5
9,Northumberland,5


## Get NER in Context

In [10]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
         # all possible labels
        desired_ner_labels = list(nlp.get_pipe('ner').labels) 
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                print('---')
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [13]:
for document in chunked_documents:
    get_ner_in_context('Paris', document)

---


**LOC**

Han er Diplomat og bor i **Paris** og Rusland og den Slags morsomme Steder, saa han kommer sjælden til England.

---


**LOC**

Maaske til **Paris** -- naturligvis, hvis jeg ikke blive gift med Hr. Carruthers, -- jeg antager ikke, at det er kedeligt at være gift.

---


**LOC**

"  "Hør," sagde han og kastede sig i en Lænestol, "De kan gifte Dem med mig, saa skal jeg tage Dem med til **Paris**, eller hvorhen De vil, og jeg skal ikke kommandere Dem -- jeg skal kun hindre de andre Bæster af Mænd i at se paa Dem.

---


**LOC**

ordi han i Almindelighed tilbeder mig, og i bedste Tilfælde kun forlader mig for at tage paa en tre Ugers Baderejse til Homburg, eller nu og da en Uge til **Paris**; men Malcolm kunde man stadig sende til Klippebjergene og den Slags Steder; han er en hel Sportsmand.

---


**LOC**

Sir Charles Verningham er for Øjeblikket i **Paris**, saa jeg har endnu ikke set ham.  

---


**LOC**

England er kedeligt -- hvad mener De om **Paris**?"  Hvor det morede mig at udslynge disse Bemærkninger!

---


**LOC**

"De maa ikke tage til **Paris** -- alene.

---


**LOC**

"  "I Gaar ved Lunch var der nogen, der sagde, at der var en smuk Dame i **Paris**, hvis Hjerte bankede for Dem," sagde jeg og saá igen paa ham.  

---


**LOC**

"Saa synes De, at **Paris** ligger langt borte!" sagde jeg uskyldigt.  

---


**LOC**

Jeg vil ikke være hjemme, naar Charlie kommer fra **Paris**.

---


**LOC**

Fru Carruthers lod mig lære det, hver Gang vi kom til **Paris**, hun holdt selv af at se det.  

---


**LOC**

"  Gode Manerer er blevet trommet ind i mig fra min tidligste Barndom, og jeg sagde høfligt: "De kom først fra **Paris** sent i Aftes, ikke sandt?

---


**LOC**

har du bragt nogle ny Dukker med til os fra **Paris**?

---


**LOC**

Han vilde spise mig op, og saa tage tilbage til **Paris** til den Dame, han elsker -- men saa kunde jeg leve det Liv, jeg holder af -- og Carruthers Smaragderne er smukke -- og jeg elsker Branches -- og -- og --  "Hendes Naade vilde gerne tale med Dem, Frøken," sagde en Tjener.  

---


**LOC**

"  "Maaske."  "Naa, det gør han altid, naar han kommer fra **Paris**.

---


**LOC**

"  Vi kyssede hinanden flygtigt, og jeg gik op paa mit Værelse.  Ja, det bedste, jeg kan gøre, er at gifte mig med Christopher, jeg bryder mig saa lidt om ham, at Damen i **Paris** ikke vilde have noget at betyde for mig, selv om hun ligner Sir Charles' Poulet à la Victoria aux truffes.

---


**LOC**

De drillede ham allesammen med **Paris**, og han tog det meget godmodigt -- det syntes endogsaa at more ham.

---


**LOC**

Hun gjorde dem alle mulige Spørgsmaal om deres ny Kjoler og sagde, at de skulde hellere tage til **Paris** engang imellem.

---


**LOC**

"Jeg vil skrive til **Paris**; min gamle Mademoiselle er gift dèr med en Digter, tror jeg; hun vil maaske modtage mig som Pensionær i nogen Tid.

---


**LOC**

Det er altsaa sandt, antager jeg, at hun i Mellemrummene mellem **Paris** kan faa hans Hjerte til at banke.  

---


**LOC**

Christopher sendte mig dette karakteristiske Brev sammen med de Ørenringe, der er hans Gave -- meget store Smaragder indfattet med Diamanter:  "Det gør mig saa ondt, at jeg ikke skal se Dem paa den lykkelige Dag, men jeg har været heldig nok til at opdage, at **Paris** endnu har Glæder for mig.  