# Enriching Wikidata with the Getty KG

The [Getty vocabularies](https://www.getty.edu/research/tools/vocabularies/lod/index.html) contain rich data represented in RDF format.

This notebook shows how graphs like Getty Vocabulary can be used to enrich Wikidata by using `kgtk` operations. We will show this enrichment on the records of people in the `Arnold Schwarzenegger` graph that exist both in Wikidata (with Qnode) and Getty Vocabulary (with ULAN ID). We will enrich their `date of birth` information. 

Specifically, we will investigate: *Does Getty contain complementary information to Wikidata about people's date of birth?*

We will use KGTK to import Getty data, align Getty to Wikidata, query dates of birth in both graphs separately, compare the results, and enrich the Wikidata graph with the missing information.

## Step 0: Install KGTK

Only run the following cell if KGTK is not installed.
 For example, if running in [Google Colab](https://colab.research.google.com/)

In [None]:
!pip install kgtk

In [2]:
import os
import json

from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

## Set up environment path
Here we set up environment variables that will be used in the following sections, including folders, files like basic databases, query output and so on.

In [3]:
# Parameters

# Folder on local machine where to create the output and temporary folders

input_path = None
output_path = "/tmp/kgtk-projects"
project_name = "getty-enrichment"

In [None]:
files = [
    "all",
    "label"
]
additional_files = {
    "ulan_terms": "ULANOut_2Terms.nt.gz", 
    "ulan_subjects": "ULANOut_1Subjects.nt.gz",
    "ulan_agentmap": "ULANOut_AgentMap.nt.gz", 
    "ulan_biographies": "ULANOut_Biographies.nt.gz",
     "namespaces": "namespaces.tsv"}

ck = ConfigureKGTK(files, 
                   input_files_url="https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/getty")
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                  additional_files=additional_files)

In [8]:
ck.print_env_variables()

EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples
KGTK_GRAPH_CACHE: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db
STORE: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db
KGTK_OPTION_DEBUG: false
USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases
KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/labels.en.tsv.gz
OUT: /tmp/kgtk-projects/getty-enrichment
GRAPH: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input
kypher: kgtk query --graph-cache /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db
TEMP: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment
kgtk: kgtk
all: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/all.tsv.gz
label: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/labels.en.tsv.gz
ulan_terms: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/ULANOut_2Terms.nt.gz
ulan_subjects: /Users/amandeep/isi-kgtk-tuto

## Approach overview

The Getty knowledge graph consists of [multiple vocabulary files](https://www.getty.edu/research/tools/vocabularies/), including ULAN (Union List of Artist Names), TGN (Thesaurus of Geographic Names), and AAT (Art & Architecture Thesaurus).
In this tutorial, we will focus on the ULAN vocabulary, which "includes names, rich relationships, notes, sources, and biographical information for artists, architects, firms, studios, repositories, and patrons, both individuals and corporate bodies, named and anonymous". The procedures for the other vocabularies should be analogous as they are also in `.nt` format.

The method that we will use consists of the following 5 steps:
1. Import Getty's ULAN file into KGTK
2. Align Getty to Wikidata
3. Query Wikidata, record known & unknown values
4. Query Getty to see if we can find these unknown values
5. Append the newly found values to Wikidata

## 1. Import Getty's ULAN data into `kgtk`

As both ULAN and TGN are stored in n-triples (`.nt`) format, we can simply use the `import-ntriples` command. 

**Understanding prefixes** Getty conveniently provides an ontology file in an [RDF format](http://vocab.getty.edu/ontology.rdf), which defines the prefixes in the file header. We have transformed this file in KGTK format (`namespaces.tsv`) and we will use it to help KGTK understand prefixes in the data. Here are its contents:

In [9]:
kgtk("""
    cat -i $GRAPH/namespaces.tsv
""")

Unnamed: 0,node1,label,node2
0,xml-schema-type,prefix_expansion,http://www.w3.org/2001/XMLSchema#
1,ulan_scopeNote,prefix_expansion,http://vocab.getty.edu/ulan/scopeNote/
2,tgn_term,prefix_expansion,http://vocab.getty.edu/tgn/term/
3,rrx,prefix_expansion,http://purl.org/r2rml-ext/
4,tgn_scopeNote,prefix_expansion,http://vocab.getty.edu/tgn/scopeNote/
...,...,...,...
57,vann,prefix_expansion,http://purl.org/vocab/vann/
58,vcard,prefix_expansion,http://www.w3.org/2006/vcard/ns#
59,ulan_source,prefix_expansion,http://vocab.getty.edu/ulan/source/
60,cc,prefix_expansion,http://creativecommons.org/ns#


**Getty files** We will use four files from Getty's ULAN vocabulary:
1. `Biography` - which links agents to biographies, using the `gvp:biographyPrefered` property.
2. `Agent Map` which links people to their roles ("agents"), through the `foaf:focus` property.
3. `Subjects` use `dc:identifier` to link ULAN nodes to their ULAN ID strings.
4. `Terms` which links agents to their year of birth and death, using the `gvp:estStart` and `gvp:estEnd` properties 

Together, the four files are needed to enable the following path from people to birthdates:

![ULAN](../media/ULAN.png)

For convenience, we have uploaded the files in a .gz format to this GitHub repository, we just have to gunzip them before using them in KGTK:

In [10]:
!gunzip $GRAPH/ULANOut_2Terms.nt.gz
!gunzip $GRAPH/ULANOut_1Subjects.nt.gz
!gunzip $GRAPH/ULANOut_AgentMap.nt.gz
!gunzip $GRAPH/ULANOut_Biographies.nt.gz

We can now import each of these four files into KGTK:

In [11]:
%%time
kgtk("""
    import-ntriples 
        -i $GRAPH/ULANOut_2Terms.nt
        -o $TEMP/ULAN_term_KGTK.tsv
        --namespace-file $GRAPH/namespaces.tsv
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 46.7 ms, sys: 40 ms, total: 86.7 ms
Wall time: 2min 27s


In [12]:
kgtk("""
    head -i $TEMP/ULAN_term_KGTK.tsv
""")

Unnamed: 0,node1,label,node2
0,ulan:500523031,gvp:prefLabelGVP,ulan_term:1501308164
1,ulan:500523031,skosxl:prefLabel,ulan_term:1501308164
2,ulan:500523038,gvp:prefLabelGVP,ulan_term:1501308170
3,ulan:500523038,skosxl:prefLabel,ulan_term:1501308170
4,ulan:500523041,gvp:prefLabelGVP,ulan_term:1501308173
5,ulan:500523041,skosxl:prefLabel,ulan_term:1501308173
6,ulan:500523044,gvp:prefLabelGVP,ulan_term:1501308176
7,ulan:500523044,skosxl:prefLabel,ulan_term:1501308176
8,ulan:500523050,gvp:prefLabelGVP,ulan_term:1501308181
9,ulan:500523050,skosxl:prefLabel,ulan_term:1501308181


In [13]:
%%time
kgtk("""
    import-ntriples 
        -i $GRAPH/ULANOut_1Subjects.nt
        -o $TEMP/ULAN_subject_KGTK.tsv 
        --namespace-file $GRAPH/namespaces.tsv
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 18 ms, sys: 21.2 ms, total: 39.1 ms
Wall time: 59.2 s


In [14]:
kgtk("""
    head -i $TEMP/ULAN_subject_KGTK.tsv
""")

Unnamed: 0,node1,label,node2
0,ulan:500204004,rdf:type,gvp:UnknownPersonConcept
1,ulan:500204004,gvp:displayOrder,1
2,ulan:500204004,gvp:parentStringAbbrev,Unknown People by Culture
3,ulan:500204004,gvp:parentString,Unknown People by Culture
4,ulan:500204004,dc:identifier,500204004
5,ulan:500204004,dcterm:license,http://opendatacommons.org/licenses/by/1.0/
6,ulan:500204004,cc:license,http://opendatacommons.org/licenses/by/1.0/
7,ulan:500204004,skos:inScheme,ulan:
8,ulan:500204004,void:inDataset,http://vocab.getty.edu/dataset/ulan
9,ulan:500372685,rdf:type,gvp:UnknownPersonConcept


In [15]:
%%time
kgtk("""
    import-ntriples 
        -i $GRAPH/ULANOut_AgentMap.nt
        -o $TEMP/ULAN_agentmap_KGTK.tsv 
        --namespace-file $GRAPH/namespaces.tsv 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 10.8 ms, sys: 15.6 ms, total: 26.4 ms
Wall time: 29.4 s


In [16]:
kgtk("""
    head -i $TEMP/ULAN_agentmap_KGTK.tsv
""")

Unnamed: 0,node1,label,node2
0,ulan:500000002,foaf:focus,ulan:500000002-agent
1,ulan:500000003,foaf:focus,ulan:500000003-agent
2,ulan:500000004,foaf:focus,ulan:500000004-agent
3,ulan:500000005,foaf:focus,ulan:500000005-agent
4,ulan:500000006,foaf:focus,ulan:500000006-agent
5,ulan:500000007,foaf:focus,ulan:500000007-agent
6,ulan:500000009,foaf:focus,ulan:500000009-agent
7,ulan:500000010,foaf:focus,ulan:500000010-agent
8,ulan:500000011,foaf:focus,ulan:500000011-agent
9,ulan:500000012,foaf:focus,ulan:500000012-agent


In [17]:
%%time
kgtk("""
    import-ntriples 
        -i $GRAPH/ULANOut_Biographies.nt
        -o $TEMP/ULAN_biography_KGTK.tsv 
        --namespace-file $GRAPH/namespaces.tsv
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 33.1 ms, sys: 30.6 ms, total: 63.7 ms
Wall time: 1min 55s


In [18]:
kgtk("""
    head -i $TEMP/ULAN_biography_KGTK.tsv
""")

Unnamed: 0,node1,label,node2
0,ulan:500000002-agent,gvp:biographyPreferred,ulan_bio:4000336014
1,ulan:500000003-agent,gvp:biographyPreferred,ulan_bio:4000336015
2,ulan:500000004-agent,gvp:biographyNonPreferred,ulan_bio:4000000003
3,ulan:500000004-agent,gvp:biographyPreferred,ulan_bio:4000000001
4,ulan:500000004-agent,gvp:biographyNonPreferred,ulan_bio:4000000002
5,ulan:500000004-agent,gvp:biographyNonPreferred,ulan_bio:4000334645
6,ulan:500000004-agent,gvp:biographyNonPreferred,ulan_bio:4000757338
7,ulan:500000005-agent,gvp:biographyPreferred,ulan_bio:4000000004
8,ulan:500000005-agent,gvp:biographyNonPreferred,ulan_bio:4000000005
9,ulan:500000005-agent,gvp:biographyNonPreferred,ulan_bio:4000000006


After importing each of the files, we can now use KGTK operations on them. We start by `kgtk cat` to concatenate them into a single file for more convenient work with it.

In [19]:
%%time
kgtk("""
    cat -i $TEMP/ULAN_term_KGTK.tsv $TEMP/ULAN_subject_KGTK.tsv $TEMP/ULAN_agentmap_KGTK.tsv $TEMP/ULAN_biography_KGTK.tsv 
        -o $TEMP/ULAN_all.tsv
    """)

CPU times: user 22.3 ms, sys: 21.9 ms, total: 44.2 ms
Wall time: 1min 8s


## 2. Build Getty-Wikidata Alignment
Getty provides a `WikidataAlignment` file but our analysis showed that this alignment file is incomplete or out-of-date. Thus, we build our own alignment file, which links ULAN IDs to Wikidata Qnodes.

We perform a join between the Wikidata and the ULAN graph, through the ULAN identifiers available in both graphs.
Wikidata uses the property `P245` to map Qnode ids to ULAN identifiers, whereas Getty combines ULAN nodes to IDs with the `dc:identifier` property.

We will use the `skos:exactMatch` property to indicate alignment between ULAN nodes and Wikidata nodes.

*This query is taking our subgraph of Wikidata, and the Getty ULAN graph which is in an entirely different format, and queries the two jointly.*

Let's first see what results we get with this join operation:

In [21]:
%%time
kgtk("""
    query -i $all $TEMP/ULAN_all.tsv 
        --match '
            all: (qnode)-[:P245]->(identifier), 
            ULAN: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return '
                distinct ulanid as node1, 
                "skos:exactMatch" as label, 
                qnode as node2' 
    / add-labels
    """)

CPU times: user 8.26 ms, sys: 12.3 ms, total: 20.6 ms
Wall time: 3.56 s


Unnamed: 0,node1,label,node2,node2;label
0,ulan:500224955,skos:exactMatch,Q100948,'Rachel Carson'@en
1,ulan:500281177,skos:exactMatch,Q101771,'Gottfried Gruben'@en
2,ulan:500001235,skos:exactMatch,Q101791,'Sep Ruf'@en
3,ulan:500256782,skos:exactMatch,Q102139,'Margrethe II of Denmark'@en
4,ulan:500302331,skos:exactMatch,Q1024362,'Spanish National Research Council'@en
...,...,...,...,...
538,ulan:500262206,skos:exactMatch,Q9696,'John F. Kennedy'@en
539,ulan:500247140,skos:exactMatch,Q972381,'George Hall'@en
540,ulan:500324997,skos:exactMatch,Q97416,'Gerhart Rodenwaldt'@en
541,ulan:500274474,skos:exactMatch,Q979511,'Stuart Craig'@en


The results look reasonable, so let's go ahead and store the alignment into a KGTK file:

In [22]:
%%time
kgtk("""
    query -i $all $TEMP/ULAN_all.tsv 
        --match '
            all: (qnode)-[:P245]->(identifier), 
            ULAN: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return '
                distinct ulanid as node1, 
                "skos:exactMatch" as label, 
                qnode as node2' 
        -o $TEMP/ULAN_ALIGN.tsv
    """)

CPU times: user 3.52 ms, sys: 9.88 ms, total: 13.4 ms
Wall time: 1.47 s


We will now run a simple Kypher query to count the Qnodes for which we have ULAN mapping:

In [23]:
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv 
        --match '(ulanid)-[]->(qnode)' 
        --return 'count(distinct qnode) as QNODE'
    """)

Unnamed: 0,QNODE
0,535


Hmm... So there are 535 Qnodes that correspond to 543 ULAN nodes, which means that we have some Qnodes with more than one ULAN ID. In theory, this should not happen - each entity in Wikidata should correspond to a single ULAN node.

Let's find the Qnodes with multiple mappings, and inspect them closer:

In [24]:
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv
        --match '
            (u1)-[]->(qnode),
            (u2)-[]->(qnode)'
        --where 'u1<u2'
        --return 'qnode as qnode, u1 as ulan1, u2 as ulan2' 
""")

Unnamed: 0,qnode,ulan1,ulan2
0,Q1244372,ulan:500304981,ulan:500305436
1,Q127064,ulan:500279772,ulan:500304559
2,Q157808,ulan:500210203,ulan:500303345
3,Q1600831,ulan:500227540,ulan:500312167
4,Q2837755,ulan:500312076,ulan:500312077
5,Q2945260,ulan:500251050,ulan:500307043
6,Q526170,ulan:500307065,ulan:500312663
7,Q66149,ulan:500023792,ulan:500358178


After manual inspection, we see that https://www.wikidata.org/wiki/Q1244372 indeed has two ULAN identifiers associated with it: 
1. Allard Pierson Museum (Dutch repository, Amsterdam, contemporary) with id `500304981`
2. Universiteit van Amsterdam, Allard Pierson Museum (Dutch repository, Amsterdam, contemporary) with id `500305436`

Thus, the cases where we have multiple ULANs for a Wikidata Qnode are not mapping mistakes, they exist in the data.

**Finding:** we obtain 543 ULAN mappings for 535 Wikidata nodes. Eight Wikidata Qnodes have two ULAN nodes associated with them.

## 3. Query Wikidata (our KG)
We query our KG subset of Wikidata for these 535 people to see if it has recorded date of birth, using the `P569` property.

We provide first a glimpse of the query results:

In [25]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $all 
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
    / add-labels
    """)

CPU times: user 8.05 ms, sys: 11.3 ms, total: 19.4 ms
Wall time: 4.62 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-05-27T00:00:00Z/11,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-06-21T00:00:00Z/11,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-03-09T00:00:00Z/11,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-04-16T00:00:00Z/11,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q102711,P569,^1936-05-17T00:00:00Z/11,'Dennis Hopper'@en,'date of birth'@en
...,...,...,...,...,...
295,Q9696,P569,^1917-05-29T00:00:00Z/11,'John F. Kennedy'@en,'date of birth'@en
296,Q972381,P569,^1916-11-19T00:00:00Z/11,'George Hall'@en,'date of birth'@en
297,Q97416,P569,^1886-10-16T00:00:00Z/11,'Gerhart Rodenwaldt'@en,'date of birth'@en
298,Q979511,P569,^1942-04-14T00:00:00Z/11,'Stuart Craig'@en,'date of birth'@en


Now that we understand the results, we perform the query for all 535 people:

In [26]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $all 
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
        -o $TEMP/WD_BD.tsv
    """)

CPU times: user 3.02 ms, sys: 10 ms, total: 13 ms
Wall time: 1.33 s


And we count the date of birth rows that we find:

In [27]:
kgtk("""
    query -i $TEMP/WD_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,266


Again here, each person should in theory have a single date of birth, as it is a functional property. Hence, the finding that we find 300 dates for 266 people needs further investigation:

In [28]:
kgtk("""
    query -i $TEMP/WD_BD.tsv
        --match '
            (qnode)-[]->(bd1),
            (qnode)-[]->(bd2)'
        --where 'bd1<bd2'
        --return 'qnode as qnode, bd1 as bd1, bd2 as bd2' 
""")

Unnamed: 0,qnode,bd1,bd2
0,Q106775,^1930-10-01T00:00:00Z/11,^1930-10-02T00:00:00Z/11
1,Q11806,^1735-10-19T00:00:00Z/11,^1735-10-30T00:00:00Z/11
2,Q131981,^1683-10-30T00:00:00Z/11,^1683-11-10T00:00:00Z/11
3,Q1434,^0161-01-01T00:00:00Z/9,^0161-08-31T00:00:00Z/11
4,Q1681112,^1940-01-01T00:00:00Z/9,^1940-07-13T00:00:00Z/11
5,Q174880,^1515-01-01T00:00:00Z/9,^1515-03-28T00:00:00Z/11
6,Q177847,^0120-01-01T00:00:00Z/9,^0125-01-01T00:00:00Z/9
7,Q182021,^1573-04-26T00:00:00Z/11,^1575-04-26T00:00:00Z/11
8,Q2643,^1943-01-01T00:00:00Z/9,^1943-02-25T00:00:00Z/11
9,Q30875,^1854-06-16T00:00:00Z/11,^1854-10-16T00:00:00Z/11


So, indeed some entities have multiple dates of birth recorded. For example, https://www.wikidata.org/wiki/Q75612 has the dates `^1902-11-21T00:00:00Z/11` and `^1904-07-14T00:00:00Z/11`, both in our file and in Wikidata's GUI. Interestingly, these two dates both have a large number of references (5-6), which makes it difficult to pick the right one.

**Finding:** Out of the 535 people, 266 have date of birth in Wikidata. In total, we obtain 300 dates of birth for these 266 people, due to co-existing conflicting information.

## 4. Query Getty

*Let's see whether Getty can fill the knowledge gaps for the remaining people in Wikidata...*

We now query Getty for the same set of 535 people. In this query, we take the ulan IDs that correspond to the Qnodes of interest, we link these ULAN nodes to ULAN agents or "roles" (using `foaf:focus`), we find their biography (using `gvp:biographyPreferred`), and we get the birth year based on the `gvp:estStart` property. As the birth year is a structured literal, we consider its value (`gvp:structuredValue`).

Getty provides date of birth on a year granularity level. For this purpose, we query Getty for years of birth, and we format them as dates, using the appropriate year precision marker (`/9` in Wikidata and KGTK).

In [29]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv
        --match '
                ALIGN: (ulanid)-[]->(qnode), 
                all: (ulanid)-[p0]->(ulanagent), 
                all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where '
                p0.label = "foaf:focus" 
                AND p1.label = "gvp:biographyPreferred" 
                AND p2.label = "gvp:estStart" 
                AND p3.label = "gvp:structured_value"' 
        --return '
                  distinct qnode as node1, 
                  "P569" as label, 
                  printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
    / add-labels
    """)

CPU times: user 15.5 ms, sys: 16.1 ms, total: 31.6 ms
Wall time: 21.5 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-01-01T00:00:00Z/9,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-01-01T00:00:00Z/9,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-01-01T00:00:00Z/9,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-01-01T00:00:00Z/9,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q1024362,P569,^1800-01-01T00:00:00Z/9,'Spanish National Research Council'@en,'date of birth'@en
...,...,...,...,...,...
535,Q9696,P569,^1917-01-01T00:00:00Z/9,'John F. Kennedy'@en,'date of birth'@en
536,Q972381,P569,^1916-01-01T00:00:00Z/9,'George Hall'@en,'date of birth'@en
537,Q97416,P569,^1886-01-01T00:00:00Z/9,'Gerhart Rodenwaldt'@en,'date of birth'@en
538,Q979511,P569,^1942-01-01T00:00:00Z/9,'Stuart Craig'@en,'date of birth'@en


As expected, we obtain dates of birth with a year precision (`/9`). We can thus go ahead and query for the dates of birth for all 535 entities:

In [30]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (ulanid)-[p0]->(ulanagent), 
                 all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estStart" AND p3.label = "gvp:structured_value"' 
        --return 'distinct qnode as node1, "P569" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        -o $TEMP/Getty_BD.tsv
    """)

CPU times: user 3.18 ms, sys: 10.2 ms, total: 13.3 ms
Wall time: 1.45 s


Let's see how many results we found in Getty:

In [31]:
kgtk("""
    query -i $TEMP/Getty_BD.tsv 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,535


**Finding:** We find date of birth for all 535 people in our Getty knowledge graph! We get 540 dates in total, which again means that we have some duplicates.

### 4a. How many values are novel?
Here we count for how many new date of birth we found in Getty:

In [32]:
%%time
kgtk("""
    ifnotexists -i $TEMP/Getty_BD.tsv 
        --filter-on $TEMP/WD_BD.tsv 
        --input-keys node1 
        --filter-keys node1 
        -o $TEMP/New_BD.tsv
    """)

CPU times: user 3.44 ms, sys: 10.2 ms, total: 13.6 ms
Wall time: 1.38 s


In [33]:
kgtk("""
    query -i $TEMP/New_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,269


**Finding:** There are newly found dates of birth in Getty for 269 entities -- this is expected, given that Getty has 535 values, and Wikidata had 266 values.

Let's see how many values we get in total:

In [34]:
kgtk("""
    query -i $TEMP/New_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,273


**Finding:** We see that in four of the novel cases, Getty has two birth dates for a node.

### 4b. Do the known values in Getty and Wikidata match?
Let's check if the found results in Getty match with those in Wikidata. We first obtain the list of matching birth dates, using the `ifexists` command:

In [35]:
%%time
kgtk("""
    ifexists -i $TEMP/Getty_BD.tsv 
        --filter-on $TEMP/WD_BD.tsv 
        --input-keys node1 
        --filter-keys node1 
        -o $TEMP/matching_bd.tsv
    """)

CPU times: user 3.59 ms, sys: 10.4 ms, total: 14 ms
Wall time: 1.34 s


We expect to get birth date values by both sources for 266 nodes:

In [36]:
kgtk("""
    query -i $TEMP/matching_bd.tsv 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,266


Ok, our expectation is correct. Let's now see for how many of those nodes do Wikidata and Getty agree on the birth year:

In [37]:
kgtk("""
    query -i $TEMP/Getty_BD.tsv $TEMP/WD_BD.tsv
        --match '
                Getty: (qnode)-[p]->(v1), 
                WD: (qnode)-[]->(v2)' 
        --where 'kgtk_date_year(v1) = kgtk_date_year(v2)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_5_c1.""node1"")"
0,250


Ok, so Getty and Wikidata agree for 250 out of the 266 overlapping entities. Let's inspect the entities for which they contain different information:

In [38]:
kgtk("""
    query -i $TEMP/Getty_BD.tsv $TEMP/WD_BD.tsv
        --match '
                Getty: (qnode)-[p]->(v1), 
                WD: (qnode)-[]->(v2)' 
        --where 'kgtk_date_year(v1) != kgtk_date_year(v2)' 
        --return 'distinct qnode, kgtk_date_year(v1) as getty_year, kgtk_date_year(v2) as wd_year'
        --order-by 'qnode' 
    / add-labels
    """)

Unnamed: 0,node1,getty_year,wd_year,node1;label
0,Q177847,100,125,'Lucian of Samosata'@en
1,Q177847,100,120,'Lucian of Samosata'@en
2,Q182021,1573,1575,'Marie de' Medici'@en
3,Q193426,1923,1921,'Nancy Reagan'@en
4,Q2454564,1970,1930,'Charles Kalani'@en
5,Q3089653,1803,1805,'Frédéric Bourgeois de Mercey'@en
6,Q311469,1635,1634,'Mariana of Austria'@en
7,Q3124601,1926,1921,'LeRoy Neiman'@en
8,Q43689,339,340,'Ambrose'@en
9,Q472520,1899,1898,'Hal B. Wallis'@en


**Finding:** 250 of the 266 ULAN ids have identical years of birth in Wikidata and Getty. In the remaining cases, the years usually differ a little bit (e.g., 1181 vs 1182).

# 5. Append the newly found years to our Wikidata subgraph

We are now ready to insert the 273 new values for the 269 entities from Getty into our Wikidata subgraph. 

We first complete each edge with an id, using the `add-id` command:

In [39]:
%%time
kgtk("""add-id --debug -i $TEMP/New_BD.tsv --id-style wikidata -o $TEMP/New_BD_with_ID.tsv""")

CPU times: user 3.42 ms, sys: 10.5 ms, total: 13.9 ms
Wall time: 1.31 s


Finally, we concatenate the original Wikidata graph with the new edges from Getty:

In [40]:
%%time
kgtk("""cat -i $all $TEMP/New_BD_with_ID.tsv -o $OUT/all_plus_getty.tsv""")

CPU times: user 5.74 ms, sys: 13.8 ms, total: 19.6 ms
Wall time: 9.29 s


Let's count the number of edges in Wikidata before and after enrichment.

Before:

In [41]:
%%time
kgtk("""
    query -i $all 
    --match '(q)-[]->()'
    --return 'count(q)'
    """)

CPU times: user 5.95 ms, sys: 13.1 ms, total: 19.1 ms
Wall time: 1.52 s


Unnamed: 0,"count(graph_1_c1.""node1"")"
0,2614949


After:

In [42]:
%%time
kgtk("""
    query -i $OUT/all_plus_getty.tsv 
    --match '(q)-[]->()'
    --return 'count(q)'
    """)

CPU times: user 9.66 ms, sys: 14.4 ms, total: 24.1 ms
Wall time: 16.8 s


Unnamed: 0,"count(graph_8_c1.""node1"")"
0,2615222


**Finding:** As expected, the difference is 273 (2,615,222 - 2,614,949) edges.