{ "cells": [ { "cell_type": "markdown", "id": "d34e89ca", "metadata": {}, "source": [ "# Enriching Wikidata with the Getty KG\n", "\n", "The [Getty vocabularies](https://www.getty.edu/research/tools/vocabularies/lod/index.html) contain rich data represented in RDF format.\n", "\n", "This notebook shows how graphs like Getty Vocabulary can be used to enrich Wikidata by using `kgtk` operations. We will show this enrichment on the records of people in the `Arnold Schwarzenegger` graph that exist both in Wikidata (with Qnode) and Getty Vocabulary (with ULAN ID). We will enrich their `date of birth` information. \n", "\n", "Specifically, we will investigate: *Does Getty contain complementary information to Wikidata about people's date of birth?*\n", "\n", "We will use KGTK to import Getty data, align Getty to Wikidata, query dates of birth in both graphs separately, compare the results, and enrich the Wikidata graph with the missing information." ] }, { "cell_type": "markdown", "id": "70b8e1c1-d0ba-46de-aa7f-14a5e4db304c", "metadata": {}, "source": [ "## Step 0: Install KGTK" ] }, { "cell_type": "markdown", "id": "9d6e7007-7d51-43f4-b33f-25bbcb0d9155", "metadata": {}, "source": [ "Only run the following cell if KGTK is not installed.\n", " For example, if running in [Google Colab](https://colab.research.google.com/)" ] }, { "cell_type": "code", "execution_count": null, "id": "00fc9a9a-b525-4774-9528-0316b266fb84", "metadata": {}, "outputs": [], "source": [ "!pip install kgtk" ] }, { "cell_type": "code", "execution_count": 2, "id": "ad7e2663", "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n", "from kgtk.functions import kgtk, kypher" ] }, { "cell_type": "markdown", "id": "0d135d32", "metadata": {}, "source": [ "## Set up environment path\n", "Here we set up environment variables that will be used in the following sections, including folders, files like basic databases, query output and so on." ] }, { "cell_type": "code", "execution_count": 3, "id": "f62fd5b0-c242-48e7-9dfd-4d9dbaedb164", "metadata": {}, "outputs": [], "source": [ "# Parameters\n", "\n", "# Folder on local machine where to create the output and temporary folders\n", "\n", "input_path = None\n", "output_path = \"/tmp/kgtk-projects\"\n", "project_name = \"getty-enrichment\"" ] }, { "cell_type": "code", "execution_count": null, "id": "fd172fb9-8155-4824-b443-4faa8d742279", "metadata": {}, "outputs": [], "source": [ "files = [\n", " \"all\",\n", " \"label\"\n", "]\n", "additional_files = {\n", " \"ulan_terms\": \"ULANOut_2Terms.nt.gz\", \n", " \"ulan_subjects\": \"ULANOut_1Subjects.nt.gz\",\n", " \"ulan_agentmap\": \"ULANOut_AgentMap.nt.gz\", \n", " \"ulan_biographies\": \"ULANOut_Biographies.nt.gz\",\n", " \"namespaces\": \"namespaces.tsv\"}\n", "\n", "ck = ConfigureKGTK(files, \n", " input_files_url=\"https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/getty\")\n", "ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name,\n", " additional_files=additional_files)" ] }, { "cell_type": "code", "execution_count": 8, "id": "c413a881-de11-4e47-a409-ee1d5720288f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n", "KGTK_GRAPH_CACHE: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db\n", "STORE: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db\n", "KGTK_OPTION_DEBUG: false\n", "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n", "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/labels.en.tsv.gz\n", "OUT: /tmp/kgtk-projects/getty-enrichment\n", "GRAPH: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input\n", "kypher: kgtk query --graph-cache /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db\n", "TEMP: /tmp/kgtk-projects/getty-enrichment/temp.getty-enrichment\n", "kgtk: kgtk\n", "all: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/all.tsv.gz\n", "label: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/labels.en.tsv.gz\n", "ulan_terms: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/ULANOut_2Terms.nt.gz\n", "ulan_subjects: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/ULANOut_1Subjects.nt.gz\n", "ulan_agentmap: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/ULANOut_AgentMap.nt.gz\n", "ulan_biographies: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/ULANOut_Biographies.nt.gz\n", "namespaces: /Users/amandeep/isi-kgtk-tutorial/getty-enrichment_input/namespaces.tsv\n" ] } ], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "markdown", "id": "791ea034-d77d-44e6-85f1-0946ded85070", "metadata": {}, "source": [ "## Approach overview\n", "\n", "The Getty knowledge graph consists of [multiple vocabulary files](https://www.getty.edu/research/tools/vocabularies/), including ULAN (Union List of Artist Names), TGN (Thesaurus of Geographic Names), and AAT (Art & Architecture Thesaurus).\n", "In this tutorial, we will focus on the ULAN vocabulary, which \"includes names, rich relationships, notes, sources, and biographical information for artists, architects, firms, studios, repositories, and patrons, both individuals and corporate bodies, named and anonymous\". The procedures for the other vocabularies should be analogous as they are also in `.nt` format.\n", "\n", "The method that we will use consists of the following 5 steps:\n", "1. Import Getty's ULAN file into KGTK\n", "2. Align Getty to Wikidata\n", "3. Query Wikidata, record known & unknown values\n", "4. Query Getty to see if we can find these unknown values\n", "5. Append the newly found values to Wikidata" ] }, { "cell_type": "markdown", "id": "0923ad59", "metadata": {}, "source": [ "## 1. Import Getty's ULAN data into `kgtk`" ] }, { "cell_type": "markdown", "id": "2c08d3f1", "metadata": {}, "source": [ "As both ULAN and TGN are stored in n-triples (`.nt`) format, we can simply use the `import-ntriples` command. \n", "\n", "**Understanding prefixes** Getty conveniently provides an ontology file in an [RDF format](http://vocab.getty.edu/ontology.rdf), which defines the prefixes in the file header. We have transformed this file in KGTK format (`namespaces.tsv`) and we will use it to help KGTK understand prefixes in the data. Here are its contents:" ] }, { "cell_type": "code", "execution_count": 9, "id": "8b1d2555-3bc3-4026-bab3-53760fa1bdc3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0xml-schema-typeprefix_expansionhttp://www.w3.org/2001/XMLSchema#
1ulan_scopeNoteprefix_expansionhttp://vocab.getty.edu/ulan/scopeNote/
2tgn_termprefix_expansionhttp://vocab.getty.edu/tgn/term/
3rrxprefix_expansionhttp://purl.org/r2rml-ext/
4tgn_scopeNoteprefix_expansionhttp://vocab.getty.edu/tgn/scopeNote/
............
57vannprefix_expansionhttp://purl.org/vocab/vann/
58vcardprefix_expansionhttp://www.w3.org/2006/vcard/ns#
59ulan_sourceprefix_expansionhttp://vocab.getty.edu/ulan/source/
60ccprefix_expansionhttp://creativecommons.org/ns#
61rdfsprefix_expansionhttp://www.w3.org/2000/01/rdf-schema#
\n", "

62 rows × 3 columns

\n", "
" ], "text/plain": [ " node1 label node2\n", "0 xml-schema-type prefix_expansion http://www.w3.org/2001/XMLSchema#\n", "1 ulan_scopeNote prefix_expansion http://vocab.getty.edu/ulan/scopeNote/\n", "2 tgn_term prefix_expansion http://vocab.getty.edu/tgn/term/\n", "3 rrx prefix_expansion http://purl.org/r2rml-ext/\n", "4 tgn_scopeNote prefix_expansion http://vocab.getty.edu/tgn/scopeNote/\n", ".. ... ... ...\n", "57 vann prefix_expansion http://purl.org/vocab/vann/\n", "58 vcard prefix_expansion http://www.w3.org/2006/vcard/ns#\n", "59 ulan_source prefix_expansion http://vocab.getty.edu/ulan/source/\n", "60 cc prefix_expansion http://creativecommons.org/ns#\n", "61 rdfs prefix_expansion http://www.w3.org/2000/01/rdf-schema#\n", "\n", "[62 rows x 3 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " cat -i $GRAPH/namespaces.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "0699ef39-bdc6-441d-9c1b-a32223405e99", "metadata": {}, "source": [ "**Getty files** We will use four files from Getty's ULAN vocabulary:\n", "1. `Biography` - which links agents to biographies, using the `gvp:biographyPrefered` property.\n", "2. `Agent Map` which links people to their roles (\"agents\"), through the `foaf:focus` property.\n", "3. `Subjects` use `dc:identifier` to link ULAN nodes to their ULAN ID strings.\n", "4. `Terms` which links agents to their year of birth and death, using the `gvp:estStart` and `gvp:estEnd` properties \n", "\n", "Together, the four files are needed to enable the following path from people to birthdates:\n", "\n", "![ULAN](../media/ULAN.png)\n", "\n", "For convenience, we have uploaded the files in a .gz format to this GitHub repository, we just have to gunzip them before using them in KGTK:" ] }, { "cell_type": "code", "execution_count": 10, "id": "73fb6915-037d-48c3-85a7-d6ff33d6e9e9", "metadata": {}, "outputs": [], "source": [ "!gunzip $GRAPH/ULANOut_2Terms.nt.gz\n", "!gunzip $GRAPH/ULANOut_1Subjects.nt.gz\n", "!gunzip $GRAPH/ULANOut_AgentMap.nt.gz\n", "!gunzip $GRAPH/ULANOut_Biographies.nt.gz" ] }, { "cell_type": "markdown", "id": "c15d3b39-f818-44ef-927b-e02ec94480a7", "metadata": {}, "source": [ "We can now import each of these four files into KGTK:" ] }, { "cell_type": "code", "execution_count": 11, "id": "bb458fc6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 46.7 ms, sys: 40 ms, total: 86.7 ms\n", "Wall time: 2min 27s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " import-ntriples \n", " -i $GRAPH/ULANOut_2Terms.nt\n", " -o $TEMP/ULAN_term_KGTK.tsv\n", " --namespace-file $GRAPH/namespaces.tsv\n", " --namespace-id-use-uuid True \n", " --build-new-namespaces False \n", " --output-only-used-namespaces True \n", " --structured-value-label gvp:structured_value \n", " --structured-uri-label gvp:structured_uri \n", " --newnode-prefix node \n", " --newnode-use-uuid True\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 12, "id": "e9277089-59e8-4909-861c-158cf5f9e7bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0ulan:500523031gvp:prefLabelGVPulan_term:1501308164
1ulan:500523031skosxl:prefLabelulan_term:1501308164
2ulan:500523038gvp:prefLabelGVPulan_term:1501308170
3ulan:500523038skosxl:prefLabelulan_term:1501308170
4ulan:500523041gvp:prefLabelGVPulan_term:1501308173
5ulan:500523041skosxl:prefLabelulan_term:1501308173
6ulan:500523044gvp:prefLabelGVPulan_term:1501308176
7ulan:500523044skosxl:prefLabelulan_term:1501308176
8ulan:500523050gvp:prefLabelGVPulan_term:1501308181
9ulan:500523050skosxl:prefLabelulan_term:1501308181
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 ulan:500523031 gvp:prefLabelGVP ulan_term:1501308164\n", "1 ulan:500523031 skosxl:prefLabel ulan_term:1501308164\n", "2 ulan:500523038 gvp:prefLabelGVP ulan_term:1501308170\n", "3 ulan:500523038 skosxl:prefLabel ulan_term:1501308170\n", "4 ulan:500523041 gvp:prefLabelGVP ulan_term:1501308173\n", "5 ulan:500523041 skosxl:prefLabel ulan_term:1501308173\n", "6 ulan:500523044 gvp:prefLabelGVP ulan_term:1501308176\n", "7 ulan:500523044 skosxl:prefLabel ulan_term:1501308176\n", "8 ulan:500523050 gvp:prefLabelGVP ulan_term:1501308181\n", "9 ulan:500523050 skosxl:prefLabel ulan_term:1501308181" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " head -i $TEMP/ULAN_term_KGTK.tsv\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "bcd33982", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 18 ms, sys: 21.2 ms, total: 39.1 ms\n", "Wall time: 59.2 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " import-ntriples \n", " -i $GRAPH/ULANOut_1Subjects.nt\n", " -o $TEMP/ULAN_subject_KGTK.tsv \n", " --namespace-file $GRAPH/namespaces.tsv\n", " --namespace-id-use-uuid True \n", " --build-new-namespaces False \n", " --output-only-used-namespaces True \n", " --structured-value-label gvp:structured_value \n", " --structured-uri-label gvp:structured_uri \n", " --newnode-prefix node \n", " --newnode-use-uuid True\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "0c663f44-2cfa-4769-a3ef-294a07acc1ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0ulan:500204004rdf:typegvp:UnknownPersonConcept
1ulan:500204004gvp:displayOrder1
2ulan:500204004gvp:parentStringAbbrevUnknown People by Culture
3ulan:500204004gvp:parentStringUnknown People by Culture
4ulan:500204004dc:identifier500204004
5ulan:500204004dcterm:licensehttp://opendatacommons.org/licenses/by/1.0/
6ulan:500204004cc:licensehttp://opendatacommons.org/licenses/by/1.0/
7ulan:500204004skos:inSchemeulan:
8ulan:500204004void:inDatasethttp://vocab.getty.edu/dataset/ulan
9ulan:500372685rdf:typegvp:UnknownPersonConcept
\n", "
" ], "text/plain": [ " node1 label \\\n", "0 ulan:500204004 rdf:type \n", "1 ulan:500204004 gvp:displayOrder \n", "2 ulan:500204004 gvp:parentStringAbbrev \n", "3 ulan:500204004 gvp:parentString \n", "4 ulan:500204004 dc:identifier \n", "5 ulan:500204004 dcterm:license \n", "6 ulan:500204004 cc:license \n", "7 ulan:500204004 skos:inScheme \n", "8 ulan:500204004 void:inDataset \n", "9 ulan:500372685 rdf:type \n", "\n", " node2 \n", "0 gvp:UnknownPersonConcept \n", "1 1 \n", "2 Unknown People by Culture \n", "3 Unknown People by Culture \n", "4 500204004 \n", "5 http://opendatacommons.org/licenses/by/1.0/ \n", "6 http://opendatacommons.org/licenses/by/1.0/ \n", "7 ulan: \n", "8 http://vocab.getty.edu/dataset/ulan \n", "9 gvp:UnknownPersonConcept " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " head -i $TEMP/ULAN_subject_KGTK.tsv\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "5e7a747e-cc2c-4f9a-aa85-343c6e7a08a4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.8 ms, sys: 15.6 ms, total: 26.4 ms\n", "Wall time: 29.4 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " import-ntriples \n", " -i $GRAPH/ULANOut_AgentMap.nt\n", " -o $TEMP/ULAN_agentmap_KGTK.tsv \n", " --namespace-file $GRAPH/namespaces.tsv \n", " --namespace-id-use-uuid True \n", " --build-new-namespaces False \n", " --output-only-used-namespaces True \n", " --structured-value-label gvp:structured_value \n", " --structured-uri-label gvp:structured_uri \n", " --newnode-prefix node \n", " --newnode-use-uuid True\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 16, "id": "320e7bac-f24e-4801-b282-f6b2b7db1b2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0ulan:500000002foaf:focusulan:500000002-agent
1ulan:500000003foaf:focusulan:500000003-agent
2ulan:500000004foaf:focusulan:500000004-agent
3ulan:500000005foaf:focusulan:500000005-agent
4ulan:500000006foaf:focusulan:500000006-agent
5ulan:500000007foaf:focusulan:500000007-agent
6ulan:500000009foaf:focusulan:500000009-agent
7ulan:500000010foaf:focusulan:500000010-agent
8ulan:500000011foaf:focusulan:500000011-agent
9ulan:500000012foaf:focusulan:500000012-agent
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 ulan:500000002 foaf:focus ulan:500000002-agent\n", "1 ulan:500000003 foaf:focus ulan:500000003-agent\n", "2 ulan:500000004 foaf:focus ulan:500000004-agent\n", "3 ulan:500000005 foaf:focus ulan:500000005-agent\n", "4 ulan:500000006 foaf:focus ulan:500000006-agent\n", "5 ulan:500000007 foaf:focus ulan:500000007-agent\n", "6 ulan:500000009 foaf:focus ulan:500000009-agent\n", "7 ulan:500000010 foaf:focus ulan:500000010-agent\n", "8 ulan:500000011 foaf:focus ulan:500000011-agent\n", "9 ulan:500000012 foaf:focus ulan:500000012-agent" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " head -i $TEMP/ULAN_agentmap_KGTK.tsv\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 17, "id": "bc17eafd-8925-4895-a19c-ecc33e265c4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 33.1 ms, sys: 30.6 ms, total: 63.7 ms\n", "Wall time: 1min 55s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " import-ntriples \n", " -i $GRAPH/ULANOut_Biographies.nt\n", " -o $TEMP/ULAN_biography_KGTK.tsv \n", " --namespace-file $GRAPH/namespaces.tsv\n", " --namespace-id-use-uuid True \n", " --build-new-namespaces False \n", " --output-only-used-namespaces True \n", " --structured-value-label gvp:structured_value \n", " --structured-uri-label gvp:structured_uri \n", " --newnode-prefix node \n", " --newnode-use-uuid True\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 18, "id": "74629484-42d9-4bc3-87e1-a682d2ea62f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0ulan:500000002-agentgvp:biographyPreferredulan_bio:4000336014
1ulan:500000003-agentgvp:biographyPreferredulan_bio:4000336015
2ulan:500000004-agentgvp:biographyNonPreferredulan_bio:4000000003
3ulan:500000004-agentgvp:biographyPreferredulan_bio:4000000001
4ulan:500000004-agentgvp:biographyNonPreferredulan_bio:4000000002
5ulan:500000004-agentgvp:biographyNonPreferredulan_bio:4000334645
6ulan:500000004-agentgvp:biographyNonPreferredulan_bio:4000757338
7ulan:500000005-agentgvp:biographyPreferredulan_bio:4000000004
8ulan:500000005-agentgvp:biographyNonPreferredulan_bio:4000000005
9ulan:500000005-agentgvp:biographyNonPreferredulan_bio:4000000006
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 ulan:500000002-agent gvp:biographyPreferred ulan_bio:4000336014\n", "1 ulan:500000003-agent gvp:biographyPreferred ulan_bio:4000336015\n", "2 ulan:500000004-agent gvp:biographyNonPreferred ulan_bio:4000000003\n", "3 ulan:500000004-agent gvp:biographyPreferred ulan_bio:4000000001\n", "4 ulan:500000004-agent gvp:biographyNonPreferred ulan_bio:4000000002\n", "5 ulan:500000004-agent gvp:biographyNonPreferred ulan_bio:4000334645\n", "6 ulan:500000004-agent gvp:biographyNonPreferred ulan_bio:4000757338\n", "7 ulan:500000005-agent gvp:biographyPreferred ulan_bio:4000000004\n", "8 ulan:500000005-agent gvp:biographyNonPreferred ulan_bio:4000000005\n", "9 ulan:500000005-agent gvp:biographyNonPreferred ulan_bio:4000000006" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " head -i $TEMP/ULAN_biography_KGTK.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "0ae93142-7680-427a-8555-0f3d8fb9a088", "metadata": {}, "source": [ "After importing each of the files, we can now use KGTK operations on them. We start by `kgtk cat` to concatenate them into a single file for more convenient work with it." ] }, { "cell_type": "code", "execution_count": 19, "id": "7ff48c95-5e70-400d-bd22-b7d55a1c161d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 22.3 ms, sys: 21.9 ms, total: 44.2 ms\n", "Wall time: 1min 8s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " cat -i $TEMP/ULAN_term_KGTK.tsv $TEMP/ULAN_subject_KGTK.tsv $TEMP/ULAN_agentmap_KGTK.tsv $TEMP/ULAN_biography_KGTK.tsv \n", " -o $TEMP/ULAN_all.tsv\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "caae378c", "metadata": {}, "source": [ "## 2. Build Getty-Wikidata Alignment\n", "Getty provides a `WikidataAlignment` file but our analysis showed that this alignment file is incomplete or out-of-date. Thus, we build our own alignment file, which links ULAN IDs to Wikidata Qnodes." ] }, { "cell_type": "markdown", "id": "7dd3db2d", "metadata": {}, "source": [ "We perform a join between the Wikidata and the ULAN graph, through the ULAN identifiers available in both graphs.\n", "Wikidata uses the property `P245` to map Qnode ids to ULAN identifiers, whereas Getty combines ULAN nodes to IDs with the `dc:identifier` property.\n", "\n", "We will use the `skos:exactMatch` property to indicate alignment between ULAN nodes and Wikidata nodes.\n", "\n", "*This query is taking our subgraph of Wikidata, and the Getty ULAN graph which is in an entirely different format, and queries the two jointly.*\n", "\n", "Let's first see what results we get with this join operation:" ] }, { "cell_type": "code", "execution_count": 21, "id": "69cf5f27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8.26 ms, sys: 12.3 ms, total: 20.6 ms\n", "Wall time: 3.56 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;label
0ulan:500224955skos:exactMatchQ100948'Rachel Carson'@en
1ulan:500281177skos:exactMatchQ101771'Gottfried Gruben'@en
2ulan:500001235skos:exactMatchQ101791'Sep Ruf'@en
3ulan:500256782skos:exactMatchQ102139'Margrethe II of Denmark'@en
4ulan:500302331skos:exactMatchQ1024362'Spanish National Research Council'@en
...............
538ulan:500262206skos:exactMatchQ9696'John F. Kennedy'@en
539ulan:500247140skos:exactMatchQ972381'George Hall'@en
540ulan:500324997skos:exactMatchQ97416'Gerhart Rodenwaldt'@en
541ulan:500274474skos:exactMatchQ979511'Stuart Craig'@en
542ulan:500030880skos:exactMatchQ9916'Dwight D. Eisenhower'@en
\n", "

543 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 ulan:500224955 skos:exactMatch Q100948 \n", "1 ulan:500281177 skos:exactMatch Q101771 \n", "2 ulan:500001235 skos:exactMatch Q101791 \n", "3 ulan:500256782 skos:exactMatch Q102139 \n", "4 ulan:500302331 skos:exactMatch Q1024362 \n", ".. ... ... ... \n", "538 ulan:500262206 skos:exactMatch Q9696 \n", "539 ulan:500247140 skos:exactMatch Q972381 \n", "540 ulan:500324997 skos:exactMatch Q97416 \n", "541 ulan:500274474 skos:exactMatch Q979511 \n", "542 ulan:500030880 skos:exactMatch Q9916 \n", "\n", " node2;label \n", "0 'Rachel Carson'@en \n", "1 'Gottfried Gruben'@en \n", "2 'Sep Ruf'@en \n", "3 'Margrethe II of Denmark'@en \n", "4 'Spanish National Research Council'@en \n", ".. ... \n", "538 'John F. Kennedy'@en \n", "539 'George Hall'@en \n", "540 'Gerhart Rodenwaldt'@en \n", "541 'Stuart Craig'@en \n", "542 'Dwight D. Eisenhower'@en \n", "\n", "[543 rows x 4 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $all $TEMP/ULAN_all.tsv \n", " --match '\n", " all: (qnode)-[:P245]->(identifier), \n", " ULAN: (ulanid)-[p]->(identifier)' \n", " --where 'p.label = \"dc:identifier\"' \n", " --return '\n", " distinct ulanid as node1, \n", " \"skos:exactMatch\" as label, \n", " qnode as node2' \n", " / add-labels\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "17acc728-cc7e-4cb9-ade5-e92a40a36d21", "metadata": {}, "source": [ "The results look reasonable, so let's go ahead and store the alignment into a KGTK file:" ] }, { "cell_type": "code", "execution_count": 22, "id": "4a3fdca2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.52 ms, sys: 9.88 ms, total: 13.4 ms\n", "Wall time: 1.47 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $all $TEMP/ULAN_all.tsv \n", " --match '\n", " all: (qnode)-[:P245]->(identifier), \n", " ULAN: (ulanid)-[p]->(identifier)' \n", " --where 'p.label = \"dc:identifier\"' \n", " --return '\n", " distinct ulanid as node1, \n", " \"skos:exactMatch\" as label, \n", " qnode as node2' \n", " -o $TEMP/ULAN_ALIGN.tsv\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "2daceb10-44c6-4e82-9dc7-e628618451ad", "metadata": {}, "source": [ "We will now run a simple Kypher query to count the Qnodes for which we have ULAN mapping:" ] }, { "cell_type": "code", "execution_count": 23, "id": "5e2e7b05", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QNODE
0535
\n", "
" ], "text/plain": [ " QNODE\n", "0 535" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv \n", " --match '(ulanid)-[]->(qnode)' \n", " --return 'count(distinct qnode) as QNODE'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "3723772c-7c29-4a2b-8104-77b08d66cb89", "metadata": {}, "source": [ "Hmm... So there are 535 Qnodes that correspond to 543 ULAN nodes, which means that we have some Qnodes with more than one ULAN ID. In theory, this should not happen - each entity in Wikidata should correspond to a single ULAN node.\n", "\n", "Let's find the Qnodes with multiple mappings, and inspect them closer:" ] }, { "cell_type": "code", "execution_count": 24, "id": "0aa9882a-4269-4efe-a457-1031dc708404", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
qnodeulan1ulan2
0Q1244372ulan:500304981ulan:500305436
1Q127064ulan:500279772ulan:500304559
2Q157808ulan:500210203ulan:500303345
3Q1600831ulan:500227540ulan:500312167
4Q2837755ulan:500312076ulan:500312077
5Q2945260ulan:500251050ulan:500307043
6Q526170ulan:500307065ulan:500312663
7Q66149ulan:500023792ulan:500358178
\n", "
" ], "text/plain": [ " qnode ulan1 ulan2\n", "0 Q1244372 ulan:500304981 ulan:500305436\n", "1 Q127064 ulan:500279772 ulan:500304559\n", "2 Q157808 ulan:500210203 ulan:500303345\n", "3 Q1600831 ulan:500227540 ulan:500312167\n", "4 Q2837755 ulan:500312076 ulan:500312077\n", "5 Q2945260 ulan:500251050 ulan:500307043\n", "6 Q526170 ulan:500307065 ulan:500312663\n", "7 Q66149 ulan:500023792 ulan:500358178" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv\n", " --match '\n", " (u1)-[]->(qnode),\n", " (u2)-[]->(qnode)'\n", " --where 'u1\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node1;labellabel;label
0Q100948P569^1907-05-27T00:00:00Z/11'Rachel Carson'@en'date of birth'@en
1Q101771P569^1929-06-21T00:00:00Z/11'Gottfried Gruben'@en'date of birth'@en
2Q101791P569^1908-03-09T00:00:00Z/11'Sep Ruf'@en'date of birth'@en
3Q102139P569^1940-04-16T00:00:00Z/11'Margrethe II of Denmark'@en'date of birth'@en
4Q102711P569^1936-05-17T00:00:00Z/11'Dennis Hopper'@en'date of birth'@en
..................
295Q9696P569^1917-05-29T00:00:00Z/11'John F. Kennedy'@en'date of birth'@en
296Q972381P569^1916-11-19T00:00:00Z/11'George Hall'@en'date of birth'@en
297Q97416P569^1886-10-16T00:00:00Z/11'Gerhart Rodenwaldt'@en'date of birth'@en
298Q979511P569^1942-04-14T00:00:00Z/11'Stuart Craig'@en'date of birth'@en
299Q9916P569^1890-10-14T00:00:00Z/11'Dwight D. Eisenhower'@en'date of birth'@en
\n", "

300 rows × 5 columns

\n", "" ], "text/plain": [ " node1 label node2 node1;label \\\n", "0 Q100948 P569 ^1907-05-27T00:00:00Z/11 'Rachel Carson'@en \n", "1 Q101771 P569 ^1929-06-21T00:00:00Z/11 'Gottfried Gruben'@en \n", "2 Q101791 P569 ^1908-03-09T00:00:00Z/11 'Sep Ruf'@en \n", "3 Q102139 P569 ^1940-04-16T00:00:00Z/11 'Margrethe II of Denmark'@en \n", "4 Q102711 P569 ^1936-05-17T00:00:00Z/11 'Dennis Hopper'@en \n", ".. ... ... ... ... \n", "295 Q9696 P569 ^1917-05-29T00:00:00Z/11 'John F. Kennedy'@en \n", "296 Q972381 P569 ^1916-11-19T00:00:00Z/11 'George Hall'@en \n", "297 Q97416 P569 ^1886-10-16T00:00:00Z/11 'Gerhart Rodenwaldt'@en \n", "298 Q979511 P569 ^1942-04-14T00:00:00Z/11 'Stuart Craig'@en \n", "299 Q9916 P569 ^1890-10-14T00:00:00Z/11 'Dwight D. Eisenhower'@en \n", "\n", " label;label \n", "0 'date of birth'@en \n", "1 'date of birth'@en \n", "2 'date of birth'@en \n", "3 'date of birth'@en \n", "4 'date of birth'@en \n", ".. ... \n", "295 'date of birth'@en \n", "296 'date of birth'@en \n", "297 'date of birth'@en \n", "298 'date of birth'@en \n", "299 'date of birth'@en \n", "\n", "[300 rows x 5 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv $all \n", " --match 'ALIGN: (ulanid)-[]->(qnode), \n", " all: (qnode)-[p:P569]->(birthdate)' \n", " --return 'qnode as node1, p.label as label, birthdate as node2' \n", " / add-labels\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "1dbe420d-5fd1-41f2-9f46-3d2b919d794a", "metadata": {}, "source": [ "Now that we understand the results, we perform the query for all 535 people:" ] }, { "cell_type": "code", "execution_count": 26, "id": "6b0a1f41", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.02 ms, sys: 10 ms, total: 13 ms\n", "Wall time: 1.33 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv $all \n", " --match 'ALIGN: (ulanid)-[]->(qnode), \n", " all: (qnode)-[p:P569]->(birthdate)' \n", " --return 'qnode as node1, p.label as label, birthdate as node2' \n", " -o $TEMP/WD_BD.tsv\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "02722b2b-fe2e-4692-bb2c-4624824773e5", "metadata": {}, "source": [ "And we count the date of birth rows that we find:" ] }, { "cell_type": "code", "execution_count": 27, "id": "b56692c5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qnode
0266
\n", "
" ], "text/plain": [ " Qnode\n", "0 266" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/WD_BD.tsv\n", " --match '(qnode)-[]->()' \n", " --return 'count(distinct qnode) as Qnode'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "4001e999-b302-4c5d-9abc-cfb58cfa73ee", "metadata": {}, "source": [ "Again here, each person should in theory have a single date of birth, as it is a functional property. Hence, the finding that we find 300 dates for 266 people needs further investigation:" ] }, { "cell_type": "code", "execution_count": 28, "id": "d2ea5e68-6955-474d-a433-de86f15bf52b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
qnodebd1bd2
0Q106775^1930-10-01T00:00:00Z/11^1930-10-02T00:00:00Z/11
1Q11806^1735-10-19T00:00:00Z/11^1735-10-30T00:00:00Z/11
2Q131981^1683-10-30T00:00:00Z/11^1683-11-10T00:00:00Z/11
3Q1434^0161-01-01T00:00:00Z/9^0161-08-31T00:00:00Z/11
4Q1681112^1940-01-01T00:00:00Z/9^1940-07-13T00:00:00Z/11
5Q174880^1515-01-01T00:00:00Z/9^1515-03-28T00:00:00Z/11
6Q177847^0120-01-01T00:00:00Z/9^0125-01-01T00:00:00Z/9
7Q182021^1573-04-26T00:00:00Z/11^1575-04-26T00:00:00Z/11
8Q2643^1943-01-01T00:00:00Z/9^1943-02-25T00:00:00Z/11
9Q30875^1854-06-16T00:00:00Z/11^1854-10-16T00:00:00Z/11
10Q3089653^1803-05-20T00:00:00Z/11^1805-01-01T00:00:00Z/9
11Q311469^1634-12-22T00:00:00Z/11^1634-12-24T00:00:00Z/11
12Q352^1889-01-01T00:00:00Z/9^1889-04-20T00:00:00Z/11
13Q395578^1627-01-01T00:00:00Z/9^1627-10-26T00:00:00Z/11
14Q43689^0339-01-01T00:00:00Z/9^0340-01-01T00:00:00Z/9
15Q44281^1491-10-23T00:00:00Z/11^1491-11-01T00:00:00Z/11
16Q443540^1912-10-13T00:00:00Z/11^1912-10-31T00:00:00Z/11
17Q472520^1898-09-14T00:00:00Z/11^1899-09-14T00:00:00Z/11
18Q472520^1898-09-14T00:00:00Z/11^1898-10-18T00:00:00Z/11
19Q472520^1898-09-14T00:00:00Z/11^1898-10-08T00:00:00Z/11
20Q472520^1898-10-18T00:00:00Z/11^1899-09-14T00:00:00Z/11
21Q472520^1898-10-08T00:00:00Z/11^1899-09-14T00:00:00Z/11
22Q472520^1898-10-08T00:00:00Z/11^1898-10-18T00:00:00Z/11
23Q5738^1797-04-14T00:00:00Z/11^1797-04-15T00:00:00Z/11
24Q676555^1182-01-01T00:00:00Z/9^1182-06-24T00:00:00Z/11
25Q7322^1451-01-01T00:00:00Z/9^1451-10-31T00:00:00Z/11
26Q7322^1451-01-01T00:00:00Z/9^1451-09-01T00:00:00Z/11
27Q7322^1451-01-01T00:00:00Z/9^1451-10-31T00:00:00Z/11
28Q7322^1451-01-01T00:00:00Z/9^1451-09-01T00:00:00Z/11
29Q7322^1450-01-01T00:00:00Z/9^1451-01-01T00:00:00Z/9
30Q7322^1450-01-01T00:00:00Z/9^1451-01-01T00:00:00Z/9
31Q7322^1450-01-01T00:00:00Z/9^1451-10-31T00:00:00Z/11
32Q7322^1450-01-01T00:00:00Z/9^1451-09-01T00:00:00Z/11
33Q7322^1451-09-01T00:00:00Z/11^1451-10-31T00:00:00Z/11
34Q75612^1902-11-21T00:00:00Z/11^1904-07-14T00:00:00Z/11
35Q8018^0354-01-01T00:00:00Z/9^0354-11-13T00:00:00Z/11
36Q8018^0354-01-01T00:00:00Z/9^0354-11-13T00:00:00Z/11
37Q8479^1672-05-30T00:00:00Z/11^1672-06-09T00:00:00Z/11
38Q855^1878-12-18T00:00:00Z/11^1879-12-09T00:00:00Z/11
39Q930679^1881-09-18T00:00:00Z/11^1881-09-28T00:00:00Z/11
\n", "
" ], "text/plain": [ " qnode bd1 bd2\n", "0 Q106775 ^1930-10-01T00:00:00Z/11 ^1930-10-02T00:00:00Z/11\n", "1 Q11806 ^1735-10-19T00:00:00Z/11 ^1735-10-30T00:00:00Z/11\n", "2 Q131981 ^1683-10-30T00:00:00Z/11 ^1683-11-10T00:00:00Z/11\n", "3 Q1434 ^0161-01-01T00:00:00Z/9 ^0161-08-31T00:00:00Z/11\n", "4 Q1681112 ^1940-01-01T00:00:00Z/9 ^1940-07-13T00:00:00Z/11\n", "5 Q174880 ^1515-01-01T00:00:00Z/9 ^1515-03-28T00:00:00Z/11\n", "6 Q177847 ^0120-01-01T00:00:00Z/9 ^0125-01-01T00:00:00Z/9\n", "7 Q182021 ^1573-04-26T00:00:00Z/11 ^1575-04-26T00:00:00Z/11\n", "8 Q2643 ^1943-01-01T00:00:00Z/9 ^1943-02-25T00:00:00Z/11\n", "9 Q30875 ^1854-06-16T00:00:00Z/11 ^1854-10-16T00:00:00Z/11\n", "10 Q3089653 ^1803-05-20T00:00:00Z/11 ^1805-01-01T00:00:00Z/9\n", "11 Q311469 ^1634-12-22T00:00:00Z/11 ^1634-12-24T00:00:00Z/11\n", "12 Q352 ^1889-01-01T00:00:00Z/9 ^1889-04-20T00:00:00Z/11\n", "13 Q395578 ^1627-01-01T00:00:00Z/9 ^1627-10-26T00:00:00Z/11\n", "14 Q43689 ^0339-01-01T00:00:00Z/9 ^0340-01-01T00:00:00Z/9\n", "15 Q44281 ^1491-10-23T00:00:00Z/11 ^1491-11-01T00:00:00Z/11\n", "16 Q443540 ^1912-10-13T00:00:00Z/11 ^1912-10-31T00:00:00Z/11\n", "17 Q472520 ^1898-09-14T00:00:00Z/11 ^1899-09-14T00:00:00Z/11\n", "18 Q472520 ^1898-09-14T00:00:00Z/11 ^1898-10-18T00:00:00Z/11\n", "19 Q472520 ^1898-09-14T00:00:00Z/11 ^1898-10-08T00:00:00Z/11\n", "20 Q472520 ^1898-10-18T00:00:00Z/11 ^1899-09-14T00:00:00Z/11\n", "21 Q472520 ^1898-10-08T00:00:00Z/11 ^1899-09-14T00:00:00Z/11\n", "22 Q472520 ^1898-10-08T00:00:00Z/11 ^1898-10-18T00:00:00Z/11\n", "23 Q5738 ^1797-04-14T00:00:00Z/11 ^1797-04-15T00:00:00Z/11\n", "24 Q676555 ^1182-01-01T00:00:00Z/9 ^1182-06-24T00:00:00Z/11\n", "25 Q7322 ^1451-01-01T00:00:00Z/9 ^1451-10-31T00:00:00Z/11\n", "26 Q7322 ^1451-01-01T00:00:00Z/9 ^1451-09-01T00:00:00Z/11\n", "27 Q7322 ^1451-01-01T00:00:00Z/9 ^1451-10-31T00:00:00Z/11\n", "28 Q7322 ^1451-01-01T00:00:00Z/9 ^1451-09-01T00:00:00Z/11\n", "29 Q7322 ^1450-01-01T00:00:00Z/9 ^1451-01-01T00:00:00Z/9\n", "30 Q7322 ^1450-01-01T00:00:00Z/9 ^1451-01-01T00:00:00Z/9\n", "31 Q7322 ^1450-01-01T00:00:00Z/9 ^1451-10-31T00:00:00Z/11\n", "32 Q7322 ^1450-01-01T00:00:00Z/9 ^1451-09-01T00:00:00Z/11\n", "33 Q7322 ^1451-09-01T00:00:00Z/11 ^1451-10-31T00:00:00Z/11\n", "34 Q75612 ^1902-11-21T00:00:00Z/11 ^1904-07-14T00:00:00Z/11\n", "35 Q8018 ^0354-01-01T00:00:00Z/9 ^0354-11-13T00:00:00Z/11\n", "36 Q8018 ^0354-01-01T00:00:00Z/9 ^0354-11-13T00:00:00Z/11\n", "37 Q8479 ^1672-05-30T00:00:00Z/11 ^1672-06-09T00:00:00Z/11\n", "38 Q855 ^1878-12-18T00:00:00Z/11 ^1879-12-09T00:00:00Z/11\n", "39 Q930679 ^1881-09-18T00:00:00Z/11 ^1881-09-28T00:00:00Z/11" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/WD_BD.tsv\n", " --match '\n", " (qnode)-[]->(bd1),\n", " (qnode)-[]->(bd2)'\n", " --where 'bd1\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node1;labellabel;label
0Q100948P569^1907-01-01T00:00:00Z/9'Rachel Carson'@en'date of birth'@en
1Q101771P569^1929-01-01T00:00:00Z/9'Gottfried Gruben'@en'date of birth'@en
2Q101791P569^1908-01-01T00:00:00Z/9'Sep Ruf'@en'date of birth'@en
3Q102139P569^1940-01-01T00:00:00Z/9'Margrethe II of Denmark'@en'date of birth'@en
4Q1024362P569^1800-01-01T00:00:00Z/9'Spanish National Research Council'@en'date of birth'@en
..................
535Q9696P569^1917-01-01T00:00:00Z/9'John F. Kennedy'@en'date of birth'@en
536Q972381P569^1916-01-01T00:00:00Z/9'George Hall'@en'date of birth'@en
537Q97416P569^1886-01-01T00:00:00Z/9'Gerhart Rodenwaldt'@en'date of birth'@en
538Q979511P569^1942-01-01T00:00:00Z/9'Stuart Craig'@en'date of birth'@en
539Q9916P569^1890-01-01T00:00:00Z/9'Dwight D. Eisenhower'@en'date of birth'@en
\n", "

540 rows × 5 columns

\n", "" ], "text/plain": [ " node1 label node2 \\\n", "0 Q100948 P569 ^1907-01-01T00:00:00Z/9 \n", "1 Q101771 P569 ^1929-01-01T00:00:00Z/9 \n", "2 Q101791 P569 ^1908-01-01T00:00:00Z/9 \n", "3 Q102139 P569 ^1940-01-01T00:00:00Z/9 \n", "4 Q1024362 P569 ^1800-01-01T00:00:00Z/9 \n", ".. ... ... ... \n", "535 Q9696 P569 ^1917-01-01T00:00:00Z/9 \n", "536 Q972381 P569 ^1916-01-01T00:00:00Z/9 \n", "537 Q97416 P569 ^1886-01-01T00:00:00Z/9 \n", "538 Q979511 P569 ^1942-01-01T00:00:00Z/9 \n", "539 Q9916 P569 ^1890-01-01T00:00:00Z/9 \n", "\n", " node1;label label;label \n", "0 'Rachel Carson'@en 'date of birth'@en \n", "1 'Gottfried Gruben'@en 'date of birth'@en \n", "2 'Sep Ruf'@en 'date of birth'@en \n", "3 'Margrethe II of Denmark'@en 'date of birth'@en \n", "4 'Spanish National Research Council'@en 'date of birth'@en \n", ".. ... ... \n", "535 'John F. Kennedy'@en 'date of birth'@en \n", "536 'George Hall'@en 'date of birth'@en \n", "537 'Gerhart Rodenwaldt'@en 'date of birth'@en \n", "538 'Stuart Craig'@en 'date of birth'@en \n", "539 'Dwight D. Eisenhower'@en 'date of birth'@en \n", "\n", "[540 rows x 5 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv\n", " --match '\n", " ALIGN: (ulanid)-[]->(qnode), \n", " all: (ulanid)-[p0]->(ulanagent), \n", " all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' \n", " --where '\n", " p0.label = \"foaf:focus\" \n", " AND p1.label = \"gvp:biographyPreferred\" \n", " AND p2.label = \"gvp:estStart\" \n", " AND p3.label = \"gvp:structured_value\"' \n", " --return '\n", " distinct qnode as node1, \n", " \"P569\" as label, \n", " printf(\"^%s-01-01T00:00:00Z/9\", kgtk_unstringify(datevalue)) as node2' \n", " / add-labels\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "67ecee50-1e8e-441b-a730-c538f25dea70", "metadata": {}, "source": [ "As expected, we obtain dates of birth with a year precision (`/9`). We can thus go ahead and query for the dates of birth for all 535 entities:" ] }, { "cell_type": "code", "execution_count": 30, "id": "265a5ca7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.18 ms, sys: 10.2 ms, total: 13.3 ms\n", "Wall time: 1.45 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv\n", " --match 'ALIGN: (ulanid)-[]->(qnode), \n", " all: (ulanid)-[p0]->(ulanagent), \n", " all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' \n", " --where 'p0.label = \"foaf:focus\" AND p1.label = \"gvp:biographyPreferred\" AND p2.label = \"gvp:estStart\" AND p3.label = \"gvp:structured_value\"' \n", " --return 'distinct qnode as node1, \"P569\" as label, printf(\"^%s-01-01T00:00:00Z/9\", kgtk_unstringify(datevalue)) as node2' \n", " -o $TEMP/Getty_BD.tsv\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "ba953da9", "metadata": {}, "source": [ "Let's see how many results we found in Getty:" ] }, { "cell_type": "code", "execution_count": 31, "id": "7199e4a8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qnode
0535
\n", "
" ], "text/plain": [ " Qnode\n", "0 535" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/Getty_BD.tsv \n", " --match '(qnode)-[]->()' \n", " --return 'count(distinct qnode) as Qnode'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "71cfd959-712e-429a-8b99-f82fdde6f642", "metadata": {}, "source": [ "**Finding:** We find date of birth for all 535 people in our Getty knowledge graph! We get 540 dates in total, which again means that we have some duplicates." ] }, { "cell_type": "markdown", "id": "c6b9ea99", "metadata": {}, "source": [ "### 4a. How many values are novel?\n", "Here we count for how many new date of birth we found in Getty:" ] }, { "cell_type": "code", "execution_count": 32, "id": "93b9ba52", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.44 ms, sys: 10.2 ms, total: 13.6 ms\n", "Wall time: 1.38 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " ifnotexists -i $TEMP/Getty_BD.tsv \n", " --filter-on $TEMP/WD_BD.tsv \n", " --input-keys node1 \n", " --filter-keys node1 \n", " -o $TEMP/New_BD.tsv\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "f624e3fe", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qnode
0269
\n", "
" ], "text/plain": [ " Qnode\n", "0 269" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/New_BD.tsv\n", " --match '(qnode)-[]->()' \n", " --return 'count(distinct qnode) as Qnode'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "399e2e73", "metadata": {}, "source": [ "**Finding:** There are newly found dates of birth in Getty for 269 entities -- this is expected, given that Getty has 535 values, and Wikidata had 266 values." ] }, { "cell_type": "markdown", "id": "30e3b26d-2548-4a4d-92b9-4807f8a14c63", "metadata": {}, "source": [ "Let's see how many values we get in total:" ] }, { "cell_type": "code", "execution_count": 34, "id": "8635cf9e-fccf-409d-b1f9-ba893996abe9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qnode
0273
\n", "
" ], "text/plain": [ " Qnode\n", "0 273" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/New_BD.tsv\n", " --match '(qnode)-[]->()' \n", " --return 'count(qnode) as Qnode'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "f9113f44-2b62-4d53-96a4-cb4ab939fdb8", "metadata": {}, "source": [ "**Finding:** We see that in four of the novel cases, Getty has two birth dates for a node." ] }, { "cell_type": "markdown", "id": "771e3669", "metadata": {}, "source": [ "### 4b. Do the known values in Getty and Wikidata match?\n", "Let's check if the found results in Getty match with those in Wikidata. We first obtain the list of matching birth dates, using the `ifexists` command:" ] }, { "cell_type": "code", "execution_count": 35, "id": "0c413e48", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.59 ms, sys: 10.4 ms, total: 14 ms\n", "Wall time: 1.34 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " ifexists -i $TEMP/Getty_BD.tsv \n", " --filter-on $TEMP/WD_BD.tsv \n", " --input-keys node1 \n", " --filter-keys node1 \n", " -o $TEMP/matching_bd.tsv\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "6912b9d0-ed38-4050-bfe3-41079f40d72d", "metadata": {}, "source": [ "We expect to get birth date values by both sources for 266 nodes:" ] }, { "cell_type": "code", "execution_count": 36, "id": "3194a6af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qnode
0266
\n", "
" ], "text/plain": [ " Qnode\n", "0 266" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/matching_bd.tsv \n", " --match '(qnode)-[]->()' \n", " --return 'count(distinct qnode) as Qnode'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "2f4e4850-b38f-48ae-9bce-bb8429835f2c", "metadata": {}, "source": [ "Ok, our expectation is correct. Let's now see for how many of those nodes do Wikidata and Getty agree on the birth year:" ] }, { "cell_type": "code", "execution_count": 37, "id": "7d3574ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(DISTINCT graph_5_c1.\"node1\")
0250
\n", "
" ], "text/plain": [ " count(DISTINCT graph_5_c1.\"node1\")\n", "0 250" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/Getty_BD.tsv $TEMP/WD_BD.tsv\n", " --match '\n", " Getty: (qnode)-[p]->(v1), \n", " WD: (qnode)-[]->(v2)' \n", " --where 'kgtk_date_year(v1) = kgtk_date_year(v2)' \n", " --return 'count(distinct qnode)'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "d5f8a09a-5a08-438f-a0b2-0a18f8ea7cc1", "metadata": {}, "source": [ "Ok, so Getty and Wikidata agree for 250 out of the 266 overlapping entities. Let's inspect the entities for which they contain different information:" ] }, { "cell_type": "code", "execution_count": 38, "id": "9523d4af-7424-4fe3-82dd-9137b43a7d36", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1getty_yearwd_yearnode1;label
0Q177847100125'Lucian of Samosata'@en
1Q177847100120'Lucian of Samosata'@en
2Q18202115731575'Marie de' Medici'@en
3Q19342619231921'Nancy Reagan'@en
4Q245456419701930'Charles Kalani'@en
5Q308965318031805'Frédéric Bourgeois de Mercey'@en
6Q31146916351634'Mariana of Austria'@en
7Q312460119261921'LeRoy Neiman'@en
8Q43689339340'Ambrose'@en
9Q47252018991898'Hal B. Wallis'@en
10Q51144618501940'Luciana Arrighi'@en
11Q52548717391673'Jean Chalgrin'@en
12Q5303919261928'Lina Wertmüller'@en
13Q56259619551956'David Alan Grier'@en
14Q573817961797'Adolphe Thiers'@en
15Q6008018311833'Wilhelm Dilthey'@en
16Q6142318541853'Adolf Furtwängler'@en
17Q6614917921791'Friedrich von Gärtner'@en
18Q67655511811182'Francis of Assisi'@en
19Q68799719181926'Dale Hennesy'@en
20Q72581615611571'Salomon de Brosse'@en
21Q732214511450'Christopher Columbus'@en
22Q7561219041902'Isaac Bashevis Singer'@en
23Q85518791878'Joseph Stalin'@en
24Q93726719141912'Frank Thomas'@en
\n", "
" ], "text/plain": [ " node1 getty_year wd_year node1;label\n", "0 Q177847 100 125 'Lucian of Samosata'@en\n", "1 Q177847 100 120 'Lucian of Samosata'@en\n", "2 Q182021 1573 1575 'Marie de' Medici'@en\n", "3 Q193426 1923 1921 'Nancy Reagan'@en\n", "4 Q2454564 1970 1930 'Charles Kalani'@en\n", "5 Q3089653 1803 1805 'Frédéric Bourgeois de Mercey'@en\n", "6 Q311469 1635 1634 'Mariana of Austria'@en\n", "7 Q3124601 1926 1921 'LeRoy Neiman'@en\n", "8 Q43689 339 340 'Ambrose'@en\n", "9 Q472520 1899 1898 'Hal B. Wallis'@en\n", "10 Q511446 1850 1940 'Luciana Arrighi'@en\n", "11 Q525487 1739 1673 'Jean Chalgrin'@en\n", "12 Q53039 1926 1928 'Lina Wertmüller'@en\n", "13 Q562596 1955 1956 'David Alan Grier'@en\n", "14 Q5738 1796 1797 'Adolphe Thiers'@en\n", "15 Q60080 1831 1833 'Wilhelm Dilthey'@en\n", "16 Q61423 1854 1853 'Adolf Furtwängler'@en\n", "17 Q66149 1792 1791 'Friedrich von Gärtner'@en\n", "18 Q676555 1181 1182 'Francis of Assisi'@en\n", "19 Q687997 1918 1926 'Dale Hennesy'@en\n", "20 Q725816 1561 1571 'Salomon de Brosse'@en\n", "21 Q7322 1451 1450 'Christopher Columbus'@en\n", "22 Q75612 1904 1902 'Isaac Bashevis Singer'@en\n", "23 Q855 1879 1878 'Joseph Stalin'@en\n", "24 Q937267 1914 1912 'Frank Thomas'@en" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $TEMP/Getty_BD.tsv $TEMP/WD_BD.tsv\n", " --match '\n", " Getty: (qnode)-[p]->(v1), \n", " WD: (qnode)-[]->(v2)' \n", " --where 'kgtk_date_year(v1) != kgtk_date_year(v2)' \n", " --return 'distinct qnode, kgtk_date_year(v1) as getty_year, kgtk_date_year(v2) as wd_year'\n", " --order-by 'qnode' \n", " / add-labels\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "487ed9da", "metadata": {}, "source": [ "**Finding:** 250 of the 266 ULAN ids have identical years of birth in Wikidata and Getty. In the remaining cases, the years usually differ a little bit (e.g., 1181 vs 1182)." ] }, { "cell_type": "markdown", "id": "7f00dc6a", "metadata": {}, "source": [ "# 5. Append the newly found years to our Wikidata subgraph" ] }, { "cell_type": "markdown", "id": "88a66808", "metadata": {}, "source": [ "We are now ready to insert the 273 new values for the 269 entities from Getty into our Wikidata subgraph. \n", "\n", "We first complete each edge with an id, using the `add-id` command:" ] }, { "cell_type": "code", "execution_count": 39, "id": "c1ea1318", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.42 ms, sys: 10.5 ms, total: 13.9 ms\n", "Wall time: 1.31 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"add-id --debug -i $TEMP/New_BD.tsv --id-style wikidata -o $TEMP/New_BD_with_ID.tsv\"\"\")" ] }, { "cell_type": "markdown", "id": "4c927fcd", "metadata": {}, "source": [ "Finally, we concatenate the original Wikidata graph with the new edges from Getty:" ] }, { "cell_type": "code", "execution_count": 40, "id": "9643994d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.74 ms, sys: 13.8 ms, total: 19.6 ms\n", "Wall time: 9.29 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"cat -i $all $TEMP/New_BD_with_ID.tsv -o $OUT/all_plus_getty.tsv\"\"\")" ] }, { "cell_type": "markdown", "id": "5da279ca-bc9b-4939-95ec-07d9ce630d03", "metadata": {}, "source": [ "Let's count the number of edges in Wikidata before and after enrichment.\n", "\n", "Before:" ] }, { "cell_type": "code", "execution_count": 41, "id": "4c22293a-df9f-4684-bf8d-46fa7329ec9d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.95 ms, sys: 13.1 ms, total: 19.1 ms\n", "Wall time: 1.52 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(graph_1_c1.\"node1\")
02614949
\n", "
" ], "text/plain": [ " count(graph_1_c1.\"node1\")\n", "0 2614949" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $all \n", " --match '(q)-[]->()'\n", " --return 'count(q)'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "3b028c09-d13b-4679-969e-abda4aaac3ad", "metadata": {}, "source": [ "After:" ] }, { "cell_type": "code", "execution_count": 42, "id": "d7143b3c-cddf-4537-9c16-b52dbd63e91d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.66 ms, sys: 14.4 ms, total: 24.1 ms\n", "Wall time: 16.8 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(graph_8_c1.\"node1\")
02615222
\n", "
" ], "text/plain": [ " count(graph_8_c1.\"node1\")\n", "0 2615222" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i $OUT/all_plus_getty.tsv \n", " --match '(q)-[]->()'\n", " --return 'count(q)'\n", " \"\"\")" ] }, { "cell_type": "markdown", "id": "cd63c295-b8fa-497c-b094-1e5edbe60753", "metadata": {}, "source": [ "**Finding:** As expected, the difference is 273 (2,615,222 - 2,614,949) edges." ] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }