{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Knowledge Graph Profiling\n", "\n", "The goal fo profiling is to produce a summary of the classes, properties and instances present in a KG. Profiling is challenging because it is comptationally expensive as the queries touch large parts of the KG. In this part of the tutorial, you will learn how to use KGTK to profile a KG, and how KGTK addresses the computatinal challenges of computing profiles. Along the way, you will learn advanced uses of the KGTK query command.\n", "\n", "This part of the tutorial is divided into multiple subsetions:\n", "- Counting the number of instances, classes and properties\n", "- Counting the number of instances of each class, the the most basic form of profiling\n", "- Extending instance counting to include the instance of all subclasses of a class\n", "- Generalizing the Wikidata `instance of (P31)` to include `occupation (P106)` and `position held (P39)` so that our profiles include statistics about classes such as `director (P57)`, which in Wikidata don't have instances\n", "- Counting the number of times each property is used in the instances of each class and all its subclasses; you will learn how to divide a computationally challenging task into simpler queries that you can chain together \n", "- Customizing the profiles to include items of interest\n", "\n", "At the end, you will load the profile data in the browsesr so that you can get more insights into the knowledge present in the tutorial KG." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Install KGTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only run the following cell if KGTK is not installed.\n", " For example, if running in [Google Colab](https://colab.research.google.com/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install kgtk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preamble: set up the environment and files used in the tutorial" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n", "from kgtk.functions import kgtk, kypher" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "\n", "# Folder on local machine where to create the output and temporary folders\n", "input_path = None\n", "output_path = \"/tmp/projects\"\n", "project_name = \"tutorial-profiling\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our Wikidata distribution partitions the knowledge in Wikidata into smaller files that make it possible for you to pick and choose which files you want to use. Our tutorial KG is a subset of Wikidata, and is partitioned in the same way as the full Wikidata. The following is a partial list of all the files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files = [\n", " \"all\",\n", " \"label\",\n", " \"quantity\",\n", " \"item\",\n", " \"wikibase_property\",\n", " \"qualifiers\",\n", " \"p279star\",\n", " \"p31\"\n", "]\n", "ck = ConfigureKGTK(files)\n", "ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The KGTK setup command defines environment variables for all the files so that you can reuse the Jupyter notebook when you install it on your local machine." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEMP: /tmp/projects/tutorial-profiling/temp.tutorial-profiling\n", "KGTK_GRAPH_CACHE: /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db\n", "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n", "OUT: /tmp/projects/tutorial-profiling\n", "GRAPH: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input\n", "kgtk: kgtk\n", "KGTK_OPTION_DEBUG: false\n", "kypher: kgtk query --graph-cache /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db\n", "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/labels.en.tsv.gz\n", "STORE: /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db\n", "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n", "all: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/all.tsv.gz\n", "label: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/labels.en.tsv.gz\n", "quantity: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.quantity.tsv.gz\n", "item: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.wikibase-item.tsv.gz\n", "wikibase_property: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.wikibase-property.tsv.gz\n", "qualifiers: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/qualifiers.tsv.gz\n", "p279star: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/derived.P279star.tsv.gz\n", "p31: /Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/derived.P31.tsv.gz\n" ] } ], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The KGTK query command (https://kgtk.readthedocs.io/en/latest/transform/query/) uses a database to cache the file used in the queries. In this tutorial, we will populate the cache now to include the files we need so that later. KGTK will populate the cache on demand, the first time you use a file. I like to do it at configuration time to keep all the aliases in one place so that I can quickly come here and see the aliases of all the files." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "kgtk query --graph-cache /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/all.tsv.gz\" --as all -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/labels.en.tsv.gz\" --as label -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.quantity.tsv.gz\" --as quantity -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.wikibase-item.tsv.gz\" --as item -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/claims.wikibase-property.tsv.gz\" --as wikibase_property -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/qualifiers.tsv.gz\" --as qualifiers -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/derived.P279star.tsv.gz\" --as p279star -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-profiling_input/derived.P31.tsv.gz\" --as p31 --limit 3\n", "node1\tlabel\tnode2\tid\tnode2;wikidatatype\n", "P10\talias\t'gif'@en\tP10-alias-en-282226-0\t\n", "P10\talias\t'animation'@en\tP10-alias-en-2f86d8-0\t\n", "P10\talias\t'media'@en\tP10-alias-en-c1427e-0\t\n", "CPU times: user 3.54 ms, sys: 11 ms, total: 14.5 ms\n", "Wall time: 30.9 s\n" ] } ], "source": [ "%%time\n", "ck.load_files_into_cache()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Compute global KG statistics\n", "In this part of the tutorial we will compute global statistics about the number of instances in the KG, the number of properties used to describe all the instances and classes, and the number of classes.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Total number of edges in our graph:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2654671\n" ] } ], "source": [ "%%bash\n", "zcat < $all | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the total number of nodes is a bit harder as nodes can appear in the `node1` poistion or the `node2` position. \n", "In the queries below we count literals as nodes, as in KGTK they are nodes:\n", "- list all the nodes that appear in the `node1` position.\n", "- list all the nodes that appear in the `node1` position.\n", "- concatenate and deduplicate the two files" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id
0$a United States. $b Department of the Interior
1((0?[1-9]|[1-2][0-9]|3[0-6])[LRC]?)(/(0?[1-9]|...
2(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]?|25[0...
3(+20) 2
4(+32) 2
......
1420765url
1420766wikibase-form
1420767wikibase-item
1420768wikibase-property
1420769wikibase-sense
\n", "

1420770 rows × 1 columns

\n", "
" ], "text/plain": [ " id\n", "0 $a United States. $b Department of the Interior\n", "1 ((0?[1-9]|[1-2][0-9]|3[0-6])[LRC]?)(/(0?[1-9]|...\n", "2 (([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]?|25[0...\n", "3 (+20) 2\n", "4 (+32) 2\n", "... ...\n", "1420765 url\n", "1420766 wikibase-form\n", "1420767 wikibase-item\n", "1420768 wikibase-property\n", "1420769 wikibase-sense\n", "\n", "[1420770 rows x 1 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '(n1)-[id]->(n2)'\n", " --return 'distinct n1 as id'\n", " -o $TEMP/node1.tsv\n", "\"\"\")\n", "\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(n1)-[id]->(n2)'\n", " --return 'distinct n2 as id'\n", " -o $TEMP/node2.tsv\n", "\"\"\")\n", "\n", "kgtk(\"\"\"\n", " cat -i $TEMP/node1.tsv -i $TEMP/node2.tsv\n", " / compact\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of instances is easy as we can use the `instance of (P31)` property to identify the instances:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.37 ms, sys: 17.7 ms, total: 24.1 ms\n", "Wall time: 3 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count_instances
058831
\n", "
" ], "text/plain": [ " count_instances\n", "0 58831" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[:P31]->(class)'\n", " --return 'count(distinct instance) as count_instances'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of properties used is also easy: you do a query over all statements in the KG, and count the occurrence of each property:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.17 ms, sys: 14.8 ms, total: 21 ms\n", "Wall time: 857 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count_property
03874
\n", "
" ], "text/plain": [ " count_property\n", "0 3874" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[l {label: property}]->(class)'\n", " --return 'count(distinct property) as count_property'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of classes is more challenging because the notion of class in Wikidata is implicit. Here, we define **class** to be any item that is involved in a `subclass of (P279)`. Some classes don't have instances, so we cannot use `instance of (P31)` to count classes. The KGTK `p279star` graph is very handy for this task, and for any other task where you want to quickly traverse the `subclass of (P279)`. KGTK defines the `subclass of (transitive) (P279star)` property to record all the superclesses of each class, includng itself.\n", "\n", "You can count the number of classes by counting the number of distinct classes that appear as values of `P279star` :" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.72 ms, sys: 15.5 ms, total: 21.2 ms\n", "Wall time: 1.28 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count_classes
014598
\n", "
" ], "text/plain": [ " count_classes\n", "0 14598" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i p279star\n", " --match '(class)-[:P279star]->(super_class)'\n", " --return 'count(distinct super_class) as count_classes'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count the number of qualifiers. All the qualifiers are in a file, so we can count them by getting the number of lines in the file:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 322505\n" ] } ], "source": [ "%%bash\n", "zcat < $qualifiers | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also count the number of qualifier edges using a query, and it is instructive to do it as tis example shows how to access the qualifiers on an edge:\n", "- The first match clasuse has `[id]`, which binds the variable `id` to the identifier of the edge.\n", "- The second match clause uses `(id)` in the `node1` position, and puts the identifier of the qualifier edge in the `qualifier_id` variable.\n", "- The retrun statement returns the count of `qualifier_id`, which is the number of qualifier edges." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(DISTINCT graph_9_c2.\"id\")
0455226
\n", "
" ], "text/plain": [ " count(DISTINCT graph_9_c2.\"id\")\n", "0 455226" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (n1)-[id]->(n2),\n", " (id)-[qualifier_id]->(qualifier_value)'\n", " --return 'count(distinct qualifier_id)'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can enhance the query to show us the distribution of properties used as qualifiers by introducing a variable `qualifier_property` to capture the property:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11.2 ms, sys: 19.1 ms, total: 30.3 ms\n", "Wall time: 6.62 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node1;label
0P1545count134301'series ordinal'@en
1P585count96781'point in time'@en
2P580count33212'start time'@en
3P459count32944'determination method'@en
4P805count19969'statement is subject of'@en
...............
719P945count1'allegiance'@en
720P952count1'ISCO-88 occupation code'@en
721P97count1'noble title'@en
722P974count1'tributary'@en
723P991count1'successful candidate'@en
\n", "

724 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label node2 node1;label\n", "0 P1545 count 134301 'series ordinal'@en\n", "1 P585 count 96781 'point in time'@en\n", "2 P580 count 33212 'start time'@en\n", "3 P459 count 32944 'determination method'@en\n", "4 P805 count 19969 'statement is subject of'@en\n", ".. ... ... ... ...\n", "719 P945 count 1 'allegiance'@en\n", "720 P952 count 1 'ISCO-88 occupation code'@en\n", "721 P97 count 1 'noble title'@en\n", "722 P974 count 1 'tributary'@en\n", "723 P991 count 1 'successful candidate'@en\n", "\n", "[724 rows x 4 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (n1)-[id]->(n2),\n", " (id)-[qualifier_id {label: qualifier_property}]->(qualifier_value)'\n", " --return 'qualifier_property as node1, \"count\" as label, count(distinct qualifier_id) as node2'\n", " --order-by 'cast(node2, int) desc'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Get instance counts for each class\n", "\n", "In this part you will do the simplest profiling query where you count the number of direct instancess of each class.\n", "We can compute the instance counts by retrieving all statements that use `instance of (P31)` and counting the instances for each class.\n", "We order the result by the number of instances to see the classes that have the most instances.\n", "You can see that our tutorial KG contains a large number of people, and that there is a long tail of classes with very few instances; this is common in Wikidata, which defines over 1 million classes." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 36 ms, sys: 21.9 ms, total: 57.9 ms\n", "Wall time: 2.27 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcountclass;label
0Q513873'human'@en
1Q152216233177'bilateral relation'@en
2Q114242136'film'@en
3Q40221550'river'@en
4Q3918815'university'@en
............
5779Q1005090521'metallurgical plant'@en
5780Q10048871'thermal bath'@en
5781Q10028121'metropolitan borough'@en
5782Q1002120021'subjective quality'@en
5783Q1000393271'autonomous constitutional agency'@en
\n", "

5784 rows × 3 columns

\n", "
" ], "text/plain": [ " class count class;label\n", "0 Q5 13873 'human'@en\n", "1 Q15221623 3177 'bilateral relation'@en\n", "2 Q11424 2136 'film'@en\n", "3 Q4022 1550 'river'@en\n", "4 Q3918 815 'university'@en\n", "... ... ... ...\n", "5779 Q100509052 1 'metallurgical plant'@en\n", "5780 Q1004887 1 'thermal bath'@en\n", "5781 Q1002812 1 'metropolitan borough'@en\n", "5782 Q100212002 1 'subjective quality'@en\n", "5783 Q100039327 1 'autonomous constitutional agency'@en\n", "\n", "[5784 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[:P31]->(class)'\n", " --return 'class as class, count(distinct instance) as count'\n", " --order-by 'cast(count, int) desc' \n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to add the profiling data back into the KG so that we can use it in queries and look at it in the browser.\n", "To do so, we create a KGTK graph by using `node1, label, node2` as column headers:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.15 ms, sys: 15.6 ms, total: 21.7 ms\n", "Wall time: 1.08 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0Q5P31_count13873
1Q15221623P31_count3177
2Q11424P31_count2136
3Q4022P31_count1550
4Q3918P31_count815
5Q4164871P31_count645
6Q1549591P31_count627
7Q3917681P31_count614
8Q19595382P31_count595
9Q11862829P31_count568
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 Q5 P31_count 13873\n", "1 Q15221623 P31_count 3177\n", "2 Q11424 P31_count 2136\n", "3 Q4022 P31_count 1550\n", "4 Q3918 P31_count 815\n", "5 Q4164871 P31_count 645\n", "6 Q1549591 P31_count 627\n", "7 Q3917681 P31_count 614\n", "8 Q19595382 P31_count 595\n", "9 Q11862829 P31_count 568" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[:P31]->(class)'\n", " --return 'class as node1, \"P31_count\" as label, count(distinct instance) as node2'\n", " --order-by 'cast(node2, int) desc'\n", " --limit 10 \n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is good practice to add identifiers to the edges so that we can add qualifiers later if we desire. To add the identifiers, we chain the query output to the `add-id` command:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 63 ms, sys: 29.1 ms, total: 92.1 ms\n", "Wall time: 1.98 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q5P31count13873Q5-P31count-247e30
1Q15221623P31count3177Q15221623-P31count-61d8c4
2Q11424P31count2136Q11424-P31count-907bdc
3Q4022P31count1550Q4022-P31count-c27484
4Q3918P31count815Q3918-P31count-96da2f
...............
5779Q995347P31count1Q995347-P31count-6b86b2
5780Q99566538P31count1Q99566538-P31count-6b86b2
5781Q996839P31count1Q996839-P31count-6b86b2
5782Q99960791P31count1Q99960791-P31count-6b86b2
5783Q99972219P31count1Q99972219-P31count-6b86b2
\n", "

5784 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label node2 id\n", "0 Q5 P31count 13873 Q5-P31count-247e30\n", "1 Q15221623 P31count 3177 Q15221623-P31count-61d8c4\n", "2 Q11424 P31count 2136 Q11424-P31count-907bdc\n", "3 Q4022 P31count 1550 Q4022-P31count-c27484\n", "4 Q3918 P31count 815 Q3918-P31count-96da2f\n", "... ... ... ... ...\n", "5779 Q995347 P31count 1 Q995347-P31count-6b86b2\n", "5780 Q99566538 P31count 1 Q99566538-P31count-6b86b2\n", "5781 Q996839 P31count 1 Q996839-P31count-6b86b2\n", "5782 Q99960791 P31count 1 Q99960791-P31count-6b86b2\n", "5783 Q99972219 P31count 1 Q99972219-P31count-6b86b2\n", "\n", "[5784 rows x 4 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[:P31]->(class)'\n", " --return 'class as node1, \"P31count\" as label, count(distinct instance) as node2'\n", " --order-by 'cast(node2, int) desc' \n", " / add-id --id-style wikidata\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we saw the steps to create the graph with the counts, we want to output the results to a file using the `-o` option:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.79 ms, sys: 13.9 ms, total: 17.7 ms\n", "Wall time: 1.94 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '(instance)-[:P31]->(class)'\n", " --return 'class as node1, \"P31count\" as label, count(distinct instance) as node2'\n", " --order-by 'cast(node2, int) desc'\n", " / add-id --id-style wikidata\n", " -o $OUT/metadata.p31.count.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm that the output file went to the right place:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 5952\n", "-rw-r--r-- 1 amandeep wheel 2171415 Mar 22 14:04 P39_P106.tsv\n", "-rw-r--r-- 1 amandeep wheel 601680 Mar 22 14:04 metadata.p31.count.transitive.tsv\n", "-rw-r--r-- 1 amandeep wheel 261137 Mar 22 14:12 metadata.p31.count.tsv\n", "drwxr-xr-x 5 amandeep wheel 160 Mar 22 14:12 \u001b[34mtemp.tutorial-profiling\u001b[m\u001b[m\n" ] } ], "source": [ "!ls -l $OUT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the `P31count` graph in the KGTK cache so that we can use it in queries later" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q5P31count13873Q5-P31count-247e30
1Q15221623P31count3177Q15221623-P31count-61d8c4
\n", "
" ], "text/plain": [ " node1 label node2 id\n", "0 Q5 P31count 13873 Q5-P31count-247e30\n", "1 Q15221623 P31count 3177 Q15221623-P31count-61d8c4" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $OUT/metadata.p31.count.tsv --as p31count --limit 2\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary of this section:\n", "- In this section we computed the count of instances for every class in our KG.\n", "- We illustrated the use of `instance of (P31)` to do queries.\n", "- We illustrated common conventions to add identifiers to edges and to save results to files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute `P31count_transitive`, the count of instances of a class including the instances of all the subclasses\n", "\n", "Approach:\n", "- get the class of each instance\n", "- get all the superclass of the class of each instance\n", "- for every superclass, count all the instances\n", "\n", "> This query will run at the scale of all Wikidata, which contains millions of classes\n", "\n", "We add the labels to see the results, not surprisingly, `entity` has the most instances, and the top classes are those at the top of the Wikidata ontology:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 52.1 ms, sys: 29.7 ms, total: 81.8 ms\n", "Wall time: 14.3 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcountclass;label
0Q3512058496'entity'@en
1Q9952751738373'collection entity'@en
2Q2881362035555'set'@en
3Q1688738035533'group'@en
4Q5841592930837'spatio-temporal entity'@en
............
8926Q1001663911'salt production facility'@en
8927Q10010591'writ'@en
8928Q10006601'algebra over a field'@en
8929Q1000520081'anthropomorphic Pantherinae'@en
8930Q1000393271'autonomous constitutional agency'@en
\n", "

8931 rows × 3 columns

\n", "
" ], "text/plain": [ " class count class;label\n", "0 Q35120 58496 'entity'@en\n", "1 Q99527517 38373 'collection entity'@en\n", "2 Q28813620 35555 'set'@en\n", "3 Q16887380 35533 'group'@en\n", "4 Q58415929 30837 'spatio-temporal entity'@en\n", "... ... ... ...\n", "8926 Q100166391 1 'salt production facility'@en\n", "8927 Q1001059 1 'writ'@en\n", "8928 Q1000660 1 'algebra over a field'@en\n", "8929 Q100052008 1 'anthropomorphic Pantherinae'@en\n", "8930 Q100039327 1 'autonomous constitutional agency'@en\n", "\n", "[8931 rows x 3 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (instance)-[:P31]->(class),\n", " (class)-[:P279star]->(superclass)'\n", " --return 'superclass as class, count(distinct instance) as count'\n", " --order-by 'cast(count, int) desc'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the results in a file using a new property `P31count_transitive`" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.26 ms, sys: 15.8 ms, total: 23 ms\n", "Wall time: 12.6 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all \n", " --match '\n", " (instance)-[:P31]->(class),\n", " (class)-[:P279star]->(superclass)'\n", " --return 'superclass as node1, \"P31count_transitive\" as label, count(distinct instance) as node2'\n", " --order-by 'cast(node2, int) desc'\n", " / add-id --id-style wikidata\n", " -o $OUT/metadata.p31.count.transitive.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the number of instances of `Q5: human`, `artist: Q483501` and `film director: Q2526255`. There are many instances of human, but only one of artist and zero of film director." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2idnode1;label
0Q5P31count_transitive13944Q5-P31count_transitive-c2d55f'human'@en
1Q483501P31count_transitive1Q483501-P31count_transitive-6b86b2'artist'@en
\n", "
" ], "text/plain": [ " node1 label node2 id \\\n", "0 Q5 P31count_transitive 13944 Q5-P31count_transitive-c2d55f \n", "1 Q483501 P31count_transitive 1 Q483501-P31count_transitive-6b86b2 \n", "\n", " node1;label \n", "0 'human'@en \n", "1 'artist'@en " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " filter -i $OUT/metadata.p31.count.transitive.tsv -p \"Q5, Q483501, Q2526255 ;;\" / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason there are no instances of `artist: Q483501` or `film director: Q2526255` is that Wikidata uses the property `occupation: P106` to relate people to their occupations, so the connection between human and artist of director is not `instance of: P31`. It would be nice if the browser page for `artist: Q483501` or `film director: Q2526255` would show the number of people with this occupation. DBpedia uses a different model where humans are instances of artist or film director.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of this section\n", "In this section we:\n", "- Computed the count of instaces of every class, including all subclasses.\n", "- Introduced `P279star`, the precomputed transitive closure of the Wikidata `subclass of (P279)` property that allows you to conveniently do queries over all super classes or subclasses of an entity." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Define `P31x`, a generalization of `instance of: P31`\n", "\n", "In our KG we are going to define a new property called `instance of (generalized): P31x` that behaves like DBpedia, so that we can ask for instances of `artist: Q483501`.\n", "We do this by generalizing `occupation: P106` abd `position held: 39` to also behave as `P31` statements.\n", "\n", "Approach:\n", "- Combine `x P31 y`, `x P106 y` and `x P39 y` statements using a new `P31x` predicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `filter` to take a peek at the data and see whether our plan makes sense." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\tnode2;wikidatatype\tnode1;label\tlabel;label\tnode2;label\n", "Q1000048\tP106\tQ1622272\tQ1000048-P106-Q1622272-3a1be6b5-0\twikibase-item\t'Franz Zimmermann'@en\t'occupation'@en\t'university teacher'@en\n", "Q1000048\tP106\tQ16267607\tQ1000048-P106-Q16267607-e13e45d1-0\twikibase-item\t'Franz Zimmermann'@en\t'occupation'@en\t'classical philologist'@en\n", "Q100063874\tP39\tQ1162163\tQ100063874-P39-Q1162163-ae076e77-0\twikibase-item\t'Catherine Musson'@en\t'position held'@en\t'director'@en\n", "Q100066085\tP39\tQ1162163\tQ100066085-P39-Q1162163-93ac33fd-0\twikibase-item\t'Anne-Laurence Mennessier'@en\t'position held'@en\t'director'@en\n", "Q1001\tP106\tQ11774202\tQ1001-P106-Q11774202-45d8eb34-0\twikibase-item\t'Mahatma Gandhi'@en\t'occupation'@en\t'essayist'@en\n", "Q1001\tP106\tQ17351648\tQ1001-P106-Q17351648-e64838e9-0\twikibase-item\t'Mahatma Gandhi'@en\t'occupation'@en\t'newspaper editor'@en\n", "Q1001\tP106\tQ1930187\tQ1001-P106-Q1930187-6cf568db-0\twikibase-item\t'Mahatma Gandhi'@en\t'occupation'@en\t'journalist'@en\n", "Q1001\tP106\tQ4964182\tQ1001-P106-Q4964182-a0867b04-0\twikibase-item\t'Mahatma Gandhi'@en\t'occupation'@en\t'philosopher'@en\n", "Q1001\tP106\tQ808967\tQ1001-P106-Q808967-57fe7a7e-0\twikibase-item\t'Mahatma Gandhi'@en\t'occupation'@en\t'barrister'@en\n", "Q100159381\tP106\tQ37226\tQ100159381-P106-Q37226-d95f0b81-0\twikibase-item\t'Victor Cherner'@en\t'occupation'@en\t'teacher'@en\n", "Exception ignored in: <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>\n", "BrokenPipeError: [Errno 32] Broken pipe\n" ] } ], "source": [ "!kgtk --debug filter -i $item -p \"; P39, P106 ;\" / add-labels / head" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"\n", " filter -i $item -p \"; P39, P106 ;\"\n", " / add-labels\n", " -o $OUT/P39_P106.tsv\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2idnode2;wikidatatypenode1;labellabel;labelnode2;label
0Q1000048P106Q1622272Q1000048-P106-Q1622272-3a1be6b5-0wikibase-item'Franz Zimmermann'@en'occupation'@en'university teacher'@en
1Q1000048P106Q16267607Q1000048-P106-Q16267607-e13e45d1-0wikibase-item'Franz Zimmermann'@en'occupation'@en'classical philologist'@en
2Q100063874P39Q1162163Q100063874-P39-Q1162163-ae076e77-0wikibase-item'Catherine Musson'@en'position held'@en'director'@en
3Q100066085P39Q1162163Q100066085-P39-Q1162163-93ac33fd-0wikibase-item'Anne-Laurence Mennessier'@en'position held'@en'director'@en
4Q1001P106Q11774202Q1001-P106-Q11774202-45d8eb34-0wikibase-item'Mahatma Gandhi'@en'occupation'@en'essayist'@en
5Q1001P106Q17351648Q1001-P106-Q17351648-e64838e9-0wikibase-item'Mahatma Gandhi'@en'occupation'@en'newspaper editor'@en
6Q1001P106Q1930187Q1001-P106-Q1930187-6cf568db-0wikibase-item'Mahatma Gandhi'@en'occupation'@en'journalist'@en
7Q1001P106Q4964182Q1001-P106-Q4964182-a0867b04-0wikibase-item'Mahatma Gandhi'@en'occupation'@en'philosopher'@en
8Q1001P106Q808967Q1001-P106-Q808967-57fe7a7e-0wikibase-item'Mahatma Gandhi'@en'occupation'@en'barrister'@en
9Q100159381P106Q37226Q100159381-P106-Q37226-d95f0b81-0wikibase-item'Victor Cherner'@en'occupation'@en'teacher'@en
\n", "
" ], "text/plain": [ " node1 label node2 id \\\n", "0 Q1000048 P106 Q1622272 Q1000048-P106-Q1622272-3a1be6b5-0 \n", "1 Q1000048 P106 Q16267607 Q1000048-P106-Q16267607-e13e45d1-0 \n", "2 Q100063874 P39 Q1162163 Q100063874-P39-Q1162163-ae076e77-0 \n", "3 Q100066085 P39 Q1162163 Q100066085-P39-Q1162163-93ac33fd-0 \n", "4 Q1001 P106 Q11774202 Q1001-P106-Q11774202-45d8eb34-0 \n", "5 Q1001 P106 Q17351648 Q1001-P106-Q17351648-e64838e9-0 \n", "6 Q1001 P106 Q1930187 Q1001-P106-Q1930187-6cf568db-0 \n", "7 Q1001 P106 Q4964182 Q1001-P106-Q4964182-a0867b04-0 \n", "8 Q1001 P106 Q808967 Q1001-P106-Q808967-57fe7a7e-0 \n", "9 Q100159381 P106 Q37226 Q100159381-P106-Q37226-d95f0b81-0 \n", "\n", " node2;wikidatatype node1;label label;label \\\n", "0 wikibase-item 'Franz Zimmermann'@en 'occupation'@en \n", "1 wikibase-item 'Franz Zimmermann'@en 'occupation'@en \n", "2 wikibase-item 'Catherine Musson'@en 'position held'@en \n", "3 wikibase-item 'Anne-Laurence Mennessier'@en 'position held'@en \n", "4 wikibase-item 'Mahatma Gandhi'@en 'occupation'@en \n", "5 wikibase-item 'Mahatma Gandhi'@en 'occupation'@en \n", "6 wikibase-item 'Mahatma Gandhi'@en 'occupation'@en \n", "7 wikibase-item 'Mahatma Gandhi'@en 'occupation'@en \n", "8 wikibase-item 'Mahatma Gandhi'@en 'occupation'@en \n", "9 wikibase-item 'Victor Cherner'@en 'occupation'@en \n", "\n", " node2;label \n", "0 'university teacher'@en \n", "1 'classical philologist'@en \n", "2 'director'@en \n", "3 'director'@en \n", "4 'essayist'@en \n", "5 'newspaper editor'@en \n", "6 'journalist'@en \n", "7 'philosopher'@en \n", "8 'barrister'@en \n", "9 'teacher'@en " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " head -i $OUT/P39_P106.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select all the `P31`, `P39` and `P106` statements and rewrite them as `P31x` statements, and also make sure that we do this only for humans:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node1;labelnode2;label
0Q1000048P31xQ1622272'Franz Zimmermann'@en'university teacher'@en
1Q1000048P31xQ16267607'Franz Zimmermann'@en'classical philologist'@en
2Q1000048P31xQ5'Franz Zimmermann'@en'human'@en
3Q1000061P31xQ5'Valentyn Symonenko'@en'human'@en
4Q100063874P31xQ5'Catherine Musson'@en'human'@en
5Q100063874P31xQ1162163'Catherine Musson'@en'director'@en
6Q100066085P31xQ5'Anne-Laurence Mennessier'@en'human'@en
7Q100066085P31xQ1162163'Anne-Laurence Mennessier'@en'director'@en
8Q1001P31xQ11774202'Mahatma Gandhi'@en'essayist'@en
9Q1001P31xQ17351648'Mahatma Gandhi'@en'newspaper editor'@en
\n", "
" ], "text/plain": [ " node1 label node2 node1;label \\\n", "0 Q1000048 P31x Q1622272 'Franz Zimmermann'@en \n", "1 Q1000048 P31x Q16267607 'Franz Zimmermann'@en \n", "2 Q1000048 P31x Q5 'Franz Zimmermann'@en \n", "3 Q1000061 P31x Q5 'Valentyn Symonenko'@en \n", "4 Q100063874 P31x Q5 'Catherine Musson'@en \n", "5 Q100063874 P31x Q1162163 'Catherine Musson'@en \n", "6 Q100066085 P31x Q5 'Anne-Laurence Mennessier'@en \n", "7 Q100066085 P31x Q1162163 'Anne-Laurence Mennessier'@en \n", "8 Q1001 P31x Q11774202 'Mahatma Gandhi'@en \n", "9 Q1001 P31x Q17351648 'Mahatma Gandhi'@en \n", "\n", " node2;label \n", "0 'university teacher'@en \n", "1 'classical philologist'@en \n", "2 'human'@en \n", "3 'human'@en \n", "4 'human'@en \n", "5 'director'@en \n", "6 'human'@en \n", "7 'director'@en \n", "8 'essayist'@en \n", "9 'newspaper editor'@en " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (n1)-[:P31]->(:Q5),\n", " (n1)-[r {label: property}]->(n2)'\n", " --where 'property in [\"P106\", \"P39\", \"P31\"]'\n", " --return 'distinct n1 as node1, \"P31x\" as label, n2 as node2'\n", " --limit 10\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query needs to be more sophisticated, because the previous query adds the extended `instance of` only to humans. If we don't do this, fictional characters that have occupations end up below `human (Q5)` due to the way the Wikidata ontology is structure. The fix is to concatenate (`cat`)the results of the previuos query with the original `instance of (P31)` graph and to deduplicate (`compact`).\n", "The resulting graph goes in file `derived.P31x.tsv`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.8 ms, sys: 14.7 ms, total: 19.5 ms\n", "Wall time: 4.15 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i item\n", " --match '\n", " (n1)-[:P31]->(:Q5),\n", " (n1)-[r {label: property}]->(n2)'\n", " --where 'property in [\"P106\", \"P39\", \"P31\"]'\n", " --return 'distinct n1 as node1, \"P31x\" as label, n2 as node2'\n", " / add-id --id-style wikidata\n", " / cat -i - -i $p31\n", " / compact\n", " -o $OUT/derived.P31x.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the `p31x` graph defining our generalized `instance of` property:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0P10P31Q18610173P10-P31-Q18610173-85ef4d24-0
1P1000P31Q18608871P1000-P31-Q18608871-093affb5-0
\n", "
" ], "text/plain": [ " node1 label node2 id\n", "0 P10 P31 Q18610173 P10-P31-Q18610173-85ef4d24-0\n", "1 P1000 P31 Q18608871 P1000-P31-Q18608871-093affb5-0" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $OUT/derived.P31x.tsv --as p31x --limit 2\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can fix our `P31count_transitive` property to also include classes such as `film director (Q2526255)`. Use the new `P31x` graph to substitute `P31x` for `P31` in our query that computes the class counts:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.32 ms, sys: 13.1 ms, total: 17.4 ms\n", "Wall time: 3.71 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all -i p31x\n", " --match '\n", " p31x: (instance)-[:P31x]->(class),\n", " all: (class)-[:P279star]->(superclass)'\n", " --return 'superclass as node1, \"P31xcount_transitive\" as label, count(distinct instance) as node2'\n", " --order-by 'cast(node2, int) desc'\n", " / add-id --id-style wikidata\n", " -o $OUT/metadata.p31x.count.transitive.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Redo our query to get the number of instances of `Q5: human`, `artist: Q483501` and `film director: Q2526255`.\n", "Now we get more reasonable counts for artist and film directors:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2idnode1;label
0Q5P31xcount_transitive13873Q5-P31xcount_transitive-247e30'human'@en
1Q483501P31xcount_transitive2575Q483501-P31xcount_transitive-e7303a'artist'@en
2Q2526255P31xcount_transitive674Q2526255-P31xcount_transitive-8ef532'film director'@en
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q5 P31xcount_transitive 13873 \n", "1 Q483501 P31xcount_transitive 2575 \n", "2 Q2526255 P31xcount_transitive 674 \n", "\n", " id node1;label \n", "0 Q5-P31xcount_transitive-247e30 'human'@en \n", "1 Q483501-P31xcount_transitive-e7303a 'artist'@en \n", "2 Q2526255-P31xcount_transitive-8ef532 'film director'@en " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " filter -i $OUT/metadata.p31x.count.transitive.tsv -p \"Q5, Q483501, Q2526255 ;;\" / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find out the classes that appear in the new file that didn't appear in the old file. To do this we use the `ifnotexists` command that can be used to subtract the statements of one grpah from the statements from another graph.\n", "> Some classes may appear in both graphs and have their counts updated (e.g., artists appeared with a count of 1 before):" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2idnode1;label
0Q713200P31xcount_transitive1912Q713200-P31xcount_transitive-a991b8'performing artist'@en
1Q33999P31xcount_transitive1911Q33999-P31xcount_transitive-dc4bc8'actor'@en
2Q15980804P31xcount_transitive1400Q15980804-P31xcount_transitive-55fdec'media professional'@en
3Q2285706P31xcount_transitive1222Q2285706-P31xcount_transitive-16a3e9'head of government'@en
4Q3282637P31xcount_transitive881Q3282637-P31xcount_transitive-28096b'film producer'@en
..................
902Q957729P31xcount_transitive1Q957729-P31xcount_transitive-6b86b2'photojournalist'@en
903Q96172702P31xcount_transitive1Q96172702-P31xcount_transitive-6b86b2'Minister General of the Order of Franciscans'@en
904Q978708P31xcount_transitive1Q978708-P31xcount_transitive-6b86b2'Prime Minister of East Timor'@en
905Q98084799P31xcount_transitive1Q98084799-P31xcount_transitive-6b86b2'professional photographer'@en
906Q994779P31xcount_transitive1Q994779-P31xcount_transitive-6b86b2'delegate'@en
\n", "

907 rows × 5 columns

\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q713200 P31xcount_transitive 1912 \n", "1 Q33999 P31xcount_transitive 1911 \n", "2 Q15980804 P31xcount_transitive 1400 \n", "3 Q2285706 P31xcount_transitive 1222 \n", "4 Q3282637 P31xcount_transitive 881 \n", ".. ... ... ... \n", "902 Q957729 P31xcount_transitive 1 \n", "903 Q96172702 P31xcount_transitive 1 \n", "904 Q978708 P31xcount_transitive 1 \n", "905 Q98084799 P31xcount_transitive 1 \n", "906 Q994779 P31xcount_transitive 1 \n", "\n", " id \\\n", "0 Q713200-P31xcount_transitive-a991b8 \n", "1 Q33999-P31xcount_transitive-dc4bc8 \n", "2 Q15980804-P31xcount_transitive-55fdec \n", "3 Q2285706-P31xcount_transitive-16a3e9 \n", "4 Q3282637-P31xcount_transitive-28096b \n", ".. ... \n", "902 Q957729-P31xcount_transitive-6b86b2 \n", "903 Q96172702-P31xcount_transitive-6b86b2 \n", "904 Q978708-P31xcount_transitive-6b86b2 \n", "905 Q98084799-P31xcount_transitive-6b86b2 \n", "906 Q994779-P31xcount_transitive-6b86b2 \n", "\n", " node1;label \n", "0 'performing artist'@en \n", "1 'actor'@en \n", "2 'media professional'@en \n", "3 'head of government'@en \n", "4 'film producer'@en \n", ".. ... \n", "902 'photojournalist'@en \n", "903 'Minister General of the Order of Franciscans'@en \n", "904 'Prime Minister of East Timor'@en \n", "905 'professional photographer'@en \n", "906 'delegate'@en \n", "\n", "[907 rows x 5 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " ifnotexists -i $OUT/metadata.p31x.count.transitive.tsv\n", " --filter-on $OUT/metadata.p31.count.transitive.tsv\n", " --input-keys node1\n", " --filter-keys node1\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of this section\n", "In this section we:\n", "- Computed `P31x` representing our generalized instance of property. Results in `derived.P31x.tsv`.\n", "- Computed `P31xcount_transitive` as a revision of `P31count_transitive` to also include counts via occupation and position held links. Results in `metadata.p31x.count.transitive.tsv`.\n", "- Illustrated how to work with precomputed transitive closures (`P279star`), which enables KGTK to efficiently execute queries that otherwise would be very expensive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute the number of times each property appears in a class\n", "\n", "In this section we will compute the distribution of the use of properties in every class in th KG. \n", "We want to know the count of the different properties used in all instance of a class.\n", "For example, if we look at `film (Q11424)` we want to see what properties are used to describe films, including all subclasses of film.\n", "\n", "Computing this distirbution is challenging because as the query below shows, there are many classes in our KG:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count of classes
07483
\n", "
" ], "text/plain": [ " count of classes\n", "0 7483" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all --match '(entity)-[:P279]->(class)' --return 'count(distinct class) as `count of classes`'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Approach: we divide the task into two steps:\n", "- For every entity, compute the set of properties used to describe it, and store this information in `item_properties.tsv`\n", "- For every class, collect all the instances below it, and count the number of times each property appears in `item_properties.tsv`\n", "\n", "The query for the first step is below. \n", "The first clause of the match clause gets the properties used in every instance of the KG.\n", "I included a second clause to get the data type of the property, and used the `--where` clause to exlude properties with external identifiers, as there are so many of them, and for the tutorial we want the query to run faster." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.79 s, sys: 870 ms, total: 5.66 s\n", "Wall time: 15.5 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node1;labelnode2;label
0P8874Phas_propertyP1001'Hong Kong film rating'@en'applies to jurisdiction'@en
1Q1001543Phas_propertyP1001'Embassy of Finland, Budapest'@en'applies to jurisdiction'@en
2Q100325415Phas_propertyP1001'Embassy of Belarus, Budapest'@en'applies to jurisdiction'@en
3Q1005422Phas_propertyP1001'Federal Office of Bundeswehr Equipment, Infor...'applies to jurisdiction'@en
4Q1006360Phas_propertyP1001'Bundesminister'@en'applies to jurisdiction'@en
..................
837038Q7020999Phas_propertyP991'2017 French presidential election'@en'successful candidate'@en
837039Q72251Phas_propertyP991'1876 United States presidential election'@en'successful candidate'@en
837040Q72472Phas_propertyP991'1892 United States presidential election'@en'successful candidate'@en
837041Q72835Phas_propertyP991'1908 United States presidential election'@en'successful candidate'@en
837042P991-P1855-Q327959-2d857cd2-0Phas_propertyP991NaN'successful candidate'@en
\n", "

837043 rows × 5 columns

\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 P8874 Phas_property P1001 \n", "1 Q1001543 Phas_property P1001 \n", "2 Q100325415 Phas_property P1001 \n", "3 Q1005422 Phas_property P1001 \n", "4 Q1006360 Phas_property P1001 \n", "... ... ... ... \n", "837038 Q7020999 Phas_property P991 \n", "837039 Q72251 Phas_property P991 \n", "837040 Q72472 Phas_property P991 \n", "837041 Q72835 Phas_property P991 \n", "837042 P991-P1855-Q327959-2d857cd2-0 Phas_property P991 \n", "\n", " node1;label \\\n", "0 'Hong Kong film rating'@en \n", "1 'Embassy of Finland, Budapest'@en \n", "2 'Embassy of Belarus, Budapest'@en \n", "3 'Federal Office of Bundeswehr Equipment, Infor... \n", "4 'Bundesminister'@en \n", "... ... \n", "837038 '2017 French presidential election'@en \n", "837039 '1876 United States presidential election'@en \n", "837040 '1892 United States presidential election'@en \n", "837041 '1908 United States presidential election'@en \n", "837042 NaN \n", "\n", " node2;label \n", "0 'applies to jurisdiction'@en \n", "1 'applies to jurisdiction'@en \n", "2 'applies to jurisdiction'@en \n", "3 'applies to jurisdiction'@en \n", "4 'applies to jurisdiction'@en \n", "... ... \n", "837038 'successful candidate'@en \n", "837039 'successful candidate'@en \n", "837040 'successful candidate'@en \n", "837041 'successful candidate'@en \n", "837042 'successful candidate'@en \n", "\n", "[837043 rows x 5 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (entity)-[l {label: property}]->(),\n", " (property)-[:datatype]->(datatype)'\n", " --where 'datatype != \"external-id\"' \n", " --return 'distinct entity as node1, \"Phas_property\" as label, property as node2'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results look good, so we add the identifiers to the edges and store the results in `item_properties.tsv`." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.33 ms, sys: 20.3 ms, total: 25.7 ms\n", "Wall time: 7.53 s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (property)-[:datatype]->(datatype), \n", " (entity)-[l {label: property}]->()'\n", " --where 'datatype != \"external-id\"' \n", " --return 'distinct entity as node1, \"Phas_property\" as label, property as node2'\n", " / add-id --id-style wikidata\n", " -o $TEMP/item_properties.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the second step, we use `P279star` to get all the superclasses of each entity, and then look up the entity in the `item_properties` graph to find the properties it uses.\n", "We invent a new property called `P1963computed` to store the counts. Wikidata has a property `properties for this type (P1963)` where editors can manually specify the properties that should be used to describe the instance of a class. We are computing the properties bottom up from the data, so we call the property `P1963computed`.\n", "\n", "In the return clause, we list `superclass`, and the value of the `property` variable ahead of the `count` clause to tell KGTK that we want to aggregate by superclass and property. We reuse the Wikidata `quantity (P1114)` to record the counts:\n", "\n", "> This query is very expensive to run on the full Wikidata as it touches every entity in Wikidata, but it will complete after many hours." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 58.4 ms, sys: 54.5 ms, total: 113 ms\n", "Wall time: 2min 10s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2P1114node1;labelnode2;label
0Q35120P1963computedP3157874'entity'@en'instance of'@en
1Q99527517P1963computedP3137827'collection entity'@en'instance of'@en
2Q28813620P1963computedP3135036'set'@en'instance of'@en
3Q16887380P1963computedP3135014'group'@en'instance of'@en
4Q58415929P1963computedP3132620'spatio-temporal entity'@en'instance of'@en
.....................
95Q24229398P1963computedP1710839'agent'@en'country'@en
96Q43229P1963computedP1710815'organization'@en'country'@en
97Q58416391P1963computedP85610631'spatial entity'@en'official website'@en
98Q27096213P1963computedP85610630'geographic entity'@en'official website'@en
99Q17334923P1963computedP85610601'location'@en'official website'@en
\n", "

100 rows × 6 columns

\n", "
" ], "text/plain": [ " node1 label node2 P1114 node1;label \\\n", "0 Q35120 P1963computed P31 57874 'entity'@en \n", "1 Q99527517 P1963computed P31 37827 'collection entity'@en \n", "2 Q28813620 P1963computed P31 35036 'set'@en \n", "3 Q16887380 P1963computed P31 35014 'group'@en \n", "4 Q58415929 P1963computed P31 32620 'spatio-temporal entity'@en \n", ".. ... ... ... ... ... \n", "95 Q24229398 P1963computed P17 10839 'agent'@en \n", "96 Q43229 P1963computed P17 10815 'organization'@en \n", "97 Q58416391 P1963computed P856 10631 'spatial entity'@en \n", "98 Q27096213 P1963computed P856 10630 'geographic entity'@en \n", "99 Q17334923 P1963computed P856 10601 'location'@en \n", "\n", " node2;label \n", "0 'instance of'@en \n", "1 'instance of'@en \n", "2 'instance of'@en \n", "3 'instance of'@en \n", "4 'instance of'@en \n", ".. ... \n", "95 'country'@en \n", "96 'country'@en \n", "97 'official website'@en \n", "98 'official website'@en \n", "99 'official website'@en \n", "\n", "[100 rows x 6 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all -i p31x -i $TEMP/item_properties.tsv\n", " --match ' \n", " p31x: (entity)-[]->(class), \n", " all: (class)-[:P279star]->(superclass),\n", " item_properties: (entity)-[l]->(property)'\n", " --return 'distinct superclass as node1, \"P1963computed\" as label, property as node2, count(distinct l) as P1114' \\\n", " --order-by 'cast(P1114, int) desc'\n", " --limit 100\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results look good, so we store them in `derived.P1963computed.tsv`" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 45.6 ms, sys: 45.2 ms, total: 90.8 ms\n", "Wall time: 1min 47s\n" ] } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all -i p31x -i $TEMP/item_properties.tsv\n", " --match ' \n", " p31x: (entity)-[]->(class), \n", " all: (class)-[:P279star]->(superclass),\n", " item_properties: (entity)-[l]->(property)'\n", " --return 'distinct superclass as node1, \"P1963computed\" as label, property as node2, count(distinct l) as P1114' \n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $OUT/derived.P1963computed.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the new graph to the databse and define alias `p1963computed` for it." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q100039327P1963computedP159Q100039327-P1963computed-P159
1Q100039327-P1963computed-P159P11141Q100039327-P1963computed-P159-P1114-1-0000
2Q100039327P1963computedP17Q100039327-P1963computed-P17
3Q100039327-P1963computed-P17P11141Q100039327-P1963computed-P17-P1114-1-0000
4Q100039327P1963computedP1813Q100039327-P1963computed-P1813
5Q100039327-P1963computed-P1813P11141Q100039327-P1963computed-P1813-P1114-1-0000
6Q100039327P1963computedP31Q100039327-P1963computed-P31
7Q100039327-P1963computed-P31P11141Q100039327-P1963computed-P31-P1114-1-0000
8Q100039327P1963computedP373Q100039327-P1963computed-P373
9Q100039327-P1963computed-P373P11141Q100039327-P1963computed-P373-P1114-1-0000
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q100039327 P1963computed P159 \n", "1 Q100039327-P1963computed-P159 P1114 1 \n", "2 Q100039327 P1963computed P17 \n", "3 Q100039327-P1963computed-P17 P1114 1 \n", "4 Q100039327 P1963computed P1813 \n", "5 Q100039327-P1963computed-P1813 P1114 1 \n", "6 Q100039327 P1963computed P31 \n", "7 Q100039327-P1963computed-P31 P1114 1 \n", "8 Q100039327 P1963computed P373 \n", "9 Q100039327-P1963computed-P373 P1114 1 \n", "\n", " id \n", "0 Q100039327-P1963computed-P159 \n", "1 Q100039327-P1963computed-P159-P1114-1-0000 \n", "2 Q100039327-P1963computed-P17 \n", "3 Q100039327-P1963computed-P17-P1114-1-0000 \n", "4 Q100039327-P1963computed-P1813 \n", "5 Q100039327-P1963computed-P1813-P1114-1-0000 \n", "6 Q100039327-P1963computed-P31 \n", "7 Q100039327-P1963computed-P31-P1114-1-0000 \n", "8 Q100039327-P1963computed-P373 \n", "9 Q100039327-P1963computed-P373-P1114-1-0000 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i $OUT/derived.P1963computed.tsv --as p1963computed --limit 10\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let' see the distribution of properties for `film (Q11424)`:\n", "> You can try it for `film director (Q2526255)` or `entity (Q35120)`, which gives you the distribution of all properties in the KG:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.9 ms, sys: 25.5 ms, total: 36.4 ms\n", "Wall time: 4.79 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classpropertycountclass;labelproperty;label
0Q11424P312447'film'@en'instance of'@en
1Q11424P5771402'film'@en'publication date'@en
2Q11424P4951398'film'@en'country of origin'@en
3Q11424P14761381'film'@en'title'@en
4Q11424P3641368'film'@en'original language of film or TV show'@en
..................
94Q11424P62511'film'@en'catchphrase'@en
95Q11424P6411'film'@en'sport'@en
96Q11424P7671'film'@en'contributor to the creative work or subject'@en
97Q11424P84111'film'@en'set in environment'@en
98Q11424P9411'film'@en'inspired by'@en
\n", "

99 rows × 5 columns

\n", "
" ], "text/plain": [ " class property count class;label \\\n", "0 Q11424 P31 2447 'film'@en \n", "1 Q11424 P577 1402 'film'@en \n", "2 Q11424 P495 1398 'film'@en \n", "3 Q11424 P1476 1381 'film'@en \n", "4 Q11424 P364 1368 'film'@en \n", ".. ... ... ... ... \n", "94 Q11424 P6251 1 'film'@en \n", "95 Q11424 P641 1 'film'@en \n", "96 Q11424 P767 1 'film'@en \n", "97 Q11424 P8411 1 'film'@en \n", "98 Q11424 P941 1 'film'@en \n", "\n", " property;label \n", "0 'instance of'@en \n", "1 'publication date'@en \n", "2 'country of origin'@en \n", "3 'title'@en \n", "4 'original language of film or TV show'@en \n", ".. ... \n", "94 'catchphrase'@en \n", "95 'sport'@en \n", "96 'contributor to the creative work or subject'@en \n", "97 'set in environment'@en \n", "98 'inspired by'@en \n", "\n", "[99 rows x 5 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i p1963computed\n", " --match '\n", " (class:Q11424)-[l:P1963computed]->(property),\n", " (l)-[:P1114]->(quantity)'\n", " --return 'distinct class as class, property as property, quantity as count'\n", " --order-by 'cast(count, int) desc'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the resulting graph in `derived.Pproperty_domain.tsv` and define the alias `property_domain` for it in the database:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.6 ms, sys: 40.4 ms, total: 55 ms\n", "Wall time: 15.2 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0P1001Pproperty_domainQ35120P1001-Pproperty_domain-Q35120
1P1001-Pproperty_domain-Q35120P11142317P1001-Pproperty_domain-Q35120-P1114-2317-0000
2P1001Pproperty_domainQ99527517P1001-Pproperty_domain-Q99527517
3P1001-Pproperty_domain-Q99527517P11141940P1001-Pproperty_domain-Q99527517-P1114-1940-0000
4P1001Pproperty_domainQ16889133P1001-Pproperty_domain-Q16889133
5P1001-Pproperty_domain-Q16889133P11141866P1001-Pproperty_domain-Q16889133-P1114-1866-0000
6P1001Pproperty_domainQ16686448P1001-Pproperty_domain-Q16686448
7P1001-Pproperty_domain-Q16686448P11141667P1001-Pproperty_domain-Q16686448-P1114-1667-0000
8P1001Pproperty_domainQ16887380P1001-Pproperty_domain-Q16887380
9P1001-Pproperty_domain-Q16887380P11141410P1001-Pproperty_domain-Q16887380-P1114-1410-0000
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 P1001 Pproperty_domain Q35120 \n", "1 P1001-Pproperty_domain-Q35120 P1114 2317 \n", "2 P1001 Pproperty_domain Q99527517 \n", "3 P1001-Pproperty_domain-Q99527517 P1114 1940 \n", "4 P1001 Pproperty_domain Q16889133 \n", "5 P1001-Pproperty_domain-Q16889133 P1114 1866 \n", "6 P1001 Pproperty_domain Q16686448 \n", "7 P1001-Pproperty_domain-Q16686448 P1114 1667 \n", "8 P1001 Pproperty_domain Q16887380 \n", "9 P1001-Pproperty_domain-Q16887380 P1114 1410 \n", "\n", " id \n", "0 P1001-Pproperty_domain-Q35120 \n", "1 P1001-Pproperty_domain-Q35120-P1114-2317-0000 \n", "2 P1001-Pproperty_domain-Q99527517 \n", "3 P1001-Pproperty_domain-Q99527517-P1114-1940-0000 \n", "4 P1001-Pproperty_domain-Q16889133 \n", "5 P1001-Pproperty_domain-Q16889133-P1114-1866-0000 \n", "6 P1001-Pproperty_domain-Q16686448 \n", "7 P1001-Pproperty_domain-Q16686448-P1114-1667-0000 \n", "8 P1001-Pproperty_domain-Q16887380 \n", "9 P1001-Pproperty_domain-Q16887380-P1114-1410-0000 " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i p1963computed\n", " --match '\n", " (class)-[l:P1963computed]->(property),\n", " (l)-[:P1114]->(quantity)'\n", " --return 'distinct property as node1, \"Pproperty_domain\" as label, class as node2, quantity as P1114'\n", " --order-by 'property, cast(P1114, int) desc'\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $OUT/derived.Pproperty_domain.tsv\n", "\"\"\")\n", "\n", "kgtk(\"query -i $OUT/derived.Pproperty_domain.tsv --as property_domain --limit 10\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the distribution of classes for `cast member(P161)`. We restrict the results to be subclasses of `visual artwork (Q4502142)` because otherwise the results contain too many of the abstract classes. We see that property `cast member(P161)` is defined for film and subclasses of film:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2P1114node1;labelnode2;label
0P161Pproperty_domainQ114241133'cast member'@en'film'@en
1P161Pproperty_domainQ45021421133'cast member'@en'visual artwork'@en
2P161Pproperty_domainQ2486938'cast member'@en'feature film'@en
3P161Pproperty_domainQ22939036'cast member'@en'3D film'@en
4P161Pproperty_domainQ50624017'cast member'@en'television film'@en
5P161Pproperty_domainQ612838088'cast member'@en'Star Trek film'@en
6P161Pproperty_domainQ2028665'cast member'@en'animated film'@en
7P161Pproperty_domainQ251102695'cast member'@en'live-action animated film'@en
8P161Pproperty_domainQ5173865'cast member'@en'live action'@en
9P161Pproperty_domainQ248624'cast member'@en'short film'@en
\n", "
" ], "text/plain": [ " node1 label node2 P1114 node1;label \\\n", "0 P161 Pproperty_domain Q11424 1133 'cast member'@en \n", "1 P161 Pproperty_domain Q4502142 1133 'cast member'@en \n", "2 P161 Pproperty_domain Q24869 38 'cast member'@en \n", "3 P161 Pproperty_domain Q229390 36 'cast member'@en \n", "4 P161 Pproperty_domain Q506240 17 'cast member'@en \n", "5 P161 Pproperty_domain Q61283808 8 'cast member'@en \n", "6 P161 Pproperty_domain Q202866 5 'cast member'@en \n", "7 P161 Pproperty_domain Q25110269 5 'cast member'@en \n", "8 P161 Pproperty_domain Q517386 5 'cast member'@en \n", "9 P161 Pproperty_domain Q24862 4 'cast member'@en \n", "\n", " node2;label \n", "0 'film'@en \n", "1 'visual artwork'@en \n", "2 'feature film'@en \n", "3 '3D film'@en \n", "4 'television film'@en \n", "5 'Star Trek film'@en \n", "6 'animated film'@en \n", "7 'live-action animated film'@en \n", "8 'live action'@en \n", "9 'short film'@en " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i property_domain -i all\n", " --match '\n", " all: (class)-[:P279star]->(:Q4502142), \n", " property_domain: (property:P161)-[l:Pproperty_domain]->(class),\n", " property_domain: (l)-[:P1114]->(quantity)'\n", " --return 'distinct property as node1, \"Pproperty_domain\" as label, class as node2, quantity as P1114'\n", " --order-by 'property, cast(P1114, int) desc'\n", " --limit 10\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "### Summary of this section\n", "In this section we:\n", "- Computed `P1963computed`, to record the frequence of the use of properties in every class.\n", "- Used `P1963computed` to see the distribution of properties for a few classes.\n", "- Illustrated the ability to break down very expensive queries into simpler steps.\n", "- Illustrated a KGTK feature that allows you to use the results of one query as a new graph (`$TEMP/item_properties.tsv`) that can be integrated into other queries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute the distribution of units for quantity properties\n", "This part of the tutorial illustrates how to work with KGTK structured literals:\n", "- quantities: composed of a numeric value followed by the identifier of a unit, quantities can also define tolerances\n", "- dates and times: composed of an ISO-formatted date, followed by a numeric precision indicator, and sometimes by a calendar\n", "- monolingual strings: composed of a unicode string followed by a language tag\n", "\n", "Additional documentation on the KGTK file format is in https://kgtk.readthedocs.io/en/latest/specification/\n", "and documentation for the functions to operate on structured literals within queries is in https://kgtk.readthedocs.io/en/latest/transform/query/\n", "\n", "Below is a specific example of how to query the units in structured literals. THe objective in the example is to compute a distribution of the units used in all properties that store quantities.\n", "The query uses the `quantity` graph, which contains all properties whose values are quantities. \n", "\n", "The results of the query are interesting as we see some inconsistencies in the data present in our small subset of Wikidata. \n", "For example, most instances of `population (P1082)` have no units `point in time (Q186408)`, one has unit `Habitants (Q15621516)`, neither of which are units of `unit of measurement (Q47574)`" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2P1114node1;labelnode2;label
0P1081Pproperty_units_usedNaN6810'Human Development Index'@enNaN
1P1082Pproperty_units_usedNaN46643'population'@enNaN
2P1082Pproperty_units_usedQ1864082'population'@en'point in time'@en
3P1082Pproperty_units_usedQ156215161'population'@en'Habitants'@en
4P1082Pproperty_units_usedQ57279021'population'@en'circa'@en
.....................
418P8476Pproperty_units_usedNaN992'BTI Governance Index'@enNaN
419P8477Pproperty_units_usedNaN970'BTI Status Index'@enNaN
420P8687Pproperty_units_usedNaN6469'social media followers'@enNaN
421P8843Pproperty_units_usedNaN201'poverty incidence'@enNaN
422P8887Pproperty_units_usedQ7122261'water area'@en'square kilometre'@en
\n", "

423 rows × 6 columns

\n", "
" ], "text/plain": [ " node1 label node2 P1114 \\\n", "0 P1081 Pproperty_units_used NaN 6810 \n", "1 P1082 Pproperty_units_used NaN 46643 \n", "2 P1082 Pproperty_units_used Q186408 2 \n", "3 P1082 Pproperty_units_used Q15621516 1 \n", "4 P1082 Pproperty_units_used Q5727902 1 \n", ".. ... ... ... ... \n", "418 P8476 Pproperty_units_used NaN 992 \n", "419 P8477 Pproperty_units_used NaN 970 \n", "420 P8687 Pproperty_units_used NaN 6469 \n", "421 P8843 Pproperty_units_used NaN 201 \n", "422 P8887 Pproperty_units_used Q712226 1 \n", "\n", " node1;label node2;label \n", "0 'Human Development Index'@en NaN \n", "1 'population'@en NaN \n", "2 'population'@en 'point in time'@en \n", "3 'population'@en 'Habitants'@en \n", "4 'population'@en 'circa'@en \n", ".. ... ... \n", "418 'BTI Governance Index'@en NaN \n", "419 'BTI Status Index'@en NaN \n", "420 'social media followers'@en NaN \n", "421 'poverty incidence'@en NaN \n", "422 'water area'@en 'square kilometre'@en \n", "\n", "[423 rows x 6 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i quantity\n", " --match '(n1)-[l {label: property}]->(quantity)'\n", " --return 'distinct property as node1, \"Pproperty_units_used\" as label, kgtk_quantity_wd_units(quantity) as node2, count(distinct l) as P1114'\n", " --order-by 'property, cast(P1114, int) desc'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will store the units graph in `derived.Pproperty_units_used.tsv`. The final query includes a `where` clause to filter out the NULL values." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0P1082Pproperty_units_usedQ186408P1082-Pproperty_units_used-Q186408
1P1082-Pproperty_units_used-Q186408P11142P1082-Pproperty_units_used-Q186408-P1114-2-0000
2P1082Pproperty_units_usedQ15621516P1082-Pproperty_units_used-Q15621516
3P1082-Pproperty_units_used-Q15621516P11141P1082-Pproperty_units_used-Q15621516-P1114-1-0000
4P1082Pproperty_units_usedQ5727902P1082-Pproperty_units_used-Q5727902
5P1082-Pproperty_units_used-Q5727902P11141P1082-Pproperty_units_used-Q5727902-P1114-1-0000
6P1083Pproperty_units_usedQ44666669P1083-Pproperty_units_used-Q44666669
7P1083-Pproperty_units_used-Q44666669P11142P1083-Pproperty_units_used-Q44666669-P1114-2-0000
8P1083Pproperty_units_usedQ42177P1083-Pproperty_units_used-Q42177
9P1083-Pproperty_units_used-Q42177P11141P1083-Pproperty_units_used-Q42177-P1114-1-0000
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 P1082 Pproperty_units_used Q186408 \n", "1 P1082-Pproperty_units_used-Q186408 P1114 2 \n", "2 P1082 Pproperty_units_used Q15621516 \n", "3 P1082-Pproperty_units_used-Q15621516 P1114 1 \n", "4 P1082 Pproperty_units_used Q5727902 \n", "5 P1082-Pproperty_units_used-Q5727902 P1114 1 \n", "6 P1083 Pproperty_units_used Q44666669 \n", "7 P1083-Pproperty_units_used-Q44666669 P1114 2 \n", "8 P1083 Pproperty_units_used Q42177 \n", "9 P1083-Pproperty_units_used-Q42177 P1114 1 \n", "\n", " id \n", "0 P1082-Pproperty_units_used-Q186408 \n", "1 P1082-Pproperty_units_used-Q186408-P1114-2-0000 \n", "2 P1082-Pproperty_units_used-Q15621516 \n", "3 P1082-Pproperty_units_used-Q15621516-P1114-1-0000 \n", "4 P1082-Pproperty_units_used-Q5727902 \n", "5 P1082-Pproperty_units_used-Q5727902-P1114-1-0000 \n", "6 P1083-Pproperty_units_used-Q44666669 \n", "7 P1083-Pproperty_units_used-Q44666669-P1114-2-0000 \n", "8 P1083-Pproperty_units_used-Q42177 \n", "9 P1083-Pproperty_units_used-Q42177-P1114-1-0000 " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i quantity\n", " --match '(n1)-[l {label: property}]->(quantity)'\n", " --where 'kgtk_quantity_wd_units(quantity) IS NOT NULL'\n", " --return 'distinct property as node1, \"Pproperty_units_used\" as label, kgtk_quantity_wd_units(quantity) as node2, count(distinct l) as P1114'\n", " --order-by 'property, cast(P1114, int) desc'\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $OUT/derived.Pproperty_units_used.tsv\n", "\"\"\")\n", "\n", "kgtk(\"query -i $OUT/derived.Pproperty_units_used.tsv --as property_units_used --limit 10\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of this section\n", "In this section we:\n", "- Computed the distribution of the units used for properties that store quantities\n", "- Found examples of inappropriate use of units of measure in Wikidata\n", "- Illustrated how to use functions in `query` to extract elements from structured literals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute the number of awards by sex or gender of the receiver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, get a distirbution of the `sex or gender (P21)` of people in our graph.\n", "The distribution is skewed, perhaps because it is skewed in Wikidata or a result of how the tutorial graph was constructed." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sex_or_gendercountsex_or_gender;label
0Q65810721783'female'@en
1Q65810978111'male'@en
\n", "
" ], "text/plain": [ " sex_or_gender count sex_or_gender;label\n", "0 Q6581072 1783 'female'@en\n", "1 Q6581097 8111 'male'@en" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (person)-[:P31]->(:Q5),\n", " (person)-[:P21]->(sex_or_gender)'\n", " --return 'distinct sex_or_gender as sex_or_gender, count(distinct person) as count'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we compute the distirbution of `sex or gender (P21)` per type of award. We use the property `award received (P166)` to extract the awards that people received.\n", "\n", "We create a new property `Paward_count` to record the count, and put the `sex or gender (P21)` as a qualifier." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8.96 ms, sys: 21.5 ms, total: 30.4 ms\n", "Wall time: 3.35 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelP21node2node1;labelP21;label
0Q101007233Paward_countQ65810971'film critics association'@en'male'@en
1Q1011547Paward_countQ658107238'Golden Globe Award'@en'female'@en
2Q1011547Paward_countQ658109742'Golden Globe Award'@en'male'@en
3Q101251494Paward_countQ658109724'star'@en'male'@en
4Q1044427Paward_countQ65810728'Primetime Emmy Award'@en'female'@en
.....................
220Q96474707Paward_countQ658109716'honorary award'@en'male'@en
221Q96474709Paward_countQ65810722'award for best visual effects'@en'female'@en
222Q96474709Paward_countQ6581097121'award for best visual effects'@en'male'@en
223Q973011Paward_countQ658109718'campaign medal'@en'male'@en
224Q97551638Paward_countQ65810972'merit order'@en'male'@en
\n", "

225 rows × 6 columns

\n", "
" ], "text/plain": [ " node1 label P21 node2 \\\n", "0 Q101007233 Paward_count Q6581097 1 \n", "1 Q1011547 Paward_count Q6581072 38 \n", "2 Q1011547 Paward_count Q6581097 42 \n", "3 Q101251494 Paward_count Q6581097 24 \n", "4 Q1044427 Paward_count Q6581072 8 \n", ".. ... ... ... ... \n", "220 Q96474707 Paward_count Q6581097 16 \n", "221 Q96474709 Paward_count Q6581072 2 \n", "222 Q96474709 Paward_count Q6581097 121 \n", "223 Q973011 Paward_count Q6581097 18 \n", "224 Q97551638 Paward_count Q6581097 2 \n", "\n", " node1;label P21;label \n", "0 'film critics association'@en 'male'@en \n", "1 'Golden Globe Award'@en 'female'@en \n", "2 'Golden Globe Award'@en 'male'@en \n", "3 'star'@en 'male'@en \n", "4 'Primetime Emmy Award'@en 'female'@en \n", ".. ... ... \n", "220 'honorary award'@en 'male'@en \n", "221 'award for best visual effects'@en 'female'@en \n", "222 'award for best visual effects'@en 'male'@en \n", "223 'campaign medal'@en 'male'@en \n", "224 'merit order'@en 'male'@en \n", "\n", "[225 rows x 6 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (actor)-[:P31]->(:Q5),\n", " (actor)-[:P21]->(sex_or_gender),\n", " (actor)-[:P166]->(award)-[:P31]->(award_type)'\n", " --return 'distinct award_type as node1, \"Paward_count\" as label, sex_or_gender as P21, count(distinct actor) as node2'\n", " --order-by 'award_type'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the new `Paward_count` graph in a file and define the alias `award_count` for it" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.6 ms, sys: 35.9 ms, total: 46.4 ms\n", "Wall time: 4.35 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q101007233Paward_count1Q101007233-Paward_count-6b86b2
1Q101007233-Paward_count-6b86b2P21Q6581097Q101007233-Paward_count-6b86b2-P21-Q6581097-0000
2Q1011547Paward_count38Q1011547-Paward_count-aea921
3Q1011547-Paward_count-aea921P21Q6581072Q1011547-Paward_count-aea921-P21-Q6581072-0000
4Q1011547Paward_count42Q1011547-Paward_count-73475c
5Q1011547-Paward_count-73475cP21Q6581097Q1011547-Paward_count-73475c-P21-Q6581097-0000
6Q101251494Paward_count24Q101251494-Paward_count-c23560
7Q101251494-Paward_count-c23560P21Q6581097Q101251494-Paward_count-c23560-P21-Q6581097-0000
8Q1044427Paward_count8Q1044427-Paward_count-2c6242
9Q1044427-Paward_count-2c6242P21Q6581072Q1044427-Paward_count-2c6242-P21-Q6581072-0000
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q101007233 Paward_count 1 \n", "1 Q101007233-Paward_count-6b86b2 P21 Q6581097 \n", "2 Q1011547 Paward_count 38 \n", "3 Q1011547-Paward_count-aea921 P21 Q6581072 \n", "4 Q1011547 Paward_count 42 \n", "5 Q1011547-Paward_count-73475c P21 Q6581097 \n", "6 Q101251494 Paward_count 24 \n", "7 Q101251494-Paward_count-c23560 P21 Q6581097 \n", "8 Q1044427 Paward_count 8 \n", "9 Q1044427-Paward_count-2c6242 P21 Q6581072 \n", "\n", " id \n", "0 Q101007233-Paward_count-6b86b2 \n", "1 Q101007233-Paward_count-6b86b2-P21-Q6581097-0000 \n", "2 Q1011547-Paward_count-aea921 \n", "3 Q1011547-Paward_count-aea921-P21-Q6581072-0000 \n", "4 Q1011547-Paward_count-73475c \n", "5 Q1011547-Paward_count-73475c-P21-Q6581097-0000 \n", "6 Q101251494-Paward_count-c23560 \n", "7 Q101251494-Paward_count-c23560-P21-Q6581097-0000 \n", "8 Q1044427-Paward_count-2c6242 \n", "9 Q1044427-Paward_count-2c6242-P21-Q6581072-0000 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (actor)-[:P31]->(:Q5),\n", " (actor)-[:P21]->(sex_or_gender),\n", " (actor)-[:P166]->(award)-[:P31]->(award_type)'\n", " --return 'distinct award_type as node1, \"Paward_count\" as label, sex_or_gender as P21, count(distinct actor) as node2'\n", " --order-by 'award_type'\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $OUT/derived.Paward_count.tsv\n", "\"\"\")\n", "\n", "kgtk(\"query -i $OUT/derived.Paward_count.tsv --as award_count --limit 10\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of this section\n", "In this section we:\n", "- Profiled awards to find the gender or sex of awardees, and found that males appear more frequently. We don't know if it is a skew in Wikidata or the real world.\n", "- Defined a new property to hold the data so that it can be shown in the browser." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
award_typeaward_type;label
0Q1011547'Golden Globe Award'@en
1Q106301'Academy Award for Best Supporting Actress'@en
2Q110145'MTV Movie Awards'@en
3Q1111310'Directors Guild of America Award'@en
4Q1131772'Saturn Award for Best Science Fiction Film'@en
.........
90Q96474700'award for best screenplay'@en
91Q96474701'award for best adapted screenplay'@en
92Q96474704'award for best makeup and hairdressing'@en
93Q96474707'honorary award'@en
94Q96474709'award for best visual effects'@en
\n", "

95 rows × 2 columns

\n", "
" ], "text/plain": [ " award_type award_type;label\n", "0 Q1011547 'Golden Globe Award'@en\n", "1 Q106301 'Academy Award for Best Supporting Actress'@en\n", "2 Q110145 'MTV Movie Awards'@en\n", "3 Q1111310 'Directors Guild of America Award'@en\n", "4 Q1131772 'Saturn Award for Best Science Fiction Film'@en\n", ".. ... ...\n", "90 Q96474700 'award for best screenplay'@en\n", "91 Q96474701 'award for best adapted screenplay'@en\n", "92 Q96474704 'award for best makeup and hairdressing'@en\n", "93 Q96474707 'honorary award'@en\n", "94 Q96474709 'award for best visual effects'@en\n", "\n", "[95 rows x 2 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '\n", " (award)-[P31]->(award_type)-[:P279star]->(:Q4220917)'\n", " --return 'distinct award_type as award_type'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Deploy the results (optional)\n", "\n", "**DO NOT RUN THE CELLS BELOW, IF YOU ARE RUNNING in Google Colab**\n", "\n", "Deploy the tutorial KG after profiling so that the profiles can be used in other notebooks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files_to_deploy = [\n", " \"metadata.p31x.count.transitive.tsv\",\n", " \"derived.P31x.tsv\",\n", " \"derived.P1963computed.tsv\",\n", " \"derived.Pproperty_domain.tsv\",\n", " \"derived.Punits_used.tsv\",\n", " \"derived.Paward_count.tsv\"\n", "]\n", "\n", "# First copy all the files from the add-derived-graphs, we will overwrite the ones that change, e.g., all.tsv\n", "!cp -p {tutorial_deployment_path + \"/arnold\"}/*.tsv* {project_deployment_path}\n", "\n", "for file in files_to_deploy:\n", " path = \"$OUT/\" + file\n", " !cp -p {path} {project_deployment_path} \n", "\n", "all_file_path = project_deployment_path + \"/all.tsv.gz\"\n", "if os.path.exists(all_file_path):\n", " !rm {all_file_path}\n", "!kgtk cat -i {tutorial_deployment_path + \"/arnold/all.tsv.gz\"} -i {project_deployment_path}/*.tsv -o {all_file_path}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List all the files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -l {project_deployment_path}" ] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }