# Creating a subset of Wikidata

This notebook illustrates how to create a subset of Wikidata. We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound)

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill Example8\ -\ Wikidata\ Subset.ipynb example8.out.ipynb \
-p wikidata_parts_path /Users/pedroszekely/Downloads/kypher/output.all.10 \
-p subset_name Q11173 \
-p output_path /Users/pedroszekely/Downloads/kypher \
-p cache_path /Users/pedroszekely/Downloads/kypher
-p delete_database no 
-p hops_right 0
```

### Parameters for invoking the notebook

- `wikidata_parts_path`: a folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz`
- `subset_name`: the name of the subset being created. In the current implementation the `subset_name` must be a q-node in Wikidata representing a class.
- `output_path`: the path where a folder will be created to hold the KGTK files for the subset. A folder named `subset_name` will be createed in this filder.
- `cache_path`: the path of a folder where the Kypher SQL database will be created.
- `delete_database`: whether to delete the SQL database before running the notebook: "" or "no" means don't delete it.
- `hops_right`: after getting the initial collection of q-nodes for the subset, how many hops forward to follow links, can be 0, 1 or 2.

In [1]:
# Parameters
wikidata_parts_path = "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files"
#wikidata_parts_path = "/Users/pedroszekely/Downloads/kypher/output.all.10"
subset_name = "Q11173"
#subset_name = "Q318"
subset_name = "Q5"
subset_name = "Q44"
output_path = "/Users/pedroszekely/Downloads/kypher"
cache_path = "/Users/pedroszekely/Downloads/kypher"
hops_right = "1"
delete_database = "no"

In [2]:
temp_folder = subset_name + "-temp"
output_folder = subset_name

In [3]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary.

In [4]:
def run_command(command, substitution_dictionary = {}):
 """Run a templetized command."""
 cmd = command.replace("NAME", subset_name)
 for k, v in substitution_dictionary.items():
 cmd = cmd.replace(k, v)
 
 print(cmd)
 output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 print(output.stdout)
 print(output.stderr)
 #print(output.returncode)

### Set up environment variables and folders that we need
We need to define environment variables to pass to the KGTK commands.

In [133]:
# folder containing wikidata broken down into smaller files.
os.environ['WIKIDATA_PARTS'] = wikidata_parts_path
# name of the dataset
os.environ['NAME'] = subset_name
# folder where to put the output
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
# temporary folder
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
# kgtk command to run
os.environ['kgtk'] = "kgtk"
# os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
if cache_path:
 os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
 os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

In [6]:
cd $output_folder

/Users/pedroszekely/Documents/GitHub/kgtk/examples/Q44


In [7]:
!mkdir $output_folder
!mkdir $temp_folder

mkdir: Q44: File exists
mkdir: Q44-temp: File exists


In [8]:
!rm $OUT/*.tsv $OUT/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

rm: /Users/pedroszekely/Downloads/kypher/Q44/*.tsv: No such file or directory


In [9]:
if delete_database and delete_database != "no":
 print("Deleted database")
 !rm $STORE

### Extract the Q-nodes for the items we want
Here we assume that the subset is for an individual q-node, so that the subset name is the name of the q-node. We should generalize this so that this query can be passed in as a parameter. We construct a file that contains all the node1s that are isa of the given NAME q-node.

In [10]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz \
 --graph-cache $STORE \
 -o $TEMP/qnodelist.NAME.tsv.gz \
 --match 'isa: (n1)-[l:isa]->(n2:NAME)' \
 --return 'distinct n1, l.label, n2'"
run_command(command)

$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz --graph-cache $STORE -o $TEMP/qnodelist.Q44.tsv.gz --match 'isa: (n1)-[l:isa]->(n2:Q44)' --return 'distinct n1, l.label, n2'




In [11]:
!gzcat $TEMP/qnodelist.$NAME.tsv.gz | head 

node1	label	node2
Q2579953	isa	Q44
Q15883984	isa	Q44
Q63379154	isa	Q44
Q3699039	isa	Q44
Q3360035	isa	Q44
Q16070652	isa	Q44
Q85313643	isa	Q44
Q897293	isa	Q44
Q999745	isa	Q44


In [12]:
!gzcat $TEMP/qnodelist.$NAME.tsv.gz | wc 

 454 1362 7887


In [13]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz \
 --graph-cache $STORE \
 -o $TEMP/all.P279star.NAME.tsv.gz \
 --match '(n1)-[l:P279star]->(n2:NAME)' \
 --return 'distinct n1, l.label, n2'"
run_command(command)

$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz --graph-cache $STORE -o $TEMP/all.P279star.Q44.tsv.gz --match '(n1)-[l:P279star]->(n2:Q44)' --return 'distinct n1, l.label, n2'




In [14]:
!gzcat $TEMP/all.P279star.$NAME.tsv.gz | wc

 320 960 7110


### Genereate the nodes one hop to the right

In [15]:
hops_right_count = int(hops_right)

In [16]:
command = "$kgtk query \
 -i $TEMP/qnodelist.NAME.tsv.gz \
 -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz \
 -o $TEMP/NAME.hop.right1.tsv.gz \
 --graph-cache $STORE \
 --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)' \
 --return 'distinct l, n2 as node1, l.label as label, n3 as node2'" 

if hops_right_count > 0:
 run_command(command)

$kgtk query -i $TEMP/qnodelist.Q44.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz -o $TEMP/Q44.hop.right1.tsv.gz --graph-cache $STORE --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)' --return 'distinct l, n2 as node1, l.label as label, n3 as node2'




Create a dummy empty hop file so that the gzcat command below doesn't fail if the number of hops is zero

In [17]:
!echo -e "node1\tlabel\tnode2\tid" | gzip > $TEMP/$NAME.hop.dummy.tsv.gz

In [18]:
!kgtk cat -i $TEMP/$NAME.hop.*.tsv.gz $TEMP/all.P279star.$NAME.tsv.gz $TEMP/qnodelist.$NAME.tsv.gz | gzip > $TEMP/$NAME.all-items.tsv.gz

### Generate the parts of this dataset

In [19]:
types = [
 "time",
 "wikibase-item",
 "math",
 "wikibase-form",
 "quantity",
 "string",
 "external-id",
 "commonsMedia",
 "globe-coordinate",
 "monolingualtext",
 "musical-notation",
 "geo-shape",
 "wikibase-property",
 "url",
]
command = "$kgtk query -i $TEMP/NAME.all-items.tsv.gz -i $WIKIDATA_PARTS/part.TYPE_FILE.tsv.gz --graph-cache $STORE \
 -o $OUT/NAME.part.TYPE_FILE.tsv.gz \
 --match 'NAME: (n1)-[]->(), `TYPE_FILE`: (n1)-[l]->(n2)' \
 --return 'distinct l, n1, l.label, n2' \
 --order-by 'n1, l.label, n2'"
for type in types:
 run_command(command, {"TYPE_FILE": type})


$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.time.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.time.tsv.gz --match 'Q44: (n1)-[]->(), `time`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.wikibase-item.tsv.gz --match 'Q44: (n1)-[]->(), `wikibase-item`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.math.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.math.tsv.gz --match 'Q44: (n1)-[]->(), `math`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-form.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.wikibase-form.tsv.gz --match 'Q44: (n1)-[]->(), `wikibase-form`: (n1)-[l]->(n2)' --return 'distinct l

### Generate a P279star file

First generate the P279 and P31 or every node2 in the wikibase_item file.

In [20]:
command_p279 = "$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE \
-o $TEMP/NAME.node2.P279.tsv.gz \
--match 'NAME: ()-[]->(n1), P279: (n1)-[l]->(n2)' \
--return 'distinct l, n1 as node1, l.label, n2' \
--order-by 'n1, l.label, n2'"

command_p31 = "$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE \
-o $TEMP/NAME.node2.P31.tsv.gz \
--match 'NAME: ()-[]->(n1), P31: (n1)-[l]->(n2)' \
--return 'distinct l, n1 as node1, l.label, n2' \
--order-by 'n1, l.label, n2'"

run_command(command_p279)
run_command(command_p31)

!$kgtk cat -i $TEMP/$NAME.node2.P279.tsv.gz $TEMP/$NAME.node2.P31.tsv.gz | gzip > $TEMP/$NAME.P279_P31.tsv.gz


$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P279.tsv.gz --match 'Q44: ()-[]->(n1), P279: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P31.tsv.gz --match 'Q44: ()-[]->(n1), P31: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'




In [21]:
!kgtk cat -i $OUT/$NAME.part.*.tsv.gz | gzip > $TEMP/$NAME.all_1.tsv.gz

In [22]:
command_node1 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
 --graph-cache $STORE \
 -o $TEMP/NAME.P279star.1.tsv.gz \
 --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' \
 --return 'distinct l, n1, l.label, n2'"

command_node2 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
 --graph-cache $STORE \
 -o $TEMP/NAME.P279star.2.tsv.gz \
 --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' \
 --return 'distinct l, n1 as node1, l.label, n2'" 

cat_command = "$kgtk cat -i $TEMP/NAME.P279star.1.tsv.gz $TEMP/NAME.P279star.2.tsv.gz | gzip > $OUT/NAME.P279star.tsv.gz"

run_command(command_node1)
run_command(command_node2)
run_command(cat_command)

$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz --graph-cache $STORE -o $TEMP/Q44.P279star.1.tsv.gz --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' --return 'distinct l, n1, l.label, n2'


$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz --graph-cache $STORE -o $TEMP/Q44.P279star.2.tsv.gz --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' --return 'distinct l, n1 as node1, l.label, n2'


$kgtk cat -i $TEMP/Q44.P279star.1.tsv.gz $TEMP/Q44.P279star.2.tsv.gz | gzip > $OUT/Q44.P279star.tsv.gz




### Get info on all properties

In [23]:
!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_1.tsv.gz

First get a list of all the proerties used in this file

In [24]:
!$kgtk query -i $TEMP/$NAME.everything_1.tsv.gz --graph-cache $STORE \
-o $TEMP/$NAME.properties.tsv \
--match '(n1)-[l]->(n2)' \
--return 'distinct l.label as node1, "dummy" as label, "dummy" as node2' 

Now get all the info in these properties

In [25]:
!$kgtk query -i $TEMP/$NAME.properties.tsv -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE \
-o $OUT/$NAME.properties.tsv.gz \
--match '`wikibase-item`: (p)-[l]->(n2), properties: (p)-[]->()' \
--return 'distinct l, p, l.label, n2' 

### Generate the labels, aliases and descriptions
We want the labels, aliases and descriptions for every q-node in our dataset. THis means that we need these lables for all q-nodes that appear in the node1 or node2 position.

The first step is to concatenate all the files in our dataset.

In [26]:
!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_2.tsv.gz

Now we extract the labels from from our input wikidata folder. We do this matching node1, thend node 2, then we concatenate the resulting label files.

In [27]:
labels = [
 "label",
 "alias",
 "description"
]

command_node1 = "$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \
 -o $TEMP/NAME.LABEL.en.1.tsv.gz \
 --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' \
 --return 'distinct l, n1, l.label, n2' \
 --order-by 'n1, l.label, n2'"

command_node2 = "$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \
 -o $TEMP/NAME.LABEL.en.2.tsv.gz \
 --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' \
 --return 'distinct l, n1 as node1, l.label, n2' \
 --order-by 'n1, l.label, n2'"

command_label = "$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \
 -o $TEMP/NAME.LABEL.en.3.tsv.gz \
 --match 'everything_2: ()-[l {label: n1}]->(), part: (n1)-[l]->(n2)' \
 --return 'distinct l, n1 as node1, l.label, n2' \
 --order-by 'n1, l.label, n2'"

cat_command = "kgtk cat -i $TEMP/NAME.LABEL.*.gz | gzip > $OUT/NAME.LABEL.en.tsv.gz"

for label in labels:
 run_command(command_node1, {"LABEL": label})
 run_command(command_node2, {"LABEL": label})
 run_command(cat_command, {"LABEL": label})


$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.label.en.1.tsv.gz --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.label.en.2.tsv.gz --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'


kgtk cat -i $TEMP/Q44.label.*.gz | gzip > $OUT/Q44.label.en.tsv.gz


$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.alias.en.1.tsv.gz --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'


$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.alias.en

### Summary of what we got

In [28]:
%%bash
for f in $OUT/*.tsv.gz; do
 echo -n `basename $f`
 gzcat $f | wc -l
done

Q44.P279star.tsv.gz 183410
Q44.alias.en.tsv.gz 36266
Q44.description.en.tsv.gz 27447
Q44.label.en.tsv.gz 36591
Q44.part.commonsMedia.tsv.gz 1386
Q44.part.external-id.tsv.gz 8406
Q44.part.geo-shape.tsv.gz 75
Q44.part.globe-coordinate.tsv.gz 416
Q44.part.math.tsv.gz 1
Q44.part.monolingualtext.tsv.gz 3594
Q44.part.musical-notation.tsv.gz 1
Q44.part.quantity.tsv.gz 25281
Q44.part.string.tsv.gz 1314
Q44.part.time.tsv.gz 366
Q44.part.url.tsv.gz 314
Q44.part.wikibase-form.tsv.gz 1
Q44.part.wikibase-item.tsv.gz 21942
Q44.part.wikibase-property.tsv.gz 20
Q44.properties.tsv.gz 10582


Unzip the everything file as graph-statistics cannont work with gz files

In [29]:
!rm $TEMP/$NAME.everything_2.tsv

rm: /Users/pedroszekely/Downloads/kypher/Q44-temp/Q44.everything_2.tsv: No such file or directory


In [30]:
!gunzip --keep $TEMP/$NAME.everything_2.tsv.gz

In [31]:
!$kgtk graph-statistics --log $OUT/$NAME.everything.statistics.txt \
 --statistics-only --pagerank -i $TEMP/$NAME.everything_2.tsv \
 | gzip > $OUT/$NAME.statistics.tsv.gz

In [32]:
!cat $OUT/$NAME.everything.statistics.txt

graph loaded! It has 53099 nodes and 257093 edges

###Top relations:
P279star	183409
P2302	4031
P530	3434
P1082	3347
P2936	3235
P2131	3158
P2132	3058
P2134	2891
P31	2630
P1549	2574

###PageRank
Max pageranks
44	Q4406616	0.001093
43	Q44	0.001219
46	Q488383	0.001689
38	Q35120	0.001846
57	novalue	0.012314


In [33]:
!ls -lh $OUT

total 12712
-rw-r--r-- 1 pedroszekely staff 1.2M Oct 16 22:39 Q44.P279star.tsv.gz
-rw-r--r-- 1 pedroszekely staff 407K Oct 16 22:39 Q44.alias.en.tsv.gz
-rw-r--r-- 1 pedroszekely staff 479K Oct 16 22:39 Q44.description.en.tsv.gz
-rw-r--r-- 1 pedroszekely staff 304B Oct 16 22:39 Q44.everything.statistics.txt
-rw-r--r-- 1 pedroszekely staff 446K Oct 16 22:39 Q44.label.en.tsv.gz
-rw-r--r-- 1 pedroszekely staff 22K Oct 16 22:38 Q44.part.commonsMedia.tsv.gz
-rw-r--r-- 1 pedroszekely staff 93K Oct 16 22:38 Q44.part.external-id.tsv.gz
-rw-r--r-- 1 pedroszekely staff 938B Oct 16 22:38 Q44.part.geo-shape.tsv.gz
-rw-r--r-- 1 pedroszekely staff 6.3K Oct 16 22:38 Q44.part.globe-coordinate.tsv.gz
-rw-r--r-- 1 pedroszekely staff 62B Oct 16 22:38 Q44.part.math.tsv.gz
-rw-r--r-- 1 pedroszekely staff 38K Oct 16 22:38 Q44.part.monolingualtext.tsv.gz
-rw-r--r-- 1 pedroszekely staff 74B Oct 16 22:38 Q44.part.musical-notation.tsv.gz
-rw-r--r-- 1 pedroszekely staff 201K Oct 16 22:38 Q44.part.quantity.tsv.gz


Example of how to get statistics on the properties. 

In [34]:
!kgtk query -i $TEMP/$NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \
--match 'everything: (n1)-[l:P106]->(n2), label: (n2)-[:label]->(label)' \
--return 'distinct l.label as property_id, label as property_label, n2 as value, count(n2) as value_count' \
--order-by 'count(n2) desc' \
--limit 10 \
| column -t -s $'\t' 

property_id property_label value value_count
P106 'Entrepreneur'@en-ca Q131524 3
P106 'entrepreneur'@en Q131524 3
P106 'entrepreneur'@en-gb Q131524 3
P106 'toy maker'@en Q2310380 1


Distribution of label/node2

## Entity Profiles
The cells in this section should be moved to a new `Example10 Entity Profiler` notebook

### Entity profiler for items
Get distinct P31(node1)/label/node2 triples, and count the number of instances of such edges.

Represent the result as KGTK edges:
- `node1`: the property, ie the `label` in our definition
- `label`: a new property we call `Pprofiler_count`
- `node2`: the count

Use qualifiers to represent the context:
- `Pcontext_item`: represents the `node2` in our definition
- `Pcontext_type`: represents `P31(node1)` in our definition

In [151]:
!$kgtk query -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \
--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab), label: (type)-[:label]->(type_label), label: (n2)-[:label]->(n2_label)' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--return 'distinct type as Pcontext_type, l.label as node1_dummy, n2 as Pcontext_item, count(n1) as node2, lab as `node1;label`, n2_label as `Pcontext_item;label`, type_label as `Pcontext_type;label`, "Pprofiler_count" as label' \
--order-by 'type, p, count(n1) desc' \
--limit 10 \
| column -t -s $'\t' 

Pcontext_type node1_dummy Pcontext_item node2 node1;label Pcontext_item;label Pcontext_type;label label
Q1066984 P1151 Q11028213 6 'topic\\\\\\\\'s main Wikimedia portal'@en 'Portal:Munich'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P131 Q10562 18 'located in the administrative territorial entity'@en 'Upper Bavaria'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P131 Q1673724 6 'located in the administrative territorial entity'@en 'Isarkreis'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P1313 Q11902879 36 'office held by head of government'@en 'Lord Mayor'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P1313 Q1958954 12 'office held by head of government'@en 'list of mayors of Munich'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P1343 Q97879676 18 'described by source'@en 'Regesta Imperii XIII'@en 'Financial centre'@en-ca Pprofiler_count
Q1066984 P1343 Q316838 12 'described by source'@en 'Regesta Imperii'@en 'Financial centre'@en-ca Pprofiler_count

### The cells below compute profiles for other data types and should be refactored to follow the pattern of the Entity profiler for items

In [52]:
!$kgtk query -i $OUT/$NAME.part.time.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \
--match 'time: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_date_year(n2) as year, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 10 \
| column -t -s $'\t' 

type prop property_label year count
Q3624078 P571 'inception'@en 1991 18
Q3624078 P571 'inception'@en 1918 16
Q6256 P571 'inception'@en 1991 12
Q6256 P571 'inception'@en 1918 10
Q123480 P571 'inception'@en 1991 8
Q179164 P571 'inception'@en 1991 8
Q4209223 P571 'inception'@en 1991 8
Q44 P571 'inception'@en 2001 8
Q619610 P571 'inception'@en 1991 8
Q63791824 P571 'inception'@en 1918 8


In [53]:
!$kgtk query -i $OUT/$NAME.part.quantity.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \
--match 'quantity: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_quantity_number(n2) as value, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 10 \
| column -t -s $'\t' 

type prop property_label value count
Q3624078 P3000 'marriageable age'@en 18 56
Q3624078 P2997 'age of majority'@en 18 53
Q3624078 P2884 'mains voltage'@en 230 41
Q3624078 P1279 'inflation rate'@en 1.7 39
Q3624078 P1279 'inflation rate'@en 1.8 39
Q3624078 P1279 'inflation rate'@en 2.1 37
Q3624078 P1279 'inflation rate'@en 1.5 32
Q3624078 P1279 'inflation rate'@en 2 31
Q3624078 P1279 'inflation rate'@en 2.8 31
Q3624078 P1279 'inflation rate'@en 3.5 29


## Extending KG to include nodes with ambiguous names

Find node2s where we have node1/label/node1_label in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_label

In [161]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n2)-[:alias]->(n1_label), label: (n2)-[:label]->(n2_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, "Pshares_name" as label' \
--limit 10 \
| column -t -s $'\t' 

node1 node1;label node2 node2;label label
Q1017471 'Bush'@en Q80857164 'The Bush'@en Pshares_name
Q1017471 'Bush'@en Q80857164 'The Bush'@en-gb Pshares_name
Q1017471 'Bush'@en Q21810649 'Norton Bush'@en Pshares_name
Q1017471 'Bush'@en Q60614686 'The Gentlemen'@en Pshares_name
Q1017471 'Bush'@en Q4888621 'Benjamin Franklin Bush'@en Pshares_name
Q1017471 'Bush'@en Q54888574 'Bush, Washington'@en Pshares_name
Q10350781 'Polar'@en Q1500857 'Polar Electro'@en Pshares_name
Q1041750 'Carling'@en Q7230524 'Port Carling'@en Pshares_name
Q12009657 'Victoria'@en Q286499 'Vitruvia'@en Pshares_name
Q12009657 'Victoria'@en Q3557663 'Michel Sardou'@en Pshares_name


Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias

In [160]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), label: (n2)-[:label]->(n1_alias)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_alias as `node2;label`, "Pshares_name" as label' \
--limit 10 \
| column -t -s $'\t' 

node1 node1;label node2 node2;label label
Q1157108 'Cerveza Sol'@en Q64961707 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q7555482 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q69509964 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q1237552 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q7555484 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q7555486 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q64961436 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q3489075 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q37563235 'Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q23664473 'Sol'@en Pshares_name


Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_alias

In [163]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), alias: (n2)-[:alias]->(n1_alias), label: (n2)-[:label]->(n2_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, "Pshares_name" as label' \
--limit 10 \
| column -t -s $'\t' 

node1 node1;label node2 node2;label label
Q1157108 'Cerveza Sol'@en Q22583558 'J-Hope'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en-ca Pshares_name
Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en-gb Pshares_name
Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en-ca Pshares_name
Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en-gb Pshares_name
Q1157108 'Cerveza Sol'@en Q654596 'Sól'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q7666238 'Sól'@en Pshares_name
Q1157108 'Cerveza Sol'@en Q525 'Sun'@en Pshares_name


Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias

In [165]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), label: (n2)-[:label]->(n1_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_label as `node2;label`, "Pshares_name" as label' \
--limit 10 \
| column -t -s $'\t' 

node1 node1;label node2 node2;label label
Q1017471 'Bush'@en Q5001360 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q77894031 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q5001365 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q20482703 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q18793771 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q247949 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q1017464 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q1484464 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q224168 'Bush'@en Pshares_name
Q1017471 'Bush'@en Q2469309 'Bush'@en Pshares_name


In [167]:
!wd u Q1017471 Q247949

[90mid[39m Q1017471
[42mLabel[49m Bush
[44mDescription[49m Beer of Belgium (Wallonia)
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mbeer brand [90m(Q15075508)[39m | beer [90m(Q44)[39m

[90mid[39m Q247949
[42mLabel[49m Bush
[44mDescription[49m British rock band
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mmusical group [90m(Q215380)[39m
