# Generating Useful Wikidata Files

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Example7\ -\ Wikidata\ Outputs.ipynb example7.out.ipynb \
-p home /Users/pedroszekely/Downloads/kypher \
-p wiki_file all.10.tsv.gz \
-p output_folder output.all.10 \
-p temp_folder temp.all.10 \
-p delete_database true 
```

In [7]:
# Parameters
home = "/Users/pedroszekely/Downloads/kypher"
wiki_file = "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz"
wiki_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz"
output_path = "/Users/pedroszekely/Downloads/kypher"
cache_path = "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3"
output_folder = "useful_wikidata_files_v3"
temp_folder = "temp.useful_wikidata_files_v3"
delete_database = "no"

In [8]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

## Set up environment and folders to store the files

- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software

The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.

To dos:

- Make sure that all files have id columns as `query` gets unhappy when files have no ids.
- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.
- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.
- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.
- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`

Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.

If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file.

In [9]:
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
os.environ['EDGES'] = wiki_file
if cache_path:
 os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
 os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

In [10]:
!echo $OUT
!echo $TEMP
!echo $kgtk
!echo $EDGES
!echo $STORE

/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3
time kgtk --debug
/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db


In [11]:
cd $output_path

/Users/pedroszekely/Downloads/kypher


In [12]:
!mkdir $OUT
!mkdir $TEMP

mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3: File exists


Clean up the output and temp folders before we start

In [7]:
# !rm $OUT/*.tsv $OUT/*.tsv.gz
# !rm $TEMP/*.tsv $TEMP/*.tsv.gz

In [8]:
if delete_database and delete_database != "no":
 print("Deleted database")
 !rm $STORE

Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload.

### Get a sample and force importing the edge file into the database

In [None]:
!$kgtk query -i "$EDGES" --limit 10 --graph-cache $STORE

[2020-11-03 22:16:22 sqlstore]: IMPORT graph directly into table graph_1 from /Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz ...


Force creation of the index on the label column

In [13]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o - \
 --match '(i)-[:P31]->(c)' \
 --limit 5

[2020-10-23 17:59:58 query]: SQL Translation:
---------------------------------------------
 SELECT *
 FROM graph_14 AS graph_14_c1
 WHERE graph_14_c1."label"=?
 LIMIT ?
 PARAS: ['P31', 5]
---------------------------------------------
[2020-10-23 17:59:58 sqlstore]: CREATE INDEX on table graph_14 column label ...
[2020-10-23 18:33:51 sqlstore]: ANALYZE INDEX on table graph_14 column label ...
id	node1	label	node2	rank	node2;wikidatatype
Q1-P31-Q36906466-q1$8983b0ea-4a9c-0902-c0db-785db33f767c-0	Q1	P31	Q36906466	normal	wikibase-item
Q100-P31-Q1093829-q100$3f4925a8-32d0-424f-b65a-4e3b5dbd07ec-0	Q100	P31	Q1093829	normal	wikibase-item
Q100-P31-Q1549591-q100$ad5b329b-43c9-f6d9-9d0b-a08c1f4f0abb-0	Q100	P31	Q1549591	normal	wikibase-item
Q100-P31-Q21518270-q100$5b85ea08-419d-51f3-81d2-c7d50fc935f3-0	Q100	P31	Q21518270	preferred	wikibase-item
Q1000-P31-Q179023-q1000$fd440406-4ef4-6bb3-f9ed-484f630a4f8c-0	Q1000	P31	Q179023	normal	wikibase-item
 2150.60 real 1114.30 user 483.40 sys


Force creation of the index on the node2 column

In [14]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o - \
 --match '(i)-[r]->(:Q5)' \
 --limit 5

[2020-10-23 18:35:49 query]: SQL Translation:
---------------------------------------------
 SELECT *
 FROM graph_14 AS graph_14_c1
 WHERE graph_14_c1."node2"=?
 LIMIT ?
 PARAS: ['Q5', 5]
---------------------------------------------
[2020-10-23 18:35:49 sqlstore]: CREATE INDEX on table graph_14 column node2 ...
[2020-10-23 19:33:50 sqlstore]: ANALYZE INDEX on table graph_14 column node2 ...
id	node1	label	node2	rank	node2;wikidatatype
Q10000001-P31-Q5-q10000001$63adfe23-2df9-477f-aa0a-62a68a9eab1d-0	Q10000001	P31	Q5	normal	wikibase-item
Q1000002-P31-Q5-q1000002$bfba0a52-a667-4574-9eb9-4efc08582259-0	Q1000002	P31	Q5	normal	wikibase-item
Q1000005-P31-Q5-q1000005$d7f256b6-91a1-4bdf-9e77-03798a7d0c36-0	Q1000005	P31	Q5	normal	wikibase-item
Q1000006-P31-Q5-q1000006$3f995cf5-520a-4ec5-99b7-987fd8c57a6a-0	Q1000006	P31	Q5	normal	wikibase-item
Q1000015-P31-Q5-q1000015$909acc6d-b7a7-43e2-8cd0-50eab8891d52-0	Q1000015	P31	Q5	normal	wikibase-item
 3648.45 real 1715.03 user 797.70 sys


### Count the number of edges

In [15]:
!$kgtk query -i "$EDGES" --graph-cache $STORE \
 --match 'all: ()-[r]->()' \
 --return 'count(r) as count' \
 --limit 10

[2020-10-23 19:36:38 query]: SQL Translation:
---------------------------------------------
 SELECT count(graph_14_c1."id") "count"
 FROM graph_14 AS graph_14_c1
 LIMIT ?
 PARAS: [10]
---------------------------------------------
count
1102827643
 726.36 real 103.43 user 171.80 sys


### Get the distribution of the label column
I would like to have it sorted numerically, but don't know how to make it happen

In [15]:
!$kgtk unique --column label -i "$EDGES" / sort2 -c node2 -r -o $OUT/all-distribution.tsv 

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py", line 143, in run
 uniq.process()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/unique.py", line 166, in process
 for row in kr:
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 976, in __next__
 return self.nextrow()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 860, in nextrow
 line = next(self.source) # Will throw StopIteration
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py", line 30, in __next__
 return self.s.__next__()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py", line 300, in read1
 return self._b

In [16]:
!head $OUT/all-distribution.tsv | column -t -s $'\t' 

### Compute files with labels, aliases and descriptions
Return the id, node1, label and node2 columns

In [17]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.label.tsv.gz \
 --match '(n1)-[l:label]->(n2)' \
 --return 'l, n1, l.label, n2' 

[2020-10-23 20:04:12 query]: SQL Translation:
---------------------------------------------
 SELECT graph_14_c1."id", graph_14_c1."node1", graph_14_c1."label", graph_14_c1."node2"
 FROM graph_14 AS graph_14_c1
 WHERE graph_14_c1."label"=?
 PARAS: ['label']
---------------------------------------------
 1.12 real 0.58 user 0.22 sys


In [18]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.alias.tsv.gz \
 --match '(n1)-[l:alias]->(n2)' \
 --return 'l, n1, l.label, n2'

[2020-10-23 02:14:59 sqlstore]: IMPORT graph directly into table graph_7 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz ...
Exception in thread background thread for pid 80140:
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
 self.run()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 870, in run
 self._target(*self._args, **self._kwargs)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 1662, in wrap
 fn(*args, **kwargs)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2606, in background_thread
 handle_exit_code(exit_code)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2304, in fn
 return self.command.handle_command_exit_code(ex

In [19]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.description.tsv.gz \
 --match '(n1)-[l:description]->(n2)' \
 --return 'l, n1, l.label, n2'

[2020-10-23 02:21:35 sqlstore]: IMPORT graph directly into table graph_8 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz ...
Exception in thread background thread for pid 80236:
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
 self.run()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 870, in run
 self._target(*self._args, **self._kwargs)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 1662, in wrap
 fn(*args, **kwargs)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2606, in background_thread
 handle_exit_code(exit_code)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2304, in fn
 return self.command.handle_command_exit_code(ex

### Now create files with the English labels, aliases and descriptions

In [20]:
!$kgtk query -i $OUT/part.label.tsv.gz --graph-cache $STORE -o $OUT/part.label.en.tsv.gz \
 --match '()-[]->(n2)' \
 --where 'n2.kgtk_lqstring_lang_suffix = "en"' 

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/pyth

In [21]:
!$kgtk query -i $OUT/part.alias.tsv.gz --graph-cache $STORE -o $OUT/part.alias.en.tsv.gz \
 --match '()-[]->(n2)' \
 --where 'n2.kgtk_lqstring_lang_suffix = "en"'

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/pyth

In [22]:
!$kgtk query -i $OUT/part.description.tsv.gz --graph-cache $STORE -o $OUT/part.description.en.tsv.gz \
 --match '()-[]->(n2)' \
 --where 'n2.kgtk_lqstring_lang_suffix = "en"' 

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/pyth

Let's sample these files to see what they look like:

* we are getting all variants of English, we really want `en` only
* the labels have the language tags, how do we output only the string without the language tag?

In [23]:
!gzcat $OUT/part.label.en.tsv.gz | head | column -t -s $'\t' 

### Compute the distribution of the number of edges for each Wikidata type

In [24]:
!$kgtk unique --column 'node2;wikidatatype' -i "$EDGES" / sort2 -c node2 -r | gzip > $OUT/all.wikidatatype.distribution.tsv.gz

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py", line 143, in run
 uniq.process()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/unique.py", line 166, in process
 for row in kr:
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 976, in __next__
 return self.nextrow()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 860, in nextrow
 line = next(self.source) # Will throw StopIteration
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py", line 30, in __next__
 return self.s.__next__()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py", line 300, in read1
 return self._b

In [25]:
!gzcat $OUT/all.wikidatatype.distribution.tsv.gz | column -t -s $'\t' 

### Create a file to contain the edges for each wikidata type

In [None]:
types = [
 "time",
 "wikibase-item",
 "math",
 "wikibase-form",
 "quantity",
 "string",
 "external-id",
 "commonsMedia",
 "globe-coordinate",
 "monolingualtext",
 "musical-notation",
 "geo-shape",
 "wikibase-property",
 "url",
]

command = "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.TYPE_FILE.tsv.gz \
 --match '(n1)-[l]->(n2 {wikidatatype: type})' \
 --return 'l, n1, l.label, n2'\
 --where 'type = \"TYPE\"'"
for type in types:
 cmd = command.replace("TYPE_FILE", type)
 cmd = cmd.replace("TYPE", type)

 print(cmd)
 os.system(cmd)

$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.time.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = "time"'
$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.wikibase-item.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = "wikibase-item"'
$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.math.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = "math"'
$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.wikibase-form.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = "wikibase-form"'
$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.quantity.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = "quantity"'
$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.string.tsv.gz --match '(n1)-[l]->(n2 {wikidataty

### Create a file with the sitelinks

In [None]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.wikipedia_sitelink.tsv.gz \
 --match '(n1)-[l:wikipedia_sitelink]->(n2)' \
 --return 'l, n1, l.label, n2' 

### Create a file that specifies for each node whether it is an item or a property

In [None]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/part.type.tsv.gz \
 --match '(n1)-[l:type]->(n2)' \
 --return 'l, n1, l.label, n2' 

### Create the P31 and P279 files

In [None]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/all.P31.tsv.gz \
 --match '(n1)-[l:P31]->(n2)' \
 --return 'l, n1, l.label, n2' 

In [None]:
!$kgtk query -i "$EDGES" --graph-cache $STORE -o $OUT/all.P279.tsv.gz \
 --match '(n1)-[l:P279]->(n2)' \
 --return 'l, n1, l.label, n2' 

In [None]:
!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\t' 

In [None]:
!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz 

In [None]:
!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\t' 

### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [None]:
!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \
 --match '(n1)-[l]->()' \
 --return 'n1 as node, l as id' 

In [None]:
!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P31.n2.tsv.gz \
 --match '()-[l]->(n2)' \
 --return 'n2 as node, l as id' 

In [None]:
!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \
 / compact --mode NONE --columns node \
 > $TEMP/P279.roots.tsv

Now we can invoke the reachable-nodes command

In [None]:
!head $TEMP/P279.roots.tsv

In [None]:
!$kgtk reachable-nodes \
 --mode NONE \
 --rootfile $TEMP/P279.roots.tsv \
 --rootfilecolumn 0 \
 --subj 1 --pred 2 --obj 3 \
 -i $OUT/all.P279.tsv.gz \
 | kgtk sort2 \
 | gzip > $TEMP/P279.reachable.tsv.gz

The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [36]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.1.tsv.gz \
 --match '(n1)-[]->(n2)' \
 --return 'n1, "P279star" as label, n2 as node2' 

[2020-10-22 02:28:48 sqlstore]: IMPORT graph directly into table graph_7 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_2/P279.reachable.tsv.gz ...
[2020-10-22 02:28:48 query]: SQL Translation:
---------------------------------------------
 SELECT graph_7_c1."node1", ? "label", graph_7_c1."node2" "node2"
 FROM graph_7 AS graph_7_c1
 PARAS: ['P279star']
---------------------------------------------
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 155, in run
 result = query.execute()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 677, in execute
 result = self.store.execute(query, params)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 231, in execute
 return self.get_conn().execute(*args, **kwa

We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1

In [37]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.2.tsv.gz \
 --match '(n1)-[]->(n2)' \
 --return 'n1 as node1, "P279star" as label, n1 as node2' 

[2020-10-22 02:28:48 query]: SQL Translation:
---------------------------------------------
 SELECT graph_7_c1."node1" "node1", ? "label", graph_7_c1."node1" "node2"
 FROM graph_7 AS graph_7_c1
 PARAS: ['P279star']
---------------------------------------------
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 155, in run
 result = query.execute()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 677, in execute
 result = self.store.execute(query, params)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 231, in execute
 return self.get_conn().execute(*args, **kwargs)
sqlite3.OperationalError: no such column: graph_7_c1.node1

During handling of the above exception, another exception occurred:

Traceback (most recent call 

In [38]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.3.tsv.gz \
 --match '(n1)-[]->(n2)' \
 --return 'n2 as node1, "P279star" as label, n2 as node2' 

[2020-10-22 02:28:49 query]: SQL Translation:
---------------------------------------------
 SELECT graph_7_c1."node2" "node1", ? "label", graph_7_c1."node2" "node2"
 FROM graph_7 AS graph_7_c1
 PARAS: ['P279star']
---------------------------------------------
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 155, in run
 result = query.execute()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 677, in execute
 result = self.store.execute(query, params)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 231, in execute
 return self.get_conn().execute(*args, **kwargs)
sqlite3.OperationalError: no such column: graph_7_c1.node2

During handling of the above exception, another exception occurred:

Traceback (most recent call 

In [39]:
!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P279star.4.tsv.gz \
 --match '(n1)-[]->(n2)' \
 --return 'n2 as node1, "P279star" as label, n2 as node2' 

[2020-10-22 02:28:50 query]: SQL Translation:
---------------------------------------------
 SELECT graph_6_c1."node2" "node1", ? "label", graph_6_c1."node2" "node2"
 FROM graph_6 AS graph_6_c1
 PARAS: ['P279star']
---------------------------------------------
 234.72 real 223.22 user 4.97 sys


Now we can concatenate these files to produce the final output

In [40]:
!$kgtk cat --mode NONE -i $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz $TEMP/P279star.3.tsv.gz $TEMP/P279star.4.tsv.gz \
 | kgtk compact \
 | kgtk sort2 \
 | kgtk add-id --id-style node1-label-node2-num \
 | gzip > $OUT/all.P279star.tsv.gz

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 718, in _build_column_names
 header: str = next(source).rstrip("\r\n")
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py", line 30, in __next__
 return self.s.__next__()
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/cat.py", line 152, in run
 kc.process()
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/kgtkcat.py", line 91, in process
 very_verbose=self.very_verbose,
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py", line 513, in open
 (h

This is difficult to test with our Wikidata subset because our hierarchy is very sparse.

This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the `n1` that are instances of subclasses of beer (q44).

In [41]:
!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \
 --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \
 --return 'count(n1) as count'

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/pyth

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [42]:
!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \
 | gzip > $TEMP/isa.1.tsv.gz

 406.55 real 404.76 user 1.41 sys


In [43]:
!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE -o $OUT/all.isa.tsv.gz \
 --match '(n1)-[]->(n2)' \
 --return 'n1, "isa" as label, n2' 

[2020-10-22 02:39:33 sqlstore]: IMPORT graph directly into table graph_8 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_2/isa.1.tsv.gz ...
[2020-10-22 02:43:46 query]: SQL Translation:
---------------------------------------------
 SELECT graph_8_c1."node1", ? "label", graph_8_c1."node2"
 FROM graph_8 AS graph_8_c1
 PARAS: ['isa']
---------------------------------------------
 600.27 real 757.12 user 17.87 sys


Example of how to use the `isa` relation

In [44]:
!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \
 --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \
 --return 'distinct n1, l.label, "Q44" as node2' \
 --limit 10

[2020-10-22 02:49:34 sqlstore]: IMPORT graph directly into table graph_9 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz ...
Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

Durin

### Creating a subset of Wikidata without scholarly articles (Q13442814)
First create a file with the schloarly articles

In [45]:
!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o $OUT/all.isa.Q13442814.tsv.gz \
 --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \
 --return 'distinct n1, l.label, n2'

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
 index=options.get('index'))
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
 store.add_graph(file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
 self.import_graph_data_via_import(table, file)
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
 if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/pyth

Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory

In [46]:
!gzcat $OUT/all.isa.Q13442814.tsv | head | column -t -s $'\t' 

## Summary

In [47]:
!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES

 7479 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all-distribution.tsv
 45941 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279.tsv.gz
 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279star.tsv.gz
 2206506 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31.tsv.gz
 2254104 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31_P279.tsv.gz
 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.Q13442814.tsv.gz
 1208961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz
 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.wikidatatype.distribution.tsv.gz
 374344 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.en.tsv.gz
 383165 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.tsv.gz
 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.commonsMedia.tsv.gz
 2253868 /Users/pedroszekely

Number of distinct items in our dataset

In [48]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
 --match '(n1)-[]->()' \
 --return 'count(distinct n1) as count'

[2020-10-22 02:52:30 query]: SQL Translation:
---------------------------------------------
 SELECT count(DISTINCT graph_1_c1."node1") "count"
 FROM graph_1 AS graph_1_c1
 PARAS: []
---------------------------------------------
count
88228944
 1364.75 real 1000.96 user 122.75 sys
