# Example Scenario 3: Reachable occupations for selected people in Wikidata

*Carol would like to combine two subsets of Wikidata: one containing all subclass relations, and the other containing occuppations for several notable people. The combined file needs to be sorted by subject, after which she would compute the set of reachable nodes for those people via the properties `occupation (P106)` or `subclass of (P279)`.*

**Note on the expected running time:** Running this notebook takes around 1.5 hours on a Macbook Pro laptop with MacOS Catalina 10.15, a 2.3 GHz 8-Core Intel Core i9 processor, 2TB SSD disk, and 64 GB 2667 MHz DDR4 memory.

## Preparation (same as in Example 2)

To run this notebook, Carol would need the Wikidata edges file. We will work with version `20200405` of Wikidata. Presumably, this file is not present on Carol's laptop, so we need to download and unpack it first:
* please download the file [here](https://drive.google.com/file/d/1WIQIYXJC1IdSlPchtqz0NDr2zEEOz-Hb/view?usp=sharing)
* unpack it by running : `gunzip wikidata_edges_20200504.tsv.gz`

You are all set!

*Note*: Here we assume that the Wikidata file has already been transformed to KGTK format from Wikidata's `json.bz2` dump. This can be done with the following KGTK command (for demonstration purposes, we will skip this command, as its execution takes around 11 hours): `kgtk import-wikidata -i wikidata-20200504-all.json.bz2 --node wikidata_nodes_20200504.tsv --edge wikidata_edges_20200504.tsv -qual wikidata_qualifiers_20200504.tsv`

## Implementation in KGTK

First, Carol needs to extract the two subsets with the `filter` operation:

In [1]:
%%bash
kgtk filter -p ' ; P279 ; ' -i wikidata_edges_20200504.tsv > subclass.tsv

In [2]:
%%bash
kgtk filter -p ' Q8023,Q483203,Q1426 ; P106 ; ' -i wikidata_edges_20200504.tsv > people.tsv


Then, she can merge the two files into one, sort that file, and generate the set of reachable nodes for the three nodes of interest.

In [3]:
%%bash
kgtk cat -i people.tsv subclass.tsv / sort -c "node1" > cat.tsv

In [4]:
%%bash
kgtk reachable-nodes --subj 1 --pred 2 --obj 3 --props P106,P279 --root "Q8023,Q483203,Q1426" -i cat.tsv > reachable.tsv

We can now inspect the output:

In [5]:
%%bash
cat reachable.tsv

node1	label	node2
Q1426	reachable	Q10833314
Q1426	reachable	Q2066131
Q1426	reachable	Q50995749
Q1426	reachable	Q215627
Q1426	reachable	Q830077
Q1426	reachable	Q35120
Q1426	reachable	Q24229398
Q1426	reachable	Q23958946
Q1426	reachable	Q18336849
Q1426	reachable	Q795052
Q1426	reachable	Q18536342
Q1426	reachable	Q4197743
Q483203	reachable	Q33999
Q483203	reachable	Q483501
Q483203	reachable	Q2500638
Q483203	reachable	Q215627
Q483203	reachable	Q830077
Q483203	reachable	Q35120
Q483203	reachable	Q24229398
Q483203	reachable	Q23958946
Q483203	reachable	Q18336849
Q483203	reachable	Q795052
Q483203	reachable	Q488205
Q483203	reachable	Q177220
Q483203	reachable	Q2643890
Q483203	reachable	Q639669
Q483203	reachable	Q753110
Q483203	reachable	Q28771895
Q483203	reachable	Q36834
Q483203	reachable	Q482980
Q483203	reachable	Q584301
Q483203	reachable	Q1278335
Q483203	reachable	Q855091
Q483203	reachable	Q21166956
Q483203	reachable	Q10800557
Q483203	reachable	Q183945
Q483203	reachable	Q13235160
Q483203	reachable