{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Scenario 3: Reachable occupations for selected people in Wikidata\n", "\n", "*Carol would like to combine two subsets of Wikidata: one containing all subclass relations, and the other containing occuppations for several notable people. The combined file needs to be sorted by subject, after which she would compute the set of reachable nodes for those people via the properties `occupation (P106)` or `subclass of (P279)`.*\n", "\n", "**Note on the expected running time:** Running this notebook takes around 1.5 hours on a Macbook Pro laptop with MacOS Catalina 10.15, a 2.3 GHz 8-Core Intel Core i9 processor, 2TB SSD disk, and 64 GB 2667 MHz DDR4 memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation (same as in Example 2)\n", "\n", "To run this notebook, Carol would need the Wikidata edges file. We will work with version `20200405` of Wikidata. Presumably, this file is not present on Carol's laptop, so we need to download and unpack it first:\n", "* please download the file [here](https://drive.google.com/file/d/1WIQIYXJC1IdSlPchtqz0NDr2zEEOz-Hb/view?usp=sharing)\n", "* unpack it by running : `gunzip wikidata_edges_20200504.tsv.gz`\n", "\n", "You are all set!\n", "\n", "*Note*: Here we assume that the Wikidata file has already been transformed to KGTK format from Wikidata's `json.bz2` dump. This can be done with the following KGTK command (for demonstration purposes, we will skip this command, as its execution takes around 11 hours): `kgtk import-wikidata -i wikidata-20200504-all.json.bz2 --node wikidata_nodes_20200504.tsv --edge wikidata_edges_20200504.tsv -qual wikidata_qualifiers_20200504.tsv`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation in KGTK\n", "\n", "First, Carol needs to extract the two subsets with the `filter` operation:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk filter -p ' ; P279 ; ' -i wikidata_edges_20200504.tsv > subclass.tsv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk filter -p ' Q8023,Q483203,Q1426 ; P106 ; ' -i wikidata_edges_20200504.tsv > people.tsv\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, she can merge the two files into one, sort that file, and generate the set of reachable nodes for the three nodes of interest." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk cat -i people.tsv subclass.tsv / sort -c \"node1\" > cat.tsv" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk reachable-nodes --subj 1 --pred 2 --obj 3 --props P106,P279 --root \"Q8023,Q483203,Q1426\" -i cat.tsv > reachable.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now inspect the output:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "Q1426\treachable\tQ10833314\n", "Q1426\treachable\tQ2066131\n", "Q1426\treachable\tQ50995749\n", "Q1426\treachable\tQ215627\n", "Q1426\treachable\tQ830077\n", "Q1426\treachable\tQ35120\n", "Q1426\treachable\tQ24229398\n", "Q1426\treachable\tQ23958946\n", "Q1426\treachable\tQ18336849\n", "Q1426\treachable\tQ795052\n", "Q1426\treachable\tQ18536342\n", "Q1426\treachable\tQ4197743\n", "Q483203\treachable\tQ33999\n", "Q483203\treachable\tQ483501\n", "Q483203\treachable\tQ2500638\n", "Q483203\treachable\tQ215627\n", "Q483203\treachable\tQ830077\n", "Q483203\treachable\tQ35120\n", "Q483203\treachable\tQ24229398\n", "Q483203\treachable\tQ23958946\n", "Q483203\treachable\tQ18336849\n", "Q483203\treachable\tQ795052\n", "Q483203\treachable\tQ488205\n", "Q483203\treachable\tQ177220\n", "Q483203\treachable\tQ2643890\n", "Q483203\treachable\tQ639669\n", "Q483203\treachable\tQ753110\n", "Q483203\treachable\tQ28771895\n", "Q483203\treachable\tQ36834\n", "Q483203\treachable\tQ482980\n", "Q483203\treachable\tQ584301\n", "Q483203\treachable\tQ1278335\n", "Q483203\treachable\tQ855091\n", "Q483203\treachable\tQ21166956\n", "Q483203\treachable\tQ10800557\n", "Q483203\treachable\tQ183945\n", "Q483203\treachable\tQ13235160\n", "Q483203\treachable\tQ131524\n", "Q483203\treachable\tQ43845\n", "Q483203\treachable\tQ702269\n", "Q483203\treachable\tQ327055\n", "Q483203\treachable\tQ128124\n", "Q483203\treachable\tQ81096\n", "Q8023\treachable\tQ82955\n", "Q8023\treachable\tQ702269\n", "Q8023\treachable\tQ327055\n", "Q8023\treachable\tQ215627\n", "Q8023\treachable\tQ830077\n", "Q8023\treachable\tQ35120\n", "Q8023\treachable\tQ24229398\n", "Q8023\treachable\tQ23958946\n", "Q8023\treachable\tQ18336849\n", "Q8023\treachable\tQ795052\n", "Q8023\treachable\tQ18814623\n", "Q8023\treachable\tQ864380\n", "Q8023\treachable\tQ15980158\n", "Q8023\treachable\tQ36180\n", "Q8023\treachable\tQ482980\n", "Q8023\treachable\tQ2500638\n", "Q8023\treachable\tQ40348\n", "Q8023\treachable\tQ185351\n", "Q8023\treachable\tQ57735705\n", "Q8023\treachable\tQ11499147\n", "Q8023\treachable\tQ15253558\n" ] } ], "source": [ "%%bash\n", "cat reachable.tsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }