{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Useful Wikidata Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch Invocation\n", "Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.\n", "\n", "```\n", "papermill Example7\\ -\\ Wikidata\\ Outputs.ipynb example7.out.ipynb \\\n", "-p home /Users/pedroszekely/Downloads/kypher \\\n", "-p wiki_file all.10.tsv.gz \\\n", "-p output_folder output.all.10 \\\n", "-p temp_folder temp.all.10 \\\n", "-p delete_database true \n", "```" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "home = \"/Users/pedroszekely/Downloads/kypher\"\n", "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz\"\n", "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\"\n", "label_file = \"/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz\"\n", "output_path = \"/Users/pedroszekely/Downloads/kypher\"\n", "cache_path = \"/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\"\n", "output_folder = \"useful_wikidata_files_v3\"\n", "temp_folder = \"temp.useful_wikidata_files_v3\"\n", "delete_database = \"no\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# from IPython.display import display, HTML, Image\n", "# from pandas_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up environment and folders to store the files\n", "\n", "- `OUT` folder where the output files go\n", "- `TEMP` folder to keep temporary files , including the database\n", "- `kgtk` shortcut to invoke the kgtk software\n", "\n", "The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.\n", "\n", "To dos:\n", "\n", "- Make sure that all files have id columns as `query` gets unhappy when files have no ids.\n", "- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.\n", "- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.\n", "- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.\n", "- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`\n", "\n", "Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.\n", "\n", "If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n", "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n", "os.environ['kgtk'] = \"kgtk\"\n", "os.environ['kgtk'] = \"time kgtk --debug\"\n", "os.environ['EDGES'] = wiki_file\n", "os.environ['LABELS'] = label_file\n", "if cache_path:\n", " os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n", "else:\n", " os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3\n", "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\n", "time kgtk --debug\n", "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\n", "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz\n", "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db\n" ] } ], "source": [ "!echo $OUT\n", "!echo $TEMP\n", "!echo $kgtk\n", "!echo $EDGES\n", "!echo $LABELS\n", "!echo $STORE" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher\n" ] } ], "source": [ "cd $output_path" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3: File exists\n", "mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3: File exists\n" ] } ], "source": [ "!mkdir $OUT\n", "!mkdir $TEMP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean up the output and temp folders before we start" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# !rm $OUT/*.tsv $OUT/*.tsv.gz\n", "# !rm $TEMP/*.tsv $TEMP/*.tsv.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "if delete_database and delete_database != \"no\":\n", " print(\"Deleted database\")\n", " !rm $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get a sample and force importing the edge file into the database" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 21:29:51 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " LIMIT ?\n", " PARAS: [10]\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n", "Q1-P1036-418bc4-78f5a565-0\tQ1\tP1036\t\"113\"\tnormal\texternal-id\n", "Q1-P1036-b98c08-1dc98be9-0\tQ1\tP1036\t\"523.1\"\tnormal\texternal-id\n", "Q1-P1051-d70eb1-60991f20-0\tQ1\tP1051\t\"517\"\tnormal\texternal-id\n", "Q1-P1245-ee25a9-46be09ed-0\tQ1\tP1245\t\"8506\"\tnormal\texternal-id\n", "Q1-P1256-8da0ce-af30f4e9-0\tQ1\tP1256\t\"51A11\"\tnormal\texternal-id\n", "Q1-P1296-f73b4e-4d0c1e5d-0\tQ1\tP1296\t\"0216407\"\tnormal\texternal-id\n", "Q1-P1343-Q19190511-ab132b87-0\tQ1\tP1343\tQ19190511\tnormal\twikibase-item\n", "Q1-P1343-Q2041543-4ed8a129-0\tQ1\tP1343\tQ2041543\tnormal\twikibase-item\n", "Q1-P1343-Q602358-12bf99e2-0\tQ1\tP1343\tQ602358\tnormal\twikibase-item\n", "Q1-P1343-Q88672152-5080b9e2-0\tQ1\tP1343\tQ88672152\tnormal\twikibase-item\n", " 0.97 real 0.57 user 0.12 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --limit 10 --graph-cache $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the label column" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 00:23:41 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"label\"=?\n", " LIMIT ?\n", " PARAS: ['P31', 5]\n", "---------------------------------------------\n", "[2020-11-04 00:23:41 sqlstore]: CREATE INDEX on table graph_1 column label ...\n", "[2020-11-04 00:55:13 sqlstore]: ANALYZE INDEX on table graph_1 column label ...\n", "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n", "Q1-P31-Q36906466-ae5f57ac-0\tQ1\tP31\tQ36906466\tnormal\twikibase-item\n", "Q100-P31-Q1093829-d2e8fc52-0\tQ100\tP31\tQ1093829\tnormal\twikibase-item\n", "Q100-P31-Q1549591-56524ffd-0\tQ100\tP31\tQ1549591\tnormal\twikibase-item\n", "Q100-P31-Q21518270-a4c0505d-0\tQ100\tP31\tQ21518270\tpreferred\twikibase-item\n", "Q1000-P31-Q179023-e86c0605-0\tQ1000\tP31\tQ179023\tnormal\twikibase-item\n", " 2014.94 real 1094.60 user 460.64 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n", " --match '(i)-[:P31]->(c)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the node2 column" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 14:23:52 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"node2\"=?\n", " LIMIT ?\n", " PARAS: ['Q5', 5]\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n", "Q10000001-P31-Q5-cc1c4199-0\tQ10000001\tP31\tQ5\tnormal\twikibase-item\n", "Q1000002-P31-Q5-5c9914ea-0\tQ1000002\tP31\tQ5\tnormal\twikibase-item\n", "Q1000005-P31-Q5-4d5e2a2b-0\tQ1000005\tP31\tQ5\tnormal\twikibase-item\n", "Q1000006-P31-Q5-38728290-0\tQ1000006\tP31\tQ5\tnormal\twikibase-item\n", "Q1000015-P31-Q5-81519a6c-0\tQ1000015\tP31\tQ5\tnormal\twikibase-item\n", " 0.91 real 0.59 user 0.17 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n", " --match '(i)-[r]->(:Q5)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of edges" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 07:08:13 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(graph_1_c1.\"id\") \"count\"\n", " FROM graph_1 AS graph_1_c1\n", " LIMIT ?\n", " PARAS: [10]\n", "---------------------------------------------\n", "count\n", "1102827643\n", " 632.62 real 97.92 user 153.08 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE \\\n", " --match 'all: ()-[r]->()' \\\n", " --return 'count(r) as count' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file that specifies for each node whether it is an item or a property" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.type.tsv.gz \\\n", " --match '(n1)-[l:type]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the P31 and P279 files" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 07:18:46 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_1_c1.\"id\", graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\"\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"label\"=?\n", " PARAS: ['P31']\n", "---------------------------------------------\n", " 1676.81 real 940.66 user 250.77 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P31.tsv.gz \\\n", " --match '(n1)-[l:P31]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 07:46:43 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_1_c1.\"id\", graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\"\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"label\"=?\n", " PARAS: ['P279']\n", "---------------------------------------------\n", " 106.45 real 38.45 user 18.43 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P279.tsv.gz \\\n", " --match '(n1)-[l:P279]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/all.P31.tsv.gz: uncompress failed\n", "id node1 label node2\n", "Q1-P31-Q36906466-ae5f57ac-0 Q1 P31 Q36906466\n", "Q100-P31-Q1093829-d2e8fc52-0 Q100 P31 Q1093829\n", "Q100-P31-Q1549591-56524ffd-0 Q100 P31 Q1549591\n", "Q100-P31-Q21518270-a4c0505d-0 Q100 P31 Q21518270\n", "Q1000-P31-Q179023-e86c0605-0 Q1000 P31 Q179023\n", "Q1000-P31-Q3624078-3aade447-0 Q1000 P31 Q3624078\n", "Q1000-P31-Q6256-bbaa8a03-0 Q1000 P31 Q6256\n", "Q10000-P31-Q10876391-e428a782-0 Q10000 P31 Q10876391\n", "Q100000-P31-Q1852859-3665587f-0 Q100000 P31 Q1852859\n" ] } ], "source": [ "!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1023.85 real 1008.10 user 7.25 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gzcat: id node1 label node2\n", "error writing to outputQ1000032-P279-Q1813494-0aa0f1dc-0 Q1000032 P279 Q1813494\n", ": Q1000032-P279-Q83602-482a1943-0 Q1000032 P279 Q83602\n", "Broken pipe\n", "Q1000039-P279-Q11555767-2dddfd86-0 Q1000039 P279 Q11555767\n", "Q1000064-P279-Q11016-0ab23344-0 Q1000064 P279 Q11016\n", "Q1000084-P279-Q159810-7852671b-0 Q1000084 P279 Q159810\n", "Q1000108-P279-Q849640-02ec6f84-0 Q1000108 P279 Q849640\n", "Q1000116-P279-Q179692-f8d04cc2-0 Q1000116 P279 Q179692\n", "Q1000120-P279-Q12510-3e6da659-0 Q1000120 P279 Q12510\n", "gzcat: Q1000120-P279-Q181296-f30f7109-0 Q1000120 P279 Q181296\n", "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/all.P31_P279.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First compute the roots" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 20:51:51 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_2_c1.\"node1\" \"id\"\n", " FROM graph_2 AS graph_2_c1\n", " PARAS: []\n", "---------------------------------------------\n", " 16.80 real 15.43 user 0.57 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \\\n", " --match '(n1)-[l]->()' \\\n", " --return 'n1 as id' " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-04 20:52:07 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_3_c1.\"node2\" \"id\"\n", " FROM graph_3 AS graph_3_c1\n", " PARAS: []\n", "---------------------------------------------\n", " 175.55 real 154.56 user 11.03 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P31.n2.tsv.gz \\\n", " --match '()-[l]->(n2)' \\\n", " --return 'n2 as id' " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 348.65 real 346.10 user 1.47 sys\n" ] } ], "source": [ "!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \\\n", " | gzip > $TEMP/P279.roots.1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 70.85 real 56.25 user 12.59 sys\n" ] } ], "source": [ "!$kgtk sort2 --mode NONE --column id -i $TEMP/P279.roots.1.tsv.gz \\\n", " | gzip > $TEMP/P279.roots.2.tsv.gz" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\n", "Q1\n", "Q1\n", "Q1000032\n", "Q1000032\n", "Q1000039\n", "Q1000064\n", "Q1000084\n", "Q1000108\n", "Q1000116\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $TEMP/P279.roots.2.tsv.gz | head" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 382.41 real 380.34 user 1.00 sys\n" ] } ], "source": [ "!$kgtk compact -i $TEMP/P279.roots.2.tsv.gz --mode NONE \\\n", " --presorted \\\n", " --columns id \\\n", "> $TEMP/P279.roots.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can invoke the reachable-nodes command" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5839.30 real 3373.77 user 2374.71 sys\n" ] } ], "source": [ "!$kgtk reachable-nodes \\\n", " --rootfile $TEMP/P279.roots.tsv \\\n", " --selflink \\\n", " -i $OUT/all.P279.tsv.gz \\\n", " | gzip > $TEMP/P279.reachable.tsv.gz" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "Q1000032\treachable\tQ1000032\n", "Q1000032\treachable\tQ1813494\n", "Q1000032\treachable\tQ1799072\n", "Q1000032\treachable\tQ16686448\n", "Q1000032\treachable\tQ35120\n", "Q1000032\treachable\tnovalue\n", "Q1000032\treachable\tQ2695280\n", "Q1000032\treachable\tQ1914636\n", "Q1000032\treachable\tQ20937557\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $TEMP/P279.reachable.tsv.gz | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 09:25:49 sqlstore]: DROP graph data table graph_4 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/P279.reachable.tsv.gz\n", "[2020-11-05 09:25:49 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/P279.reachable.tsv.gz ...\n", "[2020-11-05 09:29:12 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_4_c1.\"node1\", ? \"label\", graph_4_c1.\"node2\" \"node2\"\n", " FROM graph_4 AS graph_4_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 847.92 real 955.62 user 28.23 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.1.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can concatenate these files to produce the final output" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 284.59 real 242.05 user 65.20 sys\n" ] } ], "source": [ "!$kgtk sort2 -i $TEMP/P279star.1.tsv.gz -o $TEMP/P279star.2.tsv.gz" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1371.31 real 1362.32 user 3.93 sys\n" ] } ], "source": [ "!$kgtk compact --presorted -i $TEMP/P279star.2.tsv.gz -o $TEMP/P279star.3.tsv.gz" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1336.10 real 1284.93 user 28.65 sys\n" ] } ], "source": [ "!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/P279star.3.tsv.gz -o $OUT/all.P279star.tsv.gz" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\n", "Q1000032\tP279star\tQ1000032\tQ1000032-P279star-Q1000032-0000\n", "Q1000032\tP279star\tQ1150070\tQ1000032-P279star-Q1150070-0000\n", "Q1000032\tP279star\tQ1190554\tQ1000032-P279star-Q1190554-0000\n", "Q1000032\tP279star\tQ133500\tQ1000032-P279star-Q133500-0000\n", "Q1000032\tP279star\tQ13878858\tQ1000032-P279star-Q13878858-0000\n", "Q1000032\tP279star\tQ14819853\tQ1000032-P279star-Q14819853-0000\n", "Q1000032\tP279star\tQ14912053\tQ1000032-P279star-Q14912053-0000\n", "Q1000032\tP279star\tQ16686448\tQ1000032-P279star-Q16686448-0000\n", "Q1000032\tP279star\tQ16722960\tQ1000032-P279star-Q16722960-0000\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $OUT/all.P279star.tsv.gz | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is how we would do the typical `?item P31/P279* ?class` in Kypher. \n", "The example shows how to get all the `n1` that are instances of subclasses of beer (q44)." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 19:26:57 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_3_c1.\"node1\", graph_5_c2.\"node1\" \"class\", count(graph_5_c2.\"node1\") \"count\", graph_7_c3.\"node2\" \"n1_label\", graph_7_c4.\"node2\" \"class name\"\n", " FROM graph_3 AS graph_3_c1, graph_5 AS graph_5_c2, graph_7 AS graph_7_c3, graph_7 AS graph_7_c4\n", " WHERE graph_3_c1.\"label\"=?\n", " AND graph_5_c2.\"node2\"=?\n", " AND graph_7_c3.\"label\"=?\n", " AND graph_7_c4.\"label\"=?\n", " AND graph_3_c1.\"node1\"=graph_7_c3.\"node1\"\n", " AND graph_3_c1.\"node2\"=graph_5_c2.\"node1\"\n", " AND graph_3_c1.\"node2\"=graph_7_c4.\"node1\"\n", " GROUP BY graph_3_c1.\"node1\", class\n", " ORDER BY graph_3_c1.\"node1\" ASC, graph_5_c2.\"node1\" DESC, count(graph_3_c1.\"node1\") DESC\n", " LIMIT ?\n", " PARAS: ['P31', 'Q44', 'label', 'label', 10]\n", "---------------------------------------------\n", " 1.84 real 0.65 user 0.35 sys\n", "node1 class count n1_label class name\n", "Q1000737 Q15075508 1 'Marthabräu'@en 'beer brand'@en\n", "Q1017471 Q44 1 'Bush'@en 'beer'@en\n", "Q1017471 Q15075508 1 'Bush'@en 'beer brand'@en\n", "Q1026242 Q44 1 'Calanda Bräu'@en 'beer'@en\n", "Q10304159 Q44 1 'Itaipava'@en 'beer'@en\n", "Q10350781 Q15075508 1 'Polar'@en 'beer brand'@en\n", "Q1035257 Q44 1 'Cara Pils'@en 'beer'@en\n", "Q1041750 Q44 1 'Carling'@en 'beer'@en\n", "Q10487716 Q253481 1 'Eriksberg'@en 'lager'@en\n", "Q10577358 Q44 1 'Mariestads'@en 'beer'@en\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE \\\n", "--match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \\\n", "--return 'distinct n1, c as class, count(c) as count, label as n1_label, c_label as `class name`' \\\n", "--order-by 'n1, c desc, count(n1) desc' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 19:29:23 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_5 AS graph_5_c1\n", " WHERE graph_5_c1.\"node2\"=?\n", " LIMIT ?\n", " PARAS: ['Q44', 10]\n", "---------------------------------------------\n", " 0.77 real 0.61 user 0.14 sys\n", "node1 label node2 id\n", "Q10262008 P279star Q44 Q10262008-P279star-Q44-0000\n", "Q10313616 P279star Q44 Q10313616-P279star-Q44-0000\n", "Q10314719 P279star Q44 Q10314719-P279star-Q44-0000\n", "Q10317863 P279star Q44 Q10317863-P279star-Q44-0000\n", "Q10340754 P279star Q44 Q10340754-P279star-Q44-0000\n", "Q10350781 P279star Q44 Q10350781-P279star-Q44-0000\n", "Q10355535 P279star Q44 Q10355535-P279star-Q44-0000\n", "Q10382024 P279star Q44 Q10382024-P279star-Q44-0000\n", "Q10461865 P279star Q44 Q10461865-P279star-Q44-0000\n", "Q1046968 P279star Q44 Q1046968-P279star-Q44-0000\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE \\\n", "--match 'star: (n1)-[]->(c:Q44)' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file to do generalized Is-A queries\n", "The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`\n", "\n", "We do this by concatenating the files and renaming the relation" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 538.04 real 519.06 user 9.69 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \\\n", " | gzip > $TEMP/isa.1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 14:26:22 sqlstore]: IMPORT graph directly into table graph_6 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/isa.1.tsv.gz ...\n", "[2020-11-05 14:33:10 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_6_c1.\"node1\", ? \"label\", graph_6_c1.\"node2\"\n", " FROM graph_6 AS graph_6_c1\n", " PARAS: ['isa']\n", "---------------------------------------------\n", " 835.15 real 1061.27 user 43.66 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE -o $OUT/all.isa.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"isa\" as label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example of how to use the `isa` relation" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 20:07:32 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_8_c1.\"node1\", graph_8_c1.\"label\", ? \"node2\", graph_7_c3.\"node2\" \"n1_label\"\n", " FROM graph_5 AS graph_5_c2, graph_7 AS graph_7_c3, graph_8 AS graph_8_c1\n", " WHERE graph_5_c2.\"node2\"=?\n", " AND graph_7_c3.\"label\"=?\n", " AND graph_8_c1.\"label\"=?\n", " AND graph_5_c2.\"node1\"=graph_8_c1.\"node2\"\n", " AND graph_7_c3.\"node1\"=graph_8_c1.\"node1\"\n", " LIMIT ?\n", " PARAS: ['Q44', 'Q44', 'label', 'isa', 10]\n", "---------------------------------------------\n", "[2020-11-05 20:07:32 sqlstore]: CREATE INDEX on table graph_8 column node1 ...\n", "[2020-11-05 20:08:27 sqlstore]: ANALYZE INDEX on table graph_8 column node1 ...\n", "node1\tlabel\tnode2\tn1_label\n", "Q15875298\tisa\tQ44\t'Floreffe'@en\n", "Q1917255\tisa\tQ44\t'St-Idesbald'@en\n", "Q2004062\tisa\tQ44\t'Sancti Adalberti'@en\n", "Q2006077\tisa\tQ44\t'Bonne-Espérance abbey'@en\n", "Q2272636\tisa\tQ44\t'Ename beer'@en\n", "Q2290730\tisa\tQ44\t'Ter Dolen (beer)'@en\n", "Q3625571\tisa\tQ44\t'Herkenrode Tripel'@en\n", "Q505815\tisa\tQ44\t'Trappist beer'@en\n", "Q747216\tisa\tQ44\t'Bornem'@en\n", "Q15985396\tisa\tQ44\t'Crabbelaer'@en\n", " 69.00 real 40.73 user 10.87 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE -o - \\\n", " --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \\\n", " --return 'distinct n1, l.label, \"Q44\" as node2, label as n1_label' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a subset of Wikidata without scholarly articles (Q13442814)\n", "First create a file with the schloarly articles" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-06 10:01:39 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\" \"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " AND (graph_8_c1.\"node2\" IN (?, ?, ?, ?, ?, ?, ?, ?))\n", " PARAS: ['isa', 'Q13442814', 'Q523', 'Q16521', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054']\n", "---------------------------------------------\n", " 411.40 real 279.05 user 46.43 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o $OUT/all.isa.Q13442814.Q523.Q16521.Q318.Q7318358.Q7187.Q11173.Q8054.tsv.gz \\\n", "--match 'isa: (n1)-[l:isa]->(class)' \\\n", "--where 'class in [\"Q13442814\", \"Q523\", \"Q16521\", \"Q318\", \"Q7318358\", \"Q7187\", \"Q11173\", \"Q8054\"]' \\\n", "--return 'distinct n1, l.label, class as node2' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kypher -i $NKG \\\n", "--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \\\n", "--where 'source = $s and n1 =~ \".*/catch/.*\"' \\\n", "--return 'distinct n1 as node1' \\\n", "--spara s='CN' \\\n", "--limit 100" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 35933570 107800710 862405612\n" ] } ], "source": [ "!gzcat $OUT/all.isa.Q13442814.tsv | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-05 20:45:35 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_1_c2.\"node1\" \"node1\", graph_1_c2.\"label\" \"label\", graph_1_c2.\"node2\" \"node2\", graph_1_c2.\"id\" \"id\"\n", " FROM graph_1 AS graph_1_c2, graph_9 AS graph_9_c1\n", " WHERE ((graph_9_c1.\"node1\" != graph_1_c2.\"node1\") OR (graph_9_c1.\"node1\" != graph_1_c2.\"node2\"))\n", " PARAS: []\n", "---------------------------------------------\n", "^C\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.Q13442814.tsv.gz -i \"$EDGES\" --graph-cache $STORE \\\n", "--match 'isa: (n1)-[]->(), all: (n2)-[l]->(n3)' \\\n", "--where 'n1 != n2 or n1 != n3' \\\n", "--return 'distinct n2 as node1, l.label as label, n3 as node2, l as id' \\\n", "-o $OUT/all.minus.Q13442814.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 7479 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all-distribution.tsv\n", " 45941 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279star.tsv.gz\n", " 2206506 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31.tsv.gz\n", " 2254104 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31_P279.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.Q13442814.tsv.gz\n", " 1208961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.wikidatatype.distribution.tsv.gz\n", " 374344 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.en.tsv.gz\n", " 383165 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.commonsMedia.tsv.gz\n", " 2253868 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.en.tsv.gz\n", " 3688188 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.external-id.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.geo-shape.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.globe-coordinate.tsv.gz\n", " 7895945 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.en.tsv.gz\n", " 8034944 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.math.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.monolingualtext.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.musical-notation.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.quantity.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.string.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.time.tsv.gz\n", " 1351961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.type.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.url.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-form.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-item.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-property.tsv.gz\n", " 3404579 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikipedia_sitelink.tsv.gz\n", " 65610166 /Users/pedroszekely/Downloads/kypher/almost.all.edges.sorted.tsv.gz\n", " 98720151 total\n" ] } ], "source": [ "!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of distinct items in our dataset" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:52:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(DISTINCT graph_1_c1.\"node1\") \"count\"\n", " FROM graph_1 AS graph_1_c1\n", " PARAS: []\n", "---------------------------------------------\n", "count\n", "88228944\n", " 1364.75 real 1000.96 user 122.75 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n", " --match '(n1)-[]->()' \\\n", " --return 'count(distinct n1) as count'" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }