{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Useful Wikidata Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch Invocation\n", "Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.\n", "\n", "```\n", "papermill Example7\\ -\\ Wikidata\\ Outputs.ipynb example7.out.ipynb \\\n", "-p home /Users/pedroszekely/Downloads/kypher \\\n", "-p wiki_file all.10.tsv.gz \\\n", "-p output_folder output.all.10 \\\n", "-p temp_folder temp.all.10 \\\n", "-p delete_database true \n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "home = \"/Users/pedroszekely/Downloads/kypher\"\n", "wiki_file = \"all.tsv.gz\"\n", "wiki_file = \"all.10.tsv.gz\"\n", "output_folder = \"output\"\n", "output_folder = \"output.all.10\"\n", "temp_folder = \"temp\"\n", "temp_folder = \"temp.all.10\"\n", "delete_database = \"true\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# from IPython.display import display, HTML, Image\n", "# from pandas_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up environment and folders to store the files\n", "\n", "- `WIKIDATA_HOME` folder where you put your Wikidata data\n", "- `OUT` folder where the output files go\n", "- `TEMP` folder to keep temporary files , including the database\n", "- `kgtk` shortcut to invoke the kgtk software\n", "\n", "The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.\n", "\n", "To dos:\n", "\n", "- Make sure that all files have id columns as `query` gets unhappy when files have no ids.\n", "- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.\n", "- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.\n", "- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.\n", "- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`\n", "\n", "Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.\n", "\n", "If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "os.environ['WIKIDATA_HOME'] = home\n", "os.environ['OUT'] = \"{}/{}\".format(os.environ['WIKIDATA_HOME'], output_folder)\n", "os.environ['TEMP'] = \"{}/{}\".format(os.environ['WIKIDATA_HOME'], temp_folder)\n", "os.environ['kgtk'] = \"kgtk\"\n", "os.environ['kgtk'] = \"time kgtk --debug\"\n", "wikidata_home = os.environ['WIKIDATA_HOME']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher\n", "/Users/pedroszekely/Downloads/kypher/output.all.10\n", "/Users/pedroszekely/Downloads/kypher/temp.all.10\n", "time kgtk --debug\n" ] } ], "source": [ "!echo $WIKIDATA_HOME\n", "!echo $OUT\n", "!echo $TEMP\n", "!echo $kgtk" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher\n" ] } ], "source": [ "cd $wikidata_home" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/output.all.10: File exists\n" ] } ], "source": [ "!mkdir $OUT\n", "!mkdir $TEMP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean up the output and temp folders before we start" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/*.tsv: No such file or directory\n", "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/*.tsv.gz: No such file or directory\n" ] } ], "source": [ "!rm $OUT/*.tsv $OUT/*.tsv.gz\n", "!rm $TEMP/*.tsv $TEMP/*.tsv.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deleting database\n", "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/wikidata.sqlite3.db: No such file or directory\n" ] } ], "source": [ "if delete_database:\n", " print(\"Deleting database\")\n", " !rm $TEMP/wikidata.sqlite3.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `all` file contains 100M edges of the full dump, `all.10` contains 10M edges. This is for testing, as we should run on the full edges file." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: STORE=/Users/pedroszekely/Downloads/kypher/temp/wikidata.sqlite3.db\n", "env: EDGES=/Users/pedroszekely/Downloads/kypher/all.10.tsv.gz\n" ] } ], "source": [ "%env STORE=$wikidata_home/temp/wikidata.sqlite3.db\n", "# %env EDGES=$wikidata_home/all.10.tsv\n", "%env EDGES=$wikidata_home/$wiki_file\n", "\n", "#%env QUALS=$wikidata_home/wikidata-20200803-all-qualifiers.tsv.gz\n", "#%env LABELS=$wikidata_home/wikidata-20200803-all-labels-en-sorted.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get a sample and force importing the edge file into the database" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:21:25 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_8 AS graph_8_c1\n", " LIMIT ?\n", " PARAS: [10]\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n", "Q45-label-en\tQ45\tlabel\t'Portugal'@en\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-fr\tQ45\tlabel\t'Portugal'@fr\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-nb\tQ45\tlabel\t'Portugal'@nb\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-it\tQ45\tlabel\t'Portogallo'@it\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-ru\tQ45\tlabel\t'Португалия'@ru\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-nl\tQ45\tlabel\t'Portugal'@nl\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-es\tQ45\tlabel\t'Portugal'@es\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-de\tQ45\tlabel\t'Portugal'@de\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-pl\tQ45\tlabel\t'Portugalia'@pl\t\t\t\t\t\t\t\t\t\t\t\t\t\n", "Q45-label-be-tarask\tQ45\tlabel\t'Партугалія'@be-tarask\t\t\t\t\t\t\t\t\t\t\t\t\t\n", " 2.22 real 0.60 user 0.16 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --limit 10 --graph-cache $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the label column" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:21:26 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " LIMIT ?\n", " PARAS: ['P31', 5]\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n", "Q45-P31-1\tQ45\tP31\tQ3624078\tnormal\t\t\t\tQ3624078\t\t\t\t\t\t\titem\twikibase-item\n", "Q45-P31-2\tQ45\tP31\tQ6256\tnormal\t\t\t\tQ6256\t\t\t\t\t\t\titem\twikibase-item\n", "Q45-P31-3\tQ45\tP31\tQ20181813\tnormal\t\t\t\tQ20181813\t\t\t\t\t\t\titem\twikibase-item\n", "Q140-P31-1\tQ140\tP31\tQ16521\tnormal\t\t\t\tQ16521\t\t\t\t\t\t\titem\twikibase-item\n", "Q183-P31-1\tQ183\tP31\tQ3624078\tpreferred\t\t\t\tQ3624078\t\t\t\t\t\t\titem\twikibase-item\n", " 0.71 real 0.57 user 0.13 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n", " --match '(i)-[:P31]->(c)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the node2 column" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:21:27 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"node2\"=?\n", " LIMIT ?\n", " PARAS: ['Q5', 5]\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n", "Q1253-P31-1\tQ1253\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n", "Q1526-P31-1\tQ1526\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n", "Q3794-P31-1\tQ3794\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n", "Q4291-P31-1\tQ4291\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n", "Q4489-P31-1\tQ4489\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n", " 0.70 real 0.56 user 0.12 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n", " --match '(i)-[r]->(:Q5)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of edges" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:21:28 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(graph_8_c1.\"id\") \"count\"\n", " FROM graph_8 AS graph_8_c1\n", " LIMIT ?\n", " PARAS: [10]\n", "---------------------------------------------\n", "count\n", "9999999\n", " 6.80 real 1.57 user 1.54 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE \\\n", " --match 'all: ()-[r]->()' \\\n", " --return 'count(r) as count' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the distribution of the label column\n", "I would like to have it sorted numerically, but don't know how to make it happen" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 43.44 real 42.83 user 0.79 sys\n" ] } ], "source": [ "!$kgtk unique --column label -i $EDGES / sort2 -c node2 -r -o $OUT/all-distribution.tsv " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2\n", "P3987 count 998\n", "P410 count 987\n", "P575 count 985\n", "P6879 count 98\n", "P6562 count 98\n", "P5395 count 98\n", "P3153 count 98\n", "P4933 count 97\n", "P3135 count 97\n" ] } ], "source": [ "!head $OUT/all-distribution.tsv | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute files with labels, aliases and descriptions\n", "Return the id, node1, label and node2 columns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:22:19 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['label']\n", "---------------------------------------------\n", " 20.17 real 17.45 user 1.26 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.label.tsv.gz \\\n", " --match '(n1)-[l:label]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:22:39 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['alias']\n", "---------------------------------------------\n", " 3.65 real 2.86 user 0.42 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.alias.tsv.gz \\\n", " --match '(n1)-[l:alias]->(n2)' \\\n", " --return 'l, n1, l.label, n2'" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:22:43 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['description']\n", "---------------------------------------------\n", " 60.09 real 50.85 user 3.29 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.description.tsv.gz \\\n", " --match '(n1)-[l:description]->(n2)' \\\n", " --return 'l, n1, l.label, n2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now create files with the English labels, aliases and descriptions" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:23:43 sqlstore]: DROP graph data table graph_18 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz\n", "[2020-09-30 11:23:44 sqlstore]: IMPORT graph directly into table graph_38 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz ...\n", "[2020-09-30 11:23:54 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_38 AS graph_38_c1\n", " WHERE (kgtk_lqstring_lang_suffix(graph_38_c1.\"node2\") = ?)\n", " PARAS: ['en']\n", "---------------------------------------------\n", " 15.08 real 18.64 user 1.55 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.label.tsv.gz --graph-cache $STORE -o $OUT/part.label.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:23:58 sqlstore]: DROP graph data table graph_19 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz\n", "[2020-09-30 11:23:58 sqlstore]: IMPORT graph directly into table graph_39 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz ...\n", "[2020-09-30 11:23:59 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_39 AS graph_39_c1\n", " WHERE (kgtk_lqstring_lang_suffix(graph_39_c1.\"node2\") = ?)\n", " PARAS: ['en']\n", "---------------------------------------------\n", " 2.66 real 3.06 user 0.29 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.alias.tsv.gz --graph-cache $STORE -o $OUT/part.alias.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"'" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:24:01 sqlstore]: DROP graph data table graph_20 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz\n", "[2020-09-30 11:24:06 sqlstore]: IMPORT graph directly into table graph_40 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz ...\n", "[2020-09-30 11:24:45 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_40 AS graph_40_c1\n", " WHERE (kgtk_lqstring_lang_suffix(graph_40_c1.\"node2\") = ?)\n", " PARAS: ['en']\n", "---------------------------------------------\n", " 53.85 real 71.20 user 5.32 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.description.tsv.gz --graph-cache $STORE -o $OUT/part.description.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's sample these files to see what they look like:\n", "\n", "* we are getting all variants of English, we really want `en` only\n", "* the labels have the language tags, how do we output only the string without the language tag?" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q45-label-en Q45 label 'Portugal'@en\n", "Q140-label-en Q140 label 'lion'@en\n", "Q183-label-en Q183 label 'Germany'@en\n", "Q317-label-en Q317 label 'dictatorship'@en\n", "Q433-label-en Q433 label 'Gmina Kurów'@en\n", "Q514-label-en Q514 label 'anatomy'@en\n", "Q595-label-en Q595 label 'The Intouchables'@en\n", "Q647-label-en Q647 label 'Rennes'@en\n", "Q716-label-en Q716 label 'titanium'@en\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.en.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $OUT/part.label.en.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute the distribution of the number of edges for each Wikidata type" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 42.94 real 42.38 user 0.81 sys\n" ] } ], "source": [ "!$kgtk unique --column 'node2;wikidatatype' -i $EDGES / sort2 -c node2 -r | gzip > $OUT/all.wikidatatype.distribution.tsv.gz" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2\n", "time count 76936\n", "wikibase-item count 729535\n", "math count 70\n", "wikibase-form count 7\n", "quantity count 69823\n", "string count 68283\n", "external-id count 416408\n", "commonsMedia count 36794\n", "globe-coordinate count 26063\n", "monolingualtext count 24131\n", "musical-notation count 2\n", "geo-shape count 183\n", "wikibase-property count 148\n", "url count 12874\n" ] } ], "source": [ "!gzcat $OUT/all.wikidatatype.distribution.tsv.gz | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file to contain the edges for each wikidata type" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.time.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"time\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-item.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"wikibase-item\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.math.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"math\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-form.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"wikibase-form\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.quantity.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"quantity\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.string.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"string\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.external-id.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"external-id\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.commonsMedia.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"commonsMedia\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.globe-coordinate.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"globe-coordinate\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.monolingualtext.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"monolingualtext\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.musical-notation.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"musical-notation\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.geo-shape.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"geo-shape\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-property.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"wikibase-property\"'\n", "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.url.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"url\"'\n" ] } ], "source": [ "types = [\n", " \"time\",\n", " \"wikibase-item\",\n", " \"math\",\n", " \"wikibase-form\",\n", " \"quantity\",\n", " \"string\",\n", " \"external-id\",\n", " \"commonsMedia\",\n", " \"globe-coordinate\",\n", " \"monolingualtext\",\n", " \"musical-notation\",\n", " \"geo-shape\",\n", " \"wikibase-property\",\n", " \"url\",\n", "]\n", "\n", "command = \"$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.TYPE_FILE.tsv.gz \\\n", " --match '(n1)-[l]->(n2 {wikidatatype: type})' \\\n", " --return 'l, n1, l.label, n2'\\\n", " --where 'type = \\\"TYPE\\\"'\"\n", "for type in types:\n", " cmd = command.replace(\"TYPE_FILE\", type)\n", " cmd = cmd.replace(\"TYPE\", type)\n", "\n", " print(cmd)\n", " os.system(cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file with the sitelinks" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:37 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['wikipedia_sitelink']\n", "---------------------------------------------\n", " 8.27 real 6.88 user 0.71 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikipedia_sitelink.tsv.gz \\\n", " --match '(n1)-[l:wikipedia_sitelink]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file that specifies for each node whether it is an item or a property" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:46 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['type']\n", "---------------------------------------------\n", " 3.21 real 2.50 user 0.58 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.type.tsv.gz \\\n", " --match '(n1)-[l:type]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the P31 and P279 files" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:49 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['P31']\n", "---------------------------------------------\n", " 2.73 real 2.17 user 0.45 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P31.tsv.gz \\\n", " --match '(n1)-[l:P31]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:52 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " WHERE graph_8_c1.\"label\"=?\n", " PARAS: ['P279']\n", "---------------------------------------------\n", " 0.79 real 0.64 user 0.14 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P279.tsv.gz \\\n", " --match '(n1)-[l:P279]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz: uncompress failed\n", "id node1 label node2\n", "Q45-P31-1 Q45 P31 Q3624078\n", "Q45-P31-2 Q45 P31 Q6256\n", "Q45-P31-3 Q45 P31 Q20181813\n", "Q140-P31-1 Q140 P31 Q16521\n", "Q183-P31-1 Q183 P31 Q3624078\n", "Q183-P31-2 Q183 P31 Q43702\n", "Q183-P31-3 Q183 P31 Q7270\n", "Q183-P31-4 Q183 P31 Q619610\n", "Q183-P31-5 Q183 P31 Q4209223\n" ] } ], "source": [ "!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.57 real 2.37 user 0.14 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q317-P279-1 Q317 P279 Q173424\n", "Q514-P279-1 Q514 P279 Q420\n", "Q514-P279-2 Q514 P279 Q11190\n", "Q716-P279-1 Q716 P279 Q19588\n", "Q716-P279-2 Q716 P279 Q428766\n", "Q901-P279-1 Q901 P279 Q1650915\n", "Q901-P279-2 Q901 P279 Q20826540\n", "Q1071-P279-1 Q1071 P279 Q34749\n", "Q1071-P279-2 Q1071 P279 Q8008\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First compute the roots" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:56 sqlstore]: DROP graph data table graph_21 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz\n", "[2020-09-30 11:26:56 sqlstore]: IMPORT graph directly into table graph_41 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz ...\n", "[2020-09-30 11:26:56 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_41_c1.\"node1\" \"node\"\n", " FROM graph_41 AS graph_41_c1\n", " PARAS: []\n", "---------------------------------------------\n", " 0.75 real 0.59 user 0.15 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \\\n", " --match '(n1)-[]->()' \\\n", " --return 'n1 as node' " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:26:57 sqlstore]: DROP graph data table graph_22 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz\n", "[2020-09-30 11:26:57 sqlstore]: IMPORT graph directly into table graph_42 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz ...\n", "[2020-09-30 11:26:57 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_42_c1.\"node2\" \"node\"\n", " FROM graph_42 AS graph_42_c1\n", " PARAS: []\n", "---------------------------------------------\n", " 1.40 real 1.37 user 0.22 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P31.n2.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --return 'n2 as node' " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.34 real 2.95 user 0.44 sys\n" ] } ], "source": [ "!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \\\n", " / compact --mode NONE --columns node \\\n", " > $TEMP/P279.roots.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can invoke the reachable-nodes command" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.19 real 0.86 user 0.21 sys\n" ] } ], "source": [ "!$kgtk reachable-nodes \\\n", " --rootfile $TEMP/P279.roots.tsv \\\n", " --rootfilecolumn 0 \\\n", " --subj 1 --pred 2 --obj 3 \\\n", " -i $OUT/all.P279.tsv.gz \\\n", " | kgtk sort2 \\\n", " | gzip > $TEMP/P279.reachable.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:03 sqlstore]: DROP graph data table graph_23 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz\n", "[2020-09-30 11:27:03 sqlstore]: IMPORT graph directly into table graph_43 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz ...\n", "[2020-09-30 11:27:03 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_43_c1.\"node1\", ? \"label\", graph_43_c1.\"node2\" \"node2\"\n", " FROM graph_43 AS graph_43_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 0.84 real 0.66 user 0.16 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.1.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:04 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_43_c1.\"node1\" \"node1\", ? \"label\", graph_43_c1.\"node1\" \"node2\"\n", " FROM graph_43 AS graph_43_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 0.81 real 0.65 user 0.14 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.2.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1 as node1, \"P279star\" as label, n1 as node2' " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:05 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_43_c1.\"node2\" \"node1\", ? \"label\", graph_43_c1.\"node2\" \"node2\"\n", " FROM graph_43 AS graph_43_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 0.79 real 0.64 user 0.13 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.3.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n2 as node1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:06 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_42_c1.\"node2\" \"node1\", ? \"label\", graph_42_c1.\"node2\" \"node2\"\n", " FROM graph_42 AS graph_42_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 1.30 real 1.11 user 0.16 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P279star.4.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n2 as node1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can concatenate these files to produce the final output" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.78 real 1.50 user 0.18 sys\n" ] } ], "source": [ "!$kgtk cat --mode NONE -i $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz $TEMP/P279star.3.tsv.gz $TEMP/P279star.4.tsv.gz \\\n", " | kgtk compact \\\n", " | kgtk sort2 \\\n", " | kgtk add-id --id-style node1-label-node2-num \\\n", " | gzip > $OUT/all.P279star.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is difficult to test with our Wikidata subset because our hierarchy is very sparse.\n", "\n", "This is how we would do the typical `?item P31/P279* ?class` in Kypher. \n", "The example shows how to get all the `n1` that are instances of subclasses of beer (q44)." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:10 sqlstore]: DROP graph data table graph_24 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz\n", "[2020-09-30 11:27:10 sqlstore]: IMPORT graph directly into table graph_44 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz ...\n", "[2020-09-30 11:27:10 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(graph_42_c1.\"node1\") \"count\"\n", " FROM graph_42 AS graph_42_c1, graph_44 AS graph_44_c2\n", " WHERE graph_42_c1.\"label\"=?\n", " AND graph_44_c2.\"node2\"=?\n", " AND graph_42_c1.\"node2\"=graph_44_c2.\"node1\"\n", " PARAS: ['P31', 'Q44']\n", "---------------------------------------------\n", "[2020-09-30 11:27:10 sqlstore]: CREATE INDEX on table graph_42 column node2 ...\n", "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_42 column node2 ...\n", "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_44 column node2 ...\n", "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_44 column node2 ...\n", "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_44 column node1 ...\n", "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_44 column node1 ...\n", "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_42 column label ...\n", "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_42 column label ...\n", "count\n", "1\n", " 1.13 real 0.94 user 0.21 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \\\n", " --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \\\n", " --return 'count(n1) as count'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file to do generalized Is-A queries\n", "The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`\n", "\n", "We do this by concatenating the files and renaming the relation" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.43 real 1.26 user 0.14 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \\\n", " | gzip > $TEMP/isa.1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:13 sqlstore]: DROP graph data table graph_25 from /Users/pedroszekely/Downloads/kypher/temp.all.10/isa.1.tsv.gz\n", "[2020-09-30 11:27:13 sqlstore]: IMPORT graph directly into table graph_45 from /Users/pedroszekely/Downloads/kypher/temp.all.10/isa.1.tsv.gz ...\n", "[2020-09-30 11:27:14 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_45_c1.\"node1\", ? \"label\", graph_45_c1.\"node2\"\n", " FROM graph_45 AS graph_45_c1\n", " PARAS: ['isa']\n", "---------------------------------------------\n", " 2.10 real 1.91 user 0.25 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE -o $OUT/all.isa.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"isa\" as label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example of how to use the `isa` relation" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:15 sqlstore]: DROP graph data table graph_28 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz\n", "[2020-09-30 11:27:15 sqlstore]: IMPORT graph directly into table graph_28 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz ...\n", "[2020-09-30 11:27:16 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_28_c1.\"node1\", graph_28_c1.\"label\", ? \"node2\"\n", " FROM graph_28 AS graph_28_c1, graph_44 AS graph_44_c2\n", " WHERE graph_28_c1.\"label\"=?\n", " AND graph_44_c2.\"node2\"=?\n", " AND graph_28_c1.\"node2\"=graph_44_c2.\"node1\"\n", " LIMIT ?\n", " PARAS: ['Q44', 'isa', 'Q44', 10]\n", "---------------------------------------------\n", "[2020-09-30 11:27:16 sqlstore]: CREATE INDEX on table graph_28 column label ...\n", "[2020-09-30 11:27:16 sqlstore]: ANALYZE INDEX on table graph_28 column label ...\n", "[2020-09-30 11:27:16 sqlstore]: CREATE INDEX on table graph_28 column node2 ...\n", "[2020-09-30 11:27:16 sqlstore]: ANALYZE INDEX on table graph_28 column node2 ...\n", "node1\tlabel\tnode2\n", "Q2579953\tisa\tQ44\n", "Q4488344\tisa\tQ44\n", "Q7587890\tisa\tQ44\n", "Q10313616\tisa\tQ44\n", " 1.27 real 1.16 user 0.25 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \\\n", " --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \\\n", " --return 'distinct n1, l.label, \"Q44\" as node2' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a subset of Wikidata without scholarly articles (Q13442814)\n", "First create a file with the schloarly articles" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:17 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_28_c1.\"node1\", graph_28_c1.\"label\", graph_28_c1.\"node2\"\n", " FROM graph_28 AS graph_28_c1\n", " WHERE graph_28_c1.\"label\"=?\n", " AND graph_28_c1.\"node2\"=?\n", " PARAS: ['isa', 'Q13442814']\n", "---------------------------------------------\n", " 0.84 real 0.67 user 0.15 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o $OUT/all.isa.Q13442814.tsv.gz \\\n", " --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \\\n", " --return 'distinct n1, l.label, n2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2\n", "Q1801903 isa Q13442814\n" ] } ], "source": [ "!gzcat $OUT/all.isa.Q13442814.tsv | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4882 /Users/pedroszekely/Downloads/kypher/output.all.10/all-distribution.tsv\n", " 143 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz\n", " 793 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz\n", " 5348 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz\n", " 5512 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.Q13442814.tsv.gz\n", " 2111 /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz\n", " 3 /Users/pedroszekely/Downloads/kypher/output.all.10/all.wikidatatype.distribution.tsv.gz\n", " 1814 /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.en.tsv.gz\n", " 13162 /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz\n", " 3125 /Users/pedroszekely/Downloads/kypher/output.all.10/part.commonsMedia.tsv.gz\n", " 6171 /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.en.tsv.gz\n", " 227442 /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz\n", " 22109 /Users/pedroszekely/Downloads/kypher/output.all.10/part.external-id.tsv.gz\n", " 16 /Users/pedroszekely/Downloads/kypher/output.all.10/part.geo-shape.tsv.gz\n", " 1495 /Users/pedroszekely/Downloads/kypher/output.all.10/part.globe-coordinate.tsv.gz\n", " 7336 /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.en.tsv.gz\n", " 80294 /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz\n", " 9 /Users/pedroszekely/Downloads/kypher/output.all.10/part.math.tsv.gz\n", " 1880 /Users/pedroszekely/Downloads/kypher/output.all.10/part.monolingualtext.tsv.gz\n", " 1 /Users/pedroszekely/Downloads/kypher/output.all.10/part.musical-notation.tsv.gz\n", " 2624 /Users/pedroszekely/Downloads/kypher/output.all.10/part.quantity.tsv.gz\n", " 3903 /Users/pedroszekely/Downloads/kypher/output.all.10/part.string.tsv.gz\n", " 2415 /Users/pedroszekely/Downloads/kypher/output.all.10/part.time.tsv.gz\n", " 6435 /Users/pedroszekely/Downloads/kypher/output.all.10/part.type.tsv.gz\n", " 943 /Users/pedroszekely/Downloads/kypher/output.all.10/part.url.tsv.gz\n", " 2 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-form.tsv.gz\n", " 28782 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-item.tsv.gz\n", " 8 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-property.tsv.gz\n", " 40876 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikipedia_sitelink.tsv.gz\n", " 547657 /Users/pedroszekely/Downloads/kypher/all.10.tsv.gz\n", " 1017291 total\n" ] } ], "source": [ "!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of distinct items in our dataset" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-09-30 11:27:18 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(DISTINCT graph_8_c1.\"node1\") \"count\"\n", " FROM graph_8 AS graph_8_c1\n", " PARAS: []\n", "---------------------------------------------\n", "count\n", "156559\n", " 5.91 real 4.82 user 1.00 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n", " --match '(n1)-[]->()' \\\n", " --return 'count(distinct n1) as count'" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }