{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Useful Wikidata Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch Invocation\n", "Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.\n", "\n", "```\n", "papermill Example7\\ -\\ Wikidata\\ Outputs.ipynb example7.out.ipynb \\\n", "-p home /Users/pedroszekely/Downloads/kypher \\\n", "-p wiki_file all.10.tsv.gz \\\n", "-p output_folder output.all.10 \\\n", "-p temp_folder temp.all.10 \\\n", "-p delete_database true \n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "home = \"/Users/pedroszekely/Downloads/kypher\"\n", "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz\"\n", "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\"\n", "output_path = \"/Users/pedroszekely/Downloads/kypher\"\n", "cache_path = \"/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\"\n", "output_folder = \"useful_wikidata_files_v3\"\n", "temp_folder = \"temp.useful_wikidata_files_v3\"\n", "delete_database = \"no\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# from IPython.display import display, HTML, Image\n", "# from pandas_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up environment and folders to store the files\n", "\n", "- `OUT` folder where the output files go\n", "- `TEMP` folder to keep temporary files , including the database\n", "- `kgtk` shortcut to invoke the kgtk software\n", "\n", "The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.\n", "\n", "To dos:\n", "\n", "- Make sure that all files have id columns as `query` gets unhappy when files have no ids.\n", "- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.\n", "- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.\n", "- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.\n", "- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`\n", "\n", "Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.\n", "\n", "If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n", "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n", "os.environ['kgtk'] = \"kgtk\"\n", "os.environ['kgtk'] = \"time kgtk --debug\"\n", "os.environ['EDGES'] = wiki_file\n", "if cache_path:\n", " os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n", "else:\n", " os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3\n", "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\n", "time kgtk --debug\n", "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\n", "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db\n" ] } ], "source": [ "!echo $OUT\n", "!echo $TEMP\n", "!echo $kgtk\n", "!echo $EDGES\n", "!echo $STORE" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher\n" ] } ], "source": [ "cd $output_path" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3: File exists\n", "mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3: File exists\n" ] } ], "source": [ "!mkdir $OUT\n", "!mkdir $TEMP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean up the output and temp folders before we start" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# !rm $OUT/*.tsv $OUT/*.tsv.gz\n", "# !rm $TEMP/*.tsv $TEMP/*.tsv.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "if delete_database and delete_database != \"no\":\n", " print(\"Deleted database\")\n", " !rm $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get a sample and force importing the edge file into the database" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-11-03 22:16:22 sqlstore]: IMPORT graph directly into table graph_1 from /Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz ...\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --limit 10 --graph-cache $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the label column" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 17:59:58 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_14 AS graph_14_c1\n", " WHERE graph_14_c1.\"label\"=?\n", " LIMIT ?\n", " PARAS: ['P31', 5]\n", "---------------------------------------------\n", "[2020-10-23 17:59:58 sqlstore]: CREATE INDEX on table graph_14 column label ...\n", "[2020-10-23 18:33:51 sqlstore]: ANALYZE INDEX on table graph_14 column label ...\n", "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n", "Q1-P31-Q36906466-q1$8983b0ea-4a9c-0902-c0db-785db33f767c-0\tQ1\tP31\tQ36906466\tnormal\twikibase-item\n", "Q100-P31-Q1093829-q100$3f4925a8-32d0-424f-b65a-4e3b5dbd07ec-0\tQ100\tP31\tQ1093829\tnormal\twikibase-item\n", "Q100-P31-Q1549591-q100$ad5b329b-43c9-f6d9-9d0b-a08c1f4f0abb-0\tQ100\tP31\tQ1549591\tnormal\twikibase-item\n", "Q100-P31-Q21518270-q100$5b85ea08-419d-51f3-81d2-c7d50fc935f3-0\tQ100\tP31\tQ21518270\tpreferred\twikibase-item\n", "Q1000-P31-Q179023-q1000$fd440406-4ef4-6bb3-f9ed-484f630a4f8c-0\tQ1000\tP31\tQ179023\tnormal\twikibase-item\n", " 2150.60 real 1114.30 user 483.40 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n", " --match '(i)-[:P31]->(c)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Force creation of the index on the node2 column" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 18:35:49 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_14 AS graph_14_c1\n", " WHERE graph_14_c1.\"node2\"=?\n", " LIMIT ?\n", " PARAS: ['Q5', 5]\n", "---------------------------------------------\n", "[2020-10-23 18:35:49 sqlstore]: CREATE INDEX on table graph_14 column node2 ...\n", "[2020-10-23 19:33:50 sqlstore]: ANALYZE INDEX on table graph_14 column node2 ...\n", "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n", "Q10000001-P31-Q5-q10000001$63adfe23-2df9-477f-aa0a-62a68a9eab1d-0\tQ10000001\tP31\tQ5\tnormal\twikibase-item\n", "Q1000002-P31-Q5-q1000002$bfba0a52-a667-4574-9eb9-4efc08582259-0\tQ1000002\tP31\tQ5\tnormal\twikibase-item\n", "Q1000005-P31-Q5-q1000005$d7f256b6-91a1-4bdf-9e77-03798a7d0c36-0\tQ1000005\tP31\tQ5\tnormal\twikibase-item\n", "Q1000006-P31-Q5-q1000006$3f995cf5-520a-4ec5-99b7-987fd8c57a6a-0\tQ1000006\tP31\tQ5\tnormal\twikibase-item\n", "Q1000015-P31-Q5-q1000015$909acc6d-b7a7-43e2-8cd0-50eab8891d52-0\tQ1000015\tP31\tQ5\tnormal\twikibase-item\n", " 3648.45 real 1715.03 user 797.70 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n", " --match '(i)-[r]->(:Q5)' \\\n", " --limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of edges" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 19:36:38 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(graph_14_c1.\"id\") \"count\"\n", " FROM graph_14 AS graph_14_c1\n", " LIMIT ?\n", " PARAS: [10]\n", "---------------------------------------------\n", "count\n", "1102827643\n", " 726.36 real 103.43 user 171.80 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE \\\n", " --match 'all: ()-[r]->()' \\\n", " --return 'count(r) as count' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the distribution of the label column\n", "I would like to have it sorted numerically, but don't know how to make it happen" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py\", line 143, in run\n", " uniq.process()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/unique.py\", line 166, in process\n", " for row in kr:\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 976, in __next__\n", " return self.nextrow()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 860, in nextrow\n", " line = next(self.source) # Will throw StopIteration\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py\", line 30, in __next__\n", " return self.s.__next__()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 300, in read1\n", " return self._buffer.read1(size)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/_compression.py\", line 68, in readinto\n", " data = self.read(len(byte_view))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 480, in read\n", " buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 96, in read\n", " self.file.read(size-self._length+read)\n", "OSError: [Errno 5] Input/output error\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py\", line 150, in run\n", " raise KGTKException(str(e))\n", "kgtk.exceptions.KGTKException: [Errno 5] Input/output error\n", "[Errno 5] Input/output error\n", "In input header '': Column 0 has an invalid name in the file header\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/sort2.py\", line 403, in run\n", " kr: KgtkReader = KgtkReader.open(Path(\"<%d\" % header_read_fd))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 526, in open\n", " error_file=error_file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkbase.py\", line 142, in build_column_name_map\n", " header_line=header_line, who=who, error_action=error_action, error_file=error_file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkbase.py\", line 35, in _yelp\n", " sys.exit(1)\n", "SystemExit: 1\n", "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py:62: UserWarning: Please raise KGTKException instead of \n", " warnings.warn('Please raise KGTKException instead of {}'.format(type_))\n", "KGTKException found\n", "\n", " 198.37 real 157.54 user 0.88 sys\n" ] } ], "source": [ "!$kgtk unique --column label -i \"$EDGES\" / sort2 -c node2 -r -o $OUT/all-distribution.tsv " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "!head $OUT/all-distribution.tsv | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute files with labels, aliases and descriptions\n", "Return the id, node1, label and node2 columns" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 20:04:12 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_14_c1.\"id\", graph_14_c1.\"node1\", graph_14_c1.\"label\", graph_14_c1.\"node2\"\n", " FROM graph_14 AS graph_14_c1\n", " WHERE graph_14_c1.\"label\"=?\n", " PARAS: ['label']\n", "---------------------------------------------\n", " 1.12 real 0.58 user 0.22 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.label.tsv.gz \\\n", " --match '(n1)-[l:label]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 02:14:59 sqlstore]: IMPORT graph directly into table graph_7 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz ...\n", "Exception in thread background thread for pid 80140:\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py\", line 926, in _bootstrap_inner\n", " self.run()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py\", line 870, in run\n", " self._target(*self._args, **self._kwargs)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 1662, in wrap\n", " fn(*args, **kwargs)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 2606, in background_thread\n", " handle_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 2304, in fn\n", " return self.command.handle_command_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 877, in handle_command_exit_code\n", " raise exc\n", "sh.ErrorReturnCode_1: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 652, in import_graph_data_via_import\n", " sqlproc.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 856, in wait\n", " self.process._stdin_process.command.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 856, in wait\n", " self.process._stdin_process.command.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 849, in wait\n", " self.handle_command_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 877, in handle_command_exit_code\n", " raise exc\n", "sh.ErrorReturnCode_1: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "\n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", " 390.98 real 570.95 user 12.99 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.alias.tsv.gz \\\n", " --match '(n1)-[l:alias]->(n2)' \\\n", " --return 'l, n1, l.label, n2'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-23 02:21:35 sqlstore]: IMPORT graph directly into table graph_8 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz ...\n", "Exception in thread background thread for pid 80236:\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py\", line 926, in _bootstrap_inner\n", " self.run()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py\", line 870, in run\n", " self._target(*self._args, **self._kwargs)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 1662, in wrap\n", " fn(*args, **kwargs)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 2606, in background_thread\n", " handle_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 2304, in fn\n", " return self.command.handle_command_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 877, in handle_command_exit_code\n", " raise exc\n", "sh.ErrorReturnCode_1: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 652, in import_graph_data_via_import\n", " sqlproc.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 856, in wait\n", " self.process._stdin_process.command.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 856, in wait\n", " self.process._stdin_process.command.wait()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 849, in wait\n", " self.handle_command_exit_code(exit_code)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py\", line 877, in handle_command_exit_code\n", " raise exc\n", "sh.ErrorReturnCode_1: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: \n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", "\n", "\n", " RAN: /usr/bin/gunzip -c '/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz'\n", "\n", " STDOUT:\n", "\n", "\n", " STDERR:\n", "gunzip: failed to read stdin: Input/output error\n", "gunzip: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz: uncompress failed\n", "\n", "\n", " 394.36 real 569.52 user 13.17 sys\n" ] } ], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.description.tsv.gz \\\n", " --match '(n1)-[l:description]->(n2)' \\\n", " --return 'l, n1, l.label, n2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now create files with the English labels, aliases and descriptions" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 0.98 real 0.61 user 0.17 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.label.tsv.gz --graph-cache $STORE -o $OUT/part.label.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 0.73 real 0.59 user 0.12 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.alias.tsv.gz --graph-cache $STORE -o $OUT/part.alias.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"'" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 0.81 real 0.67 user 0.13 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/part.description.tsv.gz --graph-cache $STORE -o $OUT/part.description.en.tsv.gz \\\n", " --match '()-[]->(n2)' \\\n", " --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's sample these files to see what they look like:\n", "\n", "* we are getting all variants of English, we really want `en` only\n", "* the labels have the language tags, how do we output only the string without the language tag?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "!gzcat $OUT/part.label.en.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute the distribution of the number of edges for each Wikidata type" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py\", line 143, in run\n", " uniq.process()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/unique.py\", line 166, in process\n", " for row in kr:\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 976, in __next__\n", " return self.nextrow()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 860, in nextrow\n", " line = next(self.source) # Will throw StopIteration\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py\", line 30, in __next__\n", " return self.s.__next__()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 300, in read1\n", " return self._buffer.read1(size)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/_compression.py\", line 68, in readinto\n", " data = self.read(len(byte_view))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 480, in read\n", " buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/gzip.py\", line 96, in read\n", " self.file.read(size-self._length+read)\n", "OSError: [Errno 5] Input/output error\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/unique.py\", line 150, in run\n", " raise KGTKException(str(e))\n", "kgtk.exceptions.KGTKException: [Errno 5] Input/output error\n", "[Errno 5] Input/output error\n", "In input header '': Column 0 has an invalid name in the file header\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/sort2.py\", line 403, in run\n", " kr: KgtkReader = KgtkReader.open(Path(\"<%d\" % header_read_fd))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 526, in open\n", " error_file=error_file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkbase.py\", line 142, in build_column_name_map\n", " header_line=header_line, who=who, error_action=error_action, error_file=error_file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkbase.py\", line 35, in _yelp\n", " sys.exit(1)\n", "SystemExit: 1\n", "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py:62: UserWarning: Please raise KGTKException instead of \n", " warnings.warn('Please raise KGTKException instead of {}'.format(type_))\n", "KGTKException found\n", "\n", " 193.72 real 153.07 user 0.76 sys\n" ] } ], "source": [ "!$kgtk unique --column 'node2;wikidatatype' -i \"$EDGES\" / sort2 -c node2 -r | gzip > $OUT/all.wikidatatype.distribution.tsv.gz" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "!gzcat $OUT/all.wikidatatype.distribution.tsv.gz | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file to contain the edges for each wikidata type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.time.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"time\"'\n", "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.wikibase-item.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"wikibase-item\"'\n", "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.math.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"math\"'\n", "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.wikibase-form.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"wikibase-form\"'\n", "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.quantity.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"quantity\"'\n", "$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.string.tsv.gz --match '(n1)-[l]->(n2 {wikidatatype: type})' --return 'l, n1, l.label, n2' --where 'type = \"string\"'\n" ] } ], "source": [ "types = [\n", " \"time\",\n", " \"wikibase-item\",\n", " \"math\",\n", " \"wikibase-form\",\n", " \"quantity\",\n", " \"string\",\n", " \"external-id\",\n", " \"commonsMedia\",\n", " \"globe-coordinate\",\n", " \"monolingualtext\",\n", " \"musical-notation\",\n", " \"geo-shape\",\n", " \"wikibase-property\",\n", " \"url\",\n", "]\n", "\n", "command = \"$kgtk query -i \\\"$EDGES\\\" --graph-cache $STORE -o $OUT/part.TYPE_FILE.tsv.gz \\\n", " --match '(n1)-[l]->(n2 {wikidatatype: type})' \\\n", " --return 'l, n1, l.label, n2'\\\n", " --where 'type = \\\"TYPE\\\"'\"\n", "for type in types:\n", " cmd = command.replace(\"TYPE_FILE\", type)\n", " cmd = cmd.replace(\"TYPE\", type)\n", "\n", " print(cmd)\n", " os.system(cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file with the sitelinks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.wikipedia_sitelink.tsv.gz \\\n", " --match '(n1)-[l:wikipedia_sitelink]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file that specifies for each node whether it is an item or a property" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.type.tsv.gz \\\n", " --match '(n1)-[l:type]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the P31 and P279 files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P31.tsv.gz \\\n", " --match '(n1)-[l:P31]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P279.tsv.gz \\\n", " --match '(n1)-[l:P279]->(n2)' \\\n", " --return 'l, n1, l.label, n2' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First compute the roots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \\\n", " --match '(n1)-[l]->()' \\\n", " --return 'n1 as node, l as id' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P31.n2.tsv.gz \\\n", " --match '()-[l]->(n2)' \\\n", " --return 'n2 as node, l as id' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \\\n", " / compact --mode NONE --columns node \\\n", " > $TEMP/P279.roots.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can invoke the reachable-nodes command" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head $TEMP/P279.roots.tsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk reachable-nodes \\\n", " --mode NONE \\\n", " --rootfile $TEMP/P279.roots.tsv \\\n", " --rootfilecolumn 0 \\\n", " --subj 1 --pred 2 --obj 3 \\\n", " -i $OUT/all.P279.tsv.gz \\\n", " | kgtk sort2 \\\n", " | gzip > $TEMP/P279.reachable.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:28:48 sqlstore]: IMPORT graph directly into table graph_7 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_2/P279.reachable.tsv.gz ...\n", "[2020-10-22 02:28:48 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_7_c1.\"node1\", ? \"label\", graph_7_c1.\"node2\" \"node2\"\n", " FROM graph_7 AS graph_7_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 155, in run\n", " result = query.execute()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 677, in execute\n", " result = self.store.execute(query, params)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 231, in execute\n", " return self.get_conn().execute(*args, **kwargs)\n", "sqlite3.OperationalError: no such column: graph_7_c1.node1\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: no such column: graph_7_c1.node1\n", "\n", "no such column: graph_7_c1.node1\n", "\n", " 0.73 real 0.52 user 0.13 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.1.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:28:48 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_7_c1.\"node1\" \"node1\", ? \"label\", graph_7_c1.\"node1\" \"node2\"\n", " FROM graph_7 AS graph_7_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 155, in run\n", " result = query.execute()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 677, in execute\n", " result = self.store.execute(query, params)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 231, in execute\n", " return self.get_conn().execute(*args, **kwargs)\n", "sqlite3.OperationalError: no such column: graph_7_c1.node1\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: no such column: graph_7_c1.node1\n", "\n", "no such column: graph_7_c1.node1\n", "\n", " 0.58 real 0.47 user 0.09 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.2.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1 as node1, \"P279star\" as label, n1 as node2' " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:28:49 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_7_c1.\"node2\" \"node1\", ? \"label\", graph_7_c1.\"node2\" \"node2\"\n", " FROM graph_7 AS graph_7_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 155, in run\n", " result = query.execute()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 677, in execute\n", " result = self.store.execute(query, params)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 231, in execute\n", " return self.get_conn().execute(*args, **kwargs)\n", "sqlite3.OperationalError: no such column: graph_7_c1.node2\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: no such column: graph_7_c1.node2\n", "\n", "no such column: graph_7_c1.node2\n", "\n", " 0.59 real 0.48 user 0.10 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE -o $TEMP/P279star.3.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n2 as node1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:28:50 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_6_c1.\"node2\" \"node1\", ? \"label\", graph_6_c1.\"node2\" \"node2\"\n", " FROM graph_6 AS graph_6_c1\n", " PARAS: ['P279star']\n", "---------------------------------------------\n", " 234.72 real 223.22 user 4.97 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/P279star.4.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n2 as node1, \"P279star\" as label, n2 as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can concatenate these files to produce the final output" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 718, in _build_column_names\n", " header: str = next(source).rstrip(\"\\r\\n\")\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/utils/closableiter.py\", line 30, in __next__\n", " return self.s.__next__()\n", "StopIteration\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/cat.py\", line 152, in run\n", " kc.process()\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/join/kgtkcat.py\", line 91, in process\n", " very_verbose=self.very_verbose,\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 513, in open\n", " (header, column_names) = cls._build_column_names(source, options, error_file=error_file, verbose=verbose)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/io/kgtkreader.py\", line 720, in _build_column_names\n", " raise ValueError(\"No header line in file\")\n", "ValueError: No header line in file\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/cat.py\", line 160, in run\n", " raise KGTKException(str(e))\n", "kgtk.exceptions.KGTKException: No header line in file\n", "No header line in file\n", " 0.77 real 0.54 user 0.12 sys\n", "No header line in file\n", "In input header '': Column 0 has an invalid name in the file header\n", "In input header '': Column 0 has an invalid name in the file header\n", "Exit requested\n", "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py:62: UserWarning: Please raise KGTKException instead of \n", " warnings.warn('Please raise KGTKException instead of {}'.format(type_))\n", "KGTKException found\n", "\n" ] } ], "source": [ "!$kgtk cat --mode NONE -i $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz $TEMP/P279star.3.tsv.gz $TEMP/P279star.4.tsv.gz \\\n", " | kgtk compact \\\n", " | kgtk sort2 \\\n", " | kgtk add-id --id-style node1-label-node2-num \\\n", " | gzip > $OUT/all.P279star.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is difficult to test with our Wikidata subset because our hierarchy is very sparse.\n", "\n", "This is how we would do the typical `?item P31/P279* ?class` in Kypher. \n", "The example shows how to get all the `n1` that are instances of subclasses of beer (q44)." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 0.60 real 0.49 user 0.10 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \\\n", " --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \\\n", " --return 'count(n1) as count'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a file to do generalized Is-A queries\n", "The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`\n", "\n", "We do this by concatenating the files and renaming the relation" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 406.55 real 404.76 user 1.41 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \\\n", " | gzip > $TEMP/isa.1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:39:33 sqlstore]: IMPORT graph directly into table graph_8 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_2/isa.1.tsv.gz ...\n", "[2020-10-22 02:43:46 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_8_c1.\"node1\", ? \"label\", graph_8_c1.\"node2\"\n", " FROM graph_8 AS graph_8_c1\n", " PARAS: ['isa']\n", "---------------------------------------------\n", " 600.27 real 757.12 user 17.87 sys\n" ] } ], "source": [ "!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE -o $OUT/all.isa.tsv.gz \\\n", " --match '(n1)-[]->(n2)' \\\n", " --return 'n1, \"isa\" as label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example of how to use the `isa` relation" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:49:34 sqlstore]: IMPORT graph directly into table graph_9 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz ...\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 150.87 real 262.99 user 5.51 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o - \\\n", " --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \\\n", " --return 'distinct n1, l.label, \"Q44\" as node2' \\\n", " --limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a subset of Wikidata without scholarly articles (Q13442814)\n", "First create a file with the schloarly articles" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 148, in run\n", " index=options.get('index'))\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py\", line 180, in __init__\n", " store.add_graph(file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 565, in add_graph\n", " self.import_graph_data_via_import(table, file)\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py\", line 630, in import_graph_data_via_import\n", " if header.endswith('\\r\\n'):\n", "TypeError: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "During handling of the above exception, another exception occurred:\n", "\n", "Traceback (most recent call last):\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py\", line 42, in __call__\n", " return_code = func(*args, **kwargs) or 0\n", " File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py\", line 180, in run\n", " raise KGTKException(str(e) + '\\n')\n", "kgtk.exceptions.KGTKException: endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", "endswith first arg must be bytes or a tuple of bytes, not str\n", "\n", " 0.59 real 0.48 user 0.10 sys\n" ] } ], "source": [ "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE -o $OUT/all.isa.Q13442814.tsv.gz \\\n", " --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \\\n", " --return 'distinct n1, l.label, n2'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "!gzcat $OUT/all.isa.Q13442814.tsv | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 7479 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all-distribution.tsv\n", " 45941 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279star.tsv.gz\n", " 2206506 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31.tsv.gz\n", " 2254104 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31_P279.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.Q13442814.tsv.gz\n", " 1208961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.wikidatatype.distribution.tsv.gz\n", " 374344 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.en.tsv.gz\n", " 383165 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.commonsMedia.tsv.gz\n", " 2253868 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.en.tsv.gz\n", " 3688188 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.external-id.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.geo-shape.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.globe-coordinate.tsv.gz\n", " 7895945 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.en.tsv.gz\n", " 8034944 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.math.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.monolingualtext.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.musical-notation.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.quantity.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.string.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.time.tsv.gz\n", " 1351961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.type.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.url.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-form.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-item.tsv.gz\n", " 0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-property.tsv.gz\n", " 3404579 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikipedia_sitelink.tsv.gz\n", " 65610166 /Users/pedroszekely/Downloads/kypher/almost.all.edges.sorted.tsv.gz\n", " 98720151 total\n" ] } ], "source": [ "!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of distinct items in our dataset" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2020-10-22 02:52:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT count(DISTINCT graph_1_c1.\"node1\") \"count\"\n", " FROM graph_1 AS graph_1_c1\n", " PARAS: []\n", "---------------------------------------------\n", "count\n", "88228944\n", " 1364.75 real 1000.96 user 122.75 sys\n" ] } ], "source": [ "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n", " --match '(n1)-[]->()' \\\n", " --return 'count(distinct n1) as count'" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }