{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating a subset of Wikidata\n", "\n", "This notebook illustrates how to create a subset of Wikidata. We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound)\n", "\n", "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n", "\n", "```\n", "papermill Example8\\ -\\ Wikidata\\ Subset.ipynb example8.out.ipynb \\\n", "-p wikidata_parts_path /Users/pedroszekely/Downloads/kypher/output.all.10 \\\n", "-p subset_name Q11173 \\\n", "-p output_path /Users/pedroszekely/Downloads/kypher \\\n", "-p cache_path /Users/pedroszekely/Downloads/kypher\n", "-p delete_database no \n", "-p hops_right 0\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for invoking the notebook\n", "\n", "- `wikidata_parts_path`: a folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz`\n", "- `subset_name`: the name of the subset being created. In the current implementation the `subset_name` must be a q-node in Wikidata representing a class.\n", "- `output_path`: the path where a folder will be created to hold the KGTK files for the subset. A folder named `subset_name` will be createed in this filder.\n", "- `cache_path`: the path of a folder where the Kypher SQL database will be created.\n", "- `delete_database`: whether to delete the SQL database before running the notebook: \"\" or \"no\" means don't delete it.\n", "- `hops_right`: after getting the initial collection of q-nodes for the subset, how many hops forward to follow links, can be 0, 1 or 2." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "wikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/useful_wikidata_files\"\n", "#wikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/output.all.10\"\n", "subset_name = \"Q11173\"\n", "#subset_name = \"Q318\"\n", "subset_name = \"Q5\"\n", "subset_name = \"Q44\"\n", "output_path = \"/Users/pedroszekely/Downloads/kypher\"\n", "cache_path = \"/Users/pedroszekely/Downloads/kypher\"\n", "hops_right = \"1\"\n", "delete_database = \"no\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "temp_folder = subset_name + \"-temp\"\n", "output_folder = subset_name" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# from IPython.display import display, HTML, Image\n", "# from pandas_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def run_command(command, substitution_dictionary = {}):\n", " \"\"\"Run a templetized command.\"\"\"\n", " cmd = command.replace(\"NAME\", subset_name)\n", " for k, v in substitution_dictionary.items():\n", " cmd = cmd.replace(k, v)\n", " \n", " print(cmd)\n", " output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " print(output.stdout)\n", " print(output.stderr)\n", " #print(output.returncode)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up environment variables and folders that we need\n", "We need to define environment variables to pass to the KGTK commands." ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "# folder containing wikidata broken down into smaller files.\n", "os.environ['WIKIDATA_PARTS'] = wikidata_parts_path\n", "# name of the dataset\n", "os.environ['NAME'] = subset_name\n", "# folder where to put the output\n", "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n", "# temporary folder\n", "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n", "# kgtk command to run\n", "os.environ['kgtk'] = \"kgtk\"\n", "# os.environ['kgtk'] = \"time kgtk --debug\"\n", "# absolute path of the db\n", "if cache_path:\n", " os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n", "else:\n", " os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Documents/GitHub/kgtk/examples/Q44\n" ] } ], "source": [ "cd $output_folder" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: Q44: File exists\n", "mkdir: Q44-temp: File exists\n" ] } ], "source": [ "!mkdir $output_folder\n", "!mkdir $temp_folder" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: /Users/pedroszekely/Downloads/kypher/Q44/*.tsv: No such file or directory\n" ] } ], "source": [ "!rm $OUT/*.tsv $OUT/*.tsv.gz\n", "!rm $TEMP/*.tsv $TEMP/*.tsv.gz" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "if delete_database and delete_database != \"no\":\n", " print(\"Deleted database\")\n", " !rm $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the Q-nodes for the items we want\n", "Here we assume that the subset is for an individual q-node, so that the subset name is the name of the q-node. We should generalize this so that this query can be passed in as a parameter. We construct a file that contains all the node1s that are isa of the given NAME q-node." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz --graph-cache $STORE -o $TEMP/qnodelist.Q44.tsv.gz --match 'isa: (n1)-[l:isa]->(n2:Q44)' --return 'distinct n1, l.label, n2'\n", "\n", "\n" ] } ], "source": [ "command = \"$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/qnodelist.NAME.tsv.gz \\\n", " --match 'isa: (n1)-[l:isa]->(n2:NAME)' \\\n", " --return 'distinct n1, l.label, n2'\"\n", "run_command(command)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "Q2579953\tisa\tQ44\n", "Q15883984\tisa\tQ44\n", "Q63379154\tisa\tQ44\n", "Q3699039\tisa\tQ44\n", "Q3360035\tisa\tQ44\n", "Q16070652\tisa\tQ44\n", "Q85313643\tisa\tQ44\n", "Q897293\tisa\tQ44\n", "Q999745\tisa\tQ44\n" ] } ], "source": [ "!gzcat $TEMP/qnodelist.$NAME.tsv.gz | head " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 454 1362 7887\n" ] } ], "source": [ "!gzcat $TEMP/qnodelist.$NAME.tsv.gz | wc " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz --graph-cache $STORE -o $TEMP/all.P279star.Q44.tsv.gz --match '(n1)-[l:P279star]->(n2:Q44)' --return 'distinct n1, l.label, n2'\n", "\n", "\n" ] } ], "source": [ "command = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/all.P279star.NAME.tsv.gz \\\n", " --match '(n1)-[l:P279star]->(n2:NAME)' \\\n", " --return 'distinct n1, l.label, n2'\"\n", "run_command(command)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 320 960 7110\n" ] } ], "source": [ "!gzcat $TEMP/all.P279star.$NAME.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Genereate the nodes one hop to the right" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "hops_right_count = int(hops_right)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $TEMP/qnodelist.Q44.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz -o $TEMP/Q44.hop.right1.tsv.gz --graph-cache $STORE --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)' --return 'distinct l, n2 as node1, l.label as label, n3 as node2'\n", "\n", "\n" ] } ], "source": [ "command = \"$kgtk query \\\n", " -i $TEMP/qnodelist.NAME.tsv.gz \\\n", " -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz \\\n", " -o $TEMP/NAME.hop.right1.tsv.gz \\\n", " --graph-cache $STORE \\\n", " --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)' \\\n", " --return 'distinct l, n2 as node1, l.label as label, n3 as node2'\" \n", "\n", "if hops_right_count > 0:\n", " run_command(command)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a dummy empty hop file so that the gzcat command below doesn't fail if the number of hops is zero" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "!echo -e \"node1\\tlabel\\tnode2\\tid\" | gzip > $TEMP/$NAME.hop.dummy.tsv.gz" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i $TEMP/$NAME.hop.*.tsv.gz $TEMP/all.P279star.$NAME.tsv.gz $TEMP/qnodelist.$NAME.tsv.gz | gzip > $TEMP/$NAME.all-items.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate the parts of this dataset" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.time.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.time.tsv.gz --match 'Q44: (n1)-[]->(), `time`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.wikibase-item.tsv.gz --match 'Q44: (n1)-[]->(), `wikibase-item`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.math.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.math.tsv.gz --match 'Q44: (n1)-[]->(), `math`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-form.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.wikibase-form.tsv.gz --match 'Q44: (n1)-[]->(), `wikibase-form`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.quantity.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.quantity.tsv.gz --match 'Q44: (n1)-[]->(), `quantity`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.string.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.string.tsv.gz --match 'Q44: (n1)-[]->(), `string`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.external-id.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.external-id.tsv.gz --match 'Q44: (n1)-[]->(), `external-id`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.commonsMedia.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.commonsMedia.tsv.gz --match 'Q44: (n1)-[]->(), `commonsMedia`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.globe-coordinate.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.globe-coordinate.tsv.gz --match 'Q44: (n1)-[]->(), `globe-coordinate`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.monolingualtext.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.monolingualtext.tsv.gz --match 'Q44: (n1)-[]->(), `monolingualtext`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.musical-notation.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.musical-notation.tsv.gz --match 'Q44: (n1)-[]->(), `musical-notation`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.geo-shape.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.geo-shape.tsv.gz --match 'Q44: (n1)-[]->(), `geo-shape`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-property.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.wikibase-property.tsv.gz --match 'Q44: (n1)-[]->(), `wikibase-property`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.url.tsv.gz --graph-cache $STORE -o $OUT/Q44.part.url.tsv.gz --match 'Q44: (n1)-[]->(), `url`: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n" ] } ], "source": [ "types = [\n", " \"time\",\n", " \"wikibase-item\",\n", " \"math\",\n", " \"wikibase-form\",\n", " \"quantity\",\n", " \"string\",\n", " \"external-id\",\n", " \"commonsMedia\",\n", " \"globe-coordinate\",\n", " \"monolingualtext\",\n", " \"musical-notation\",\n", " \"geo-shape\",\n", " \"wikibase-property\",\n", " \"url\",\n", "]\n", "command = \"$kgtk query -i $TEMP/NAME.all-items.tsv.gz -i $WIKIDATA_PARTS/part.TYPE_FILE.tsv.gz --graph-cache $STORE \\\n", " -o $OUT/NAME.part.TYPE_FILE.tsv.gz \\\n", " --match 'NAME: (n1)-[]->(), `TYPE_FILE`: (n1)-[l]->(n2)' \\\n", " --return 'distinct l, n1, l.label, n2' \\\n", " --order-by 'n1, l.label, n2'\"\n", "for type in types:\n", " run_command(command, {\"TYPE_FILE\": type})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate a P279star file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First generate the P279 and P31 or every node2 in the wikibase_item file." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P279.tsv.gz --match 'Q44: ()-[]->(n1), P279: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P31.tsv.gz --match 'Q44: ()-[]->(n1), P31: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n" ] } ], "source": [ "command_p279 = \"$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE \\\n", "-o $TEMP/NAME.node2.P279.tsv.gz \\\n", "--match 'NAME: ()-[]->(n1), P279: (n1)-[l]->(n2)' \\\n", "--return 'distinct l, n1 as node1, l.label, n2' \\\n", "--order-by 'n1, l.label, n2'\"\n", "\n", "command_p31 = \"$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE \\\n", "-o $TEMP/NAME.node2.P31.tsv.gz \\\n", "--match 'NAME: ()-[]->(n1), P31: (n1)-[l]->(n2)' \\\n", "--return 'distinct l, n1 as node1, l.label, n2' \\\n", "--order-by 'n1, l.label, n2'\"\n", "\n", "run_command(command_p279)\n", "run_command(command_p31)\n", "\n", "!$kgtk cat -i $TEMP/$NAME.node2.P279.tsv.gz $TEMP/$NAME.node2.P31.tsv.gz | gzip > $TEMP/$NAME.P279_P31.tsv.gz\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i $OUT/$NAME.part.*.tsv.gz | gzip > $TEMP/$NAME.all_1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz --graph-cache $STORE -o $TEMP/Q44.P279star.1.tsv.gz --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' --return 'distinct l, n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz --graph-cache $STORE -o $TEMP/Q44.P279star.2.tsv.gz --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' --return 'distinct l, n1 as node1, l.label, n2'\n", "\n", "\n", "$kgtk cat -i $TEMP/Q44.P279star.1.tsv.gz $TEMP/Q44.P279star.2.tsv.gz | gzip > $OUT/Q44.P279star.tsv.gz\n", "\n", "\n" ] } ], "source": [ "command_node1 = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/NAME.P279star.1.tsv.gz \\\n", " --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' \\\n", " --return 'distinct l, n1, l.label, n2'\"\n", "\n", "command_node2 = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/NAME.P279star.2.tsv.gz \\\n", " --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' \\\n", " --return 'distinct l, n1 as node1, l.label, n2'\" \n", "\n", "cat_command = \"$kgtk cat -i $TEMP/NAME.P279star.1.tsv.gz $TEMP/NAME.P279star.2.tsv.gz | gzip > $OUT/NAME.P279star.tsv.gz\"\n", "\n", "run_command(command_node1)\n", "run_command(command_node2)\n", "run_command(cat_command)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get info on all properties" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_1.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First get a list of all the proerties used in this file" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/$NAME.everything_1.tsv.gz --graph-cache $STORE \\\n", "-o $TEMP/$NAME.properties.tsv \\\n", "--match '(n1)-[l]->(n2)' \\\n", "--return 'distinct l.label as node1, \"dummy\" as label, \"dummy\" as node2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now get all the info in these properties" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/$NAME.properties.tsv -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE \\\n", "-o $OUT/$NAME.properties.tsv.gz \\\n", "--match '`wikibase-item`: (p)-[l]->(n2), properties: (p)-[]->()' \\\n", "--return 'distinct l, p, l.label, n2' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate the labels, aliases and descriptions\n", "We want the labels, aliases and descriptions for every q-node in our dataset. THis means that we need these lables for all q-nodes that appear in the node1 or node2 position.\n", "\n", "The first step is to concatenate all the files in our dataset." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we extract the labels from from our input wikidata folder. We do this matching node1, thend node 2, then we concatenate the resulting label files." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.label.en.1.tsv.gz --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.label.en.2.tsv.gz --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "kgtk cat -i $TEMP/Q44.label.*.gz | gzip > $OUT/Q44.label.en.tsv.gz\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.alias.en.1.tsv.gz --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.alias.en.2.tsv.gz --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "kgtk cat -i $TEMP/Q44.alias.*.gz | gzip > $OUT/Q44.alias.en.tsv.gz\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.description.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.description.en.1.tsv.gz --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' --return 'distinct l, n1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.description.en.tsv.gz --graph-cache $STORE -o $TEMP/Q44.description.en.2.tsv.gz --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n", "\n", "\n", "kgtk cat -i $TEMP/Q44.description.*.gz | gzip > $OUT/Q44.description.en.tsv.gz\n", "\n", "\n" ] } ], "source": [ "labels = [\n", " \"label\",\n", " \"alias\",\n", " \"description\"\n", "]\n", "\n", "command_node1 = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \\\n", " -o $TEMP/NAME.LABEL.en.1.tsv.gz \\\n", " --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' \\\n", " --return 'distinct l, n1, l.label, n2' \\\n", " --order-by 'n1, l.label, n2'\"\n", "\n", "command_node2 = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \\\n", " -o $TEMP/NAME.LABEL.en.2.tsv.gz \\\n", " --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' \\\n", " --return 'distinct l, n1 as node1, l.label, n2' \\\n", " --order-by 'n1, l.label, n2'\"\n", "\n", "command_label = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE \\\n", " -o $TEMP/NAME.LABEL.en.3.tsv.gz \\\n", " --match 'everything_2: ()-[l {label: n1}]->(), part: (n1)-[l]->(n2)' \\\n", " --return 'distinct l, n1 as node1, l.label, n2' \\\n", " --order-by 'n1, l.label, n2'\"\n", "\n", "cat_command = \"kgtk cat -i $TEMP/NAME.LABEL.*.gz | gzip > $OUT/NAME.LABEL.en.tsv.gz\"\n", "\n", "for label in labels:\n", " run_command(command_node1, {\"LABEL\": label})\n", " run_command(command_node2, {\"LABEL\": label})\n", " run_command(cat_command, {\"LABEL\": label})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary of what we got" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q44.P279star.tsv.gz 183410\n", "Q44.alias.en.tsv.gz 36266\n", "Q44.description.en.tsv.gz 27447\n", "Q44.label.en.tsv.gz 36591\n", "Q44.part.commonsMedia.tsv.gz 1386\n", "Q44.part.external-id.tsv.gz 8406\n", "Q44.part.geo-shape.tsv.gz 75\n", "Q44.part.globe-coordinate.tsv.gz 416\n", "Q44.part.math.tsv.gz 1\n", "Q44.part.monolingualtext.tsv.gz 3594\n", "Q44.part.musical-notation.tsv.gz 1\n", "Q44.part.quantity.tsv.gz 25281\n", "Q44.part.string.tsv.gz 1314\n", "Q44.part.time.tsv.gz 366\n", "Q44.part.url.tsv.gz 314\n", "Q44.part.wikibase-form.tsv.gz 1\n", "Q44.part.wikibase-item.tsv.gz 21942\n", "Q44.part.wikibase-property.tsv.gz 20\n", "Q44.properties.tsv.gz 10582\n" ] } ], "source": [ "%%bash\n", "for f in $OUT/*.tsv.gz; do\n", " echo -n `basename $f`\n", " gzcat $f | wc -l\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unzip the everything file as graph-statistics cannont work with gz files" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: /Users/pedroszekely/Downloads/kypher/Q44-temp/Q44.everything_2.tsv: No such file or directory\n" ] } ], "source": [ "!rm $TEMP/$NAME.everything_2.tsv" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "!gunzip --keep $TEMP/$NAME.everything_2.tsv.gz" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "!$kgtk graph-statistics --log $OUT/$NAME.everything.statistics.txt \\\n", " --statistics-only --pagerank -i $TEMP/$NAME.everything_2.tsv \\\n", " | gzip > $OUT/$NAME.statistics.tsv.gz" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "graph loaded! It has 53099 nodes and 257093 edges\n", "\n", "###Top relations:\n", "P279star\t183409\n", "P2302\t4031\n", "P530\t3434\n", "P1082\t3347\n", "P2936\t3235\n", "P2131\t3158\n", "P2132\t3058\n", "P2134\t2891\n", "P31\t2630\n", "P1549\t2574\n", "\n", "###PageRank\n", "Max pageranks\n", "44\tQ4406616\t0.001093\n", "43\tQ44\t0.001219\n", "46\tQ488383\t0.001689\n", "38\tQ35120\t0.001846\n", "57\tnovalue\t0.012314\n" ] } ], "source": [ "!cat $OUT/$NAME.everything.statistics.txt" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 12712\n", "-rw-r--r-- 1 pedroszekely staff 1.2M Oct 16 22:39 Q44.P279star.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 407K Oct 16 22:39 Q44.alias.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 479K Oct 16 22:39 Q44.description.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 304B Oct 16 22:39 Q44.everything.statistics.txt\n", "-rw-r--r-- 1 pedroszekely staff 446K Oct 16 22:39 Q44.label.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 22K Oct 16 22:38 Q44.part.commonsMedia.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 93K Oct 16 22:38 Q44.part.external-id.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 938B Oct 16 22:38 Q44.part.geo-shape.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 6.3K Oct 16 22:38 Q44.part.globe-coordinate.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 62B Oct 16 22:38 Q44.part.math.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 38K Oct 16 22:38 Q44.part.monolingualtext.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 74B Oct 16 22:38 Q44.part.musical-notation.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 201K Oct 16 22:38 Q44.part.quantity.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 15K Oct 16 22:38 Q44.part.string.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 3.8K Oct 16 22:38 Q44.part.time.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 5.6K Oct 16 22:38 Q44.part.url.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 71B Oct 16 22:38 Q44.part.wikibase-form.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 154K Oct 16 22:38 Q44.part.wikibase-item.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 239B Oct 16 22:38 Q44.part.wikibase-property.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 76K Oct 16 22:39 Q44.properties.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 1.6M Oct 16 22:39 Q44.statistics.tsv.gz\n" ] } ], "source": [ "!ls -lh $OUT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example of how to get statistics on the properties. " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "property_id property_label value value_count\n", "P106 'Entrepreneur'@en-ca Q131524 3\n", "P106 'entrepreneur'@en Q131524 3\n", "P106 'entrepreneur'@en-gb Q131524 3\n", "P106 'toy maker'@en Q2310380 1\n" ] } ], "source": [ "!kgtk query -i $TEMP/$NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'everything: (n1)-[l:P106]->(n2), label: (n2)-[:label]->(label)' \\\n", "--return 'distinct l.label as property_id, label as property_label, n2 as value, count(n2) as value_count' \\\n", "--order-by 'count(n2) desc' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distribution of label/node2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Entity Profiles\n", "The cells in this section should be moved to a new `Example10 Entity Profiler` notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Entity profiler for items\n", "Get distinct P31(node1)/label/node2 triples, and count the number of instances of such edges.\n", "\n", "Represent the result as KGTK edges:\n", "- `node1`: the property, ie the `label` in our definition\n", "- `label`: a new property we call `Pprofiler_count`\n", "- `node2`: the count\n", "\n", "Use qualifiers to represent the context:\n", "- `Pcontext_item`: represents the `node2` in our definition\n", "- `Pcontext_type`: represents `P31(node1)` in our definition" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pcontext_type node1_dummy Pcontext_item node2 node1;label Pcontext_item;label Pcontext_type;label label\n", "Q1066984 P1151 Q11028213 6 'topic\\\\\\\\\\\\\\\\'s main Wikimedia portal'@en 'Portal:Munich'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P131 Q10562 18 'located in the administrative territorial entity'@en 'Upper Bavaria'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P131 Q1673724 6 'located in the administrative territorial entity'@en 'Isarkreis'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1313 Q11902879 36 'office held by head of government'@en 'Lord Mayor'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1313 Q1958954 12 'office held by head of government'@en 'list of mayors of Munich'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1343 Q97879676 18 'described by source'@en 'Regesta Imperii XIII'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1343 Q316838 12 'described by source'@en 'Regesta Imperii'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1343 Q19190511 6 'described by source'@en 'New Encyclopedic Dictionary'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1376 Q980 18 'capital of'@en 'Bavaria'@en 'Financial centre'@en-ca Pprofiler_count\n", "Q1066984 P1376 Q58738 18 'capital of'@en 'Bavarian Soviet Republic'@en 'Financial centre'@en-ca Pprofiler_count\n" ] } ], "source": [ "!$kgtk query -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab), label: (type)-[:label]->(type_label), label: (n2)-[:label]->(n2_label)' \\\n", "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n", "--return 'distinct type as Pcontext_type, l.label as node1_dummy, n2 as Pcontext_item, count(n1) as node2, lab as `node1;label`, n2_label as `Pcontext_item;label`, type_label as `Pcontext_type;label`, \"Pprofiler_count\" as label' \\\n", "--order-by 'type, p, count(n1) desc' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The cells below compute profiles for other data types and should be refactored to follow the pattern of the Entity profiler for items" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type prop property_label year count\n", "Q3624078 P571 'inception'@en 1991 18\n", "Q3624078 P571 'inception'@en 1918 16\n", "Q6256 P571 'inception'@en 1991 12\n", "Q6256 P571 'inception'@en 1918 10\n", "Q123480 P571 'inception'@en 1991 8\n", "Q179164 P571 'inception'@en 1991 8\n", "Q4209223 P571 'inception'@en 1991 8\n", "Q44 P571 'inception'@en 2001 8\n", "Q619610 P571 'inception'@en 1991 8\n", "Q63791824 P571 'inception'@en 1918 8\n" ] } ], "source": [ "!$kgtk query -i $OUT/$NAME.part.time.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'time: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \\\n", "--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_date_year(n2) as year, count(n1) as count' \\\n", "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n", "--order-by 'count(n1) desc' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type prop property_label value count\n", "Q3624078 P3000 'marriageable age'@en 18 56\n", "Q3624078 P2997 'age of majority'@en 18 53\n", "Q3624078 P2884 'mains voltage'@en 230 41\n", "Q3624078 P1279 'inflation rate'@en 1.7 39\n", "Q3624078 P1279 'inflation rate'@en 1.8 39\n", "Q3624078 P1279 'inflation rate'@en 2.1 37\n", "Q3624078 P1279 'inflation rate'@en 1.5 32\n", "Q3624078 P1279 'inflation rate'@en 2 31\n", "Q3624078 P1279 'inflation rate'@en 2.8 31\n", "Q3624078 P1279 'inflation rate'@en 3.5 29\n" ] } ], "source": [ "!$kgtk query -i $OUT/$NAME.part.quantity.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'quantity: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \\\n", "--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_quantity_number(n2) as value, count(n1) as count' \\\n", "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n", "--order-by 'count(n1) desc' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extending KG to include nodes with ambiguous names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/label/node1_label in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_label" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node1;label node2 node2;label label\n", "Q1017471 'Bush'@en Q80857164 'The Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q80857164 'The Bush'@en-gb Pshares_name\n", "Q1017471 'Bush'@en Q21810649 'Norton Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q60614686 'The Gentlemen'@en Pshares_name\n", "Q1017471 'Bush'@en Q4888621 'Benjamin Franklin Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q54888574 'Bush, Washington'@en Pshares_name\n", "Q10350781 'Polar'@en Q1500857 'Polar Electro'@en Pshares_name\n", "Q1041750 'Carling'@en Q7230524 'Port Carling'@en Pshares_name\n", "Q12009657 'Victoria'@en Q286499 'Vitruvia'@en Pshares_name\n", "Q12009657 'Victoria'@en Q3557663 'Michel Sardou'@en Pshares_name\n" ] } ], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n2)-[:alias]->(n1_label), label: (n2)-[:label]->(n2_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node1;label node2 node2;label label\n", "Q1157108 'Cerveza Sol'@en Q64961707 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q7555482 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q69509964 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q1237552 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q7555484 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q7555486 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q64961436 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q3489075 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q37563235 'Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q23664473 'Sol'@en Pshares_name\n" ] } ], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), label: (n2)-[:label]->(n1_alias)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_alias as `node2;label`, \"Pshares_name\" as label' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_alias" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node1;label node2 node2;label label\n", "Q1157108 'Cerveza Sol'@en Q22583558 'J-Hope'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en-ca Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q18607853 'Solomon'@en-gb Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en-ca Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q28800560 'El Sol'@en-gb Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q654596 'Sól'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q7666238 'Sól'@en Pshares_name\n", "Q1157108 'Cerveza Sol'@en Q525 'Sun'@en Pshares_name\n" ] } ], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), alias: (n2)-[:alias]->(n1_alias), label: (n2)-[:label]->(n2_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node1;label node2 node2;label label\n", "Q1017471 'Bush'@en Q5001360 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q77894031 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q5001365 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q20482703 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q18793771 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q247949 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q1017464 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q1484464 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q224168 'Bush'@en Pshares_name\n", "Q1017471 'Bush'@en Q2469309 'Bush'@en Pshares_name\n" ] } ], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), label: (n2)-[:label]->(n1_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_label as `node2;label`, \"Pshares_name\" as label' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m Q1017471\n", "\u001b[42mLabel\u001b[49m Bush\n", "\u001b[44mDescription\u001b[49m Beer of Belgium (Wallonia)\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mbeer brand \u001b[90m(Q15075508)\u001b[39m | beer \u001b[90m(Q44)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q247949\n", "\u001b[42mLabel\u001b[49m Bush\n", "\u001b[44mDescription\u001b[49m British rock band\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mmusical group \u001b[90m(Q215380)\u001b[39m\n" ] } ], "source": [ "!wd u Q1017471 Q247949" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }