{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Find ambiguous Items\n", "\n", "This notebook illustrates how to create a subset of Wikidata of ambiguous items. Given a list of qnodes the notebook returns additional qnodes with the same label string or the same alias string.\n", "\n", "Parameters are in the first cell so that we can run this notebook in batch mode. Example invocation command:\n", "\n", "```\n", "papermill 'Example 11 - Find Ambiguous Items.ipynb'\n", "-p wikidata_parts_path /lfs1/ktyao/Shared/KGTK/datasets/wikidata-20200803-v2/useful_wikidata_files\n", "-p instance_csv_file /lfs1/ktyao/data/iswc-2019/t2dv2_gt.csv\n", "-p instance_qnode_col kg_id\n", "-p subset_name iscw\n", "-p output_path /home/ktyao/dev/kgtk-cache\n", "-p cache_path /home/ktyao/dev/kgtk-cache\n", "-p delete_database no\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for invoking the notebook\n", "\n", "- `wikidata_parts_path`: a folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz`\n", "- `instance_csv_file`: the path to the csv containing the input qnodes\n", "- `instance_qnode_col`: the name of the csv table column containing the input qnodes\n", "- `subset_name`: a name of the subset being created.\n", "- `output_path`: the path where a folder will be created to hold the KGTK files for the subset. A folder named `subset_name` will be createed in this filder.\n", "- `cache_path`: the path of a folder where the Kypher SQL database will be created.\n", "- `delete_database`: whether to delete the SQL database before running the notebook: \"\" or \"no\" means don't delete it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample output\n", "\n", "Pairs of qnodes that have common label and/or alias strings.\n", "```\n", "$ zcat shares_name.iscw.tsv.gz | head\n", "node1 node1;label node2 node2;label label\n", "Q100 'Boston'@en Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name\n", "Q100 'Boston'@en Q26664318 'The Hub'@en-gb Pshares_name\n", "Q100 'Boston'@en Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name\n", "Q100 'Boston'@en-ca Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name\n", "Q100 'Boston'@en-ca Q26664318 'The Hub'@en-gb Pshares_name\n", "Q100 'Boston'@en-ca Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name\n", "Q100 'Boston'@en-gb Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name\n", "Q100 'Boston'@en-gb Q26664318 'The Hub'@en-gb Pshares_name\n", "Q100 'Boston'@en-gb Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name\n", "```\n", "\n", "All qnodes including input qnodes and ambiguous qnodes.\n", "```\n", "$ zcat qnodes_combined.iscw.tsv.gz | head\n", "node1 label node2\n", "Qiscw Pcontains_entity Q100\n", "Qiscw Pcontains_entity Q1000\n", "Qiscw Pcontains_entity Q1001437\n", "Qiscw Pcontains_entity Q1001910\n", "Qiscw Pcontains_entity Q1002142\n", "Qiscw Pcontains_entity Q1002860\n", "Qiscw Pcontains_entity Q1003177\n", "Qiscw Pcontains_entity Q1004657\n", "Qiscw Pcontains_entity Q1005\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Parameters\n", "wikidata_parts_path = \"/lfs1/ktyao/Shared/KGTK/datasets/wikidata-20200803-v2/useful_wikidata_files\"\n", "instance_csv_file=\"/lfs1/ktyao/data/iswc-2019/t2dv2_gt.csv\"\n", "instance_qnode_col=\"kg_id\"\n", "subset_name=\"iscw\"\n", "output_path = \"/home/ktyao/dev/kgtk-cache\"\n", "cache_path = \"/home/ktyao/dev/kgtk-cache\"\n", "delete_database = \"no\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "temp_folder = subset_name + \"-temp\"\n", "output_folder = subset_name" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def run_command(command, substitution_dictionary = {}):\n", " \"\"\"Run a templetized command.\"\"\"\n", " cmd = command.replace(\"NAME\", subset_name)\n", " for k, v in substitution_dictionary.items():\n", " cmd = cmd.replace(k, v)\n", " \n", " print(cmd)\n", " output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " print(output.stdout)\n", " print(output.stderr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up environment variables and folders that we need\n", "We need to define environment variables to pass to the KGTK commands." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# folder containing wikidata broken down into smaller files.\n", "os.environ['WIKIDATA_PARTS'] = wikidata_parts_path\n", "# name of the dataset\n", "os.environ['NAME'] = subset_name\n", "# folder where to put the output\n", "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n", "# temporary folder\n", "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n", "# kgtk command to run\n", "os.environ['kgtk'] = \"kgtk\"\n", "# os.environ['kgtk'] = \"time kgtk --debug\"\n", "# absolute path of the db\n", "if cache_path:\n", " os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n", "else:\n", " os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)\n", "# alias zcat\n", "has_zcat = !command -v zcat\n", "if has_zcat:\n", " # for linux\n", " %alias gzcat zcat\n", "else:\n", " # for mac\n", " %alias gzcat gzcat" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ktyao/dev/kgtk-cache\n" ] } ], "source": [ "cd $output_path" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: cannot create directory ‘iscw’: File exists\n", "mkdir: cannot create directory ‘iscw-temp’: File exists\n" ] } ], "source": [ "!mkdir $output_folder\n", "!mkdir $temp_folder" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove '/home/ktyao/dev/kgtk-cache/iscw/*.tsv': No such file or directory\r\n" ] } ], "source": [ "!rm $OUT/*.tsv $OUT/*.tsv.gz\n", "!rm $TEMP/*.tsv $TEMP/*.tsv.gz" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "if delete_database and delete_database != \"no\":\n", " print(\"Deleted database\")\n", " !rm $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Extract Q-nodes from input file" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "n1\tl\tQ470336\n", "n1\tl\tQ494085\n", "n1\tl\tQ105702\n", "n1\tl\tQ26060\n", "n1\tl\tQ2345\n", "n1\tl\tQ42198\n", "n1\tl\tQ320588\n", "n1\tl\tQ132689\n", "n1\tl\tQ45602\n", " 11892 35676 157305 /home/ktyao/dev/kgtk-cache/iscw-temp/instance_qnodes.tsv\n" ] } ], "source": [ "df = pd.read_csv(instance_csv_file, lineterminator='\\n')\n", "result = pd.DataFrame(df[instance_qnode_col].unique())\n", "result.columns = ['node2']\n", "result['node1'] = 'n1'\n", "result['label'] = 'l'\n", "result = result[['node1', 'label', 'node2']]\n", "result.to_csv(f\"{output_path}/{temp_folder}/instance_qnodes.tsv\", sep=\"\\t\", quoting=csv.QUOTE_NONE, index=None)\n", "!head $TEMP/instance_qnodes.tsv\n", "!wc $TEMP/instance_qnodes.tsv" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $TEMP/instance_qnodes.tsv --graph-cache $STORE -o $TEMP/qnodelist.iscw.tsv.gz --match 'instance: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' --return 'distinct n1 as node1, l.label as label, n2 as node2'\n", "\n", "\n", "node1\tlabel\tnode2\n", "Q100\tisa\tQ1549591\n", "Q100\tisa\tQ21518270\n", "Q100\tisa\tQ1093829\n", "Q1000\tisa\tQ6256\n", "Q1000\tisa\tQ3624078\n", "Q1000\tisa\tQ179023\n", "Q1001437\tisa\tQ4830453\n", "Q1001910\tisa\tQ5\n", "Q1002142\tisa\tQ219557\n", "\n", "gzip: stdout: Broken pipe\n", "wc -l /home/ktyao/dev/kgtk-cache/iscw-temp/qnodelist.iscw.tsv.gz\n", "16584\n" ] } ], "source": [ "command = \"$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $TEMP/instance_qnodes.tsv \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/qnodelist.NAME.tsv.gz \\\n", " --match 'instance: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' \\\n", " --return 'distinct n1 as node1, l.label as label, n2 as node2'\"\n", "run_command(command)\n", "%gzcat $TEMP/qnodelist.$NAME.tsv.gz | head\n", "!echo wc -l $TEMP/qnodelist.$NAME.tsv.gz\n", "%gzcat $TEMP/qnodelist.$NAME.tsv.gz | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extending KG to include nodes with ambiguous names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/label/node1_label in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_label" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n2)-[:alias]->(n1_label), label: (n2)-[:label]->(n2_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n", " | gzip > $TEMP/shares_name.label_alias.$NAME.tsv.gz " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), label: (n2)-[:label]->(n1_alias)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_alias as `node2;label`, \"Pshares_name\" as label' \\\n", " | gzip > $TEMP/shares_name.alias_label.$NAME.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_alias" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), alias: (n2)-[:alias]->(n1_alias), label: (n2)-[:label]->(n2_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n", " | gzip > $TEMP/shares_name.alias_alias.$NAME.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \\\n", "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), label: (n2)-[:label]->(n1_label)' \\\n", "--where 'n1 != n2' \\\n", "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_label as `node2;label`, \"Pshares_name\" as label' \\\n", "| gzip > $TEMP/shares_name.label_label.$NAME.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the above four files into one file" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "!ls $TEMP/shares_name.*.$NAME.tsv.gz | xargs echo \"-i\" | xargs kgtk cat | gzip > $OUT/shares_name.$NAME.tsv.gz" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 143901 951117 9511726\r\n" ] } ], "source": [ "%gzcat $OUT/shares_name.$NAME.tsv.gz | wc" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unique ambiguous Q-nodes\n", "132800\n" ] } ], "source": [ "!echo Unique ambiguous Q-nodes\n", "%gzcat $OUT/shares_name.$NAME.tsv.gz | awk -F'\\t' '{print $3}' | tail -n +2 | uniq | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate combined KG of input Q-nodes and ambiguous Q-nodes" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $OUT/shares_name.iscw.tsv.gz --graph-cache $STORE -o $TEMP/qnodelist_ambiguous.iscw.tsv.gz --match 'shares: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' --return 'distinct n1 as node1, l.label as label, n2 as node2'\n", "\n", "\n" ] } ], "source": [ "command = \"$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $OUT/shares_name.NAME.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $TEMP/qnodelist_ambiguous.NAME.tsv.gz \\\n", " --match 'shares: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' \\\n", " --return 'distinct n1 as node1, l.label as label, n2 as node2'\"\n", "run_command(command)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i $TEMP/qnodelist.$NAME.tsv.gz -i $TEMP/qnodelist_ambiguous.$NAME.tsv.gz \\\n", " | gzip > $OUT/isa_combined.$NAME.tsv.gz" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "kgtk query -i $OUT/isa_combined.iscw.tsv.gz --graph-cache $STORE -o $OUT/qnodes_combined.iscw.tsv.gz --match '(n2)-[]->()' --return 'distinct \"Qiscw\" as node1, \"Pcontains_entity\" as label, n2 as node2'\n", "\n", "\n" ] } ], "source": [ "command = \"\"\"kgtk query -i $OUT/isa_combined.NAME.tsv.gz \\\n", " --graph-cache $STORE \\\n", " -o $OUT/qnodes_combined.NAME.tsv.gz \\\n", " --match '(n2)-[]->()' \\\n", " --return 'distinct \"QNAME\" as node1, \"Pcontains_entity\" as label, n2 as node2'\"\"\"\n", "run_command(command)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:kgtk-env] *", "language": "python", "name": "conda-env-kgtk-env-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }