{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating a subset of Wikidata\n",
    "\n",
    "This notebook illustrates how to create a subset of Wikidata. We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound)\n",
    "\n",
    "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n",
    "\n",
    "```\n",
    "papermill Example8\\ -\\ Wikidata\\ Subset.ipynb example8.out.ipynb \\\n",
    "-p wikidata_parts_path /Users/pedroszekely/Downloads/kypher/output.all.10 \\\n",
    "-p subset_name Q11173 \\\n",
    "-p output_path /Users/pedroszekely/Downloads/kypher \\\n",
    "-p cache_path /Users/pedroszekely/Downloads/kypher\n",
    "-p delete_database no \n",
    "-p hops_right 0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters for invoking the notebook\n",
    "\n",
    "- `wikidata_parts_path`: a folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz`\n",
    "- `subset_name`: the name of the subset being created. In the current implementation the `subset_name` must be a q-node in Wikidata representing a class.\n",
    "- `output_path`: the path where a folder will be created to hold the KGTK files for the subset. A folder named `subset_name` will be createed in this filder.\n",
    "- `cache_path`: the path of a folder where the Kypher SQL database will be created.\n",
    "- `delete_database`: whether to delete the SQL database before running the notebook: \"\" or \"no\" means don't delete it.\n",
    "- `hops_right`: after getting the initial collection of q-nodes for the subset, how many hops forward to follow links, can be 0, 1 or 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "wikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/useful_wikidata_files\"\n",
    "#wikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/output.all.10\"\n",
    "subset_name = \"Q11173\"\n",
    "#subset_name = \"Q318\"\n",
    "subset_name = \"Q5\"\n",
    "subset_name = \"Q44\"\n",
    "output_path = \"/Users/pedroszekely/Downloads/kypher\"\n",
    "cache_path = \"/Users/pedroszekely/Downloads/kypher\"\n",
    "hops_right = \"1\"\n",
    "delete_database = \"no\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp_folder = subset_name + \"-temp\"\n",
    "output_folder = subset_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# from IPython.display import display, HTML, Image\n",
    "# from pandas_profiling import ProfileReport"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_command(command, substitution_dictionary = {}):\n",
    "    \"\"\"Run a templetized command.\"\"\"\n",
    "    cmd = command.replace(\"NAME\", subset_name)\n",
    "    for k, v in substitution_dictionary.items():\n",
    "        cmd = cmd.replace(k, v)\n",
    "    \n",
    "    print(cmd)\n",
    "    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
    "    print(output.stdout)\n",
    "    print(output.stderr)\n",
    "    #print(output.returncode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up environment variables and folders that we need\n",
    "We need to define environment variables to pass to the KGTK commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [],
   "source": [
    "# folder containing wikidata broken down into smaller files.\n",
    "os.environ['WIKIDATA_PARTS'] = wikidata_parts_path\n",
    "# name of the dataset\n",
    "os.environ['NAME'] = subset_name\n",
    "# folder where to put the output\n",
    "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n",
    "# temporary folder\n",
    "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n",
    "# kgtk command to run\n",
    "os.environ['kgtk'] = \"kgtk\"\n",
    "# os.environ['kgtk'] = \"time kgtk --debug\"\n",
    "# absolute path of the db\n",
    "if cache_path:\n",
    "    os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n",
    "else:\n",
    "    os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Documents/GitHub/kgtk/examples/Q44\n"
     ]
    }
   ],
   "source": [
    "cd $output_folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: Q44: File exists\n",
      "mkdir: Q44-temp: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir $output_folder\n",
    "!mkdir $temp_folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: /Users/pedroszekely/Downloads/kypher/Q44/*.tsv: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "!rm $OUT/*.tsv $OUT/*.tsv.gz\n",
    "!rm $TEMP/*.tsv $TEMP/*.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "if delete_database and delete_database != \"no\":\n",
    "    print(\"Deleted database\")\n",
    "    !rm $STORE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the Q-nodes for the items we want\n",
    "Here we assume that the subset is for an individual q-node, so that the subset name is the name of the q-node. We should generalize this so that this query can be passed in as a parameter. We construct a file that contains all the node1s that are isa of the given NAME q-node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz     --graph-cache $STORE     -o $TEMP/qnodelist.Q44.tsv.gz      --match 'isa: (n1)-[l:isa]->(n2:Q44)'     --return 'distinct n1, l.label, n2'\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "command = \"$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz \\\n",
    "    --graph-cache $STORE \\\n",
    "    -o $TEMP/qnodelist.NAME.tsv.gz  \\\n",
    "    --match 'isa: (n1)-[l:isa]->(n2:NAME)' \\\n",
    "    --return 'distinct n1, l.label, n2'\"\n",
    "run_command(command)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\n",
      "Q2579953\tisa\tQ44\n",
      "Q15883984\tisa\tQ44\n",
      "Q63379154\tisa\tQ44\n",
      "Q3699039\tisa\tQ44\n",
      "Q3360035\tisa\tQ44\n",
      "Q16070652\tisa\tQ44\n",
      "Q85313643\tisa\tQ44\n",
      "Q897293\tisa\tQ44\n",
      "Q999745\tisa\tQ44\n"
     ]
    }
   ],
   "source": [
    "!gzcat $TEMP/qnodelist.$NAME.tsv.gz | head "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     454    1362    7887\n"
     ]
    }
   ],
   "source": [
    "!gzcat $TEMP/qnodelist.$NAME.tsv.gz | wc "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz      --graph-cache $STORE     -o $TEMP/all.P279star.Q44.tsv.gz      --match '(n1)-[l:P279star]->(n2:Q44)'     --return 'distinct n1, l.label, n2'\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "command = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz  \\\n",
    "    --graph-cache $STORE \\\n",
    "    -o $TEMP/all.P279star.NAME.tsv.gz  \\\n",
    "    --match '(n1)-[l:P279star]->(n2:NAME)' \\\n",
    "    --return 'distinct n1, l.label, n2'\"\n",
    "run_command(command)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     320     960    7110\n"
     ]
    }
   ],
   "source": [
    "!gzcat $TEMP/all.P279star.$NAME.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Genereate the nodes one hop to the right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "hops_right_count = int(hops_right)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query     -i $TEMP/qnodelist.Q44.tsv.gz     -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz     -o $TEMP/Q44.hop.right1.tsv.gz     --graph-cache $STORE     --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)'     --return 'distinct l, n2 as node1, l.label as label, n3 as node2'\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "command = \"$kgtk query \\\n",
    "    -i $TEMP/qnodelist.NAME.tsv.gz \\\n",
    "    -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz \\\n",
    "    -o $TEMP/NAME.hop.right1.tsv.gz \\\n",
    "    --graph-cache $STORE \\\n",
    "    --match 'qnodelist: (n1)-[]->(), `wikibase-item`: (n1)-[]->(n2), `wikibase-item`: (n2)-[l]->(n3)' \\\n",
    "    --return 'distinct l, n2 as node1, l.label as label, n3 as node2'\" \n",
    "\n",
    "if hops_right_count > 0:\n",
    "    run_command(command)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a dummy empty hop file so that the gzcat command below doesn't fail if the number of hops is zero"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo -e \"node1\\tlabel\\tnode2\\tid\" | gzip > $TEMP/$NAME.hop.dummy.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "!kgtk cat -i $TEMP/$NAME.hop.*.tsv.gz $TEMP/all.P279star.$NAME.tsv.gz $TEMP/qnodelist.$NAME.tsv.gz | gzip > $TEMP/$NAME.all-items.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate the parts of this dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.time.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.time.tsv.gz      --match 'Q44: (n1)-[]->(), `time`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.wikibase-item.tsv.gz      --match 'Q44: (n1)-[]->(), `wikibase-item`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.math.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.math.tsv.gz      --match 'Q44: (n1)-[]->(), `math`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-form.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.wikibase-form.tsv.gz      --match 'Q44: (n1)-[]->(), `wikibase-form`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.quantity.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.quantity.tsv.gz      --match 'Q44: (n1)-[]->(), `quantity`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.string.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.string.tsv.gz      --match 'Q44: (n1)-[]->(), `string`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.external-id.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.external-id.tsv.gz      --match 'Q44: (n1)-[]->(), `external-id`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.commonsMedia.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.commonsMedia.tsv.gz      --match 'Q44: (n1)-[]->(), `commonsMedia`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.globe-coordinate.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.globe-coordinate.tsv.gz      --match 'Q44: (n1)-[]->(), `globe-coordinate`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.monolingualtext.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.monolingualtext.tsv.gz      --match 'Q44: (n1)-[]->(), `monolingualtext`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.musical-notation.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.musical-notation.tsv.gz      --match 'Q44: (n1)-[]->(), `musical-notation`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.geo-shape.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.geo-shape.tsv.gz      --match 'Q44: (n1)-[]->(), `geo-shape`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.wikibase-property.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.wikibase-property.tsv.gz      --match 'Q44: (n1)-[]->(), `wikibase-property`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.all-items.tsv.gz -i $WIKIDATA_PARTS/part.url.tsv.gz --graph-cache $STORE      -o $OUT/Q44.part.url.tsv.gz      --match 'Q44: (n1)-[]->(), `url`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "types = [\n",
    "    \"time\",\n",
    "    \"wikibase-item\",\n",
    "    \"math\",\n",
    "    \"wikibase-form\",\n",
    "    \"quantity\",\n",
    "    \"string\",\n",
    "    \"external-id\",\n",
    "    \"commonsMedia\",\n",
    "    \"globe-coordinate\",\n",
    "    \"monolingualtext\",\n",
    "    \"musical-notation\",\n",
    "    \"geo-shape\",\n",
    "    \"wikibase-property\",\n",
    "    \"url\",\n",
    "]\n",
    "command = \"$kgtk query -i $TEMP/NAME.all-items.tsv.gz -i $WIKIDATA_PARTS/part.TYPE_FILE.tsv.gz --graph-cache $STORE  \\\n",
    "    -o $OUT/NAME.part.TYPE_FILE.tsv.gz  \\\n",
    "    --match 'NAME: (n1)-[]->(), `TYPE_FILE`: (n1)-[l]->(n2)' \\\n",
    "    --return 'distinct l, n1, l.label, n2' \\\n",
    "    --order-by 'n1, l.label, n2'\"\n",
    "for type in types:\n",
    "    run_command(command, {\"TYPE_FILE\": type})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate a P279star file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First generate the P279 and P31 or every node2 in the wikibase_item file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P279.tsv.gz --match 'Q44: ()-[]->(n1), P279: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $OUT/Q44.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE -o $TEMP/Q44.node2.P31.tsv.gz --match 'Q44: ()-[]->(n1), P31: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "command_p279 = \"$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE \\\n",
    "-o $TEMP/NAME.node2.P279.tsv.gz \\\n",
    "--match 'NAME: ()-[]->(n1), P279: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l, n1 as node1, l.label, n2' \\\n",
    "--order-by 'n1, l.label, n2'\"\n",
    "\n",
    "command_p31 = \"$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE \\\n",
    "-o $TEMP/NAME.node2.P31.tsv.gz \\\n",
    "--match 'NAME: ()-[]->(n1), P31: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l, n1 as node1, l.label, n2' \\\n",
    "--order-by 'n1, l.label, n2'\"\n",
    "\n",
    "run_command(command_p279)\n",
    "run_command(command_p31)\n",
    "\n",
    "!$kgtk cat -i $TEMP/$NAME.node2.P279.tsv.gz $TEMP/$NAME.node2.P31.tsv.gz | gzip > $TEMP/$NAME.P279_P31.tsv.gz\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "!kgtk cat -i $OUT/$NAME.part.*.tsv.gz  | gzip > $TEMP/$NAME.all_1.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz     --graph-cache $STORE      -o $TEMP/Q44.P279star.1.tsv.gz     --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()'     --return 'distinct l, n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q44.all_1.tsv.gz     --graph-cache $STORE      -o $TEMP/Q44.P279star.2.tsv.gz     --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)'     --return 'distinct l, n1 as node1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk cat -i $TEMP/Q44.P279star.1.tsv.gz $TEMP/Q44.P279star.2.tsv.gz | gzip > $OUT/Q44.P279star.tsv.gz\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "command_node1 = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \\\n",
    "    --graph-cache $STORE  \\\n",
    "    -o $TEMP/NAME.P279star.1.tsv.gz \\\n",
    "    --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' \\\n",
    "    --return 'distinct l, n1, l.label, n2'\"\n",
    "\n",
    "command_node2 = \"$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \\\n",
    "    --graph-cache $STORE  \\\n",
    "    -o $TEMP/NAME.P279star.2.tsv.gz \\\n",
    "    --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' \\\n",
    "    --return 'distinct l, n1 as node1, l.label, n2'\" \n",
    "\n",
    "cat_command = \"$kgtk cat -i $TEMP/NAME.P279star.1.tsv.gz $TEMP/NAME.P279star.2.tsv.gz | gzip > $OUT/NAME.P279star.tsv.gz\"\n",
    "\n",
    "run_command(command_node1)\n",
    "run_command(command_node2)\n",
    "run_command(cat_command)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get info on all properties"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_1.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First get a list of all the proerties used in this file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk query -i $TEMP/$NAME.everything_1.tsv.gz --graph-cache $STORE \\\n",
    "-o $TEMP/$NAME.properties.tsv \\\n",
    "--match '(n1)-[l]->(n2)' \\\n",
    "--return 'distinct l.label as node1, \"dummy\" as label, \"dummy\" as node2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now get all the info in these properties"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk query -i $TEMP/$NAME.properties.tsv -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE \\\n",
    "-o $OUT/$NAME.properties.tsv.gz \\\n",
    "--match '`wikibase-item`: (p)-[l]->(n2), properties: (p)-[]->()' \\\n",
    "--return 'distinct l, p, l.label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate the labels, aliases and descriptions\n",
    "We want the labels, aliases and descriptions for every q-node in our dataset. THis means that we need these lables for all q-nodes that appear in the node1 or node2 position.\n",
    "\n",
    "The first step is to concatenate all the files in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk cat -i $OUT/*.gz | gzip > $TEMP/$NAME.everything_2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we extract the labels from from our input wikidata folder. We do this matching node1, thend node 2, then we concatenate the resulting label files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.label.en.1.tsv.gz      --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.label.en.2.tsv.gz      --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)'     --return 'distinct l, n1 as node1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "kgtk cat -i $TEMP/Q44.label.*.gz | gzip > $OUT/Q44.label.en.tsv.gz\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.alias.en.1.tsv.gz      --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.alias.en.2.tsv.gz      --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)'     --return 'distinct l, n1 as node1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "kgtk cat -i $TEMP/Q44.alias.*.gz | gzip > $OUT/Q44.alias.en.tsv.gz\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.description.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.description.en.1.tsv.gz      --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "$kgtk query -i $TEMP/Q44.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.description.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q44.description.en.2.tsv.gz      --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)'     --return 'distinct l, n1 as node1, l.label, n2'     --order-by 'n1, l.label, n2'\n",
      "\n",
      "\n",
      "kgtk cat -i $TEMP/Q44.description.*.gz | gzip > $OUT/Q44.description.en.tsv.gz\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "labels = [\n",
    "    \"label\",\n",
    "    \"alias\",\n",
    "    \"description\"\n",
    "]\n",
    "\n",
    "command_node1 = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \\\n",
    "    -o $TEMP/NAME.LABEL.en.1.tsv.gz  \\\n",
    "    --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' \\\n",
    "    --return 'distinct l, n1, l.label, n2' \\\n",
    "    --order-by 'n1, l.label, n2'\"\n",
    "\n",
    "command_node2 = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \\\n",
    "    -o $TEMP/NAME.LABEL.en.2.tsv.gz  \\\n",
    "    --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' \\\n",
    "    --return 'distinct l, n1 as node1, l.label, n2' \\\n",
    "    --order-by 'n1, l.label, n2'\"\n",
    "\n",
    "command_label = \"$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \\\n",
    "    -o $TEMP/NAME.LABEL.en.3.tsv.gz  \\\n",
    "    --match 'everything_2: ()-[l {label: n1}]->(), part: (n1)-[l]->(n2)' \\\n",
    "    --return 'distinct l, n1 as node1, l.label, n2' \\\n",
    "    --order-by 'n1, l.label, n2'\"\n",
    "\n",
    "cat_command = \"kgtk cat -i $TEMP/NAME.LABEL.*.gz | gzip > $OUT/NAME.LABEL.en.tsv.gz\"\n",
    "\n",
    "for label in labels:\n",
    "    run_command(command_node1, {\"LABEL\": label})\n",
    "    run_command(command_node2, {\"LABEL\": label})\n",
    "    run_command(cat_command, {\"LABEL\": label})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary of what we got"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q44.P279star.tsv.gz  183410\n",
      "Q44.alias.en.tsv.gz   36266\n",
      "Q44.description.en.tsv.gz   27447\n",
      "Q44.label.en.tsv.gz   36591\n",
      "Q44.part.commonsMedia.tsv.gz    1386\n",
      "Q44.part.external-id.tsv.gz    8406\n",
      "Q44.part.geo-shape.tsv.gz      75\n",
      "Q44.part.globe-coordinate.tsv.gz     416\n",
      "Q44.part.math.tsv.gz       1\n",
      "Q44.part.monolingualtext.tsv.gz    3594\n",
      "Q44.part.musical-notation.tsv.gz       1\n",
      "Q44.part.quantity.tsv.gz   25281\n",
      "Q44.part.string.tsv.gz    1314\n",
      "Q44.part.time.tsv.gz     366\n",
      "Q44.part.url.tsv.gz     314\n",
      "Q44.part.wikibase-form.tsv.gz       1\n",
      "Q44.part.wikibase-item.tsv.gz   21942\n",
      "Q44.part.wikibase-property.tsv.gz      20\n",
      "Q44.properties.tsv.gz   10582\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "for f in $OUT/*.tsv.gz; do\n",
    "    echo -n `basename $f`\n",
    "    gzcat $f | wc -l\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unzip the everything file as graph-statistics cannont work with gz files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: /Users/pedroszekely/Downloads/kypher/Q44-temp/Q44.everything_2.tsv: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "!rm $TEMP/$NAME.everything_2.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gunzip --keep $TEMP/$NAME.everything_2.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk graph-statistics --log $OUT/$NAME.everything.statistics.txt \\\n",
    "    --statistics-only --pagerank -i $TEMP/$NAME.everything_2.tsv \\\n",
    "    | gzip > $OUT/$NAME.statistics.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "graph loaded! It has 53099 nodes and 257093 edges\n",
      "\n",
      "###Top relations:\n",
      "P279star\t183409\n",
      "P2302\t4031\n",
      "P530\t3434\n",
      "P1082\t3347\n",
      "P2936\t3235\n",
      "P2131\t3158\n",
      "P2132\t3058\n",
      "P2134\t2891\n",
      "P31\t2630\n",
      "P1549\t2574\n",
      "\n",
      "###PageRank\n",
      "Max pageranks\n",
      "44\tQ4406616\t0.001093\n",
      "43\tQ44\t0.001219\n",
      "46\tQ488383\t0.001689\n",
      "38\tQ35120\t0.001846\n",
      "57\tnovalue\t0.012314\n"
     ]
    }
   ],
   "source": [
    "!cat $OUT/$NAME.everything.statistics.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 12712\n",
      "-rw-r--r--  1 pedroszekely  staff   1.2M Oct 16 22:39 Q44.P279star.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   407K Oct 16 22:39 Q44.alias.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   479K Oct 16 22:39 Q44.description.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   304B Oct 16 22:39 Q44.everything.statistics.txt\n",
      "-rw-r--r--  1 pedroszekely  staff   446K Oct 16 22:39 Q44.label.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    22K Oct 16 22:38 Q44.part.commonsMedia.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    93K Oct 16 22:38 Q44.part.external-id.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   938B Oct 16 22:38 Q44.part.geo-shape.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   6.3K Oct 16 22:38 Q44.part.globe-coordinate.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    62B Oct 16 22:38 Q44.part.math.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    38K Oct 16 22:38 Q44.part.monolingualtext.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    74B Oct 16 22:38 Q44.part.musical-notation.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   201K Oct 16 22:38 Q44.part.quantity.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    15K Oct 16 22:38 Q44.part.string.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   3.8K Oct 16 22:38 Q44.part.time.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   5.6K Oct 16 22:38 Q44.part.url.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    71B Oct 16 22:38 Q44.part.wikibase-form.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   154K Oct 16 22:38 Q44.part.wikibase-item.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   239B Oct 16 22:38 Q44.part.wikibase-property.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff    76K Oct 16 22:39 Q44.properties.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   1.6M Oct 16 22:39 Q44.statistics.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "!ls -lh $OUT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example of how to get statistics on the properties. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "property_id  property_label        value     value_count\n",
      "P106         'Entrepreneur'@en-ca  Q131524   3\n",
      "P106         'entrepreneur'@en     Q131524   3\n",
      "P106         'entrepreneur'@en-gb  Q131524   3\n",
      "P106         'toy maker'@en        Q2310380  1\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $TEMP/$NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'everything: (n1)-[l:P106]->(n2), label: (n2)-[:label]->(label)' \\\n",
    "--return 'distinct l.label as property_id, label as property_label, n2 as value, count(n2) as value_count' \\\n",
    "--order-by 'count(n2) desc' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distribution of label/node2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Entity Profiles\n",
    "The cells in this section should be moved to a new `Example10 Entity Profiler` notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entity profiler for items\n",
    "Get distinct P31(node1)/label/node2 triples, and count the number of instances of such edges.\n",
    "\n",
    "Represent the result as KGTK edges:\n",
    "- `node1`: the property, ie the `label` in our definition\n",
    "- `label`: a new property we call `Pprofiler_count`\n",
    "- `node2`: the count\n",
    "\n",
    "Use qualifiers to represent the context:\n",
    "- `Pcontext_item`: represents the `node2` in our definition\n",
    "- `Pcontext_type`: represents `P31(node1)` in our definition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pcontext_type  node1_dummy  Pcontext_item  node2  node1;label                                            Pcontext_item;label               Pcontext_type;label       label\n",
      "Q1066984       P1151        Q11028213      6      'topic\\\\\\\\\\\\\\\\'s main Wikimedia portal'@en             'Portal:Munich'@en                'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P131         Q10562         18     'located in the administrative territorial entity'@en  'Upper Bavaria'@en                'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P131         Q1673724       6      'located in the administrative territorial entity'@en  'Isarkreis'@en                    'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1313        Q11902879      36     'office held by head of government'@en                 'Lord Mayor'@en                   'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1313        Q1958954       12     'office held by head of government'@en                 'list of mayors of Munich'@en     'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1343        Q97879676      18     'described by source'@en                               'Regesta Imperii XIII'@en         'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1343        Q316838        12     'described by source'@en                               'Regesta Imperii'@en              'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1343        Q19190511      6      'described by source'@en                               'New Encyclopedic Dictionary'@en  'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1376        Q980           18     'capital of'@en                                        'Bavaria'@en                      'Financial centre'@en-ca  Pprofiler_count\n",
      "Q1066984       P1376        Q58738         18     'capital of'@en                                        'Bavarian Soviet Republic'@en     'Financial centre'@en-ca  Pprofiler_count\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab), label: (type)-[:label]->(type_label), label: (n2)-[:label]->(n2_label)' \\\n",
    "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n",
    "--return 'distinct type as Pcontext_type, l.label as node1_dummy, n2 as Pcontext_item, count(n1) as node2, lab as `node1;label`, n2_label as `Pcontext_item;label`, type_label as `Pcontext_type;label`, \"Pprofiler_count\" as label' \\\n",
    "--order-by 'type, p, count(n1) desc' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The cells below compute profiles for other data types and should be refactored to follow the pattern of the Entity profiler for items"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type       prop  property_label  year  count\n",
      "Q3624078   P571  'inception'@en  1991  18\n",
      "Q3624078   P571  'inception'@en  1918  16\n",
      "Q6256      P571  'inception'@en  1991  12\n",
      "Q6256      P571  'inception'@en  1918  10\n",
      "Q123480    P571  'inception'@en  1991  8\n",
      "Q179164    P571  'inception'@en  1991  8\n",
      "Q4209223   P571  'inception'@en  1991  8\n",
      "Q44        P571  'inception'@en  2001  8\n",
      "Q619610    P571  'inception'@en  1991  8\n",
      "Q63791824  P571  'inception'@en  1918  8\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/$NAME.part.time.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'time: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \\\n",
    "--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_date_year(n2) as year, count(n1) as count' \\\n",
    "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n",
    "--order-by 'count(n1) desc' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type      prop   property_label         value  count\n",
      "Q3624078  P3000  'marriageable age'@en  18     56\n",
      "Q3624078  P2997  'age of majority'@en   18     53\n",
      "Q3624078  P2884  'mains voltage'@en     230    41\n",
      "Q3624078  P1279  'inflation rate'@en    1.7    39\n",
      "Q3624078  P1279  'inflation rate'@en    1.8    39\n",
      "Q3624078  P1279  'inflation rate'@en    2.1    37\n",
      "Q3624078  P1279  'inflation rate'@en    1.5    32\n",
      "Q3624078  P1279  'inflation rate'@en    2      31\n",
      "Q3624078  P1279  'inflation rate'@en    2.8    31\n",
      "Q3624078  P1279  'inflation rate'@en    3.5    29\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/$NAME.part.quantity.tsv.gz -i $OUT/$NAME.part.wikibase-item.tsv.gz -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'quantity: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \\\n",
    "--return 'distinct type as type, l.label as prop, lab as property_label, kgtk_quantity_number(n2) as value, count(n1) as count' \\\n",
    "--where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \\\n",
    "--order-by 'count(n1) desc' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extending KG to include nodes with ambiguous names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find node2s where we have node1/label/node1_label in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1      node1;label    node2      node2;label                  label\n",
      "Q1017471   'Bush'@en      Q80857164  'The Bush'@en                Pshares_name\n",
      "Q1017471   'Bush'@en      Q80857164  'The Bush'@en-gb             Pshares_name\n",
      "Q1017471   'Bush'@en      Q21810649  'Norton Bush'@en             Pshares_name\n",
      "Q1017471   'Bush'@en      Q60614686  'The Gentlemen'@en           Pshares_name\n",
      "Q1017471   'Bush'@en      Q4888621   'Benjamin Franklin Bush'@en  Pshares_name\n",
      "Q1017471   'Bush'@en      Q54888574  'Bush, Washington'@en        Pshares_name\n",
      "Q10350781  'Polar'@en     Q1500857   'Polar Electro'@en           Pshares_name\n",
      "Q1041750   'Carling'@en   Q7230524   'Port Carling'@en            Pshares_name\n",
      "Q12009657  'Victoria'@en  Q286499    'Vitruvia'@en                Pshares_name\n",
      "Q12009657  'Victoria'@en  Q3557663   'Michel Sardou'@en           Pshares_name\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n2)-[:alias]->(n1_label), label: (n2)-[:label]->(n2_label)' \\\n",
    "--where 'n1 != n2' \\\n",
    "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1     node1;label       node2      node2;label  label\n",
      "Q1157108  'Cerveza Sol'@en  Q64961707  'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q7555482   'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q69509964  'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q1237552   'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q7555484   'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q7555486   'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q64961436  'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q3489075   'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q37563235  'Sol'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q23664473  'Sol'@en     Pshares_name\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), label: (n2)-[:label]->(n1_alias)' \\\n",
    "--where 'n1 != n2' \\\n",
    "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_alias as `node2;label`, \"Pshares_name\" as label' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_alias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1     node1;label       node2      node2;label      label\n",
      "Q1157108  'Cerveza Sol'@en  Q22583558  'J-Hope'@en      Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q18607853  'Solomon'@en     Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q18607853  'Solomon'@en-ca  Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q18607853  'Solomon'@en-gb  Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q28800560  'El Sol'@en      Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q28800560  'El Sol'@en-ca   Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q28800560  'El Sol'@en-gb   Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q654596    'Sól'@en         Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q7666238   'Sól'@en         Pshares_name\n",
      "Q1157108  'Cerveza Sol'@en  Q525       'Sun'@en         Pshares_name\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), alias: (n2)-[:alias]->(n1_alias), label: (n2)-[:label]->(n2_label)' \\\n",
    "--where 'n1 != n2' \\\n",
    "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, \"Pshares_name\" as label' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1     node1;label  node2      node2;label  label\n",
      "Q1017471  'Bush'@en    Q5001360   'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q77894031  'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q5001365   'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q20482703  'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q18793771  'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q247949    'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q1017464   'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q1484464   'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q224168    'Bush'@en    Pshares_name\n",
      "Q1017471  'Bush'@en    Q2469309   'Bush'@en    Pshares_name\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \\\n",
    "--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), label: (n2)-[:label]->(n1_label)' \\\n",
    "--where 'n1 != n2' \\\n",
    "--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_label as `node2;label`, \"Pshares_name\" as label' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[90mid\u001b[39m Q1017471\n",
      "\u001b[42mLabel\u001b[49m Bush\n",
      "\u001b[44mDescription\u001b[49m Beer of Belgium (Wallonia)\n",
      "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mbeer brand \u001b[90m(Q15075508)\u001b[39m | beer \u001b[90m(Q44)\u001b[39m\n",
      "\n",
      "\u001b[90mid\u001b[39m Q247949\n",
      "\u001b[42mLabel\u001b[49m Bush\n",
      "\u001b[44mDescription\u001b[49m British rock band\n",
      "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mmusical group \u001b[90m(Q215380)\u001b[39m\n"
     ]
    }
   ],
   "source": [
    "!wd u Q1017471 Q247949"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kgtk",
   "language": "python",
   "name": "kgtk"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}