{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Useful Wikidata Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Invocation\n",
    "Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.\n",
    "\n",
    "```\n",
    "papermill Example7\\ -\\ Wikidata\\ Outputs.ipynb example7.out.ipynb \\\n",
    "-p home /Users/pedroszekely/Downloads/kypher \\\n",
    "-p wiki_file all.10.tsv.gz \\\n",
    "-p output_folder output.all.10 \\\n",
    "-p temp_folder temp.all.10 \\\n",
    "-p delete_database true \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "home = \"/Users/pedroszekely/Downloads/kypher\"\n",
    "wiki_file = \"all.tsv.gz\"\n",
    "wiki_file = \"all.10.tsv.gz\"\n",
    "output_folder = \"output\"\n",
    "output_folder = \"output.all.10\"\n",
    "temp_folder = \"temp\"\n",
    "temp_folder = \"temp.all.10\"\n",
    "delete_database = \"true\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# from IPython.display import display, HTML, Image\n",
    "# from pandas_profiling import ProfileReport"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up environment and folders to store the files\n",
    "\n",
    "- `WIKIDATA_HOME` folder where you put your Wikidata data\n",
    "- `OUT` folder where the output files go\n",
    "- `TEMP` folder to keep temporary files , including the database\n",
    "- `kgtk` shortcut to invoke the kgtk software\n",
    "\n",
    "The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.\n",
    "\n",
    "To dos:\n",
    "\n",
    "- Make sure that all files have id columns as `query` gets unhappy when files have no ids.\n",
    "- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.\n",
    "- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.\n",
    "- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.\n",
    "- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`\n",
    "\n",
    "Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.\n",
    "\n",
    "If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['WIKIDATA_HOME'] = home\n",
    "os.environ['OUT'] = \"{}/{}\".format(os.environ['WIKIDATA_HOME'], output_folder)\n",
    "os.environ['TEMP'] = \"{}/{}\".format(os.environ['WIKIDATA_HOME'], temp_folder)\n",
    "os.environ['kgtk'] = \"kgtk\"\n",
    "os.environ['kgtk'] = \"time kgtk --debug\"\n",
    "wikidata_home = os.environ['WIKIDATA_HOME']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher\n",
      "/Users/pedroszekely/Downloads/kypher/output.all.10\n",
      "/Users/pedroszekely/Downloads/kypher/temp.all.10\n",
      "time kgtk --debug\n"
     ]
    }
   ],
   "source": [
    "!echo $WIKIDATA_HOME\n",
    "!echo $OUT\n",
    "!echo $TEMP\n",
    "!echo $kgtk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher\n"
     ]
    }
   ],
   "source": [
    "cd $wikidata_home"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: /Users/pedroszekely/Downloads/kypher/output.all.10: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir $OUT\n",
    "!mkdir $TEMP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean up the output and temp folders before we start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/*.tsv: No such file or directory\n",
      "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/*.tsv.gz: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "!rm $OUT/*.tsv $OUT/*.tsv.gz\n",
    "!rm $TEMP/*.tsv $TEMP/*.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Deleting database\n",
      "rm: /Users/pedroszekely/Downloads/kypher/temp.all.10/wikidata.sqlite3.db: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "if delete_database:\n",
    "    print(\"Deleting database\")\n",
    "    !rm $TEMP/wikidata.sqlite3.db"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `all` file contains 100M edges of the full dump, `all.10` contains 10M edges. This is for testing, as we should run on the full edges file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: STORE=/Users/pedroszekely/Downloads/kypher/temp/wikidata.sqlite3.db\n",
      "env: EDGES=/Users/pedroszekely/Downloads/kypher/all.10.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "%env STORE=$wikidata_home/temp/wikidata.sqlite3.db\n",
    "# %env EDGES=$wikidata_home/all.10.tsv\n",
    "%env EDGES=$wikidata_home/$wiki_file\n",
    "\n",
    "#%env QUALS=$wikidata_home/wikidata-20200803-all-qualifiers.tsv.gz\n",
    "#%env LABELS=$wikidata_home/wikidata-20200803-all-labels-en-sorted.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get a sample and force importing the edge file into the database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:21:25 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     LIMIT ?\n",
      "  PARAS: [10]\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n",
      "Q45-label-en\tQ45\tlabel\t'Portugal'@en\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-fr\tQ45\tlabel\t'Portugal'@fr\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-nb\tQ45\tlabel\t'Portugal'@nb\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-it\tQ45\tlabel\t'Portogallo'@it\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-ru\tQ45\tlabel\t'Португалия'@ru\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-nl\tQ45\tlabel\t'Portugal'@nl\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-es\tQ45\tlabel\t'Portugal'@es\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-de\tQ45\tlabel\t'Portugal'@de\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-pl\tQ45\tlabel\t'Portugalia'@pl\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "Q45-label-be-tarask\tQ45\tlabel\t'Партугалія'@be-tarask\t\t\t\t\t\t\t\t\t\t\t\t\t\n",
      "        2.22 real         0.60 user         0.16 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --limit 10 --graph-cache $STORE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Force creation of the index on the label column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:21:26 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "     LIMIT ?\n",
      "  PARAS: ['P31', 5]\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n",
      "Q45-P31-1\tQ45\tP31\tQ3624078\tnormal\t\t\t\tQ3624078\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q45-P31-2\tQ45\tP31\tQ6256\tnormal\t\t\t\tQ6256\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q45-P31-3\tQ45\tP31\tQ20181813\tnormal\t\t\t\tQ20181813\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q140-P31-1\tQ140\tP31\tQ16521\tnormal\t\t\t\tQ16521\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q183-P31-1\tQ183\tP31\tQ3624078\tpreferred\t\t\t\tQ3624078\t\t\t\t\t\t\titem\twikibase-item\n",
      "        0.71 real         0.57 user         0.13 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n",
    "    --match '(i)-[:P31]->(c)' \\\n",
    "    --limit 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Force creation of the index on the node2 column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:21:27 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"node2\"=?\n",
      "     LIMIT ?\n",
      "  PARAS: ['Q5', 5]\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\tnode2;wikidatatype\n",
      "Q1253-P31-1\tQ1253\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q1526-P31-1\tQ1526\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q3794-P31-1\tQ3794\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q4291-P31-1\tQ4291\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n",
      "Q4489-P31-1\tQ4489\tP31\tQ5\tnormal\t\t\t\tQ5\t\t\t\t\t\t\titem\twikibase-item\n",
      "        0.70 real         0.56 user         0.12 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o - \\\n",
    "    --match '(i)-[r]->(:Q5)' \\\n",
    "    --limit 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the number of edges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:21:28 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT count(graph_8_c1.\"id\") \"count\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     LIMIT ?\n",
      "  PARAS: [10]\n",
      "---------------------------------------------\n",
      "count\n",
      "9999999\n",
      "        6.80 real         1.57 user         1.54 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE \\\n",
    "    --match 'all: ()-[r]->()' \\\n",
    "    --return 'count(r) as count' \\\n",
    "    --limit 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the distribution of the label column\n",
    "I would like to have it sorted numerically, but don't know how to make it happen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       43.44 real        42.83 user         0.79 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk unique --column label -i $EDGES / sort2 -c node2 -r -o $OUT/all-distribution.tsv "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1  label  node2\n",
      "P3987  count  998\n",
      "P410   count  987\n",
      "P575   count  985\n",
      "P6879  count  98\n",
      "P6562  count  98\n",
      "P5395  count  98\n",
      "P3153  count  98\n",
      "P4933  count  97\n",
      "P3135  count  97\n"
     ]
    }
   ],
   "source": [
    "!head $OUT/all-distribution.tsv | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute files with labels, aliases and descriptions\n",
    "Return the id, node1, label and node2 columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:22:19 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['label']\n",
      "---------------------------------------------\n",
      "       20.17 real        17.45 user         1.26 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.label.tsv.gz \\\n",
    "    --match '(n1)-[l:label]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:22:39 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['alias']\n",
      "---------------------------------------------\n",
      "        3.65 real         2.86 user         0.42 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.alias.tsv.gz \\\n",
    "    --match '(n1)-[l:alias]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:22:43 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['description']\n",
      "---------------------------------------------\n",
      "       60.09 real        50.85 user         3.29 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.description.tsv.gz \\\n",
    "    --match '(n1)-[l:description]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now create files with the English labels, aliases and descriptions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:23:43 sqlstore]: DROP graph data table graph_18 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz\n",
      "[2020-09-30 11:23:44 sqlstore]: IMPORT graph directly into table graph_38 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz ...\n",
      "[2020-09-30 11:23:54 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_38 AS graph_38_c1\n",
      "     WHERE (kgtk_lqstring_lang_suffix(graph_38_c1.\"node2\") = ?)\n",
      "  PARAS: ['en']\n",
      "---------------------------------------------\n",
      "       15.08 real        18.64 user         1.55 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/part.label.tsv.gz --graph-cache $STORE -o $OUT/part.label.en.tsv.gz \\\n",
    "    --match '()-[]->(n2)' \\\n",
    "    --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:23:58 sqlstore]: DROP graph data table graph_19 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz\n",
      "[2020-09-30 11:23:58 sqlstore]: IMPORT graph directly into table graph_39 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz ...\n",
      "[2020-09-30 11:23:59 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_39 AS graph_39_c1\n",
      "     WHERE (kgtk_lqstring_lang_suffix(graph_39_c1.\"node2\") = ?)\n",
      "  PARAS: ['en']\n",
      "---------------------------------------------\n",
      "        2.66 real         3.06 user         0.29 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/part.alias.tsv.gz --graph-cache $STORE -o $OUT/part.alias.en.tsv.gz \\\n",
    "    --match '()-[]->(n2)' \\\n",
    "    --where 'n2.kgtk_lqstring_lang_suffix = \"en\"'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:24:01 sqlstore]: DROP graph data table graph_20 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz\n",
      "[2020-09-30 11:24:06 sqlstore]: IMPORT graph directly into table graph_40 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz ...\n",
      "[2020-09-30 11:24:45 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_40 AS graph_40_c1\n",
      "     WHERE (kgtk_lqstring_lang_suffix(graph_40_c1.\"node2\") = ?)\n",
      "  PARAS: ['en']\n",
      "---------------------------------------------\n",
      "       53.85 real        71.20 user         5.32 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/part.description.tsv.gz --graph-cache $STORE -o $OUT/part.description.en.tsv.gz \\\n",
    "    --match '()-[]->(n2)' \\\n",
    "    --where 'n2.kgtk_lqstring_lang_suffix = \"en\"' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's sample these files to see what they look like:\n",
    "\n",
    "* we are getting all variants of English, we really want `en` only\n",
    "* the labels have the language tags, how do we output only the string without the language tag?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id             node1  label  node2\n",
      "Q45-label-en   Q45    label  'Portugal'@en\n",
      "Q140-label-en  Q140   label  'lion'@en\n",
      "Q183-label-en  Q183   label  'Germany'@en\n",
      "Q317-label-en  Q317   label  'dictatorship'@en\n",
      "Q433-label-en  Q433   label  'Gmina Kurów'@en\n",
      "Q514-label-en  Q514   label  'anatomy'@en\n",
      "Q595-label-en  Q595   label  'The Intouchables'@en\n",
      "Q647-label-en  Q647   label  'Rennes'@en\n",
      "Q716-label-en  Q716   label  'titanium'@en\n",
      "gzcat: error writing to output: Broken pipe\n",
      "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.en.tsv.gz: uncompress failed\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/part.label.en.tsv.gz | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute the distribution of the number of edges for each Wikidata type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       42.94 real        42.38 user         0.81 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk unique --column 'node2;wikidatatype' -i $EDGES / sort2 -c node2 -r | gzip > $OUT/all.wikidatatype.distribution.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1              label  node2\n",
      "time               count  76936\n",
      "wikibase-item      count  729535\n",
      "math               count  70\n",
      "wikibase-form      count  7\n",
      "quantity           count  69823\n",
      "string             count  68283\n",
      "external-id        count  416408\n",
      "commonsMedia       count  36794\n",
      "globe-coordinate   count  26063\n",
      "monolingualtext    count  24131\n",
      "musical-notation   count  2\n",
      "geo-shape          count  183\n",
      "wikibase-property  count  148\n",
      "url                count  12874\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.wikidatatype.distribution.tsv.gz | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file to contain the edges for each wikidata type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.time.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"time\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-item.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"wikibase-item\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.math.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"math\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-form.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"wikibase-form\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.quantity.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"quantity\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.string.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"string\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.external-id.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"external-id\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.commonsMedia.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"commonsMedia\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.globe-coordinate.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"globe-coordinate\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.monolingualtext.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"monolingualtext\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.musical-notation.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"musical-notation\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.geo-shape.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"geo-shape\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikibase-property.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"wikibase-property\"'\n",
      "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.url.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = \"url\"'\n"
     ]
    }
   ],
   "source": [
    "types = [\n",
    "    \"time\",\n",
    "    \"wikibase-item\",\n",
    "    \"math\",\n",
    "    \"wikibase-form\",\n",
    "    \"quantity\",\n",
    "    \"string\",\n",
    "    \"external-id\",\n",
    "    \"commonsMedia\",\n",
    "    \"globe-coordinate\",\n",
    "    \"monolingualtext\",\n",
    "    \"musical-notation\",\n",
    "    \"geo-shape\",\n",
    "    \"wikibase-property\",\n",
    "    \"url\",\n",
    "]\n",
    "\n",
    "command = \"$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.TYPE_FILE.tsv.gz \\\n",
    "    --match '(n1)-[l]->(n2 {wikidatatype: type})' \\\n",
    "    --return 'l, n1, l.label, n2'\\\n",
    "    --where 'type = \\\"TYPE\\\"'\"\n",
    "for type in types:\n",
    "    cmd = command.replace(\"TYPE_FILE\", type)\n",
    "    cmd = cmd.replace(\"TYPE\", type)\n",
    "\n",
    "    print(cmd)\n",
    "    os.system(cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file with the sitelinks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:37 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['wikipedia_sitelink']\n",
      "---------------------------------------------\n",
      "        8.27 real         6.88 user         0.71 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.wikipedia_sitelink.tsv.gz \\\n",
    "    --match '(n1)-[l:wikipedia_sitelink]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file that specifies for each node whether it is an item or a property"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:46 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['type']\n",
      "---------------------------------------------\n",
      "        3.21 real         2.50 user         0.58 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/part.type.tsv.gz \\\n",
    "    --match '(n1)-[l:type]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the P31 and P279 files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:49 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['P31']\n",
      "---------------------------------------------\n",
      "        2.73 real         2.17 user         0.45 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P31.tsv.gz \\\n",
    "    --match '(n1)-[l:P31]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:52 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_8_c1.\"id\", graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "  PARAS: ['P279']\n",
      "---------------------------------------------\n",
      "        0.79 real         0.64 user         0.14 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P279.tsv.gz \\\n",
    "    --match '(n1)-[l:P279]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gzcat: error writing to output: Broken pipe\n",
      "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz: uncompress failed\n",
      "id          node1  label  node2\n",
      "Q45-P31-1   Q45    P31    Q3624078\n",
      "Q45-P31-2   Q45    P31    Q6256\n",
      "Q45-P31-3   Q45    P31    Q20181813\n",
      "Q140-P31-1  Q140   P31    Q16521\n",
      "Q183-P31-1  Q183   P31    Q3624078\n",
      "Q183-P31-2  Q183   P31    Q43702\n",
      "Q183-P31-3  Q183   P31    Q7270\n",
      "Q183-P31-4  Q183   P31    Q619610\n",
      "Q183-P31-5  Q183   P31    Q4209223\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.57 real         2.37 user         0.14 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id            node1  label  node2\n",
      "Q317-P279-1   Q317   P279   Q173424\n",
      "Q514-P279-1   Q514   P279   Q420\n",
      "Q514-P279-2   Q514   P279   Q11190\n",
      "Q716-P279-1   Q716   P279   Q19588\n",
      "Q716-P279-2   Q716   P279   Q428766\n",
      "Q901-P279-1   Q901   P279   Q1650915\n",
      "Q901-P279-2   Q901   P279   Q20826540\n",
      "Q1071-P279-1  Q1071  P279   Q34749\n",
      "Q1071-P279-2  Q1071  P279   Q8008\n",
      "gzcat: error writing to output: Broken pipe\n",
      "gzcat: /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz: uncompress failed\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First compute the roots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:56 sqlstore]: DROP graph data table graph_21 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz\n",
      "[2020-09-30 11:26:56 sqlstore]: IMPORT graph directly into table graph_41 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz ...\n",
      "[2020-09-30 11:26:56 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_41_c1.\"node1\" \"node\"\n",
      "     FROM graph_41 AS graph_41_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "        0.75 real         0.59 user         0.15 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \\\n",
    "    --match '(n1)-[]->()' \\\n",
    "    --return 'n1 as node' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:26:57 sqlstore]: DROP graph data table graph_22 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz\n",
      "[2020-09-30 11:26:57 sqlstore]: IMPORT graph directly into table graph_42 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz ...\n",
      "[2020-09-30 11:26:57 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_42_c1.\"node2\" \"node\"\n",
      "     FROM graph_42 AS graph_42_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "        1.40 real         1.37 user         0.22 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE  -o $TEMP/P31.n2.tsv.gz \\\n",
    "    --match '()-[]->(n2)' \\\n",
    "    --return 'n2 as node' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.34 real         2.95 user         0.44 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \\\n",
    "    / compact --mode NONE --columns node \\\n",
    "    > $TEMP/P279.roots.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can invoke the reachable-nodes command"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.19 real         0.86 user         0.21 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk reachable-nodes \\\n",
    "    --rootfile $TEMP/P279.roots.tsv \\\n",
    "    --rootfilecolumn 0 \\\n",
    "    --subj 1 --pred 2 --obj 3 \\\n",
    "    -i $OUT/all.P279.tsv.gz \\\n",
    "    | kgtk sort2 \\\n",
    "    | gzip > $TEMP/P279.reachable.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:03 sqlstore]: DROP graph data table graph_23 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz\n",
      "[2020-09-30 11:27:03 sqlstore]: IMPORT graph directly into table graph_43 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz ...\n",
      "[2020-09-30 11:27:03 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_43_c1.\"node1\", ? \"label\", graph_43_c1.\"node2\" \"node2\"\n",
      "     FROM graph_43 AS graph_43_c1\n",
      "  PARAS: ['P279star']\n",
      "---------------------------------------------\n",
      "        0.84 real         0.66 user         0.16 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.1.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n1, \"P279star\" as label, n2 as node2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:04 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_43_c1.\"node1\" \"node1\", ? \"label\", graph_43_c1.\"node1\" \"node2\"\n",
      "     FROM graph_43 AS graph_43_c1\n",
      "  PARAS: ['P279star']\n",
      "---------------------------------------------\n",
      "        0.81 real         0.65 user         0.14 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.2.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n1 as node1, \"P279star\" as label, n1 as node2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:05 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_43_c1.\"node2\" \"node1\", ? \"label\", graph_43_c1.\"node2\" \"node2\"\n",
      "     FROM graph_43 AS graph_43_c1\n",
      "  PARAS: ['P279star']\n",
      "---------------------------------------------\n",
      "        0.79 real         0.64 user         0.13 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.3.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n2 as node1, \"P279star\" as label, n2 as node2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:06 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_42_c1.\"node2\" \"node1\", ? \"label\", graph_42_c1.\"node2\" \"node2\"\n",
      "     FROM graph_42 AS graph_42_c1\n",
      "  PARAS: ['P279star']\n",
      "---------------------------------------------\n",
      "        1.30 real         1.11 user         0.16 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.4.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n2 as node1, \"P279star\" as label, n2 as node2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can concatenate these files to produce the final output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.78 real         1.50 user         0.18 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat --mode NONE -i $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz $TEMP/P279star.3.tsv.gz $TEMP/P279star.4.tsv.gz \\\n",
    "    | kgtk compact \\\n",
    "    | kgtk sort2 \\\n",
    "    | kgtk add-id --id-style node1-label-node2-num \\\n",
    "    | gzip > $OUT/all.P279star.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is difficult to test with our Wikidata subset because our hierarchy is very sparse.\n",
    "\n",
    "This is how we would do the typical `?item P31/P279* ?class` in Kypher. \n",
    "The example shows how to get all the `n1` that are instances of subclasses of beer (q44)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:10 sqlstore]: DROP graph data table graph_24 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz\n",
      "[2020-09-30 11:27:10 sqlstore]: IMPORT graph directly into table graph_44 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz ...\n",
      "[2020-09-30 11:27:10 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT count(graph_42_c1.\"node1\") \"count\"\n",
      "     FROM graph_42 AS graph_42_c1, graph_44 AS graph_44_c2\n",
      "     WHERE graph_42_c1.\"label\"=?\n",
      "     AND graph_44_c2.\"node2\"=?\n",
      "     AND graph_42_c1.\"node2\"=graph_44_c2.\"node1\"\n",
      "  PARAS: ['P31', 'Q44']\n",
      "---------------------------------------------\n",
      "[2020-09-30 11:27:10 sqlstore]: CREATE INDEX on table graph_42 column node2 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_42 column node2 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_44 column node2 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_44 column node2 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_44 column node1 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_44 column node1 ...\n",
      "[2020-09-30 11:27:11 sqlstore]: CREATE INDEX on table graph_42 column label ...\n",
      "[2020-09-30 11:27:11 sqlstore]: ANALYZE INDEX on table graph_42 column label ...\n",
      "count\n",
      "1\n",
      "        1.13 real         0.94 user         0.21 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o - \\\n",
    "    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \\\n",
    "    --return 'count(n1) as count'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file to do generalized Is-A queries\n",
    "The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`\n",
    "\n",
    "We do this by concatenating the files and renaming the relation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.43 real         1.26 user         0.14 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \\\n",
    "    | gzip > $TEMP/isa.1.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:13 sqlstore]: DROP graph data table graph_25 from /Users/pedroszekely/Downloads/kypher/temp.all.10/isa.1.tsv.gz\n",
      "[2020-09-30 11:27:13 sqlstore]: IMPORT graph directly into table graph_45 from /Users/pedroszekely/Downloads/kypher/temp.all.10/isa.1.tsv.gz ...\n",
      "[2020-09-30 11:27:14 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_45_c1.\"node1\", ? \"label\", graph_45_c1.\"node2\"\n",
      "     FROM graph_45 AS graph_45_c1\n",
      "  PARAS: ['isa']\n",
      "---------------------------------------------\n",
      "        2.10 real         1.91 user         0.25 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n1, \"isa\" as label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example of how to use the `isa` relation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:15 sqlstore]: DROP graph data table graph_28 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz\n",
      "[2020-09-30 11:27:15 sqlstore]: IMPORT graph directly into table graph_28 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz ...\n",
      "[2020-09-30 11:27:16 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_28_c1.\"node1\", graph_28_c1.\"label\", ? \"node2\"\n",
      "     FROM graph_28 AS graph_28_c1, graph_44 AS graph_44_c2\n",
      "     WHERE graph_28_c1.\"label\"=?\n",
      "     AND graph_44_c2.\"node2\"=?\n",
      "     AND graph_28_c1.\"node2\"=graph_44_c2.\"node1\"\n",
      "     LIMIT ?\n",
      "  PARAS: ['Q44', 'isa', 'Q44', 10]\n",
      "---------------------------------------------\n",
      "[2020-09-30 11:27:16 sqlstore]: CREATE INDEX on table graph_28 column label ...\n",
      "[2020-09-30 11:27:16 sqlstore]: ANALYZE INDEX on table graph_28 column label ...\n",
      "[2020-09-30 11:27:16 sqlstore]: CREATE INDEX on table graph_28 column node2 ...\n",
      "[2020-09-30 11:27:16 sqlstore]: ANALYZE INDEX on table graph_28 column node2 ...\n",
      "node1\tlabel\tnode2\n",
      "Q2579953\tisa\tQ44\n",
      "Q4488344\tisa\tQ44\n",
      "Q7587890\tisa\tQ44\n",
      "Q10313616\tisa\tQ44\n",
      "        1.27 real         1.16 user         0.25 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o - \\\n",
    "    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \\\n",
    "    --return 'distinct n1, l.label, \"Q44\" as node2' \\\n",
    "    --limit 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a subset of Wikidata without scholarly articles (Q13442814)\n",
    "First create a file with the schloarly articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:17 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_28_c1.\"node1\", graph_28_c1.\"label\", graph_28_c1.\"node2\"\n",
      "     FROM graph_28 AS graph_28_c1\n",
      "     WHERE graph_28_c1.\"label\"=?\n",
      "     AND graph_28_c1.\"node2\"=?\n",
      "  PARAS: ['isa', 'Q13442814']\n",
      "---------------------------------------------\n",
      "        0.84 real         0.67 user         0.15 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.Q13442814.tsv.gz \\\n",
    "    --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \\\n",
    "    --return 'distinct n1, l.label, n2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1     label  node2\n",
      "Q1801903  isa    Q13442814\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.isa.Q13442814.tsv | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    4882 /Users/pedroszekely/Downloads/kypher/output.all.10/all-distribution.tsv\n",
      "     143 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz\n",
      "     793 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz\n",
      "    5348 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz\n",
      "    5512 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.Q13442814.tsv.gz\n",
      "    2111 /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz\n",
      "       3 /Users/pedroszekely/Downloads/kypher/output.all.10/all.wikidatatype.distribution.tsv.gz\n",
      "    1814 /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.en.tsv.gz\n",
      "   13162 /Users/pedroszekely/Downloads/kypher/output.all.10/part.alias.tsv.gz\n",
      "    3125 /Users/pedroszekely/Downloads/kypher/output.all.10/part.commonsMedia.tsv.gz\n",
      "    6171 /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.en.tsv.gz\n",
      "  227442 /Users/pedroszekely/Downloads/kypher/output.all.10/part.description.tsv.gz\n",
      "   22109 /Users/pedroszekely/Downloads/kypher/output.all.10/part.external-id.tsv.gz\n",
      "      16 /Users/pedroszekely/Downloads/kypher/output.all.10/part.geo-shape.tsv.gz\n",
      "    1495 /Users/pedroszekely/Downloads/kypher/output.all.10/part.globe-coordinate.tsv.gz\n",
      "    7336 /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.en.tsv.gz\n",
      "   80294 /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.tsv.gz\n",
      "       9 /Users/pedroszekely/Downloads/kypher/output.all.10/part.math.tsv.gz\n",
      "    1880 /Users/pedroszekely/Downloads/kypher/output.all.10/part.monolingualtext.tsv.gz\n",
      "       1 /Users/pedroszekely/Downloads/kypher/output.all.10/part.musical-notation.tsv.gz\n",
      "    2624 /Users/pedroszekely/Downloads/kypher/output.all.10/part.quantity.tsv.gz\n",
      "    3903 /Users/pedroszekely/Downloads/kypher/output.all.10/part.string.tsv.gz\n",
      "    2415 /Users/pedroszekely/Downloads/kypher/output.all.10/part.time.tsv.gz\n",
      "    6435 /Users/pedroszekely/Downloads/kypher/output.all.10/part.type.tsv.gz\n",
      "     943 /Users/pedroszekely/Downloads/kypher/output.all.10/part.url.tsv.gz\n",
      "       2 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-form.tsv.gz\n",
      "   28782 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-item.tsv.gz\n",
      "       8 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikibase-property.tsv.gz\n",
      "   40876 /Users/pedroszekely/Downloads/kypher/output.all.10/part.wikipedia_sitelink.tsv.gz\n",
      "  547657 /Users/pedroszekely/Downloads/kypher/all.10.tsv.gz\n",
      " 1017291 total\n"
     ]
    }
   ],
   "source": [
    "!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of distinct items in our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-09-30 11:27:18 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT count(DISTINCT graph_8_c1.\"node1\") \"count\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "count\n",
      "156559\n",
      "        5.91 real         4.82 user         1.00 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE  -o - \\\n",
    "    --match '(n1)-[]->()' \\\n",
    "    --return 'count(distinct n1) as count'"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "kgtk",
   "language": "python",
   "name": "kgtk"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}