{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Useful Wikidata Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Invocation\n",
    "Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.\n",
    "\n",
    "```\n",
    "papermill Example7\\ -\\ Wikidata\\ Outputs.ipynb example7.out.ipynb \\\n",
    "-p home /Users/pedroszekely/Downloads/kypher \\\n",
    "-p wiki_file all.10.tsv.gz \\\n",
    "-p output_folder output.all.10 \\\n",
    "-p temp_folder temp.all.10 \\\n",
    "-p delete_database true \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "home = \"/Users/pedroszekely/Downloads/kypher\"\n",
    "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v3/all.tsv.gz\"\n",
    "wiki_file = \"/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\"\n",
    "label_file = \"/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz\"\n",
    "output_path = \"/Users/pedroszekely/Downloads/kypher\"\n",
    "cache_path = \"/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\"\n",
    "output_folder = \"useful_wikidata_files_v3\"\n",
    "temp_folder = \"temp.useful_wikidata_files_v3\"\n",
    "delete_database = \"no\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# from IPython.display import display, HTML, Image\n",
    "# from pandas_profiling import ProfileReport"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up environment and folders to store the files\n",
    "\n",
    "- `OUT` folder where the output files go\n",
    "- `TEMP` folder to keep temporary files , including the database\n",
    "- `kgtk` shortcut to invoke the kgtk software\n",
    "\n",
    "The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.\n",
    "\n",
    "To dos:\n",
    "\n",
    "- Make sure that all files have id columns as `query` gets unhappy when files have no ids.\n",
    "- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.\n",
    "- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.\n",
    "- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.\n",
    "- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`\n",
    "\n",
    "Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.\n",
    "\n",
    "If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['OUT'] = \"{}/{}\".format(output_path, output_folder)\n",
    "os.environ['TEMP'] = \"{}/{}\".format(output_path, temp_folder)\n",
    "os.environ['kgtk'] = \"kgtk\"\n",
    "os.environ['kgtk'] = \"time kgtk --debug\"\n",
    "os.environ['EDGES'] = wiki_file\n",
    "os.environ['LABELS'] = label_file\n",
    "if cache_path:\n",
    "    os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cache_path)\n",
    "else:\n",
    "    os.environ['STORE'] = \"{}/{}/wikidata.sqlite3.db\".format(output_path, temp_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3\n",
      "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3\n",
      "time kgtk --debug\n",
      "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz\n",
      "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz\n",
      "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db\n"
     ]
    }
   ],
   "source": [
    "!echo $OUT\n",
    "!echo $TEMP\n",
    "!echo $kgtk\n",
    "!echo $EDGES\n",
    "!echo $LABELS\n",
    "!echo $STORE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher\n"
     ]
    }
   ],
   "source": [
    "cd $output_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3: File exists\n",
      "mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir $OUT\n",
    "!mkdir $TEMP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean up the output and temp folders before we start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !rm $OUT/*.tsv $OUT/*.tsv.gz\n",
    "# !rm $TEMP/*.tsv $TEMP/*.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "if delete_database and delete_database != \"no\":\n",
    "    print(\"Deleted database\")\n",
    "    !rm $STORE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get a sample and force importing the edge file into the database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 21:29:51 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     LIMIT ?\n",
      "  PARAS: [10]\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n",
      "Q1-P1036-418bc4-78f5a565-0\tQ1\tP1036\t\"113\"\tnormal\texternal-id\n",
      "Q1-P1036-b98c08-1dc98be9-0\tQ1\tP1036\t\"523.1\"\tnormal\texternal-id\n",
      "Q1-P1051-d70eb1-60991f20-0\tQ1\tP1051\t\"517\"\tnormal\texternal-id\n",
      "Q1-P1245-ee25a9-46be09ed-0\tQ1\tP1245\t\"8506\"\tnormal\texternal-id\n",
      "Q1-P1256-8da0ce-af30f4e9-0\tQ1\tP1256\t\"51A11\"\tnormal\texternal-id\n",
      "Q1-P1296-f73b4e-4d0c1e5d-0\tQ1\tP1296\t\"0216407\"\tnormal\texternal-id\n",
      "Q1-P1343-Q19190511-ab132b87-0\tQ1\tP1343\tQ19190511\tnormal\twikibase-item\n",
      "Q1-P1343-Q2041543-4ed8a129-0\tQ1\tP1343\tQ2041543\tnormal\twikibase-item\n",
      "Q1-P1343-Q602358-12bf99e2-0\tQ1\tP1343\tQ602358\tnormal\twikibase-item\n",
      "Q1-P1343-Q88672152-5080b9e2-0\tQ1\tP1343\tQ88672152\tnormal\twikibase-item\n",
      "        0.97 real         0.57 user         0.12 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --limit 10 --graph-cache $STORE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Force creation of the index on the label column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 00:23:41 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"label\"=?\n",
      "     LIMIT ?\n",
      "  PARAS: ['P31', 5]\n",
      "---------------------------------------------\n",
      "[2020-11-04 00:23:41 sqlstore]: CREATE INDEX on table graph_1 column label ...\n",
      "[2020-11-04 00:55:13 sqlstore]: ANALYZE INDEX on table graph_1 column label ...\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n",
      "Q1-P31-Q36906466-ae5f57ac-0\tQ1\tP31\tQ36906466\tnormal\twikibase-item\n",
      "Q100-P31-Q1093829-d2e8fc52-0\tQ100\tP31\tQ1093829\tnormal\twikibase-item\n",
      "Q100-P31-Q1549591-56524ffd-0\tQ100\tP31\tQ1549591\tnormal\twikibase-item\n",
      "Q100-P31-Q21518270-a4c0505d-0\tQ100\tP31\tQ21518270\tpreferred\twikibase-item\n",
      "Q1000-P31-Q179023-e86c0605-0\tQ1000\tP31\tQ179023\tnormal\twikibase-item\n",
      "     2014.94 real      1094.60 user       460.64 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n",
    "    --match '(i)-[:P31]->(c)' \\\n",
    "    --limit 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Force creation of the index on the node2 column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 14:23:52 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"node2\"=?\n",
      "     LIMIT ?\n",
      "  PARAS: ['Q5', 5]\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\n",
      "Q10000001-P31-Q5-cc1c4199-0\tQ10000001\tP31\tQ5\tnormal\twikibase-item\n",
      "Q1000002-P31-Q5-5c9914ea-0\tQ1000002\tP31\tQ5\tnormal\twikibase-item\n",
      "Q1000005-P31-Q5-4d5e2a2b-0\tQ1000005\tP31\tQ5\tnormal\twikibase-item\n",
      "Q1000006-P31-Q5-38728290-0\tQ1000006\tP31\tQ5\tnormal\twikibase-item\n",
      "Q1000015-P31-Q5-81519a6c-0\tQ1000015\tP31\tQ5\tnormal\twikibase-item\n",
      "        0.91 real         0.59 user         0.17 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o - \\\n",
    "    --match '(i)-[r]->(:Q5)' \\\n",
    "    --limit 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the number of edges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 07:08:13 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT count(graph_1_c1.\"id\") \"count\"\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     LIMIT ?\n",
      "  PARAS: [10]\n",
      "---------------------------------------------\n",
      "count\n",
      "1102827643\n",
      "      632.62 real        97.92 user       153.08 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE \\\n",
    "    --match 'all: ()-[r]->()' \\\n",
    "    --return 'count(r) as count' \\\n",
    "    --limit 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file that specifies for each node whether it is an item or a property"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/part.type.tsv.gz \\\n",
    "    --match '(n1)-[l:type]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the P31 and P279 files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 07:18:46 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_1_c1.\"id\", graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\"\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"label\"=?\n",
      "  PARAS: ['P31']\n",
      "---------------------------------------------\n",
      "     1676.81 real       940.66 user       250.77 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P31.tsv.gz \\\n",
    "    --match '(n1)-[l:P31]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 07:46:43 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_1_c1.\"id\", graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\"\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"label\"=?\n",
      "  PARAS: ['P279']\n",
      "---------------------------------------------\n",
      "      106.45 real        38.45 user        18.43 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i \"$EDGES\" --graph-cache $STORE -o $OUT/all.P279.tsv.gz \\\n",
    "    --match '(n1)-[l:P279]->(n2)' \\\n",
    "    --return 'l, n1, l.label, n2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gzcat: error writing to output: Broken pipe\n",
      "gzcat: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/all.P31.tsv.gz: uncompress failed\n",
      "id                               node1    label  node2\n",
      "Q1-P31-Q36906466-ae5f57ac-0      Q1       P31    Q36906466\n",
      "Q100-P31-Q1093829-d2e8fc52-0     Q100     P31    Q1093829\n",
      "Q100-P31-Q1549591-56524ffd-0     Q100     P31    Q1549591\n",
      "Q100-P31-Q21518270-a4c0505d-0    Q100     P31    Q21518270\n",
      "Q1000-P31-Q179023-e86c0605-0     Q1000    P31    Q179023\n",
      "Q1000-P31-Q3624078-3aade447-0    Q1000    P31    Q3624078\n",
      "Q1000-P31-Q6256-bbaa8a03-0       Q1000    P31    Q6256\n",
      "Q10000-P31-Q10876391-e428a782-0  Q10000   P31    Q10876391\n",
      "Q100000-P31-Q1852859-3665587f-0  Q100000  P31    Q1852859\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     1023.85 real      1008.10 user         7.25 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gzcat: id                                  node1     label  node2\n",
      "error writing to outputQ1000032-P279-Q1813494-0aa0f1dc-0   Q1000032  P279   Q1813494\n",
      ": Q1000032-P279-Q83602-482a1943-0     Q1000032  P279   Q83602\n",
      "Broken pipe\n",
      "Q1000039-P279-Q11555767-2dddfd86-0  Q1000039  P279   Q11555767\n",
      "Q1000064-P279-Q11016-0ab23344-0     Q1000064  P279   Q11016\n",
      "Q1000084-P279-Q159810-7852671b-0    Q1000084  P279   Q159810\n",
      "Q1000108-P279-Q849640-02ec6f84-0    Q1000108  P279   Q849640\n",
      "Q1000116-P279-Q179692-f8d04cc2-0    Q1000116  P279   Q179692\n",
      "Q1000120-P279-Q12510-3e6da659-0     Q1000120  P279   Q12510\n",
      "gzcat: Q1000120-P279-Q181296-f30f7109-0    Q1000120  P279   Q181296\n",
      "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/all.P31_P279.tsv.gz: uncompress failed\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First compute the roots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 20:51:51 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_2_c1.\"node1\" \"id\"\n",
      "     FROM graph_2 AS graph_2_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "       16.80 real        15.43 user         0.57 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \\\n",
    "    --match '(n1)-[l]->()' \\\n",
    "    --return 'n1 as id' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-04 20:52:07 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_3_c1.\"node2\" \"id\"\n",
      "     FROM graph_3 AS graph_3_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "      175.55 real       154.56 user        11.03 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE  -o $TEMP/P31.n2.tsv.gz \\\n",
    "    --match '()-[l]->(n2)' \\\n",
    "    --return 'n2 as id' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      348.65 real       346.10 user         1.47 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \\\n",
    "    | gzip > $TEMP/P279.roots.1.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       70.85 real        56.25 user        12.59 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk sort2 --mode NONE --column id -i $TEMP/P279.roots.1.tsv.gz \\\n",
    "    | gzip > $TEMP/P279.roots.2.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\n",
      "Q1\n",
      "Q1\n",
      "Q1000032\n",
      "Q1000032\n",
      "Q1000039\n",
      "Q1000064\n",
      "Q1000084\n",
      "Q1000108\n",
      "Q1000116\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < $TEMP/P279.roots.2.tsv.gz | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      382.41 real       380.34 user         1.00 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk compact -i $TEMP/P279.roots.2.tsv.gz --mode NONE \\\n",
    "    --presorted \\\n",
    "    --columns id \\\n",
    "> $TEMP/P279.roots.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can invoke the reachable-nodes command"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     5839.30 real      3373.77 user      2374.71 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk reachable-nodes \\\n",
    "    --rootfile $TEMP/P279.roots.tsv \\\n",
    "    --selflink \\\n",
    "    -i $OUT/all.P279.tsv.gz \\\n",
    "    | gzip > $TEMP/P279.reachable.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\n",
      "Q1000032\treachable\tQ1000032\n",
      "Q1000032\treachable\tQ1813494\n",
      "Q1000032\treachable\tQ1799072\n",
      "Q1000032\treachable\tQ16686448\n",
      "Q1000032\treachable\tQ35120\n",
      "Q1000032\treachable\tnovalue\n",
      "Q1000032\treachable\tQ2695280\n",
      "Q1000032\treachable\tQ1914636\n",
      "Q1000032\treachable\tQ20937557\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < $TEMP/P279.reachable.tsv.gz | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 09:25:49 sqlstore]: DROP graph data table graph_4 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/P279.reachable.tsv.gz\n",
      "[2020-11-05 09:25:49 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/P279.reachable.tsv.gz ...\n",
      "[2020-11-05 09:29:12 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_4_c1.\"node1\", ? \"label\", graph_4_c1.\"node2\" \"node2\"\n",
      "     FROM graph_4 AS graph_4_c1\n",
      "  PARAS: ['P279star']\n",
      "---------------------------------------------\n",
      "      847.92 real       955.62 user        28.23 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.1.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n1, \"P279star\" as label, n2 as node2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can concatenate these files to produce the final output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      284.59 real       242.05 user        65.20 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk sort2 -i $TEMP/P279star.1.tsv.gz -o $TEMP/P279star.2.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     1371.31 real      1362.32 user         3.93 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk compact --presorted -i $TEMP/P279star.2.tsv.gz -o $TEMP/P279star.3.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     1336.10 real      1284.93 user        28.65 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/P279star.3.tsv.gz -o $OUT/all.P279star.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tid\n",
      "Q1000032\tP279star\tQ1000032\tQ1000032-P279star-Q1000032-0000\n",
      "Q1000032\tP279star\tQ1150070\tQ1000032-P279star-Q1150070-0000\n",
      "Q1000032\tP279star\tQ1190554\tQ1000032-P279star-Q1190554-0000\n",
      "Q1000032\tP279star\tQ133500\tQ1000032-P279star-Q133500-0000\n",
      "Q1000032\tP279star\tQ13878858\tQ1000032-P279star-Q13878858-0000\n",
      "Q1000032\tP279star\tQ14819853\tQ1000032-P279star-Q14819853-0000\n",
      "Q1000032\tP279star\tQ14912053\tQ1000032-P279star-Q14912053-0000\n",
      "Q1000032\tP279star\tQ16686448\tQ1000032-P279star-Q16686448-0000\n",
      "Q1000032\tP279star\tQ16722960\tQ1000032-P279star-Q16722960-0000\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < $OUT/all.P279star.tsv.gz | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how we would do the typical `?item P31/P279* ?class` in Kypher. \n",
    "The example shows how to get all the `n1` that are instances of subclasses of beer (q44)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 19:26:57 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_3_c1.\"node1\", graph_5_c2.\"node1\" \"class\", count(graph_5_c2.\"node1\") \"count\", graph_7_c3.\"node2\" \"n1_label\", graph_7_c4.\"node2\" \"class name\"\n",
      "     FROM graph_3 AS graph_3_c1, graph_5 AS graph_5_c2, graph_7 AS graph_7_c3, graph_7 AS graph_7_c4\n",
      "     WHERE graph_3_c1.\"label\"=?\n",
      "     AND graph_5_c2.\"node2\"=?\n",
      "     AND graph_7_c3.\"label\"=?\n",
      "     AND graph_7_c4.\"label\"=?\n",
      "     AND graph_3_c1.\"node1\"=graph_7_c3.\"node1\"\n",
      "     AND graph_3_c1.\"node2\"=graph_5_c2.\"node1\"\n",
      "     AND graph_3_c1.\"node2\"=graph_7_c4.\"node1\"\n",
      "     GROUP BY graph_3_c1.\"node1\", class\n",
      "     ORDER BY graph_3_c1.\"node1\" ASC, graph_5_c2.\"node1\" DESC, count(graph_3_c1.\"node1\") DESC\n",
      "     LIMIT ?\n",
      "  PARAS: ['P31', 'Q44', 'label', 'label', 10]\n",
      "---------------------------------------------\n",
      "        1.84 real         0.65 user         0.35 sys\n",
      "node1      class      count  n1_label           class name\n",
      "Q1000737   Q15075508  1      'Marthabräu'@en    'beer brand'@en\n",
      "Q1017471   Q44        1      'Bush'@en          'beer'@en\n",
      "Q1017471   Q15075508  1      'Bush'@en          'beer brand'@en\n",
      "Q1026242   Q44        1      'Calanda Bräu'@en  'beer'@en\n",
      "Q10304159  Q44        1      'Itaipava'@en      'beer'@en\n",
      "Q10350781  Q15075508  1      'Polar'@en         'beer brand'@en\n",
      "Q1035257   Q44        1      'Cara Pils'@en     'beer'@en\n",
      "Q1041750   Q44        1      'Carling'@en       'beer'@en\n",
      "Q10487716  Q253481    1      'Eriksberg'@en     'lager'@en\n",
      "Q10577358  Q44        1      'Mariestads'@en    'beer'@en\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE  \\\n",
    "--match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \\\n",
    "--return 'distinct n1, c as class, count(c) as count, label as n1_label, c_label as `class name`' \\\n",
    "--order-by 'n1, c desc, count(n1) desc' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 19:29:23 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_5 AS graph_5_c1\n",
      "     WHERE graph_5_c1.\"node2\"=?\n",
      "     LIMIT ?\n",
      "  PARAS: ['Q44', 10]\n",
      "---------------------------------------------\n",
      "        0.77 real         0.61 user         0.14 sys\n",
      "node1      label     node2  id\n",
      "Q10262008  P279star  Q44    Q10262008-P279star-Q44-0000\n",
      "Q10313616  P279star  Q44    Q10313616-P279star-Q44-0000\n",
      "Q10314719  P279star  Q44    Q10314719-P279star-Q44-0000\n",
      "Q10317863  P279star  Q44    Q10317863-P279star-Q44-0000\n",
      "Q10340754  P279star  Q44    Q10340754-P279star-Q44-0000\n",
      "Q10350781  P279star  Q44    Q10350781-P279star-Q44-0000\n",
      "Q10355535  P279star  Q44    Q10355535-P279star-Q44-0000\n",
      "Q10382024  P279star  Q44    Q10382024-P279star-Q44-0000\n",
      "Q10461865  P279star  Q44    Q10461865-P279star-Q44-0000\n",
      "Q1046968   P279star  Q44    Q1046968-P279star-Q44-0000\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE  \\\n",
    "--match 'star: (n1)-[]->(c:Q44)' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a file to do generalized Is-A queries\n",
    "The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`\n",
    "\n",
    "We do this by concatenating the files and renaming the relation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      538.04 real       519.06 user         9.69 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \\\n",
    "    | gzip > $TEMP/isa.1.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 14:26:22 sqlstore]: IMPORT graph directly into table graph_6 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/isa.1.tsv.gz ...\n",
      "[2020-11-05 14:33:10 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_6_c1.\"node1\", ? \"label\", graph_6_c1.\"node2\"\n",
      "     FROM graph_6 AS graph_6_c1\n",
      "  PARAS: ['isa']\n",
      "---------------------------------------------\n",
      "      835.15 real      1061.27 user        43.66 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.tsv.gz \\\n",
    "    --match '(n1)-[]->(n2)' \\\n",
    "    --return 'n1, \"isa\" as label, n2' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example of how to use the `isa` relation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 20:07:32 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_8_c1.\"node1\", graph_8_c1.\"label\", ? \"node2\", graph_7_c3.\"node2\" \"n1_label\"\n",
      "     FROM graph_5 AS graph_5_c2, graph_7 AS graph_7_c3, graph_8 AS graph_8_c1\n",
      "     WHERE graph_5_c2.\"node2\"=?\n",
      "     AND graph_7_c3.\"label\"=?\n",
      "     AND graph_8_c1.\"label\"=?\n",
      "     AND graph_5_c2.\"node1\"=graph_8_c1.\"node2\"\n",
      "     AND graph_7_c3.\"node1\"=graph_8_c1.\"node1\"\n",
      "     LIMIT ?\n",
      "  PARAS: ['Q44', 'Q44', 'label', 'isa', 10]\n",
      "---------------------------------------------\n",
      "[2020-11-05 20:07:32 sqlstore]: CREATE INDEX on table graph_8 column node1 ...\n",
      "[2020-11-05 20:08:27 sqlstore]: ANALYZE INDEX on table graph_8 column node1 ...\n",
      "node1\tlabel\tnode2\tn1_label\n",
      "Q15875298\tisa\tQ44\t'Floreffe'@en\n",
      "Q1917255\tisa\tQ44\t'St-Idesbald'@en\n",
      "Q2004062\tisa\tQ44\t'Sancti Adalberti'@en\n",
      "Q2006077\tisa\tQ44\t'Bonne-Espérance abbey'@en\n",
      "Q2272636\tisa\tQ44\t'Ename beer'@en\n",
      "Q2290730\tisa\tQ44\t'Ter Dolen (beer)'@en\n",
      "Q3625571\tisa\tQ44\t'Herkenrode Tripel'@en\n",
      "Q505815\tisa\tQ44\t'Trappist beer'@en\n",
      "Q747216\tisa\tQ44\t'Bornem'@en\n",
      "Q15985396\tisa\tQ44\t'Crabbelaer'@en\n",
      "       69.00 real        40.73 user        10.87 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz -i \"$LABELS\" --graph-cache $STORE  -o - \\\n",
    "    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \\\n",
    "    --return 'distinct n1, l.label, \"Q44\" as node2, label as n1_label' \\\n",
    "    --limit 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a subset of Wikidata without scholarly articles (Q13442814)\n",
    "First create a file with the schloarly articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-06 10:01:39 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_8_c1.\"node1\", graph_8_c1.\"label\", graph_8_c1.\"node2\" \"node2\"\n",
      "     FROM graph_8 AS graph_8_c1\n",
      "     WHERE graph_8_c1.\"label\"=?\n",
      "     AND (graph_8_c1.\"node2\" IN (?, ?, ?, ?, ?, ?, ?, ?))\n",
      "  PARAS: ['isa', 'Q13442814', 'Q523', 'Q16521', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054']\n",
      "---------------------------------------------\n",
      "      411.40 real       279.05 user        46.43 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.Q13442814.Q523.Q16521.Q318.Q7318358.Q7187.Q11173.Q8054.tsv.gz \\\n",
    "--match 'isa: (n1)-[l:isa]->(class)' \\\n",
    "--where 'class in [\"Q13442814\", \"Q523\", \"Q16521\", \"Q318\", \"Q7318358\", \"Q7187\", \"Q11173\", \"Q8054\"]' \\\n",
    "--return 'distinct n1, l.label, class as node2' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kypher -i $NKG \\\n",
    "--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \\\n",
    "--where 'source = $s and n1 =~ \".*/catch/.*\"' \\\n",
    "--return 'distinct n1 as node1' \\\n",
    "--spara s='CN' \\\n",
    "--limit 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 35933570 107800710 862405612\n"
     ]
    }
   ],
   "source": [
    "!gzcat $OUT/all.isa.Q13442814.tsv | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-11-05 20:45:35 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT DISTINCT graph_1_c2.\"node1\" \"node1\", graph_1_c2.\"label\" \"label\", graph_1_c2.\"node2\" \"node2\", graph_1_c2.\"id\" \"id\"\n",
      "     FROM graph_1 AS graph_1_c2, graph_9 AS graph_9_c1\n",
      "     WHERE ((graph_9_c1.\"node1\" != graph_1_c2.\"node1\") OR (graph_9_c1.\"node1\" != graph_1_c2.\"node2\"))\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "^C\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $OUT/all.isa.Q13442814.tsv.gz -i \"$EDGES\" --graph-cache $STORE  \\\n",
    "--match 'isa: (n1)-[]->(), all: (n2)-[l]->(n3)' \\\n",
    "--where 'n1 != n2 or n1 != n3' \\\n",
    "--return 'distinct n2 as node1, l.label as label, n3 as node2, l as id' \\\n",
    "-o $OUT/all.minus.Q13442814.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    7479 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all-distribution.tsv\n",
      "   45941 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P279star.tsv.gz\n",
      " 2206506 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31.tsv.gz\n",
      " 2254104 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.P31_P279.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.Q13442814.tsv.gz\n",
      " 1208961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.isa.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/all.wikidatatype.distribution.tsv.gz\n",
      "  374344 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.en.tsv.gz\n",
      "  383165 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.alias.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.commonsMedia.tsv.gz\n",
      " 2253868 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.en.tsv.gz\n",
      " 3688188 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.description.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.external-id.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.geo-shape.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.globe-coordinate.tsv.gz\n",
      " 7895945 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.en.tsv.gz\n",
      " 8034944 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.label.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.math.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.monolingualtext.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.musical-notation.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.quantity.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.string.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.time.tsv.gz\n",
      " 1351961 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.type.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.url.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-form.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-item.tsv.gz\n",
      "       0 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikibase-property.tsv.gz\n",
      " 3404579 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_2/part.wikipedia_sitelink.tsv.gz\n",
      " 65610166 /Users/pedroszekely/Downloads/kypher/almost.all.edges.sorted.tsv.gz\n",
      " 98720151 total\n"
     ]
    }
   ],
   "source": [
    "!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of distinct items in our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2020-10-22 02:52:30 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT count(DISTINCT graph_1_c1.\"node1\") \"count\"\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "  PARAS: []\n",
      "---------------------------------------------\n",
      "count\n",
      "88228944\n",
      "     1364.75 real      1000.96 user       122.75 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk query -i $EDGES --graph-cache $STORE  -o - \\\n",
    "    --match '(n1)-[]->()' \\\n",
    "    --return 'count(distinct n1) as count'"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "kgtk",
   "language": "python",
   "name": "kgtk"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}