{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# KGTK Tutorial\n",
    "\n",
    "Beer sites:\n",
    "- https://www.realbeer.com/edu/health/calories.php\n",
    "- http://getdrunknotfat.com/alcohol-content-of-beer/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys  \n",
    "sys.path.insert(0, 'tutorial')\n",
    "from tutorial_setup import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ALIAS',\n",
       " 'ALL',\n",
       " 'CLAIMS',\n",
       " 'DESCRIPTION',\n",
       " 'EXAMPLES_DIR',\n",
       " 'GE',\n",
       " 'ISA',\n",
       " 'ITEM',\n",
       " 'LABEL',\n",
       " 'OUT',\n",
       " 'P279',\n",
       " 'P279STAR',\n",
       " 'PROPERTY_DATATYPES',\n",
       " 'Q154ALIAS',\n",
       " 'Q154ALL',\n",
       " 'Q154CLAIMS',\n",
       " 'Q154DESCRIPTION',\n",
       " 'Q154ISA',\n",
       " 'Q154ITEM',\n",
       " 'Q154LABEL',\n",
       " 'Q154P279',\n",
       " 'Q154P279STAR',\n",
       " 'Q154PROPERTY_DATATYPES',\n",
       " 'Q154QUALIFIERS',\n",
       " 'Q154QUALIFIERS_TIME',\n",
       " 'Q154SITELINKS',\n",
       " 'QUALIFIERS',\n",
       " 'QUALIFIERS_TIME',\n",
       " 'SITELINKS',\n",
       " 'STORE',\n",
       " 'TE',\n",
       " 'TEMP',\n",
       " 'WIKIDATA',\n",
       " 'kgtk',\n",
       " 'kypher']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk_environment_variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher\n"
     ]
    }
   ],
   "source": [
    "%cd {output_path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: wikidata_os_v5: File exists\n",
      "mkdir: temp.wikidata_os_v5: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir {output_folder}\n",
    "!mkdir {temp_folder}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir \"$GE\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir \"$TE\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wikidata in KGTK\n",
    "KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.\n",
    "\n",
    "The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 pedroszekely  staff    68M Nov 16 08:07 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/aliases.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   4.7G Nov 16 08:05 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   269M Nov 16 08:08 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/descriptions.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   376M Nov 16 08:06 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/labels.en.tsv.gz\n",
      "-rw-r--r--  1 pedroszekely  staff   662M Nov 16 08:43 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/qualifiers.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "!ls -lh \"$CLAIMS\" \"$QUALIFIERS\" \"$LABEL\" \"$ALIAS\" \"$DESCRIPTION\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`claims.tsv` contains many edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 254135077 1578463882 20285305033\n",
      "\n",
      "real\t1m15.857s\n",
      "user\t2m7.309s\n",
      "sys\t0m8.130s\n"
     ]
    }
   ],
   "source": [
    "!time zcat < \"$CLAIMS\" | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# KGTK Data Model\n",
    "The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.\n",
    "\n",
    "All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.\n",
    "Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Claims"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                              node1  label  node2                                    rank    node2;wikidatatype\n",
      "P10-P1628-32b85d-7927ece6-0     P10    P1628  \"http://www.w3.org/2006/vcard/ns#Video\"  normal  url\n",
      "P10-P1628-acf60d-b8950832-0     P10    P1628  \"https://schema.org/video\"               normal  url\n",
      "P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508                                   normal  wikibase-item\n",
      "P10-P1659-P1651-c4068028-0      P10    P1659  P1651                                    normal  wikibase-property\n",
      "P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18                                      normal  wikibase-property\n",
      "P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238                                    normal  wikibase-property\n",
      "P10-P1659-P51-86aca4c5-0        P10    P1659  P51                                      normal  wikibase-property\n",
      "P10-P1855-Q15075950-7eff6d65-0  P10    P1855  Q15075950                                normal  wikibase-item\n",
      "P10-P1855-Q69063653-c8cdb04c-0  P10    P1855  Q69063653                                normal  wikibase-item\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$CLAIMS\" | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6\n",
      "module.exports = async params => {\n",
      "                       ^^^^^^\n",
      "\n",
      "SyntaxError: Unexpected identifier\n",
      "    at createScript (vm.js:56:10)\n",
      "    at Object.runInThisContext (vm.js:97:10)\n",
      "    at Module._compile (module.js:549:28)\n",
      "    at Object.Module._extensions..js (module.js:586:10)\n",
      "    at Module.load (module.js:494:32)\n",
      "    at tryModuleLoad (module.js:453:12)\n",
      "    at Function.Module._load (module.js:445:3)\n",
      "    at Module.require (module.js:504:17)\n",
      "    at require (internal/module.js:20:19)\n",
      "    at Object.<anonymous> (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)\n"
     ]
    }
   ],
   "source": [
    "!wd u P10 P1628 P1629"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 262,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>rank</th>\n",
       "      <th>node2;wikidatatype</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q31-P1036-c4e1ad-df86eeb8-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1036</td>\n",
       "      <td>\"2--493\"</td>\n",
       "      <td>normal</td>\n",
       "      <td>external-id</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q31-P1081-02c2ed-033524b0-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.866</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q31-P1081-02c2ed-7971505b-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.866</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q31-P1081-068470-c1c63b8d-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.889</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q31-P1081-068470-ddac01e0-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.889</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q31-P1081-144738-c1851cdc-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.905</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q31-P1081-175742-c07ac1c8-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.888</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q31-P1081-19636d-c08dd8a8-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.896</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q31-P1081-1efc03-433a7a4d-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.913</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q31-P1081-1f8602-ddac530d-0</td>\n",
       "      <td>Q31</td>\n",
       "      <td>P1081</td>\n",
       "      <td>+0.852</td>\n",
       "      <td>normal</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            id node1  label     node2    rank  \\\n",
       "0  Q31-P1036-c4e1ad-df86eeb8-0   Q31  P1036  \"2--493\"  normal   \n",
       "1  Q31-P1081-02c2ed-033524b0-0   Q31  P1081    +0.866  normal   \n",
       "2  Q31-P1081-02c2ed-7971505b-0   Q31  P1081    +0.866  normal   \n",
       "3  Q31-P1081-068470-c1c63b8d-0   Q31  P1081    +0.889  normal   \n",
       "4  Q31-P1081-068470-ddac01e0-0   Q31  P1081    +0.889  normal   \n",
       "5  Q31-P1081-144738-c1851cdc-0   Q31  P1081    +0.905  normal   \n",
       "6  Q31-P1081-175742-c07ac1c8-0   Q31  P1081    +0.888  normal   \n",
       "7  Q31-P1081-19636d-c08dd8a8-0   Q31  P1081    +0.896  normal   \n",
       "8  Q31-P1081-1efc03-433a7a4d-0   Q31  P1081    +0.913  normal   \n",
       "9  Q31-P1081-1f8602-ddac530d-0   Q31  P1081    +0.852  normal   \n",
       "\n",
       "  node2;wikidatatype  \n",
       "0        external-id  \n",
       "1           quantity  \n",
       "2           quantity  \n",
       "3           quantity  \n",
       "4           quantity  \n",
       "5           quantity  \n",
       "6           quantity  \n",
       "7           quantity  \n",
       "8           quantity  \n",
       "9           quantity  "
      ]
     },
     "execution_count": 262,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !$kypher_raw -i \"$CLAIMS\" \\\n",
    "--match '(:Q31)-[]-()' \\\n",
    "--limit 10 \n",
    "\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zcat: id              node1  label  node2\n",
      "P10-label-en    P10    label  'video'@en\n",
      "P1000-label-en  P1000  label  'record held'@en\n",
      "P1001-label-en  P1001  label  'applies to jurisdiction'@en\n",
      "P1002-label-en  P1002  label  'engine configuration'@en\n",
      "error writing to outputP1003-label-en  P1003  label  'National Library of Romania ID'@en\n",
      ": P1004-label-en  P1004  label  'MusicBrainz place ID'@en\n",
      "Broken pipe\n",
      "P1005-label-en  P1005  label  'Portuguese National Library ID'@en\n",
      "P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en\n",
      "P1007-label-en  P1007  label  'Lattes Platform number'@en\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$LABEL\" | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.90 real         0.77 user         0.11 sys\n",
      "id                           node1  label  node2     label;label\n",
      "Q31-P1036-c4e1ad-df86eeb8-0  Q31    P1036  \"2--493\"  'Dewey Decimal Classification'@en\n",
      "Q31-P1081-02c2ed-033524b0-0  Q31    P1081  +0.866    'Human Development Index'@en\n",
      "Q31-P1081-02c2ed-7971505b-0  Q31    P1081  +0.866    'Human Development Index'@en\n",
      "Q31-P1081-068470-c1c63b8d-0  Q31    P1081  +0.889    'Human Development Index'@en\n",
      "Q31-P1081-068470-ddac01e0-0  Q31    P1081  +0.889    'Human Development Index'@en\n",
      "Q31-P1081-144738-c1851cdc-0  Q31    P1081  +0.905    'Human Development Index'@en\n",
      "Q31-P1081-175742-c07ac1c8-0  Q31    P1081  +0.888    'Human Development Index'@en\n",
      "Q31-P1081-19636d-c08dd8a8-0  Q31    P1081  +0.896    'Human Development Index'@en\n",
      "Q31-P1081-1efc03-433a7a4d-0  Q31    P1081  +0.913    'Human Development Index'@en\n",
      "Q31-P1081-1f8602-ddac530d-0  Q31    P1081  +0.852    'Human Development Index'@en\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$LABEL\" \\\n",
    "--match 'claim: (n1:Q31)-[l {label: property}]-(n2), label: (property)-[:label]->(property_label)' \\\n",
    "--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at a the heads of state of Belgium recorded in property `P35`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.86 real         0.74 user         0.10 sys\n",
      "id                            node1  label  node2      node2;label\n",
      "Q31-P35-Q1079522-c82ed584-0   Q31    P35    Q1079522   'Erasme Louis Surlet de Chokier'@en\n",
      "Q31-P35-Q12967-f2b9aaf3-0     Q31    P35    Q12967     'Leopold II of Belgium'@en\n",
      "Q31-P35-Q12971-2088471b-0     Q31    P35    Q12971     'Leopold I of Belgium'@en\n",
      "Q31-P35-Q12973-31c1b700-0     Q31    P35    Q12973     'Leopold III of Belgium'@en\n",
      "Q31-P35-Q12976-f3e8a567-0     Q31    P35    Q12976     'Baudouin I of Belgium'@en\n",
      "Q31-P35-Q155004-619ba603-0    Q31    P35    Q155004    'Philippe I of Belgium'@en\n",
      "Q31-P35-Q3911-137f01fe-0      Q31    P35    Q3911      'Albert II of Belgium'@en\n",
      "Q31-P35-Q445553-7599749f-0    Q31    P35    Q445553    'Prince Charles, Count of Flanders'@en\n",
      "Q31-P35-Q55008046-725dce40-0  Q31    P35    Q55008046  'Albert I of Belgium'@en\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$LABEL\" \\\n",
    "--match 'claims: (n1:Q31)-[l:P35]->(n2), labels: (n2)-[:label]->(n2_label)' \\\n",
    "--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Qualifiers\n",
    "Qualifiers provide additional information about the claims stated in the edges. For `P1081` the qualifiers tell use the year, and for head of state the qualifiers provide information about the period of time and position held by the head of state. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first head of state (Erasme Louis). To do so, we use the identifier of the edge (`Q31-P35-Q1079522-c82ed584-0`) as `node1` in the `qualifiers.tsv` file. We get three edges, meaning that the edge `Q31/P35/Q1079522` has three qualifiers. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.90 real         0.77 user         0.11 sys\n",
      "id                                         node1                        label  node2                     node2;wikidatatype\n",
      "Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   wikibase-item\n",
      "Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  time\n",
      "Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  time\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$QUALIFIERS\" \\\n",
    "--match '(n1:`Q31-P35-Q1079522-c82ed584-0`)-[l]->(n2)' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make them readable: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.90 real         0.77 user         0.11 sys\n",
      "node1                        label  node2                     label;label\n",
      "Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   'position held'@en\n",
      "Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  'start time'@en\n",
      "Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  'end time'@en\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$QUALIFIERS\" -i \"$LABEL\" \\\n",
    "--match 'qual: (n1:`Q31-P35-Q1079522-c82ed584-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \\\n",
    "--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's put all the values of `P35` in a file, which we will conveniently name `Q31.P35.tsv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.83 real         0.71 user         0.09 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" \\\n",
    "--match '(n1:Q31)-[l:P35]->(n2)' \\\n",
    "--return 'l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q31.P35.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to combine the `P35` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P35.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.\n",
    "\n",
    "This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                                         node1                        label  node2\n",
      "Q31-P35-Q1079522-c82ed584-0                Q31                          P35    Q1079522\n",
      "Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406\n",
      "Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11\n",
      "Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11\n",
      "Q31-P35-Q12967-f2b9aaf3-0                  Q31                          P35    Q12967\n",
      "Q31-P35-Q12967-f2b9aaf3-0-P39-Q13592862-0  Q31-P35-Q12967-f2b9aaf3-0    P39    Q13592862\n",
      "Q31-P35-Q12967-f2b9aaf3-0-P580-f29037-0    Q31-P35-Q12967-f2b9aaf3-0    P580   ^1865-12-17T00:00:00Z/11\n",
      "Q31-P35-Q12967-f2b9aaf3-0-P582-136f02-0    Q31-P35-Q12967-f2b9aaf3-0    P582   ^1909-12-17T00:00:00Z/11\n",
      "Q31-P35-Q12971-2088471b-0                  Q31                          P35    Q12971\n",
      "Q31-P35-Q12971-2088471b-0-P39-Q13592862-0  Q31-P35-Q12971-2088471b-0    P39    Q13592862\n",
      "Q31-P35-Q12971-2088471b-0-P580-a35d41-0    Q31-P35-Q12971-2088471b-0    P580   ^1831-06-04T00:00:00Z/11\n",
      "        1.83 real         2.86 user         0.47 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$QUALIFIERS\" -i \"$TEMP\"/Q31.P35.tsv \\\n",
    "--match 'P35: ()-[l]->(), qual: (l)-[lq]->(n2)' \\\n",
    "--return 'lq as id, l as node1, lq.label as label, n2 as node2' \\\n",
    "/ cat -i - -i \"$TEMP\"/Q31.P35.tsv \\\n",
    "/ sort2 \\\n",
    "| head -12 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`\n",
    "- It is possible to include arbitrary additional columns in KGTK files\n",
    "- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges\n",
    "- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Case: A Knowledge Graph About Alocholic Beverages\n",
    "We are going to build a small KG about alcoholoc beverages by extracting from Wikidata the subgraph that relates to alcoholic beverages (https://www.wikidata.org/wiki/Q154)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: create a list of all descendants of `alcoholic beverage` (https://www.wikidata.org/wiki/Q154)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6\n",
      "module.exports = async params => {\n",
      "                       ^^^^^^\n",
      "\n",
      "SyntaxError: Unexpected identifier\n",
      "    at createScript (vm.js:56:10)\n",
      "    at Object.runInThisContext (vm.js:97:10)\n",
      "    at Module._compile (module.js:549:28)\n",
      "    at Object.Module._extensions..js (module.js:586:10)\n",
      "    at Module.load (module.js:494:32)\n",
      "    at tryModuleLoad (module.js:453:12)\n",
      "    at Function.Module._load (module.js:445:3)\n",
      "    at Module.require (module.js:504:17)\n",
      "    at require (internal/module.js:20:19)\n",
      "    at Object.<anonymous> (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)\n"
     ]
    }
   ],
   "source": [
    "!wd u Q154"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wikidata uses two properties to organize entities in a hierarchy: the `instance of` property (`P31`) and the `subclass of` (`P279`) property. In many cases, the distinction between instance of and subclass of is subtle, and we find many situations in Wikidata where either one or the other is used to organize hierarchies. For this reason, we created a new property called `isa` that contains the union of `P31` and `P279` and stored in the file `derived.isa.tsv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\n",
      "P10\tisa\tQ18610173\n",
      "P1000\tisa\tQ18608871\n",
      "P1001\tisa\tQ15720608\n",
      "P1001\tisa\tQ22984026\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$ISA\" | head -5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get all the alcoholic beverages, we need to get all entities that are `isa` of alcoholic beverage (`Q154`) or that are `isa` of any descendant of `Q154` in the `subclass of` (`P279`) hierarchy. The length of the chain of `P279` edges can be arbitrarily long. To support this uise case, KGTK offers the `derived.P279star.tsv` file that contains edges `n1/P279star/n2` if `n1` is a descendant of `n2` on chains of `P279` edges, includiing chains of zero length (`n1/P279star/n1`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zcat: node1     label     node2     id\n",
      "Q1000032  P279star  Q1000032  Q1000032-P279star-Q1000032-0000\n",
      "Q1000032  P279star  Q1150070  Q1000032-P279star-Q1150070-0000\n",
      "Q1000032  P279star  Q1190554  Q1000032-P279star-Q1190554-0000\n",
      "Q1000032  P279star  Q133500   Q1000032-P279star-Q133500-0000\n",
      "error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$P279STAR\" | head -5 | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get all alcoholic beverages, we need to find all nodes `n1` that are connected to `Q154` with an `isa` edge and a chain of `P279` edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        3.18 real         0.93 user         0.57 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$ISA\" -i \"$P279STAR\" -i \"$LABEL\" \\\n",
    "--match 'isa: (n1)-[]->(n2), star: (n2)-[]->(n3:Q154), label: (n1)-[]->(n1l)' \\\n",
    "--return 'n1 as node1, n1l as `node1;label`, n3 as node2, \"isastar\" as label' \\\n",
    "-o \"$TEMP\"/Q154.descendant.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a sample of alcoholic beverages in Wikidata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1      node1;label                  node2  label\n",
      "Q1350656   'Corn whiskey'@en            Q154   isastar\n",
      "Q20713240  'Buckwheat whisky'@en        Q154   isastar\n",
      "Q2535077   'Rye Whiskey'@en             Q154   isastar\n",
      "Q536976    'Canadian whisky'@en         Q154   isastar\n",
      "Q7991845   'Wheat whiskey'@en           Q154   isastar\n",
      "Q10429117  'Beyaz'@en                   Q154   isastar\n",
      "Q1069954   'Prosecco'@en                Q154   isastar\n",
      "Q1094850   'Clairette du Languedoc'@en  Q154   isastar\n",
      "Q1135592   'Cortese di Gavi'@en         Q154   isastar\n"
     ]
    }
   ],
   "source": [
    "!head \"$TEMP\"/Q154.descendant.tsv | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An the total number:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    3251   16116  133341 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.descendant.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc \"$TEMP\"/Q154.descendant.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The computation of `Q154.descendant.tsv` can be implemented in SPARQL using the common `P31/P279*` graph pattern, but the query will time out if the result size is large. For example, the query will time out when requesting all descendants of chemical compounds, as there are over one million chemical compounds in Wikidata. The query can be easily done in KGTK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: get the incoming and outgoing edges\n",
    "We want out graph to have the neighbors of all alcoholic beverages, so we need to get the incoming and outgoing edges.\n",
    "\n",
    "The following query gets the outgoing edges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.34 real         1.03 user         0.36 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv \\\n",
    "--match 'Q154: (n1)-[]->(), claims: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.node1.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that we are getting several properties for our items:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                                   node1     label  node2\n",
      "Q1000737-P1435-Q17297633-53903946-0  Q1000737  P1435  Q17297633\n",
      "Q1000737-P1454-Q460178-8ad4931b-0    Q1000737  P1454  Q460178\n",
      "Q1000737-P159-Q16003-31e24011-0      Q1000737  P159   Q16003\n",
      "Q1000737-P17-Q183-24107fe2-0         Q1000737  P17    Q183\n",
      "Q1000737-P18-147fc9-667304f8-0       Q1000737  P18    \"Marthabräuhalle 2011-04-03.jpg\"\n",
      "Q1000737-P31-Q131734-f97bd6f6-0      Q1000737  P31    Q131734\n",
      "Q1000737-P31-Q15075508-a4c83928-0    Q1000737  P31    Q15075508\n",
      "Q1000737-P373-689157-3110aade-0      Q1000737  P373   \"Marthabräu\"\n",
      "Q1000737-P452-Q869095-f5d8e7a2-0     Q1000737  P452   Q869095\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.node1.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now get the incoming edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.23 real         0.98 user         0.36 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv \\\n",
    "--match 'Q154: (n1)-[]->(), claims: (n3)-[l]->(n1)' \\\n",
    "--return 'distinct l as id, n3 as node1, l.label as label, n1 as node2' \\\n",
    "-o \"$TEMP\"/Q154.node2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a sample of the edges we are getting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                                  node1      label  node2\n",
      "Q1350656-P279-Q1007164-7e3ecba9-0   Q1350656   P279   Q1007164\n",
      "zcat: Q20713240-P279-Q1007164-b3112260-0  Q20713240  P279   Q1007164\n",
      "Q2535077-P279-Q1007164-b2d3684b-0   Q2535077   P279   Q1007164\n",
      "Q536976-P279-Q1007164-8bf7467b-0    Q536976    P279   Q1007164\n",
      "Q7991845-P279-Q1007164-18bc383a-0   Q7991845   P279   Q1007164\n",
      "Q10337004-P186-Q10210-c56dd7ce-0    Q10337004  P186   Q10210\n",
      "Q10429117-P31-Q10210-d342f061-0     Q10429117  P31    Q10210\n",
      "Q1051699-P279-Q10210-65d32c67-0     Q1051699   P279   Q10210\n",
      "error writing to outputQ1058259-P279-Q10210-e204554a-0     Q1058259   P279   Q10210\n",
      ": Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.node2.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenate the incoming and outgoing edges to put them in a single file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.23 real         1.10 user         0.10 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.node1.tsv.gz -i \"$TEMP\"/Q154.node2.tsv.gz -o \"$TEMP\"/Q154.claims.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have over 30,000 edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   28142  116045 1584824\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.claims.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summary of where we are:\n",
    "- Computed the list of entities below alcoholic beverage\n",
    "- Found all incoming and outgoing edges to these entities; for the new entities we bring in, we have no information, we only have the q-node\n",
    "\n",
    "Not having any information about the entities connected to the alcoholic beverages is limiting, so let's get their outgoing edges. We run the query with `Q154.claims.tsv` which will use all the entities in our graph, including the alcoholic beverages for which we already got outgoing edges; no harm done, as we can eliminate duplicated later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        5.75 real         3.92 user         0.57 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.claims.tsv.gz \\\n",
    "--match 'Q154: ()-[]->(n1), claims: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.hop.out.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For sanity check, let's take a peek:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                             node1  label  node2\n",
      "Q1000-P1036-9bef62-f77ac5cf-0  Q1000  P1036  \"2--6721\"\n",
      "Q1000-P1081-0d345f-3a33abf5-0  Q1000  P1081  +0.641\n",
      "Q1000-P1081-0d345f-6da37c02-0  Q1000  P1081  +0.641\n",
      "Q1000-P1081-1100e3-c7631769-0  Q1000  P1081  +0.624\n",
      "Q1000-P1081-1ada51-7c71c229-0  Q1000  P1081  +0.639\n",
      "Q1000-P1081-345681-88a99cab-0  Q1000  P1081  +0.702\n",
      "Q1000-P1081-347db1-da0e5e03-0  Q1000  P1081  +0.637\n",
      "Q1000-P1081-419245-b03a8b59-0  Q1000  P1081  +0.647\n",
      "Q1000-P1081-419245-f8cd58e8-0  Q1000  P1081  +0.647\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.hop.out.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consolidate our edge files into one larger file. We use compact to remove duplicates and sort to keep edges for the same subject together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        5.07 real         7.09 user         0.59 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.claims.tsv.gz -i \"$TEMP\"/Q154.hop.out.tsv.gz \\\n",
    "/ compact \\\n",
    "/ sort2 \\\n",
    "-o \"$TEMP\"/Q154.edges.1.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have over 170,000 edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  165133  678398 8868474\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.edges.1.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a peek:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                                node1  label  node2\n",
      "P1389-P1855-Q1109662-9e2ef218-0   P1389  P1855  Q1109662\n",
      "P1582-P1855-Q17329207-f4ef508d-0  P1582  P1855  Q17329207\n",
      "P2581-P1855-Q7639844-08b3a4c7-0   P2581  P1855  Q7639844\n",
      "P2665-P1855-Q1067702-402a80a9-0   P2665  P1855  Q1067702\n",
      "P2665-P1855-Q170210-30d44f0b-0    P2665  P1855  Q170210\n",
      "P5420-P1855-Q44-209cffb1-0        P5420  P1855  Q44\n",
      "P5420-P1855-Q722338-73d7be75-0    P5420  P1855  Q722338\n",
      "zcat: P6088-P1855-Q1543214-3d934541-0   P6088  P1855  Q1543214\n",
      "P6088-P1855-Q4626-4ed65964-0      P6088  P1855  Q4626\n",
      "error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.edges.1.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have all the alcoholic beverages, we want to get the upper ontology of all the classes used, so that every class in our KG has a path to the root of the ontology. For example, first go to `drink` (`Q40050`), then to `liquid` (`Q11435`), then `fluid` (`Q102205`) and so on until we reach `entity` (`Q35120`).\n",
    "\n",
    "To do this, we need to get all the `isa` of all items in our graph, then get `P279star` so we get the list of all classes that these items descend from. Finally we need to get all the `P279` edges between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       13.58 real         9.23 user         1.18 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$TEMP\"/Q154.edges.1.tsv.gz -i \"$P279STAR\" -i \"$ISA\" \\\n",
    "--match 'Q154: (n1)-[]->(), isa: (n1)-[]->(n2), P279: (n2)-[]->(class)' \\\n",
    "--return 'distinct class as node1' \\\n",
    "-o \"$TEMP\"/Q154.classes.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have almost 3,000 classes in the upper ontology for the entities in our graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    2846    2846   24939 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.classes.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc \"$TEMP\"/Q154.classes.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now use the `derived.P279.tsv` file to get the `P279` edges that connect a class to its superclass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.48 real         0.89 user         0.22 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$TEMP\"/Q154.classes.tsv -i \"$P279\" \\\n",
    "--match 'Q154: (class)-[]->(), P279: (class)-[l]->(super)' \\\n",
    "--return 'distinct l as id, class as node1, l.label as label, super as node2' \\\n",
    "-o \"$TEMP\"/Q154.P279.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get close to 5,000 `P279` edges in the upper ontology; we will take care of potential duplicates at a final cleanup step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    4517   18068  249492 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.P279.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc \"$TEMP\"/Q154.P279.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see several q-nodes below `entity` (`Q35120`), a good indication that we computed the upper ontology correctly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q16686448-P279-Q35120-674edbf9-0  Q16686448  P279  Q35120\n",
      "Q35120-P279-25b964-0520e300-0     Q35120     P279  novalue\n",
      "Q58415929-P279-Q35120-75659d0c-0  Q58415929  P279  Q35120\n",
      "Q23958946-P279-Q35120-70a9ed90-0  Q23958946  P279  Q35120\n",
      "Q488383-P279-Q35120-5fad2ad7-0    Q488383    P279  Q35120\n"
     ]
    }
   ],
   "source": [
    "!grep Q35120 \"$TEMP\"/Q154.P279.tsv | head -5 | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consolidate the edges again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        5.14 real         7.12 user         0.57 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.edges.1.tsv.gz -i \"$TEMP\"/Q154.P279.tsv \\\n",
    "/ compact \\\n",
    "/ sort2 \\\n",
    "-o \"$TEMP\"/Q154.edges.2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have over 175,000 edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  169047  694054 9085731\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.edges.2.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summary:\n",
    "- We have the instances of alcoholic beverages\n",
    "- We added incoming and outgoing edges\n",
    "- For the outgoing edges, we went one hop forward\n",
    "- We got the upper ontology\n",
    "\n",
    "The properties are also items in Wikidata, so let's collect them all and get their edges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.13 real         2.03 user         0.31 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$TEMP\"/Q154.edges.2.tsv.gz \\\n",
    "--match '()-[l {label: property}]->()' \\\n",
    "--return 'distinct property as node1' \\\n",
    "-o \"$TEMP\"/Q154.properties.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\n",
      "P10\n",
      "P1001\n",
      "P1003\n",
      "P1004\n",
      "P1005\n",
      "P1006\n",
      "P101\n",
      "P1014\n",
      "P1015\n"
     ]
    }
   ],
   "source": [
    "!head \"$TEMP\"/Q154.properties.tsv | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the edges of these properties:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.25 real         0.91 user         0.18 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.properties.tsv \\\n",
    "--match 'Q154: (p)-[]->(), claims: (p)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, p as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.properties.edges.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a peek, looks like what we had before as the file is sorted, let's proceed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id                              node1  label  node2\n",
      "P10-P1628-32b85d-7927ece6-0     P10    P1628  \"http://www.w3.org/2006/vcard/ns#Video\"\n",
      "P10-P1628-acf60d-b8950832-0     P10    P1628  \"https://schema.org/video\"\n",
      "P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508\n",
      "P10-P1659-P1651-c4068028-0      P10    P1659  P1651\n",
      "P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18\n",
      "P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238\n",
      "P10-P1659-P51-86aca4c5-0        P10    P1659  P51\n",
      "P10-P1855-Q15075950-7eff6d65-0  P10    P1855  Q15075950\n",
      "P10-P1855-Q69063653-c8cdb04c-0  P10    P1855  Q69063653\n"
     ]
    }
   ],
   "source": [
    "!head \"$TEMP\"/Q154.properties.edges.tsv | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consolidate the edges again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        6.18 real         8.48 user         0.65 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.edges.2.tsv.gz -i \"$TEMP\"/Q154.properties.edges.tsv \\\n",
    "/ compact \\\n",
    "/ sort2 \\\n",
    "-o \"$TEMP\"/Q154.edges.3.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of edges grew a bit to 206,000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  197521  811687 10791930\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.edges.3.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summary:\n",
    "- We have the instances of alcoholic beverages\n",
    "- We added incoming and outgoing edges\n",
    "- For the outgoing edges, we went one hop forward\n",
    "- We got the upper ontology\n",
    "- And we have the edges on all the properties being used\n",
    "\n",
    "We will stop adding nodes to the KG at this time, and proceed to add the labels for all the nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: get the labels, aliases and descriptions of all the items in our KG\n",
    "Before we start, let's define an environment variable to hold the final edges file so that if we change our mind later, we can update it without having to change the commands below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"Q154GRAPH\"] = os.environ[\"TEMP\"] + \"/Q154.edges.3.tsv.gz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.edges.3.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "!ls \"$Q154GRAPH\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the labels of the `node1` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        5.02 real         2.81 user         0.87 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$LABEL\" \\\n",
    "--match 'Q154: (n1)-[]-(), label: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.label.node1.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id              node1  label  node2\n",
      "P10-label-en    P10    label  'video'@en\n",
      "P1001-label-en  P1001  label  'applies to jurisdiction'@en\n",
      "P1003-label-en  P1003  label  'National Library of Romania ID'@en\n",
      "P1004-label-en  P1004  label  'MusicBrainz place ID'@en\n",
      "P1005-label-en  P1005  label  'Portuguese National Library ID'@en\n",
      "P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en\n",
      "P101-label-en   P101   label  'field of work'@en\n",
      "P1014-label-en  P1014  label  'Getty AAT ID'@en\n",
      "P1015-label-en  P1015  label  'NORAF ID'@en\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.label.node1.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the labels of the `node2` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        8.45 real         2.05 user         1.71 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$LABEL\" \\\n",
    "--match 'Q154: ()-[]-(n2), label: (n2)-[l]->(n3)' \\\n",
    "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n",
    "-o \"$TEMP\"/Q154.label.node2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenate the two label files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.66 real         1.52 user         0.10 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.label.node1.tsv.gz -i \"$TEMP\"/Q154.label.node2.tsv.gz \\\n",
    "-o \"$TEMP\"/labels.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   56123  289814 3031029\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/labels.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the aliases of `node1` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2.55 real         1.51 user         0.37 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$ALIAS\" \\\n",
    "--match 'Q154: (n1)-[]-(), alias: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.alias.node1.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the aliases of `node2` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        3.44 real         1.59 user         0.59 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$ALIAS\" \\\n",
    "--match 'Q154: ()-[]-(n2), alias: (n2)-[l]->(n3)' \\\n",
    "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n",
    "-o \"$TEMP\"/Q154.alias.node2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenate the two alias files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.63 real         1.49 user         0.11 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.alias.node1.tsv.gz -i \"$TEMP\"/Q154.alias.node2.tsv.gz \\\n",
    "-o \"$TEMP\"/alias.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the descriptions of `node1` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        3.09 real         1.11 user         0.52 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$DESCRIPTION\" \\\n",
    "--match 'Q154: (n1)-[]-(), description: (n1)-[l]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.description.node1.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the descriptions of `node2` nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        8.51 real         1.94 user         1.70 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$DESCRIPTION\" \\\n",
    "--match 'Q154: ()-[]-(n2), description: (n2)-[l]->(n3)' \\\n",
    "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n",
    "-o \"$TEMP\"/Q154.description.node2.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenate the two description files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.67 real         1.48 user         0.11 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i \"$TEMP\"/Q154.description.node1.tsv.gz -i \"$TEMP\"/Q154.description.node2.tsv.gz \\\n",
    "-o \"$TEMP\"/description.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: get the qualifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        5.29 real         2.44 user         0.73 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$QUALIFIERS\" \\\n",
    "--match 'Q154: ()-[l]->(), qual: (l)-[lq]->(n2)' \\\n",
    "--return 'lq as id, l as node1, lq.label as label, n2 as node2' \\\n",
    "-o \"$OUT\"/qualifiers.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zcat: error writing to output: Broken pipe\n",
      "id                                                node1                           label  node2\n",
      "P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0       P10-P1855-Q15075950-7eff6d65-0  P10    \"Smoorverliefd 12 september.webm\"\n",
      "P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0    P10-P1855-Q15075950-7eff6d65-0  P3831  Q622550\n",
      "P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0       P10-P1855-Q69063653-c8cdb04c-0  P10    \"Couch Commander.webm\"\n",
      "P10-P1855-Q7378-555592a4-0-P10-8a982d-0           P10-P1855-Q7378-555592a4-0      P10    \"Elephants Dream (2006).webm\"\n",
      "P10-P2302-Q21502404-d012aef4-0-P1793-f4c2ed-0     P10-P2302-Q21502404-d012aef4-0  P1793  \"(?i).+\\\\\\\\.(webm\\\\|ogv\\\\|ogg\\\\|gif)\"\n",
      "P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0  P10-P2302-Q21502404-d012aef4-0  P2316  Q21502408\n",
      "P10-P2302-Q21502404-d012aef4-0-P2916-cb0917-0     P10-P2302-Q21502404-d012aef4-0  P2916  'filename with extension: webm, ogg, ogv, or gif (case insensitive)'@en\n",
      "P10-P2302-Q21510851-5224fe0b-0-P2306-P175-0       P10-P2302-Q21510851-5224fe0b-0  P2306  P175\n",
      "P10-P2302-Q21510851-5224fe0b-0-P2306-P180-0       P10-P2302-Q21510851-5224fe0b-0  P2306  P180\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.qualifiers.tsv.gz | head | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  109816  446163 10639203\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.qualifiers.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5: consolidate all the files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-12-23 18:28:45--  https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2617 (2.6K) [text/plain]\n",
      "Saving to: ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’\n",
      "\n",
      "/Users/pedroszekely 100%[===================>]   2.56K  --.-KB/s    in 0s      \n",
      "\n",
      "2020-12-23 18:28:46 (14.4 MB/s) - ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’ saved [2617/2617]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv -O \"$TEMP\"/kgtk.properties.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1     label        node2                                   id\n",
      "isa       label        \"is a\"@en                               isa-label-e79b73\n",
      "isa       alias        \"isa\"@en                                isa-alias-7773c5\n",
      "isa       description  \"Instance or subclass relationship\"@en  isa-description-0b5cdc\n",
      "isa       P31          Q18616576                               isa-P31-Q18616576\n",
      "isa       P31          Q28326461                               isa-P31-Q28326461\n",
      "isa       P31          Q18647519                               isa-P31-Q18647519\n",
      "isa       data_type    wikibase-item                           isa-data_type-643cc9\n",
      "P279star  label        \"is a\"@en                               P279star-label-e79b73\n",
      "P279star  alias        \"isa\"@en                                P279star-alias-7773c5\n"
     ]
    }
   ],
   "source": [
    "!head \"$TEMP\"/kgtk.properties.tsv | column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "P10-datatype\tP10\tdatatype\tcommonsMedia\n",
      "P1000-datatype\tP1000\tdatatype\twikibase-item\n",
      "P1001-datatype\tP1001\tdatatype\twikibase-item\n",
      "P1002-datatype\tP1002\tdatatype\twikibase-item\n",
      "P1003-datatype\tP1003\tdatatype\texternal-id\n",
      "P1004-datatype\tP1004\tdatatype\texternal-id\n",
      "P1005-datatype\tP1005\tdatatype\texternal-id\n",
      "P1006-datatype\tP1006\tdatatype\texternal-id\n",
      "P1007-datatype\tP1007\tdatatype\texternal-id\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$PROPERTY_DATATYPES\" | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        1.26 real         0.76 user         0.11 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$Q154GRAPH\" -i \"$PROPERTY_DATATYPES\" \\\n",
    "--match 'Q15: (n1)-[]->(), property: (n1)-[l:datatype]->(n2)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "P10-datatype\tP10\tdatatype\tcommonsMedia\n",
      "P1001-datatype\tP1001\tdatatype\twikibase-item\n",
      "P1003-datatype\tP1003\tdatatype\texternal-id\n",
      "P1004-datatype\tP1004\tdatatype\texternal-id\n",
      "P1005-datatype\tP1005\tdatatype\texternal-id\n",
      "P1006-datatype\tP1006\tdatatype\texternal-id\n",
      "P101-datatype\tP101\tdatatype\twikibase-item\n",
      "P1014-datatype\tP1014\tdatatype\texternal-id\n",
      "P1015-datatype\tP1015\tdatatype\texternal-id\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        8.95 real        11.87 user         0.73 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat \\\n",
    "-i \"$TEMP\"/labels.tsv.gz \\\n",
    "-i \"$TEMP\"/alias.tsv.gz \\\n",
    "-i \"$TEMP\"/description.tsv.gz \\\n",
    "-i \"$TEMP\"/Q154.edges.3.tsv.gz \\\n",
    "-i \"$TEMP\"/kgtk.properties.tsv \\\n",
    "-i \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz \\\n",
    "/ compact \\\n",
    "/ sort2 \\\n",
    "-o \"$OUT\"/all.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "count(DISTINCT graph_35_c1.\"node1\")\n",
      "13147\n",
      "        0.92 real         0.79 user         0.10 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$TEMP\"/Q154.edges.3.tsv.gz \\\n",
    "--match '(n1)-[]->()' \\\n",
    "--return 'count(distinct n1)'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  346639 1718566 20581359\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$OUT\"/all.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: partition the files to follow the conventions KGTK uses for Wikidata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists\n"
     ]
    }
   ],
   "source": [
    "!mkdir $OUT/parts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        6.40 real         6.18 user         0.16 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "P10-P1628-32b85d-7927ece6-0\tP10\tP1628\t\"http://www.w3.org/2006/vcard/ns#Video\"\n",
      "P10-P1628-acf60d-b8950832-0\tP10\tP1628\t\"https://schema.org/video\"\n",
      "P10-P1629-Q34508-bcc39400-0\tP10\tP1629\tQ34508\n",
      "P10-P1659-P1651-c4068028-0\tP10\tP1659\tP1651\n",
      "P10-P1659-P18-5e4b9c4f-0\tP10\tP1659\tP18\n",
      "P10-P1659-P4238-d21d1ac0-0\tP10\tP1659\tP4238\n",
      "P10-P1659-P51-86aca4c5-0\tP10\tP1659\tP51\n",
      "P10-P1855-Q15075950-7eff6d65-0\tP10\tP1855\tQ15075950\n",
      "P10-P1855-Q69063653-c8cdb04c-0\tP10\tP1855\tQ69063653\n",
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < $TEMP/all_and_qualifiers.tsv.gz | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fb65c07ac2d747fe83c873ace33123bd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=49.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'cells': [{'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:18.778765',\n",
       "     'end_time': '2020-12-24T04:01:18.804088',\n",
       "     'duration': 0.025323,\n",
       "     'status': 'completed'}},\n",
       "   'source': '# Partitioning a subset of Wikidata\\n\\nThis notebook illustrates how to partition a Wikidata KGTK edges file.\\n\\nParameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\\n\\n```\\npapermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\\\\n-p wikidata_input_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz \\\\\\n-p wikidata_parts_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts \\\\\\n```\\n\\nHere is a sample of the records that might appear in the input KGTK file:\\n```\\nid\\tnode1\\tlabel\\tnode2\\trank\\tnode2;wikidatatype\\tlang\\nQ1-P1036-418bc4-78f5a565-0\\tQ1\\tP1036\\t\"113\"\\tnormal\\texternal-id\\t\\nQ1-P1343-Q19190511-ab132b87-0   Q1      P1343   Q19190511       normal  wikibase-item   \\nQ1-P18-92a7b3-0dcac501-0        Q1      P18     \"Hubble ultra deep field.jpg\"   normal  commonsMedia    \\nQ1-P2386-cedfb0-0fdbd641-0      Q1      P2386   +880000000000000000000000Q828224        normal  quantity        \\nQ1-P580-a2fccf-63cf4743-0       Q1      P580    ^-13798000000-00-00T00:00:00Z/3 normal  time    \\nQ1-P920-47c0f2-52689c4e-0       Q1      P920    \"LEM201201756\"  normal  string  \\nQ1-P1343-Q19190511-ab132b87-0-P805-Q84065667-0  Q1-P1343-Q19190511-ab132b87-0   P805    Q84065667               wikibase-item   \\nQ1-P1343-Q88672152-5080b9e2-0-P304-5724c3-0     Q1-P1343-Q88672152-5080b9e2-0   P304    \"13-36\"         string  \\nQ1-P2670-Q18343-030eb87e-0-P1107-ce87f8-0       Q1-P2670-Q18343-030eb87e-0      P1107   +0.70           quantity        \\nQ1-P793-Q273508-1900d69c-0-P585-a2fccf-0        Q1-P793-Q273508-1900d69c-0      P585    ^-13798000000-00-00T00:00:00Z/3         time    \\nP10-alias-en-282226-0   P10     alias   \\'gif\\'@en\\nP10-description-en      P10     description     \\'relevant video. For images, use the property P18. For film trailers, qualify with \\\\\"object has role\\\\\" (P3831)=\\\\\"trailer\\\\\" (Q622550)\\'@en                        en\\nP10-label-en    P10     label   \\'video\\'@en                      en\\nQ1-addl_wikipedia_sitelink-19e42a-0     Q1      addl_wikipedia_sitelink http://enwikiquote.org/wiki/Universe                    en\\nQ1-addl_wikipedia_sitelink-19e42a-0-language-0  Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-language       en                      en\\nQ1-addl_wikipedia_sitelink-19e42a-0-site-0      Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-site   enwikiquote                     en\\nQ1-addl_wikipedia_sitelink-19e42a-0-title-0     Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-title  \"Universe\"                      en\\nQ1-wikipedia_sitelink-5e459a-0  Q1      wikipedia_sitelink      http://en.wikipedia.org/wiki/Universe                   en\\nQ1-wikipedia_sitelink-5e459a-0-badge-Q17437798  Q1-wikipedia_sitelink-5e459a-0  sitelink-badge  Q17437798                       en\\nQ1-wikipedia_sitelink-5e459a-0-language-0       Q1-wikipedia_sitelink-5e459a-0  sitelink-language       en                      en\\nQ1-wikipedia_sitelink-5e459a-0-site-0   Q1-wikipedia_sitelink-5e459a-0  sitelink-site   enwiki                  en\\nQ1-wikipedia_sitelink-5e459a-0-title-0  Q1-wikipedia_sitelink-5e459a-0  sitelink-title  \"Universe\"                      en\\n```\\nHere are some contraints on the contents of the input file:\\n- The input file starts with a KGTK header record.\\n  - In addition to the `id`, `node1`, `label`, and `node2` columns, the file may contain the `node2;wikidatatype` column.\\n  - The `node2;wikidatatype` column is used to partition claims by Wikidata property datatype.\\n  - If it does not exist, it will be created during the partitioning process and populated using `datatype` relationships.\\n  - If it does exist, any empty values in the column will be populated using `datatype` relationships.\\n- The `id` column must contain a nonempty value.\\n- The first section of an `id` value must be the `node` value for the record.\\n  - The qualifier extraction operations depend upon this constraint. \\n- In addition to the claims and qualifiers, the input file is expected to contain:\\n  - English language labels for all property entities appearing in the file.\\n- The input file ought to contain the following:\\n  - claims records,\\n  - qualifier records,\\n  - alias records in appropriate languages,\\n  - description records in appropriate languages,\\n  - label records in appropriate languages, and\\n  - sitelink records in appropriate languages.\\n  - `datatype` records that map Wikidata property entities to Wikidata property datatypes. These records are required if the input file does not contain the `node2;wikidatatype` column.\\n- Additionally, this script provides for the appearance of `type` records in the input file.\\n  - `type` records that list all `entityId` values and identify them as properties or items. These records provides a correctness check on the operation of `kgtk import-wikidata`, and may be deprecated in the future.\\n- The input file is assumed to be unsorted. If it is already sorted on the (`id` `node1` `label` `node2`) columns , then set the `presorted` parameter to `True` to shorten the execution time of this script.'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:18.823948',\n",
       "     'end_time': '2020-12-24T04:01:18.922481',\n",
       "     'duration': 0.098533,\n",
       "     'status': 'completed'}},\n",
       "   'source': \"### Parameters for invoking the notebook\\n\\n| Parameter | Description | Default |\\n| --------- | ----------- | ------- |\\n| `wikidata_input_path` | A folder containing the Wikidata KGTK edges to partition. | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/data/all.tsv.gz' |\\n| `wikidata_parts_path` | A folder to receive the partitioned Wikidata files, such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |\\n| `temp_folder_path` |    A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |\\n| `gzip_command` |        The compression command for sorting. | 'pigz'  (Note: use version 2.4 or later)|\\n| `kgtk_command` |        The kgtk commmand. | 'time kgtk' |\\n| `kgtk_options` |        The kgtk commmand options. | '--debug --timing' |\\n| `kgtk_extension` |      The file extension for generated KGTK files. Appending `.gz` implies gzip compression. | 'tsv.gz' |\\n| `presorted` |           When True, the input file is already sorted on the (`id` `node1` `label` `node2`) columns. | 'False' |\\n| `sort_extras` |         Extra parameters for the sort program.  The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path |\\n| `use_mgzip` |           When True, use the mgzip program where appropriate for faster compression. | 'True' |\\n| `verbose` |             When True, produce additional feedback messages. | 'True' |\\n\\nNote: if `pigz` version 2.4 (or later) is not available on your system, use `gzip`.\\n\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 1,\n",
       "   'metadata': {'tags': ['parameters'],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:18.943621',\n",
       "     'end_time': '2020-12-24T04:01:18.971367',\n",
       "     'duration': 0.027746,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:18.968715Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:18.969252Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:18.970542Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:18.971150Z'}},\n",
       "   'outputs': [],\n",
       "   'source': \"# Parameters\\nwikidata_input_path = '/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz'\\nwikidata_parts_path = '/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts'\\ntemp_folder_path =    wikidata_parts_path + '/temp'\\ngzip_command =        'pigz'\\nkgtk_command =        'time kgtk'\\nkgtk_options =        '--debug --timing'\\nkgtk_extension =      'tsv.gz'\\npresorted =           'False'\\nsort_extras =         '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path\\nuse_mgzip =           'True'\\nverbose =             'True'\\n\"},\n",
       "  {'cell_type': 'code',\n",
       "   'metadata': {'tags': ['injected-parameters'],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:18.989412',\n",
       "     'end_time': '2020-12-24T04:01:19.014298',\n",
       "     'duration': 0.024886,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:19.011516Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:19.012024Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:19.013826Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:19.014178Z'}},\n",
       "   'execution_count': 2,\n",
       "   'source': '# Parameters\\nwikidata_input_path = (\\n    \"/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz\"\\n)\\nwikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts\"\\ntemp_folder_path = \"/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\"\\nsort_extras = \"--buffer-size 30% --temporary-directory $OUT/parts/temp\"\\nverbose = False\\n',\n",
       "   'outputs': []},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 3,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.032647',\n",
       "     'end_time': '2020-12-24T04:01:19.061326',\n",
       "     'duration': 0.028679,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:19.057283Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:19.057926Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:19.060554Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:19.061208Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': \"wikidata_input_path = '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz'\\nwikidata_parts_path = '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts'\\ntemp_folder_path = '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp'\\ngzip_command = 'pigz'\\nkgtk_command = 'time kgtk'\\nkgtk_options = '--debug --timing'\\nkgtk_extension = 'tsv.gz'\\npresorted = 'False'\\nsort_extras = '--buffer-size 30% --temporary-directory $OUT/parts/temp'\\nuse_mgzip = 'True'\\nverbose = False\\n\"}],\n",
       "   'source': \"print('wikidata_input_path = %s' % repr(wikidata_input_path))\\nprint('wikidata_parts_path = %s' % repr(wikidata_parts_path))\\nprint('temp_folder_path = %s' % repr(temp_folder_path))\\nprint('gzip_command = %s' % repr(gzip_command))\\nprint('kgtk_command = %s' % repr(kgtk_command))\\nprint('kgtk_options = %s' % repr(kgtk_options))\\nprint('kgtk_extension = %s' % repr(kgtk_extension))\\nprint('presorted = %s' % repr(presorted))\\nprint('sort_extras = %s' % repr(sort_extras))\\nprint('use_mgzip = %s' % repr(use_mgzip))\\nprint('verbose = %s' % repr(verbose))\\n\"},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.080737',\n",
       "     'end_time': '2020-12-24T04:01:19.099938',\n",
       "     'duration': 0.019201,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Create working folders and empty them'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 4,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.119819',\n",
       "     'end_time': '2020-12-24T04:01:19.391809',\n",
       "     'duration': 0.27199,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:19.143818Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:19.144596Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:19.390949Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:19.391649Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp: File exists\\r\\n'}],\n",
       "   'source': '!mkdir {wikidata_parts_path}\\n!mkdir {temp_folder_path}'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 5,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.420593',\n",
       "     'end_time': '2020-12-24T04:01:19.692890',\n",
       "     'duration': 0.272297,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:19.453075Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:19.453750Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:19.691403Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:19.692716Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'rm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/*.tsv: No such file or directory\\r\\nrm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/*.tsv.gz: No such file or directory\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'rm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/*.tsv: No such file or directory\\r\\nrm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/*.tsv.gz: No such file or directory\\r\\n'}],\n",
       "   'source': '!rm {wikidata_parts_path}/*.tsv {wikidata_parts_path}/*.tsv.gz\\n!rm {temp_folder_path}/*.tsv {temp_folder_path}/*.tsv.gz'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.724442',\n",
       "     'end_time': '2020-12-24T04:01:19.750139',\n",
       "     'duration': 0.025697,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Sort the Input Data Unless Presorted\\nSort the input data file by (id, node1, label, node2).\\nThis may take a while.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 6,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:19.772923',\n",
       "     'end_time': '2020-12-24T04:01:23.550324',\n",
       "     'duration': 3.777401,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:19.803414Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:19.804062Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:23.549339Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:23.550119Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': \"Sorting the input file '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz'.\\n\"},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:03.387055 CPU=0:00:00.823541 ( 24.3%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/all.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m3.627s\\r\\nuser\\t0m3.006s\\r\\nsys\\t0m0.326s\\r\\n'}],\n",
       "   'source': 'if presorted.lower() == \"true\": \\n    print(\\'Using a presorted input file %s.\\' % repr(wikidata_input_path))\\n    partition_input_file = wikidata_input_path \\nelse: \\n    print(\\'Sorting the input file %s.\\' % repr(wikidata_input_path))\\n    partition_input_file = wikidata_parts_path + \\'/all.\\' + kgtk_extension \\n    !{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {wikidata_input_path} \\\\\\n --output-file {partition_input_file} \\\\\\n --columns     id node1 label node2 \\\\\\n --extra       \"{sort_extras}\"'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:23.576957',\n",
       "     'end_time': '2020-12-24T04:01:23.601527',\n",
       "     'duration': 0.02457,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Partition the Claims, Qualifiers, and Entity Data\\nSplit out the entity data (alias, description, label, and sitelinks) and additional metadata (datatype, type).  Separate the qualifiers from the claims.\\n'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 7,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:23.624491',\n",
       "     'end_time': '2020-12-24T04:01:31.645484',\n",
       "     'duration': 8.020993,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:23.658652Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:23.677999Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:31.644483Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:31.645314Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:06.827845 CPU=0:00:06.726057 ( 98.5%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/all.tsv.gz -p ; datatype ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.property.datatypes.tsv.gz -p ; alias ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.tsv.gz -p ; description ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.tsv.gz -p ; label ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.tsv.gz -p ; addl_wikipedia_sitelink,wikipedia_sitelink ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.tsv.gz -p ; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz -p ; type ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.types.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m7.802s\\r\\nuser\\t0m6.653s\\r\\nsys\\t0m0.252s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \\\\\\n --input-file {partition_input_file} \\\\\\n -p '; datatype ;'        -o {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension} \\\\\\n -p '; alias ;'           -o {wikidata_parts_path}/aliases.{kgtk_extension} \\\\\\n -p '; description ;'     -o {wikidata_parts_path}/descriptions.{kgtk_extension} \\\\\\n -p '; label ;'           -o {wikidata_parts_path}/labels.{kgtk_extension} \\\\\\n -p '; addl_wikipedia_sitelink,wikipedia_sitelink ;' \\\\\\n                          -o {wikidata_parts_path}/sitelinks.{kgtk_extension} \\\\\\n -p '; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ;' \\\\\\n                          -o {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n -p '; type ;'            -o {wikidata_parts_path}/metadata.types.{kgtk_extension} \\\\\\n --reject-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension}\"},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:31.675980',\n",
       "     'end_time': '2020-12-24T04:01:31.699820',\n",
       "     'duration': 0.02384,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Sort the claims and qualifiers on Node1\\nSort the combined claims and qualifiers file by the node1 column.\\nThis may take a while.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 8,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:31.721352',\n",
       "     'end_time': '2020-12-24T04:01:33.048996',\n",
       "     'duration': 1.327644,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:31.746849Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:31.747450Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:33.047944Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:33.048824Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:01.061974 CPU=0:00:00.680296 ( 64.1%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --columns node1 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.183s\\r\\nuser\\t0m1.964s\\r\\nsys\\t0m0.176s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension}\\\\\\n --columns     node1 \\\\\\n --extra       \"{sort_extras}\"'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:33.078983',\n",
       "     'end_time': '2020-12-24T04:01:33.103943',\n",
       "     'duration': 0.02496,\n",
       "     'status': 'completed'}},\n",
       "   'source': \"### Split the claims and qualifiers\\nIf row A's node1 value matches some other row's id value, the then row A is a qualifier.\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 9,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:33.127543',\n",
       "     'end_time': '2020-12-24T04:01:39.868141',\n",
       "     'duration': 6.740598,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:33.155629Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:33.156229Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:39.867126Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:39.867972Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:06.268601 CPU=0:00:06.180480 ( 98.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --filter-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/qualifiers.sorted-by-node1.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.sorted-by-node1.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m6.595s\\r\\nuser\\t0m6.092s\\r\\nsys\\t0m0.225s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension} \\\\\\n --filter-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension}\\\\\\n --reject-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension}\\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:39.913151',\n",
       "     'end_time': '2020-12-24T04:01:39.941215',\n",
       "     'duration': 0.028064,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Sort the claims by ID\\nSort the split claims by id, node1, label, node2.\\nThis may take a while.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 10,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:39.965000',\n",
       "     'end_time': '2020-12-24T04:01:41.342312',\n",
       "     'duration': 1.377312,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:39.997328Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:39.998314Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:41.341422Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:41.342149Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:01.079537 CPU=0:00:00.685110 ( 63.5%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.sorted-by-node1.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.no-datatype.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.226s\\r\\nuser\\t0m1.637s\\r\\nsys\\t0m0.170s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/claims.no-datatype.{kgtk_extension}\\\\\\n --columns     id node1 label node2 \\\\\\n --extra       \"{sort_extras}\"'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:41.372564',\n",
       "     'end_time': '2020-12-24T04:01:41.396939',\n",
       "     'duration': 0.024375,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Merge the Wikidata Property Datatypes into the claims\\nMerge the Wikidata Property Datatypes into the claims row as node2;wikidatatype. This column will be used to partition the claims by Wikidata Property Datatype ina later step.  If the claims file already has a node2;wikidatatype column, lift only when that column has an empty value.\\n'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 11,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:41.422214',\n",
       "     'end_time': '2020-12-24T04:01:44.940664',\n",
       "     'duration': 3.51845,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:41.450977Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:41.451612Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:44.939766Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:44.940503Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:03.010786 CPU=0:00:02.979860 ( 99.0%): lift --verbose=False --use-mgzip=True --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.no-datatype.tsv.gz --columns-to-lift label --overwrite False --label-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.property.datatypes.tsv.gz --label-value datatype --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tsv.gz --columns-to-write node2;wikidatatype\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m3.375s\\r\\nuser\\t0m2.963s\\r\\nsys\\t0m0.135s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} lift --verbose={verbose} --use-mgzip={use_mgzip} \\\\\\n --input-file {temp_folder_path}/claims.no-datatype.{kgtk_extension} \\\\\\n --columns-to-lift label \\\\\\n --overwrite False \\\\\\n --label-file {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension}\\\\\\n --label-value datatype \\\\\\n --output-file {wikidata_parts_path}/claims.{kgtk_extension}\\\\\\n --columns-to-write 'node2;wikidatatype'\"},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:44.973475',\n",
       "     'end_time': '2020-12-24T04:01:44.999935',\n",
       "     'duration': 0.02646,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Sort the qualifiers by ID\\nSort the split qualifiers by id, node1, label, node2.\\nThis may take a while.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 12,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:45.024688',\n",
       "     'end_time': '2020-12-24T04:01:46.277644',\n",
       "     'duration': 1.252956,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:45.053828Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:45.054512Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:46.276755Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:46.277477Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:00.971581 CPU=0:00:00.670670 ( 69.0%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/qualifiers.sorted-by-node1.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n\\r\\nreal\\t0m1.109s\\r\\nuser\\t0m1.389s\\r\\nsys\\t0m0.159s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.{kgtk_extension}\\\\\\n --columns     id node1 label node2 \\\\\\n --extra       \"{sort_extras}\"'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:46.311060',\n",
       "     'end_time': '2020-12-24T04:01:46.342233',\n",
       "     'duration': 0.031173,\n",
       "     'status': 'completed'}},\n",
       "   'source': \"### Extract the English aliases, descriptions, labels, and sitelinks.\\nAliases, descriptions, and labels are extracted by selecting rows where the `node2` value ends in the language suffix for English (`@en`) in a KGTK language-qualified string. This is an abbreviated pattern; a more general pattern would include the single quotes used to delimit a KGTK language-qualified string. If `kgtk import-wikidata` has executed properly, the abbreviated pattern should be sufficient.\\n\\nSitelink rows do not have a language-specific marker in the `node2` value. We use the `lang` column to provide the language code for English ('en').  The `lang` column is an additional column created by `kgtk import-wikidata`.\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 13,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:46.370936',\n",
       "     'end_time': '2020-12-24T04:01:48.107618',\n",
       "     'duration': 1.736682,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:46.401568Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:46.402217Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:48.106672Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:48.107445Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:01.442662 CPU=0:00:01.420834 ( 98.5%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.en.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.591s\\r\\nuser\\t0m1.425s\\r\\nsys\\t0m0.117s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/aliases.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/aliases.en.{kgtk_extension}\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 14,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:48.141826',\n",
       "     'end_time': '2020-12-24T04:01:49.943003',\n",
       "     'duration': 1.801177,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:48.178097Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:48.178842Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:49.942122Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:49.942839Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:01.462942 CPU=0:00:01.445114 ( 98.8%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.en.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.647s\\r\\nuser\\t0m1.459s\\r\\nsys\\t0m0.126s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/descriptions.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/descriptions.en.{kgtk_extension}\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 15,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:49.976653',\n",
       "     'end_time': '2020-12-24T04:01:51.707139',\n",
       "     'duration': 1.730486,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:50.009197Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:50.009746Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:51.706240Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:51.706973Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:01.414010 CPU=0:00:01.399556 ( 99.0%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.en.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.578s\\r\\nuser\\t0m1.413s\\r\\nsys\\t0m0.117s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/labels.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/labels.en.{kgtk_extension}\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 16,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:51.741666',\n",
       "     'end_time': '2020-12-24T04:01:52.912481',\n",
       "     'duration': 1.170815,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:51.775646Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:51.776326Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:52.911596Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:52.912314Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:00.700093 CPU=0:00:00.693235 ( 99.0%): filter --verbose=False --use-mgzip=True --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz -p ; sitelink-language ; en -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.021s\\r\\nuser\\t0m0.712s\\r\\nsys\\t0m0.099s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} \\\\\\n --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n -p '; sitelink-language ; en' -o {temp_folder_path}/sitelinks.language.en.{kgtk_extension}\"},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 17,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:52.947779',\n",
       "     'end_time': '2020-12-24T04:01:54.343297',\n",
       "     'duration': 1.395518,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:52.981706Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:52.982265Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:54.342572Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:54.343081Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:00.827157 CPU=0:00:00.810847 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.en.tsv.gz --input-keys id --filter-keys node1\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.247s\\r\\nuser\\t0m0.817s\\r\\nsys\\t0m0.111s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/sitelinks.{kgtk_extension} \\\\\\n --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/sitelinks.en.{kgtk_extension} \\\\\\n --input-keys  id \\\\\\n --filter-keys node1'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 18,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:54.382473',\n",
       "     'end_time': '2020-12-24T04:01:55.721828',\n",
       "     'duration': 1.339355,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:54.423482Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:54.424161Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:55.720973Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:55.721666Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Timing: elapsed=0:00:00.733432 CPU=0:00:00.720798 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.en.tsv.gz --input-keys node1 --filter-keys node1\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.180s\\r\\nuser\\t0m0.747s\\r\\nsys\\t0m0.113s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/sitelinks.qualifiers.en.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys node1'},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:55.757919',\n",
       "     'end_time': '2020-12-24T04:01:55.788849',\n",
       "     'duration': 0.03093,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Partition the claims by Wikidata Property Datatype\\nWikidata has two names for each Wikidata property datatype: the name that appears in the JSON dump file, and the name that appears in the TTL dump file. `kgtk import-wikidata` currently imports rows from Wikikdata JSON dump files, and these are the names that appear below.\\n\\nThe `part.other` file catches any records that have an unknown Wikidata property datatype. Additional Wikidata property datatypes may occur when processing from certain Wikidata extensions.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 19,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:55.817159',\n",
       "     'end_time': '2020-12-24T04:01:56.912792',\n",
       "     'duration': 1.095633,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:55.852920Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:55.853478Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:56.911876Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:56.912620Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Error: Cannot find the object column \\'node2;wikidatatype\\'.\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/filter.py\", line 1169, in run\\r\\n    return process_plain()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/filter.py\", line 682, in process_plain\\r\\n    raise KGTKException(\"Missing columns.\")\\r\\nkgtk.exceptions.KGTKException: Missing columns.\\r\\nMissing columns.\\r\\nTiming: elapsed=0:00:00.797013 CPU=0:00:00.776128 ( 97.4%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tsv.gz --obj node2;wikidatatype -p ;; commonsMedia -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz -p ;; external-id -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz -p ;; geo-shape -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz -p ;; globe-coordinate -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz -p ;; math -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz -p ;; monolingualtext -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz -p ;; musical-notation -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz -p ;; quantity -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz -p ;; string -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz -p ;; tabular-data -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz -p ;; time -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz -p ;; url -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz -p ;; wikibase-form -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz -p ;; wikibase-item -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz -p ;; wikibase-lexeme -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz -p ;; wikibase-property -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz -p ;; wikibase-sense -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.other.tsv.gz\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.943s\\r\\nuser\\t0m0.787s\\r\\nsys\\t0m0.112s\\r\\n'}],\n",
       "   'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \\\\\\n --input-file {wikidata_parts_path}/claims.{kgtk_extension} \\\\\\n --obj 'node2;wikidatatype' \\\\\\n -p ';; commonsMedia'      -o {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \\\\\\n -p ';; external-id'       -o {wikidata_parts_path}/claims.external-id.{kgtk_extension} \\\\\\n -p ';; geo-shape'         -o {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \\\\\\n -p ';; globe-coordinate'  -o {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \\\\\\n -p ';; math'              -o {wikidata_parts_path}/claims.math.{kgtk_extension} \\\\\\n -p ';; monolingualtext'   -o {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \\\\\\n -p ';; musical-notation'  -o {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \\\\\\n -p ';; quantity'          -o {wikidata_parts_path}/claims.quantity.{kgtk_extension} \\\\\\n -p ';; string'            -o {wikidata_parts_path}/claims.string.{kgtk_extension} \\\\\\n -p ';; tabular-data'      -o {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \\\\\\n -p ';; time'              -o {wikidata_parts_path}/claims.time.{kgtk_extension} \\\\\\n -p ';; url'               -o {wikidata_parts_path}/claims.url.{kgtk_extension} \\\\\\n -p ';; wikibase-form'     -o {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \\\\\\n -p ';; wikibase-item'     -o {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \\\\\\n -p ';; wikibase-lexeme'   -o {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \\\\\\n -p ';; wikibase-property' -o {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \\\\\\n -p ';; wikibase-sense'    -o {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \\\\\\n --reject-file {wikidata_parts_path}/claims.other.{kgtk_extension}\"},\n",
       "  {'cell_type': 'markdown',\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:56.949949',\n",
       "     'end_time': '2020-12-24T04:01:56.979812',\n",
       "     'duration': 0.029863,\n",
       "     'status': 'completed'}},\n",
       "   'source': '### Partition the qualifiers\\nExtract the qualifier records for each of the Wikidata property datatype partition files.'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 20,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:57.007807',\n",
       "     'end_time': '2020-12-24T04:01:58.096484',\n",
       "     'duration': 1.088677,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:57.039775Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:57.040379Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:58.095586Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:58.096316Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.783377 CPU=0:00:00.703171 ( 89.8%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.commonsMedia.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.940s\\r\\nuser\\t0m0.721s\\r\\nsys\\t0m0.106s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.commonsMedia.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 21,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:58.126907',\n",
       "     'end_time': '2020-12-24T04:01:59.196539',\n",
       "     'duration': 1.069632,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:58.160443Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:58.160980Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:01:59.195572Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:01:59.196302Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.700827 CPU=0:00:00.686953 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.external-id.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.918s\\r\\nuser\\t0m0.703s\\r\\nsys\\t0m0.099s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.external-id.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.external-id.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 22,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:01:59.226546',\n",
       "     'end_time': '2020-12-24T04:02:00.286191',\n",
       "     'duration': 1.059645,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:01:59.261423Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:01:59.262067Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:00.285460Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:00.286054Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.693817 CPU=0:00:00.680174 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.geo-shape.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.910s\\r\\nuser\\t0m0.693s\\r\\nsys\\t0m0.102s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.geo-shape.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 23,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:00.317067',\n",
       "     'end_time': '2020-12-24T04:02:01.376710',\n",
       "     'duration': 1.059643,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:00.353914Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:00.354444Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:01.375736Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:01.376541Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.685847 CPU=0:00:00.674586 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.globe-coordinate.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.908s\\r\\nuser\\t0m0.695s\\r\\nsys\\t0m0.100s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.globe-coordinate.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 24,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:01.417421',\n",
       "     'end_time': '2020-12-24T04:02:02.487716',\n",
       "     'duration': 1.070295,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:01.454499Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:01.455052Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:02.486818Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:02.487549Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.692024 CPU=0:00:00.686177 ( 99.2%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.math.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.915s\\r\\nuser\\t0m0.710s\\r\\nsys\\t0m0.098s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.math.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.math.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 25,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:02.527005',\n",
       "     'end_time': '2020-12-24T04:02:03.618049',\n",
       "     'duration': 1.091044,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:02.568577Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:02.569635Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:03.617155Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:03.617884Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.683551 CPU=0:00:00.673260 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.monolingualtext.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.930s\\r\\nuser\\t0m0.713s\\r\\nsys\\t0m0.106s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.monolingualtext.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 26,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:03.658843',\n",
       "     'end_time': '2020-12-24T04:02:04.744693',\n",
       "     'duration': 1.08585,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:03.696451Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:03.697009Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:04.743613Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:04.744387Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.713679 CPU=0:00:00.704594 ( 98.7%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.musical-notation.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.930s\\r\\nuser\\t0m0.720s\\r\\nsys\\t0m0.102s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.musical-notation.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 27,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:04.786691',\n",
       "     'end_time': '2020-12-24T04:02:05.968529',\n",
       "     'duration': 1.181838,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:04.830352Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:04.830926Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:05.967605Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:05.968364Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.779701 CPU=0:00:00.767860 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.quantity.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.020s\\r\\nuser\\t0m0.794s\\r\\nsys\\t0m0.112s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.quantity.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.quantity.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 28,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:06.012197',\n",
       "     'end_time': '2020-12-24T04:02:07.098719',\n",
       "     'duration': 1.086522,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:06.051402Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:06.051944Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:07.097812Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:07.098546Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.697405 CPU=0:00:00.686692 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.string.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.929s\\r\\nuser\\t0m0.713s\\r\\nsys\\t0m0.101s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.string.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.string.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 29,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:07.142287',\n",
       "     'end_time': '2020-12-24T04:02:08.230149',\n",
       "     'duration': 1.087862,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:07.180935Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:07.181576Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:08.229186Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:08.229986Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.710830 CPU=0:00:00.698269 ( 98.2%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tabular-data.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.932s\\r\\nuser\\t0m0.719s\\r\\nsys\\t0m0.099s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.tabular-data.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 30,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:08.273591',\n",
       "     'end_time': '2020-12-24T04:02:09.458369',\n",
       "     'duration': 1.184778,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:08.320840Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:08.321416Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:09.457121Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:09.458110Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.778076 CPU=0:00:00.765454 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.time.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.019s\\r\\nuser\\t0m0.792s\\r\\nsys\\t0m0.114s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.time.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.time.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 31,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:09.509146',\n",
       "     'end_time': '2020-12-24T04:02:10.643175',\n",
       "     'duration': 1.134029,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:09.568010Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:09.569190Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:10.642292Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:10.643010Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.699523 CPU=0:00:00.688743 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.url.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.956s\\r\\nuser\\t0m0.726s\\r\\nsys\\t0m0.115s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.url.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.url.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 32,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:10.687437',\n",
       "     'end_time': '2020-12-24T04:02:11.759828',\n",
       "     'duration': 1.072391,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:10.728108Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:10.728664Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:11.758859Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:11.759666Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.690773 CPU=0:00:00.679523 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-form.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.916s\\r\\nuser\\t0m0.703s\\r\\nsys\\t0m0.099s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-form.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 33,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:11.804768',\n",
       "     'end_time': '2020-12-24T04:02:12.920111',\n",
       "     'duration': 1.115343,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:11.846434Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:11.847239Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:12.919170Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:12.919936Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.726953 CPU=0:00:00.713173 ( 98.1%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-item.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.956s\\r\\nuser\\t0m0.734s\\r\\nsys\\t0m0.105s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-item.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 34,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:12.997290',\n",
       "     'end_time': '2020-12-24T04:02:14.332264',\n",
       "     'duration': 1.334974,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:13.075096Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:13.075767Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:14.331204Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:14.332022Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.837385 CPU=0:00:00.819865 ( 97.9%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-lexeme.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m1.136s\\r\\nuser\\t0m0.864s\\r\\nsys\\t0m0.136s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-lexeme.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 35,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:14.384451',\n",
       "     'end_time': '2020-12-24T04:02:15.564758',\n",
       "     'duration': 1.180307,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:14.447325Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:14.448319Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:15.564006Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:15.564619Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.738577 CPU=0:00:00.726217 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-property.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.995s\\r\\nuser\\t0m0.760s\\r\\nsys\\t0m0.116s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-property.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'},\n",
       "  {'cell_type': 'code',\n",
       "   'execution_count': 36,\n",
       "   'metadata': {'tags': [],\n",
       "    'papermill': {'exception': False,\n",
       "     'start_time': '2020-12-24T04:02:15.607523',\n",
       "     'end_time': '2020-12-24T04:02:16.682153',\n",
       "     'duration': 1.07463,\n",
       "     'status': 'completed'},\n",
       "    'execution': {'iopub.status.busy': '2020-12-24T04:02:15.650863Z',\n",
       "     'iopub.execute_input': '2020-12-24T04:02:15.651397Z',\n",
       "     'iopub.status.idle': '2020-12-24T04:02:16.681159Z',\n",
       "     'shell.execute_reply': '2020-12-24T04:02:16.681914Z'}},\n",
       "   'outputs': [{'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': 'Traceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n    ie.process()\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n    very_verbose=self.very_verbose,\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n    source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n    verbose)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n    return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n    binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n    fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n    return_code = func(*args, **kwargs) or 0\\r\\n  File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n    raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.689980 CPU=0:00:00.678144 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-sense.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n",
       "    {'output_type': 'stream',\n",
       "     'name': 'stdout',\n",
       "     'text': '\\r\\nreal\\t0m0.912s\\r\\nuser\\t0m0.700s\\r\\nsys\\t0m0.098s\\r\\n'}],\n",
       "   'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on   {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-sense.{kgtk_extension} \\\\\\n --input-keys  node1 \\\\\\n --filter-keys id'}],\n",
       " 'metadata': {'kernelspec': {'display_name': 'Python 3',\n",
       "   'language': 'python',\n",
       "   'name': 'python3'},\n",
       "  'language_info': {'name': 'python',\n",
       "   'version': '3.7.9',\n",
       "   'mimetype': 'text/x-python',\n",
       "   'codemirror_mode': {'name': 'ipython', 'version': 3},\n",
       "   'pygments_lexer': 'ipython3',\n",
       "   'nbconvert_exporter': 'python',\n",
       "   'file_extension': '.py'},\n",
       "  'papermill': {'default_parameters': {},\n",
       "   'parameters': {'wikidata_input_path': '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz',\n",
       "    'wikidata_parts_path': '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts',\n",
       "    'temp_folder_path': '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp',\n",
       "    'sort_extras': '--buffer-size 30% --temporary-directory $OUT/parts/temp',\n",
       "    'verbose': False},\n",
       "   'environment_variables': {},\n",
       "   'version': '2.2.2',\n",
       "   'input_path': '/Users/pedroszekely/Documents/GitHub/kgtk/examples/partition-wikidata.ipynb',\n",
       "   'output_path': '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/partition-wikidata.out.ipynb',\n",
       "   'start_time': '2020-12-24T04:01:12.363465',\n",
       "   'end_time': '2020-12-24T04:02:16.945647',\n",
       "   'duration': 64.582182,\n",
       "   'exception': None}},\n",
       " 'nbformat': 4,\n",
       " 'nbformat_minor': 4}"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pm.execute_notebook(\n",
    "    os.environ[\"EXAMPLES_DIR\"] + \"/partition-wikidata.ipynb\",\n",
    "    os.environ[\"TEMP\"] + \"/partition-wikidata.out.ipynb\",\n",
    "    parameters=dict(\n",
    "        wikidata_input_path = os.environ[\"TEMP\"] + \"/all_and_qualifiers.tsv.gz\",\n",
    "        wikidata_parts_path = os.environ[\"OUT\"] + \"/parts\",\n",
    "        temp_folder_path = os.environ[\"OUT\"] + \"/parts/temp\",\n",
    "        sort_extras = \"--buffer-size 30% --temporary-directory $OUT/parts/temp\",\n",
    "        verbose = False\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The partition-wikidata notebook created the following partitioned kgtk-files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aliases.en.tsv.gz                  metadata.property.datatypes.tsv.gz\n",
      "aliases.tsv.gz                     metadata.types.tsv.gz\n",
      "all.tsv.gz                         qualifiers.tsv.gz\n",
      "claims.tsv.gz                      sitelinks.en.tsv.gz\n",
      "descriptions.en.tsv.gz             sitelinks.qualifiers.en.tsv.gz\n",
      "descriptions.tsv.gz                sitelinks.qualifiers.tsv.gz\n",
      "labels.en.tsv.gz                   sitelinks.tsv.gz\n",
      "labels.tsv.gz                      \u001b[34mtemp\u001b[m\u001b[m\n"
     ]
    }
   ],
   "source": [
    "!ls $OUT/parts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "count(DISTINCT graph_36_c1.\"node1\")\n",
      "13153\n",
      "        2.61 real         2.55 user         0.37 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i $OUT/parts/claims.tsv.gz \\\n",
    "--match '(n1)-[]->()' \\\n",
    "--return 'count(distinct n1)'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graph Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normally, we would use `Q154ITEM`, but the partioning failed so we will compute it using kypher"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$Q154ITEM\" | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  197521  811687 10791930\n"
     ]
    }
   ],
   "source": [
    "!zcat < \"$TEMP\"/Q154.edges.3.tsv.gz | wc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.83 real         0.66 user         0.15 sys\n"
     ]
    }
   ],
   "source": [
    "!$kypher -i \"$TEMP\"/Q154.edges.3.tsv.gz -i \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz -i \"$Q154LABEL\" \\\n",
    "--match 'edges: (n1)-[l {label: property}]->(n2), datatype: (property)-[]->(dt:`wikibase-item`), label: (n1)-[]->(lab)' \\\n",
    "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n",
    "-o \"$GE\"/geinput.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have over 60,000 lines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   66490  265960 3297462 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc \"$GE\"/geinput.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the graph embeddings using the default settings. Our output file `translation.txt` will be in word2vec format so we can usi it diectly in gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In Processing, Please go to /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/ge.log to check details\n",
      "Opening the input file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n",
      "KgtkReader: File_path.suffix: .tsv\n",
      "KgtkReader: reading file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n",
      "header: id\tnode1\tlabel\tnode2\n",
      "node1 column found, this is a KGTK edge file\n",
      "KgtkReader: Special columns: node1=1 label=2 node2=3 id=0\n",
      "KgtkReader: Reading an edge file.\n",
      "Opening the output file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv\n",
      "File_path.suffix: .tsv\n",
      "KgtkWriter: writing file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv\n",
      "header: id\tnode1\tlabel\tnode2\n",
      "Processing the input records.\n",
      "Processed 66489 records.\n",
      "Processed Finished.\n",
      "      193.64 real       958.24 user       107.56 sys\n"
     ]
    }
   ],
   "source": [
    "!$kgtk graph-embeddings --verbose -i \"$GE\"/geinput.tsv \\\n",
    "-o \"$GE\"/embeddings.txt \\\n",
    "--retain_temporary_data True \\\n",
    "--operator translation \\\n",
    "--workers 5 \\\n",
    "--log \"$GE\"/ge.log \\\n",
    "-T \"$GE\" \\\n",
    "-ot w2v \\\n",
    "-e 300"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the output direcory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 446864\n",
      "-rw-r--r--   1 pedroszekely  staff   101K Dec 26 16:09 Q27.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:18 Q27.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   177K Dec 26 16:09 Q29.Q45.Q142.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    43K Dec 25 22:36 Q29.Q45.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    85K Dec 26 16:09 Q29.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    79K Dec 26 16:09 Q332378.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    88K Dec 26 16:09 Q374.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    87K Dec 26 16:09 Q502268.sim.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:11 Q502268.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   4.3K Dec 25 21:33 Q610672.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    53M Dec 23 23:23 embeddings.txt\n",
      "-rw-r--r--   1 pedroszekely  staff   480K Dec 23 23:23 ge.log\n",
      "-rw-r--r--   1 pedroszekely  staff   3.1M Dec 23 22:02 geinput.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   973K Dec 23 12:41 geinput.tsv.gz\n",
      "drwxr-xr-x  10 pedroszekely  staff   320B Dec 23 23:23 \u001b[34moutput\u001b[m\u001b[m\n",
      "-rw-r--r--   1 pedroszekely  staff    19K Dec 26 22:14 projector.qnodes.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   2.7M Dec 26 22:14 projector.vectors.tsv\n",
      "-rw-r--r--@  1 pedroszekely  staff   4.9K Dec 23 15:21 test.txt\n",
      "-rw-r--r--   1 pedroszekely  staff   1.2M Dec 23 23:20 tmp_geinput.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    11K Dec 23 16:22 translation.10.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   8.2K Dec 23 21:50 translation.1000.projector.metadata.1.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    29K Dec 23 23:00 translation.1000.projector.metadata.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   1.2M Dec 23 21:50 translation.1000.projector.vectors.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   1.2M Dec 23 21:50 translation.1000.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   1.2M Dec 23 20:59 translation.1000.txt\n",
      "-rw-r--r--   1 pedroszekely  staff   622K Dec 23 23:34 translation.10000.projector.metadata.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    12M Dec 23 23:23 translation.10000.projector.vectors.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    12M Dec 23 23:23 translation.10000.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   143K Dec 23 23:07 translation.5000.projector.metadata.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   6.0M Dec 23 23:07 translation.5000.projector.vectors.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   6.0M Dec 23 23:07 translation.5000.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   114K Dec 26 22:10 translation.projector.metadata.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    83K Dec 26 21:28 translation.projector.qnodes.lab.des.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    19K Dec 26 22:10 translation.projector.qnodes.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff   2.7M Dec 26 22:10 translation.projector.vectors.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    54M Dec 23 20:25 translation.tsv\n",
      "-rw-r--r--   1 pedroszekely  staff    54M Dec 23 15:23 translation2.txt\n",
      "-rw-r--r--   1 pedroszekely  staff   7.9K Dec 23 21:58 xxx.txt\n"
     ]
    }
   ],
   "source": [
    "!ls -hl \"$GE\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's peek at the file, we have 44K vectors of dimension 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "44419 100\n",
      "Q243611 -0.331411451 -0.152568206 -0.139386058 -0.121394955 -0.334799886 0.023394363 -0.024942441 -0.137579590 0.084599547 0.876167953 -0.222018719 -0.168754980 -0.027932534 -0.289450347 0.250572681 -0.633476973 -0.440892249 -0.178823337 0.299026161 -0.407618254 -0.036977571 0.032356881 -0.081695572 -0.055025205 -0.182957411 -0.250380307 0.535348237 -0.108279251 0.452128828 -0.346319675 0.042611640 0.338040203 0.171208084 -0.275558919 0.114576176 -0.198427215 -0.277292132 -0.149741501 -0.327517658 0.146066576 0.431715995 0.481242269 -0.124767415 -0.171481445 -0.394009471 -0.305026233 0.223357961 0.360154629 0.213194653 0.012373813 -0.405227572 0.052000813 0.084122777 0.072465442 0.241527051 0.314641565 -0.258469820 0.122197300 -0.385967076 -0.472052187 -0.090907939 -0.102187648 0.184509873 0.132856295 0.402841479 0.585462868 0.695401728 0.060416430 -0.322626084 -0.238338873 0.333650321 0.479767382 -0.366145641 0.051905960 0.275238752 0.429640323 -0.370602965 0.055560533 0.609016299 -0.264090836 0.130152687 -0.186686888 0.346337169 -0.695047677 -0.011451115 -0.673357785 -0.533024371 0.064912595 0.069889240 -0.252222359 -0.089250244 -0.509508848 0.427851468 0.018754318 -0.192092314 -0.222673357 -0.156975567 -0.142941862 0.170732170 0.495883286\n"
     ]
    }
   ],
   "source": [
    "!head -2 \"$GE\"/embeddings.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the vecotrs in gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = os.environ['GE'] + \"/embeddings.txt\"\n",
    "ge_vectors = KeyedVectors.load_word2vec_format(path, binary=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.71844614, -0.72041976,  0.819834  , -0.07249352,  0.24403723,\n",
       "        0.60705996, -0.5666862 , -0.5559557 ,  0.686424  ,  0.6667965 ,\n",
       "       -0.46009716,  0.4207767 , -0.17946522, -0.18458156, -1.0764353 ,\n",
       "        1.056981  , -0.06046142,  0.00866301, -0.02163753, -0.3418129 ,\n",
       "       -0.03871485, -0.14953642,  0.8018838 ,  0.19381396, -0.10066328,\n",
       "        0.884025  , -0.08962934, -0.36985362, -0.3394345 ,  0.671762  ,\n",
       "        0.11509704, -0.6489555 , -0.22910565, -0.6392556 ,  0.8204702 ,\n",
       "       -0.260422  ,  0.4548083 ,  0.06683284, -0.09605702,  0.23433112,\n",
       "        0.4129733 ,  0.05630195, -0.24607319, -0.19756897,  0.3878965 ,\n",
       "        0.08242382,  0.07034106,  0.14290804,  0.07523334, -0.16040339,\n",
       "        0.02874546, -0.0554648 ,  0.00764391, -0.6856189 , -0.3701922 ,\n",
       "       -0.23979117,  0.26580626,  0.01087183, -1.2511953 ,  0.01297893,\n",
       "       -0.23593499, -0.16515297, -0.2442124 , -0.10745924,  1.16383   ,\n",
       "       -0.8887456 ,  0.7308084 , -0.02755331,  1.395485  , -0.34370282,\n",
       "        0.61988074,  0.28472528, -0.51778364, -0.5608775 ,  0.6496688 ,\n",
       "       -0.11930947, -0.4032322 ,  1.1153812 , -0.9912186 ,  0.09023302,\n",
       "       -0.3542225 ,  0.24804258,  0.26503336, -0.6374534 ,  0.13950008,\n",
       "       -0.47777557,  0.77702343,  0.0645401 , -0.16665687, -0.37595555,\n",
       "        0.70249134, -0.77693635,  0.2853018 ,  0.35154393, -0.03257728,\n",
       "       -1.2317531 , -0.41577864, -0.73989207,  1.072565  , -0.0718146 ],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "ge_vectors['Q502268']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the most similar qnodes to `Q15874936`, the qnode for Michelob."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Q610672', 0.9267997741699219),\n",
       " ('Q48799234', 0.7637178897857666),\n",
       " ('Q85269976', 0.762772262096405),\n",
       " ('Q5647008', 0.7582801580429077),\n",
       " ('Q5149389', 0.7565429210662842)]"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ge_vectors.most_similar(positive=['Q15874936'], topn=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is hard to use because the reuslt are qnodes and we have no idea what they are. Let's define a function to fetch the labels and descriptions so that we can interpret the results more easily"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`kgtk_most_similar` is a wrapper to gensim's `most_similar` function, and it is designed to output the results in KGTK format. The `kgtk_path` is required if we want to output the labels and descriptios as this path is where the `labels.en.tsv.gz` and `descriptions.en.tsv.gz` files care stored. You can optionally provide a `output_path` to tell it to sotre the results in a file; otherwise the results will be returned as a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kgtk_most_similar(\n",
    "    vectors,\n",
    "    positive,\n",
    "    relation_label=\"similarity_score\",\n",
    "    kg_path=None,\n",
    "    add_label_description=True,\n",
    "    output_path=None,\n",
    "    topn=25,\n",
    "):\n",
    "    \"\"\"\"\"\"\n",
    "    result = []\n",
    "    if add_label_description and kg_path:\n",
    "        fp = tempfile.NamedTemporaryFile(\n",
    "            mode=\"w\", suffix=\".tsv\", delete=False, encoding=\"utf-8\"\n",
    "        )\n",
    "        fp.write(\"node1\\tlabel\\tnode2\\n\")\n",
    "        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n",
    "            fp.write(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n",
    "        filename = fp.name\n",
    "        fp.close()\n",
    "\n",
    "        os.environ[\"_label_graph\"] = kg_path + \"/labels.en.tsv.gz\"\n",
    "        os.environ[\"_description_graph\"] = kg_path + \"/descriptions.en.tsv.gz\"\n",
    "        os.environ[\"_temp_file\"] = filename\n",
    "\n",
    "        result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_temp_file\" --as sim \\\n",
    "--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, lab as `node1;label`, des as `node1;description`' \\\n",
    "--order-by 'cast(similarity, float) desc' \n",
    "        \n",
    "        os.remove(filename)\n",
    "        \n",
    "    else:\n",
    "        result.append(\"node1\\tlabel\\tnode2\\n\")\n",
    "        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n",
    "            result.append(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n",
    "\n",
    "    if output_path:\n",
    "        handle = open(output_path, \"w\")\n",
    "        for line in result:\n",
    "            handle.write(line)\n",
    "            handle.write(\"\\n\")\n",
    "        handle.close()\n",
    "    else:\n",
    "        columns = result[0].split(\"\\t\")\n",
    "        data = []\n",
    "        for line in result[1:]:\n",
    "            data.append(line.split(\"\\t\"))\n",
    "        return pd.DataFrame(data, columns=columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's give it a try:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q610672</td>\n",
       "      <td>0.9267997741699219</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Budweiser'@en</td>\n",
       "      <td>'brand of pale lager'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q48799234</td>\n",
       "      <td>0.7637178897857666</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Virginia Black Whiskey'@en</td>\n",
       "      <td>'super-premium brand of American Bourbon whisk...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q85269976</td>\n",
       "      <td>0.762772262096405</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Busch Beer'@en</td>\n",
       "      <td>'brand of beer owned by Anheuser-Busch'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q5149389</td>\n",
       "      <td>0.7565429210662842</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Colt 45'@en</td>\n",
       "      <td>'malt liquor'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q3079990</td>\n",
       "      <td>0.752647340297699</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Four Loko'@en</td>\n",
       "      <td>'Drink'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q96952363</td>\n",
       "      <td>0.7438719272613525</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Cronk'@en</td>\n",
       "      <td>'American drink'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q7085533</td>\n",
       "      <td>0.7436875104904175</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Olde English 800'@en</td>\n",
       "      <td>'malt liquor'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label                  node1;label  \\\n",
       "0    Q610672  0.9267997741699219  similarity               'Budweiser'@en   \n",
       "1  Q48799234  0.7637178897857666  similarity  'Virginia Black Whiskey'@en   \n",
       "2  Q85269976   0.762772262096405  similarity              'Busch Beer'@en   \n",
       "3   Q5149389  0.7565429210662842  similarity                 'Colt 45'@en   \n",
       "4   Q3079990   0.752647340297699  similarity               'Four Loko'@en   \n",
       "5  Q96952363  0.7438719272613525  similarity                   'Cronk'@en   \n",
       "6   Q7085533  0.7436875104904175  similarity        'Olde English 800'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                           'brand of pale lager'@en  \n",
       "1  'super-premium brand of American Bourbon whisk...  \n",
       "2         'brand of beer owned by Anheuser-Busch'@en  \n",
       "3                                   'malt liquor'@en  \n",
       "4                                         'Drink'@en  \n",
       "5                                'American drink'@en  \n",
       "6                                   'malt liquor'@en  "
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q15874936 is Michelob\n",
    "kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zcat: error writing to output: Broken pipe\n"
     ]
    }
   ],
   "source": [
    "!zcat < $OUT/all.tsv.gz | head -500 > $TEMP/all.500.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "P10-P1628-32b85d-7927ece6-0\tP10\tP1628\t\"http://www.w3.org/2006/vcard/ns#Video\"\n",
      "P10-P1628-acf60d-b8950832-0\tP10\tP1628\t\"https://schema.org/video\"\n",
      "P10-P1629-Q34508-bcc39400-0\tP10\tP1629\tQ34508\n",
      "P10-P1659-P1651-c4068028-0\tP10\tP1659\tP1651\n",
      "P10-P1659-P18-5e4b9c4f-0\tP10\tP1659\tP18\n",
      "P10-P1659-P4238-d21d1ac0-0\tP10\tP1659\tP4238\n",
      "P10-P1659-P51-86aca4c5-0\tP10\tP1659\tP51\n",
      "P10-P1855-Q15075950-7eff6d65-0\tP10\tP1855\tQ15075950\n",
      "P10-P1855-Q69063653-c8cdb04c-0\tP10\tP1855\tQ69063653\n"
     ]
    }
   ],
   "source": [
    "!head $TEMP/all.500.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Explain the command here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk text-embedding -i $OUT/all.tsv.gz \\\n",
    "--embedding-projector-metadata-path none \\\n",
    "--label-properties label \\\n",
    "--isa-properties P31 P279 P452 P106 \\\n",
    "--description-properties description \\\n",
    "--property-value P186 P17 P127 P176 P169 \\\n",
    "--has-properties \"\" \\\n",
    "-f kgtk_format \\\n",
    "--output-data-format kgtk_format \\\n",
    "--save-embedding-sentence \\\n",
    "--model bert-large-nli-cls-token \\\n",
    "-o \"$TE\" \\\n",
    "> \"$TE\"/text-embedding.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Duration --parallel 1\n",
    "16348.11 real     16066.21 user       315.45 sys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The text embeddings are output in KGTK format and we need them in word2vec format (need to enhance the command to produce w2v format). For now, define a function to convert the KGTK embeddings to w2v format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_kgtk_to_w2v(input_path, output_path, text_embedding_label=\"text_embedding\"):\n",
    "    \"\"\"\n",
    "    Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format\n",
    "    \"\"\"\n",
    "    vector_count = 0\n",
    "    vector_length = 0\n",
    "    \n",
    "    # Read the file once to count the lines as we need to put them at the top of the w2v file\n",
    "    with open(input_path, \"r\") as kgtk_file:\n",
    "        next(kgtk_file)\n",
    "        for line in kgtk_file:\n",
    "            items = line.split(\"\\t\")\n",
    "            qnode = items[0]\n",
    "            label = items[1]\n",
    "            if label == text_embedding_label:\n",
    "                if vector_count == 0:\n",
    "                    vector_length = len(items[2].split(\",\"))\n",
    "                vector_count += 1\n",
    "        kgtk_file.close()\n",
    "\n",
    "    with open(output_path, \"w\") as w2v_file:\n",
    "        w2v_file.write(\"{} {}\\n\".format(vector_count, vector_length))\n",
    "        with open(input_path, \"r\") as kgtk_file:\n",
    "            next(kgtk_file)\n",
    "            for line in kgtk_file:\n",
    "                items = line.split(\"\\t\")\n",
    "                qnode = items[0]\n",
    "                label = items[1]\n",
    "                if label == text_embedding_label:\n",
    "                    vector = items[2].replace(\",\", \" \")\n",
    "                    w2v_file.write(qnode + \" \" + vector)\n",
    "            kgtk_file.close()\n",
    "        w2v_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_kgtk_to_w2v(os.environ['TE'] + \"/text-embedding.tsv\", os.environ['TE'] + \"/embeddings.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the output file, the embeddings have 1024 dimensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "56017 1024\n",
      "undirected_pagerank -0.42267796 0.3995441 0.5533569 -0.71286017 0.35639343 0.23904479 -0.2763573 0.37157294 -0.4283453 1.3224101 0.6862846 0.19590487 -0.6082015 -0.11240994 0.33890438 -0.20922732 -0.23069456 -0.021294963 -1.912606 0.49719235 0.6929876 0.011938913 -1.5600294 0.20473605 -0.17875122 0.45237 -0.09061487 0.0838695 0.039139077 -0.5781012 -0.2535121 0.065458305 -0.34608266 -0.42478928 -0.4474916 -0.23409875 -0.13160512 -0.076800026 -0.6984711 0.12516521 -0.42880625 -0.85138726 0.04815936 -0.6207587 -0.08866266 -1.6658425 -0.51067406 -0.34878105 0.33144328 -0.69933593 -0.36479193 -0.6388813 0.76048696 0.12395467 -0.88557744 0.34427696 1.2574033 -0.65131736 -0.9506962 0.6257681 0.36623836 0.716814 0.36953598 -1.3571995 0.2660646 -1.2076085 0.09180403 -0.36115 0.42118248 -0.92440283 -0.32160524 -0.14557533 -0.50016695 -0.12131537 -0.74813855 0.5254087 0.42912796 -0.73770857 -0.39519224 1.1647401 0.63930184 -0.33095387 -0.17238976 0.19148383 -0.31919938 -0.7583614 0.15933603 1.0313777 0.27520698 -0.4556464 -0.63495463 -0.1864288 0.6013224 0.637127 -0.07590211 0.7430643 0.06540778 -0.0065790126 0.44254926 0.27115446 0.37154993 0.022709582 -0.73920345 0.71504974 -0.04737445 0.3215596 0.14265373 0.0013700873 -0.67682695 0.42491677 0.9620013 -0.2962407 0.40307814 0.4662022 0.38908783 -0.6515235 -0.6724364 0.20429769 0.09426039 0.10870178 -0.50047547 -0.16897413 -0.29538417 0.18928146 0.87492365 0.13553919 -0.8622958 0.21274589 -0.683947 0.36040968 -0.3770436 0.03559924 0.11785667 0.0033670748 0.079977475 -0.460622 -0.922562 0.54822904 -0.7001525 -0.13735794 0.0046447627 0.93614495 -0.04533757 1.0877196 0.18663098 -0.33188298 -1.1195552 0.22625268 0.18178236 0.44003317 -0.035616595 -0.17230903 -0.39078838 0.09534323 -0.36450732 -0.13266148 -0.5948716 -0.3778122 0.115013696 -0.48863468 0.5276801 -0.10320456 0.17860238 0.5847855 -0.55870014 1.1700139 -0.8719531 0.2900501 -0.4467073 0.26552573 -0.36334535 0.0765188 -1.2428156 0.07730358 0.08907298 0.52686894 -0.43270507 -1.400375 0.107771374 -0.81395435 -0.24545032 -0.26216444 -0.32014206 -0.35348052 -0.024345992 0.53140754 0.08466306 -0.57038295 -0.1269843 0.58409613 0.46116874 0.94535094 0.025036573 0.057027116 -0.68037903 1.0046511 -1.2596852 -0.037459765 -0.389251 -0.21985579 0.53391653 0.55650496 0.3328932 1.0321438 -0.16949745 0.61743855 -0.06628016 -0.2838724 -0.72551495 -0.032637402 -0.5673327 -0.0897552 -0.84946555 -0.75218916 0.7547705 0.83145154 -0.26083234 0.14909117 0.11596523 0.15905048 0.7511518 1.3206866 -0.06821178 0.79532903 -0.25254253 0.28651667 0.2536638 1.0395417 0.092335254 1.2873124 0.08776725 -0.05958847 -0.41424736 0.11005009 0.8274726 -0.51250714 -0.09145787 0.27819672 0.8735276 -0.5256038 0.26121446 0.08272835 0.39796406 -0.025718834 0.50356233 0.21068689 0.30204117 -0.30575705 -0.20718881 0.56285316 -0.5681627 0.69479936 0.19411525 0.25880888 0.47330382 -0.3539255 0.31446198 0.05105017 -0.107441604 -0.19249792 -0.39843526 -0.2551087 0.22434184 -0.2916974 -0.43394834 0.9704601 0.05666099 -0.69681704 -0.116564095 -0.03787969 0.27423766 -0.19120161 -0.92002064 0.21582173 -1.139993 0.39552695 -0.43537337 -0.16907048 0.5157604 0.4224562 0.5610382 -0.08036005 -1.4928522 -0.13146974 0.49898425 -1.0245981 0.1626403 -0.38850468 -0.8772544 0.18778482 -0.97421217 -0.29288915 0.6725434 -0.69844306 -0.14755279 -0.1968449 -0.86375725 -0.33360827 -0.10161168 -0.49888122 -0.33912677 0.43528208 -0.42569768 -0.20765932 -0.5381073 1.4305749 1.0162153 0.14457884 0.5763004 0.97068405 0.39098093 -0.03216348 -0.15244858 0.40377033 -0.18645048 0.9399603 0.076710895 0.5312454 -0.26848876 -0.46861956 -0.27942383 -0.6348347 0.294985 -0.40342814 1.0414813 1.0504925 -0.00836426 -0.99118257 -0.45631418 -0.7005619 0.404519 -0.10713773 -0.07559447 0.5544991 0.3827246 -0.55512184 -0.33234987 0.7993359 -0.079852566 0.35297632 0.5477561 0.22683053 0.5069918 -0.13029772 0.36162373 -0.014001881 -0.11648651 -0.66647947 0.01226069 -0.7284193 0.48086953 0.006934624 0.22385629 0.08074516 0.29289985 0.61216664 -0.12032819 0.1659586 -0.2181752 0.15336005 0.4407084 -1.0953207 -0.9043968 0.21611574 -0.90479344 -0.73193157 -0.62168366 0.9956651 -0.090728715 0.3878589 0.38336518 0.2604782 -0.0650832 0.05577252 0.7666885 0.14315598 0.0359419 0.44156542 -0.15730822 -0.15735826 0.10081276 -0.45704198 0.3992815 1.0245506 1.4449844 0.50542 -0.88196254 0.62593013 -0.2081841 0.60960853 -0.66418105 0.8603846 0.61228853 -1.2286749 -0.20330366 -1.0320998 1.198905 -0.16238491 0.17897743 0.16847304 0.42968208 0.1755085 0.34175223 0.49665308 -0.40418386 0.5926915 -0.6081441 1.0003483 0.3905947 -0.30414084 -0.34114298 0.8547739 -0.4670201 -0.23203468 0.5805412 0.40133566 -0.94826126 -0.23078169 -0.28718835 0.1264745 -0.70524764 0.508715 -0.024303429 0.3079768 0.98509324 0.19859965 -0.2700488 -0.50697654 -0.1804381 -0.3221201 0.22992785 -0.11842905 0.2621886 0.17650005 0.1401335 0.5725611 0.14143167 0.015926411 -0.12371779 -0.61506104 -0.61483264 -0.570195 -0.13236725 -0.11800632 -0.10830958 0.025182672 0.8578056 0.977953 -0.0059525445 -0.39955533 1.127108 -0.4665609 -0.03740844 -0.94570136 -0.1651189 -0.7827557 0.369654 0.20145196 0.50588286 -0.6361171 -0.7590097 -0.21335843 -0.5173786 0.97785115 0.47440884 1.2242765 -1.0599612 0.49780983 1.008144 -0.33477965 0.5589736 0.9486828 -0.07865547 0.82441354 -0.28226215 0.01269538 0.22909257 -1.2406305 0.74198633 0.019226547 -0.033761285 -0.25049102 -0.27017456 0.5518724 -1.0744305 -0.90507793 -0.16111492 -0.5462715 -1.9933928 0.031789362 -1.4327815 0.055561084 0.5697889 -0.5664057 -0.6227874 0.21851781 -0.726629 -1.1050928 0.1555212 -0.13036552 1.5256817 0.0031437278 -0.34641874 -0.26029167 0.2586624 0.21606264 -0.5991851 -0.5353387 -0.013069849 0.12415337 -0.59378207 -0.05707953 1.0167447 -0.41405144 -0.2853063 0.39441592 0.62434036 -0.38296816 0.015720915 1.1869724 0.7920963 0.1103225 -0.19993234 0.867546 0.67698205 1.1679859 1.0601817 -0.32352704 0.22812766 0.99878913 0.14075853 0.22087446 0.38174963 0.63968056 -0.63889086 0.8546627 0.5452647 0.31812298 -0.11800851 -0.6306626 -0.6350914 0.5565482 -0.7874143 0.22039914 -0.5172571 0.3113776 -0.27728507 0.20026723 -1.2695498 0.043180633 0.98999727 -0.24016514 0.7123504 -0.40300757 -0.7502448 -0.8941951 -0.19905064 1.9472562 0.703084 -0.44553536 -1.5692897 -0.363004 -0.07558155 1.7863687 -0.22492197 -0.25773934 -1.5538926 -0.36908916 0.24482231 -1.495694 0.51339495 0.5043237 -0.4106086 1.9655912 0.34972793 -1.0941802 -0.744956 -1.571301 -0.38214844 0.24033594 0.37885264 0.867155 0.6672241 -0.01693214 1.1466063 -0.5114372 0.72631586 0.38685834 -0.00609982 0.918031 -1.0576688 0.68399566 -0.7276541 -1.6443924 -0.22547406 0.28392553 -0.3197943 0.4078551 -0.7731335 -0.32600537 -0.8067985 -0.23840523 0.3526526 0.0196101 0.25087988 -0.6417036 0.005255079 0.21949208 -0.12147077 0.062054902 -0.5454072 1.025671 0.38807088 0.83292055 0.14208733 0.10787519 -0.05181068 0.27549422 -0.87673503 0.29951787 0.4675076 0.7174594 -0.527458 -1.0612055 -0.73938656 0.10550579 0.28773528 -0.5872211 0.7858924 0.8159002 0.518082 -0.63988984 0.072944984 0.26428187 -0.8011928 0.85742646 -0.6546526 -0.93099636 -0.57665247 0.023779552 1.1399913 -0.06637773 0.40282077 -0.9426894 -0.6185797 -0.09437606 0.5359475 0.022806503 -1.2509018 -0.05353026 -0.18726254 1.3856194 0.25013503 -0.27004337 -0.8613362 -0.6058942 0.21644488 -0.020496178 -0.35646865 -0.06542515 -0.11639291 0.7153526 -0.1760036 0.7813124 0.93504244 -0.23096421 -0.1552721 -0.69693065 0.308117 -0.7010237 -0.28066248 -0.21433288 -0.67217493 0.7867059 0.068477064 -0.57168525 0.012380041 -0.17970753 0.31171468 -0.63663334 -0.023489561 0.22867082 -0.33117527 -0.32161456 -0.18029884 0.4430051 -0.15684946 -0.32500783 0.24891087 -0.37589657 0.1752151 -0.7131431 -0.11198734 -1.0265784 -0.82821333 -0.9937131 -0.04920406 0.2835452 -0.5676211 -0.593093 -0.410075 1.022616 1.6055924 -0.53110176 -0.6283989 -0.049254365 -0.97321147 -0.00038947538 0.519022 -0.894111 0.016800117 -0.5091581 -0.35818344 -0.55171865 -0.42846614 -0.10952275 0.4071202 -0.3670231 0.7691647 0.735392 0.28780562 0.5646238 -0.23212996 -0.32656664 -0.73763084 -0.32413647 -0.6763478 -0.29096603 -0.3797785 0.40527463 0.08826317 -0.26290894 0.8125853 -0.56574816 -0.5180119 0.33959463 0.27818117 -0.42889327 0.66216576 0.30071586 0.043642543 0.9566169 -0.7295776 -0.6970514 0.06682913 -0.11611781 1.3372544 -0.7711051 -0.27622965 0.07858875 -0.18716207 -0.21521975 0.21165168 -0.14572033 -0.23844214 0.20200655 1.3710401 0.6067855 1.481676 1.573426 0.60474557 0.40126243 0.3611929 -0.4031999 0.56728536 -0.026211482 0.3288062 -0.691287 -0.09511359 2.0640354 -0.35376358 -0.14619523 -1.1336256 -0.4286315 -0.53594714 0.095278636 -0.04674165 -0.5994138 0.7946129 -1.098087 0.3902552 -0.36271507 0.5038213 0.75229025 0.4611937 0.0022006333 0.41274896 0.63416564 -0.83857703 0.32325786 -0.11804989 -0.4368401 0.019128636 0.28285143 0.43789893 -0.13059512 0.7616387 -0.13585262 0.2664371 0.72596914 0.6382323 -0.37144414 0.5277119 0.35573763 0.1688681 1.1595916 0.0906278 -0.4178283 -1.0203297 -0.088457964 -0.2315415 -0.20515415 0.36526158 0.29821527 -0.736996 -0.4478651 1.1028807 -0.89644456 -0.41372925 -1.2328763 -0.4640182 0.5761474 -0.27844954 0.31586835 0.015641235 0.8092839 -0.9372387 0.9934972 -0.1745011 -1.0877256 0.4443961 0.6014369 0.077761345 -0.084602356 0.19059552 -0.35350552 -1.0678735 -1.0453316 0.27547592 0.9063827 0.06397402 0.18907769 -0.8156636 -0.9964015 -0.21612515 0.37872353 0.09812939 0.2623325 0.6650963 -0.5505926 -0.41475606 -0.15147823 0.3543966 -0.32942224 -0.5251814 0.075754255 0.40572733 1.5484574 0.44342157 0.40193808 0.4500907 0.6327993 0.33049652 -0.27509007 -0.40771475 0.59853494 -0.07888409 -0.615096 0.2346958 0.1482316 -0.6923686 -0.5850022 -0.27653936 -0.65077204 -0.36599004 0.5476607 0.026976332 1.5865514 0.5274796 0.6814955 -0.65799254 0.878574 -0.12011457 0.8211617 -0.77377725 1.2645822 0.12579253 -0.27432328 -0.54965115 -1.0023885 -0.79174185 -1.0093133 1.2638149 -1.506173 0.1553447 -0.109686226 -0.86125576 0.58364683 -0.011541486 -0.6712913 0.7731098 1.8426316 -1.6172205 0.34361455 1.1704623 0.79213065 -0.84654135 0.7066611 0.44231334 0.426677 0.6324037 -0.034689866 -0.46297157 0.5192616 -1.0597641 -0.26437494 0.97451335 -0.22925322 0.19796279 -1.3983903 -0.13544144 -0.47686645 0.11490105 0.14934665 1.0109042 -0.5283708 0.26576567 1.0581495 -1.4754385 3.4011955 0.9815066 0.5532428 0.8212732 -0.15424214 0.24511632 -0.05958248 0.27007103 -0.37704584 -0.9525652 -0.3379977 -0.11389114 0.35535258 0.40745777 -0.91808265 -0.050311718 -0.17617881 -0.64855754 -0.31649047 -0.36157423 -1.325951 -0.39904866 -0.022547163 -0.27759802 -0.02079327 1.000595 1.2254262 0.067013234 -0.4022824 -0.44132692 0.0016483366 -0.46476963 0.50223863 0.11289063 0.47185534 -0.11404542 -1.0319462 -0.5292204 0.569465 0.32238537 -0.33776748 -0.11816757 -0.16153826 -0.428957 -0.572226 0.95514035 -0.929618 0.050701976 -0.2523394 -0.5244093 -0.68299943 -0.018993558 0.14790796 -0.40893796 -1.3917955 0.32277173 -0.09575469 0.5825725 0.7617345 -0.09229246 0.18082699 0.36071482 0.1524885 -0.023177393 0.21314728 0.72768265 -0.019122198 1.0425646 -1.3852326 -0.40800434 0.5618238 -0.8775127 -0.02481873 -0.30698693 0.7367337 0.7375277 -0.5189366 0.30280325 0.9436593 0.83178425 0.38313845 0.665535 0.29839128 -0.9412593 -0.37495625 0.6025321 0.261773 -0.31901595 0.17819108 0.3722279 -0.45606178 0.507588 0.574316 -0.56879294 -0.49606207\n"
     ]
    }
   ],
   "source": [
    "!head -2 \"$TE\"/embeddings.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the text embeddings in gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_path = os.environ['TE'] + \"/text-embedding.w2v.txt\"\n",
    "te_vectors = KeyedVectors.load_word2vec_format(te_path, binary=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare the graph and text embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to Johnnie Walker using the **graph embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q4865371</td>\n",
       "      <td>0.833085298538208</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Bartlet for America'@en</td>\n",
       "      <td>'episode of The West Wing (S3 E9)'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q7084279</td>\n",
       "      <td>0.8258047103881836</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Old Ironsides'@en</td>\n",
       "      <td>'1926 film by James Cruze'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q7736602</td>\n",
       "      <td>0.8078582286834717</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'The Girl of the Golden West'@en</td>\n",
       "      <td>'1930 film by John Francis Dillon'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1799948</td>\n",
       "      <td>0.8060345649719238</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Ladies of Leisure'@en</td>\n",
       "      <td>'1930 film by Frank Capra'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q2288328</td>\n",
       "      <td>0.8006598949432373</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'The Matinee Idol'@en</td>\n",
       "      <td>'1928 film by Walt Disney, Frank Capra'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q628737</td>\n",
       "      <td>0.7132620811462402</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Campbeltown Single Malts'@en</td>\n",
       "      <td>'single malt Scotch whiskies distilled in the ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q280</td>\n",
       "      <td>0.6832661032676697</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Lagavulin Distillery'@en</td>\n",
       "      <td>'Scotch whisky distillery in Lagavulin, Islay,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q1761185</td>\n",
       "      <td>0.6419662237167358</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Pimm\\\\\\\\'s'@en</td>\n",
       "      <td>'alcohol brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q96278979</td>\n",
       "      <td>0.6371052861213684</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Lagavulin 16 years whisky'@en</td>\n",
       "      <td>'Lagavulin 16 years single malt scotch whisky'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label  \\\n",
       "0   Q4865371   0.833085298538208  similarity   \n",
       "1   Q7084279  0.8258047103881836  similarity   \n",
       "2   Q7736602  0.8078582286834717  similarity   \n",
       "3   Q1799948  0.8060345649719238  similarity   \n",
       "4   Q2288328  0.8006598949432373  similarity   \n",
       "5    Q628737  0.7132620811462402  similarity   \n",
       "6       Q280  0.6832661032676697  similarity   \n",
       "7   Q1761185  0.6419662237167358  similarity   \n",
       "8  Q96278979  0.6371052861213684  similarity   \n",
       "\n",
       "                        node1;label  \\\n",
       "0          'Bartlet for America'@en   \n",
       "1                'Old Ironsides'@en   \n",
       "2  'The Girl of the Golden West'@en   \n",
       "3            'Ladies of Leisure'@en   \n",
       "4             'The Matinee Idol'@en   \n",
       "5     'Campbeltown Single Malts'@en   \n",
       "6         'Lagavulin Distillery'@en   \n",
       "7                   'Pimm\\\\\\\\'s'@en   \n",
       "8    'Lagavulin 16 years whisky'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0              'episode of The West Wing (S3 E9)'@en  \n",
       "1                      '1926 film by James Cruze'@en  \n",
       "2              '1930 film by John Francis Dillon'@en  \n",
       "3                      '1930 film by Frank Capra'@en  \n",
       "4         '1928 film by Walt Disney, Frank Capra'@en  \n",
       "5  'single malt Scotch whiskies distilled in the ...  \n",
       "6  'Scotch whisky distillery in Lagavulin, Islay,...  \n",
       "7                                 'alcohol brand'@en  \n",
       "8  'Lagavulin 16 years single malt scotch whisky'@en  "
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "kgtk_most_similar(ge_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to Johnnie Walker using the **text embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q280</td>\n",
       "      <td>0.9379171133041382</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Lagavulin Distillery'@en</td>\n",
       "      <td>'Scotch whisky distillery in Lagavulin, Islay,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2490031</td>\n",
       "      <td>0.9346836805343628</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'William Grant &amp; Sons'@en</td>\n",
       "      <td>'Scottish company which distills Scotch whisky...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1543646</td>\n",
       "      <td>0.9012988805770874</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Rob Roy'@en</td>\n",
       "      <td>'cocktail based on Scotch whisky'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q2168523</td>\n",
       "      <td>0.8907997012138367</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'The Famous Grouse'@en</td>\n",
       "      <td>'brand of Scotch whisky'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1069502</td>\n",
       "      <td>0.8856703042984009</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Chivas Regal'@en</td>\n",
       "      <td>'Blended Scotch Whisky produced by Chivas Brot...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q4821838</td>\n",
       "      <td>0.8762272596359253</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Aultmore distillery'@en</td>\n",
       "      <td>'whisky distillery in Moray, Scotland, UK'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q4720319</td>\n",
       "      <td>0.8761684894561768</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Alexander Walker'@en</td>\n",
       "      <td>'Scottish whisky distiller'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q1754978</td>\n",
       "      <td>0.8664095401763916</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Rusty Nail'@en</td>\n",
       "      <td>'cocktail mixing Drambuie and Scotch whisky'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q42032478</td>\n",
       "      <td>0.8583760857582092</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Tiree Whisky Company'@en</td>\n",
       "      <td>'company that sells whisky on the island of Ti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q20031443</td>\n",
       "      <td>0.8488548994064331</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Something Special'@en</td>\n",
       "      <td>'blended Scotch whisky'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label                node1;label  \\\n",
       "0       Q280  0.9379171133041382  similarity  'Lagavulin Distillery'@en   \n",
       "1   Q2490031  0.9346836805343628  similarity  'William Grant & Sons'@en   \n",
       "2   Q1543646  0.9012988805770874  similarity               'Rob Roy'@en   \n",
       "3   Q2168523  0.8907997012138367  similarity     'The Famous Grouse'@en   \n",
       "4   Q1069502  0.8856703042984009  similarity          'Chivas Regal'@en   \n",
       "5   Q4821838  0.8762272596359253  similarity   'Aultmore distillery'@en   \n",
       "6   Q4720319  0.8761684894561768  similarity      'Alexander Walker'@en   \n",
       "7   Q1754978  0.8664095401763916  similarity            'Rusty Nail'@en   \n",
       "8  Q42032478  0.8583760857582092  similarity  'Tiree Whisky Company'@en   \n",
       "9  Q20031443  0.8488548994064331  similarity     'Something Special'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0  'Scotch whisky distillery in Lagavulin, Islay,...  \n",
       "1  'Scottish company which distills Scotch whisky...  \n",
       "2               'cocktail based on Scotch whisky'@en  \n",
       "3                        'brand of Scotch whisky'@en  \n",
       "4  'Blended Scotch Whisky produced by Chivas Brot...  \n",
       "5      'whisky distillery in Moray, Scotland, UK'@en  \n",
       "6                     'Scottish whisky distiller'@en  \n",
       "7    'cocktail mixing Drambuie and Scotch whisky'@en  \n",
       "8  'company that sells whisky on the island of Ti...  \n",
       "9                         'blended Scotch whisky'@en  "
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The graph embeddings produce poor results as the top matches are not related to whiskey. The text embeddings look much better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to Michelob using the **graph embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q610672</td>\n",
       "      <td>0.9267997741699219</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Budweiser'@en</td>\n",
       "      <td>'brand of pale lager'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q48799234</td>\n",
       "      <td>0.7637178897857666</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Virginia Black Whiskey'@en</td>\n",
       "      <td>'super-premium brand of American Bourbon whisk...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q85269976</td>\n",
       "      <td>0.762772262096405</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Busch Beer'@en</td>\n",
       "      <td>'brand of beer owned by Anheuser-Busch'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q5149389</td>\n",
       "      <td>0.7565429210662842</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Colt 45'@en</td>\n",
       "      <td>'malt liquor'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q3079990</td>\n",
       "      <td>0.752647340297699</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Four Loko'@en</td>\n",
       "      <td>'Drink'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q96952363</td>\n",
       "      <td>0.7438719272613525</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Cronk'@en</td>\n",
       "      <td>'American drink'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q7085533</td>\n",
       "      <td>0.7436875104904175</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Olde English 800'@en</td>\n",
       "      <td>'malt liquor'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label                  node1;label  \\\n",
       "0    Q610672  0.9267997741699219  similarity               'Budweiser'@en   \n",
       "1  Q48799234  0.7637178897857666  similarity  'Virginia Black Whiskey'@en   \n",
       "2  Q85269976   0.762772262096405  similarity              'Busch Beer'@en   \n",
       "3   Q5149389  0.7565429210662842  similarity                 'Colt 45'@en   \n",
       "4   Q3079990   0.752647340297699  similarity               'Four Loko'@en   \n",
       "5  Q96952363  0.7438719272613525  similarity                   'Cronk'@en   \n",
       "6   Q7085533  0.7436875104904175  similarity        'Olde English 800'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                           'brand of pale lager'@en  \n",
       "1  'super-premium brand of American Bourbon whisk...  \n",
       "2         'brand of beer owned by Anheuser-Busch'@en  \n",
       "3                                   'malt liquor'@en  \n",
       "4                                         'Drink'@en  \n",
       "5                                'American drink'@en  \n",
       "6                                   'malt liquor'@en  "
      ]
     },
     "execution_count": 152,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q15874936 is Michelob\n",
    "kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to Michelob using the **text embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q2011473</td>\n",
       "      <td>0.9664472341537476</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Fantôme'@en</td>\n",
       "      <td>'brand of beer'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q3315575</td>\n",
       "      <td>0.9586231708526611</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Bersalis'@en</td>\n",
       "      <td>'beer brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q3518554</td>\n",
       "      <td>0.9563601016998291</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Floris'@en</td>\n",
       "      <td>'beer brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q15076069</td>\n",
       "      <td>0.9531255960464478</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Marckloff'@en</td>\n",
       "      <td>'beer brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1277388</td>\n",
       "      <td>0.9511646628379822</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Pripps Blå'@en</td>\n",
       "      <td>'beer brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q1917255</td>\n",
       "      <td>0.9475076794624329</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'St-Idesbald'@en</td>\n",
       "      <td>'beer'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q263980</td>\n",
       "      <td>0.9443504810333252</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Soproni'@en</td>\n",
       "      <td>'beer mark'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q3337782</td>\n",
       "      <td>0.9438232779502869</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Carrousel'@en</td>\n",
       "      <td>'Beer'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label       node1;label  \\\n",
       "0   Q2011473  0.9664472341537476  similarity      'Fantôme'@en   \n",
       "1   Q3315575  0.9586231708526611  similarity     'Bersalis'@en   \n",
       "2   Q3518554  0.9563601016998291  similarity       'Floris'@en   \n",
       "3  Q15076069  0.9531255960464478  similarity    'Marckloff'@en   \n",
       "4   Q1277388  0.9511646628379822  similarity   'Pripps Blå'@en   \n",
       "5   Q1917255  0.9475076794624329  similarity  'St-Idesbald'@en   \n",
       "6    Q263980  0.9443504810333252  similarity      'Soproni'@en   \n",
       "7   Q3337782  0.9438232779502869  similarity    'Carrousel'@en   \n",
       "\n",
       "    node1;description  \n",
       "0  'brand of beer'@en  \n",
       "1     'beer brand'@en  \n",
       "2     'beer brand'@en  \n",
       "3     'beer brand'@en  \n",
       "4     'beer brand'@en  \n",
       "5           'beer'@en  \n",
       "6      'beer mark'@en  \n",
       "7           'Beer'@en  "
      ]
     },
     "execution_count": 149,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q15874936 is Michelob\n",
    "kgtk_most_similar(te_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "THe graph embeddings contain some bad results, but the top matches are better as they include beers that are more closely related to Michelob. The text embeddings are reasonable as they include only beers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to vodka using the **graph embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q20577688</td>\n",
       "      <td>0.8814862966537476</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'.vodka'@en</td>\n",
       "      <td>'top-level Internet domain'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q7468032</td>\n",
       "      <td>0.8503187894821167</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka'@en</td>\n",
       "      <td>'Detective Conan character'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q11328065</td>\n",
       "      <td>0.8384641408920288</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Balalaika'@en</td>\n",
       "      <td>'Japanese short drink, cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q21189725</td>\n",
       "      <td>0.8248207569122314</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Red Eye Louie\\\\\\\\'s Vodquila'@en</td>\n",
       "      <td>'blend of vodka and tequila'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.8186914920806885</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q920412</td>\n",
       "      <td>0.8170762062072754</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Belvédère'@en</td>\n",
       "      <td>'French wine and spirits producer and distribu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q7151801</td>\n",
       "      <td>0.8166672587394714</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Category:Vodkas'@en</td>\n",
       "      <td>'Wikimedia category'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q23712704</td>\n",
       "      <td>0.8152912855148315</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'EB-11 / Vodka'@en</td>\n",
       "      <td>'encyclopedic article'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q1539525</td>\n",
       "      <td>0.8101651668548584</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Stolichnaya'@en</td>\n",
       "      <td>'vodka brand'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label  \\\n",
       "0  Q20577688  0.8814862966537476  similarity   \n",
       "1   Q7468032  0.8503187894821167  similarity   \n",
       "2  Q11328065  0.8384641408920288  similarity   \n",
       "3  Q21189725  0.8248207569122314  similarity   \n",
       "4   Q2206588  0.8186914920806885  similarity   \n",
       "5    Q920412  0.8170762062072754  similarity   \n",
       "6   Q7151801  0.8166672587394714  similarity   \n",
       "7  Q23712704  0.8152912855148315  similarity   \n",
       "8   Q1539525  0.8101651668548584  similarity   \n",
       "\n",
       "                         node1;label  \\\n",
       "0                        '.vodka'@en   \n",
       "1                         'Vodka'@en   \n",
       "2                     'Balalaika'@en   \n",
       "3  'Red Eye Louie\\\\\\\\'s Vodquila'@en   \n",
       "4                    'Caipiroska'@en   \n",
       "5                     'Belvédère'@en   \n",
       "6               'Category:Vodkas'@en   \n",
       "7                 'EB-11 / Vodka'@en   \n",
       "8                   'Stolichnaya'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                     'top-level Internet domain'@en  \n",
       "1                     'Detective Conan character'@en  \n",
       "2                'Japanese short drink, cocktail'@en  \n",
       "3                    'blend of vodka and tequila'@en  \n",
       "4                  'cocktail prepared with vodka'@en  \n",
       "5  'French wine and spirits producer and distribu...  \n",
       "6                            'Wikimedia category'@en  \n",
       "7                          'encyclopedic article'@en  \n",
       "8                                   'vodka brand'@en  "
      ]
     },
     "execution_count": 153,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q374 is vodka\n",
    "kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most similar nodes to vodka using the **text embeddings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q4869283</td>\n",
       "      <td>0.9598516225814819</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Batini'@en</td>\n",
       "      <td>'vodka-based cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q3562046</td>\n",
       "      <td>0.9595369100570679</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka Stinger'@en</td>\n",
       "      <td>'type of cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.943680465221405</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q22236238</td>\n",
       "      <td>0.9384630918502808</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Mariette'@en</td>\n",
       "      <td>'vodka, alcohol'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q7939317</td>\n",
       "      <td>0.9203515648841858</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka Cruiser'@en</td>\n",
       "      <td>'brand of vodka-based alcoholic drink'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q11802565</td>\n",
       "      <td>0.9155371189117432</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Pan Tadeusz'@en</td>\n",
       "      <td>'brand of vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q268057</td>\n",
       "      <td>0.9129104614257812</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'cosmopolitan'@en</td>\n",
       "      <td>'cocktail made with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q4782617</td>\n",
       "      <td>0.9107505679130554</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Aqua Velva'@en</td>\n",
       "      <td>'vodka and gin based cocktail'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label         node1;label  \\\n",
       "0   Q4869283  0.9598516225814819  similarity         'Batini'@en   \n",
       "1   Q3562046  0.9595369100570679  similarity  'Vodka Stinger'@en   \n",
       "2   Q2206588   0.943680465221405  similarity     'Caipiroska'@en   \n",
       "3  Q22236238  0.9384630918502808  similarity       'Mariette'@en   \n",
       "4   Q7939317  0.9203515648841858  similarity  'Vodka Cruiser'@en   \n",
       "5  Q11802565  0.9155371189117432  similarity    'Pan Tadeusz'@en   \n",
       "6    Q268057  0.9129104614257812  similarity   'cosmopolitan'@en   \n",
       "7   Q4782617  0.9107505679130554  similarity     'Aqua Velva'@en   \n",
       "\n",
       "                           node1;description  \n",
       "0                  'vodka-based cocktail'@en  \n",
       "1                      'type of cocktail'@en  \n",
       "2          'cocktail prepared with vodka'@en  \n",
       "3                        'vodka, alcohol'@en  \n",
       "4  'brand of vodka-based alcoholic drink'@en  \n",
       "5                        'brand of vodka'@en  \n",
       "6              'cocktail made with vodka'@en  \n",
       "7          'vodka and gin based cocktail'@en  "
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q374 is vodka\n",
    "kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The graph embeddings are noisy as the top matches include nodes not related to vodka, the text embeddings look much better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q9676</td>\n",
       "      <td>0.8613677024841309</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Isle of Man'@en</td>\n",
       "      <td>'British Crown dependency'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1263077</td>\n",
       "      <td>0.8335838317871094</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'DAA'@en</td>\n",
       "      <td>'company that owns and operates Dublin Airport...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q4368623</td>\n",
       "      <td>0.8250888586044312</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Category:Republic of Ireland'@en</td>\n",
       "      <td>'Wikimedia category'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q164421</td>\n",
       "      <td>0.8058757781982422</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Connacht'@en</td>\n",
       "      <td>'province in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q184760</td>\n",
       "      <td>0.8017445802688599</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'County Monaghan'@en</td>\n",
       "      <td>'county in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q178283</td>\n",
       "      <td>0.7986090183258057</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'County Limerick'@en</td>\n",
       "      <td>'county in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q186220</td>\n",
       "      <td>0.7974875569343567</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'County Longford'@en</td>\n",
       "      <td>'county in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q184594</td>\n",
       "      <td>0.7974545359611511</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'County Waterford'@en</td>\n",
       "      <td>'county in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q93195</td>\n",
       "      <td>0.793678879737854</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Ulster'@en</td>\n",
       "      <td>'province in Ireland'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q187402</td>\n",
       "      <td>0.788328230381012</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'County Cavan'@en</td>\n",
       "      <td>'county in Ireland'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      node1               node2       label  \\\n",
       "0     Q9676  0.8613677024841309  similarity   \n",
       "1  Q1263077  0.8335838317871094  similarity   \n",
       "2  Q4368623  0.8250888586044312  similarity   \n",
       "3   Q164421  0.8058757781982422  similarity   \n",
       "4   Q184760  0.8017445802688599  similarity   \n",
       "5   Q178283  0.7986090183258057  similarity   \n",
       "6   Q186220  0.7974875569343567  similarity   \n",
       "7   Q184594  0.7974545359611511  similarity   \n",
       "8    Q93195   0.793678879737854  similarity   \n",
       "9   Q187402   0.788328230381012  similarity   \n",
       "\n",
       "                         node1;label  \\\n",
       "0                   'Isle of Man'@en   \n",
       "1                           'DAA'@en   \n",
       "2  'Category:Republic of Ireland'@en   \n",
       "3                      'Connacht'@en   \n",
       "4               'County Monaghan'@en   \n",
       "5               'County Limerick'@en   \n",
       "6               'County Longford'@en   \n",
       "7              'County Waterford'@en   \n",
       "8                        'Ulster'@en   \n",
       "9                  'County Cavan'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                      'British Crown dependency'@en  \n",
       "1  'company that owns and operates Dublin Airport...  \n",
       "2                            'Wikimedia category'@en  \n",
       "3                           'province in Ireland'@en  \n",
       "4                             'county in Ireland'@en  \n",
       "5                             'county in Ireland'@en  \n",
       "6                             'county in Ireland'@en  \n",
       "7                             'county in Ireland'@en  \n",
       "8                           'province in Ireland'@en  \n",
       "9                             'county in Ireland'@en  "
      ]
     },
     "execution_count": 211,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q27 Ireland\n",
    "kgtk_most_similar(ge_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q191</td>\n",
       "      <td>0.7959819436073303</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Estonia'@en</td>\n",
       "      <td>'sovereign state in Northern Europe'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q37</td>\n",
       "      <td>0.7896063327789307</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Lithuania'@en</td>\n",
       "      <td>'sovereign state in Northeastern Europe'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q34</td>\n",
       "      <td>0.7771986722946167</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Sweden'@en</td>\n",
       "      <td>'sovereign state in Northern Europe'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q35</td>\n",
       "      <td>0.7717932462692261</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Denmark'@en</td>\n",
       "      <td>'sovereign state and Scandinavian country in n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q756617</td>\n",
       "      <td>0.7578498125076294</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Kingdom of Denmark'@en</td>\n",
       "      <td>'sovereign unitary state in Europe, the Arctic...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q33</td>\n",
       "      <td>0.7564055919647217</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Finland'@en</td>\n",
       "      <td>'sovereign state in Northern Europe'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q16965019</td>\n",
       "      <td>0.7521861791610718</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'North borough of Brescia'@en</td>\n",
       "      <td>'one of 5 boroughs of Brescia'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q1526538</td>\n",
       "      <td>0.7520326972007751</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Reykjavík North'@en</td>\n",
       "      <td>'one of the six constituencies (kjördæmi) of I...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q189</td>\n",
       "      <td>0.7486690282821655</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Iceland'@en</td>\n",
       "      <td>'sovereign state in Northern Europe, situated ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q22</td>\n",
       "      <td>0.7369431257247925</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Scotland'@en</td>\n",
       "      <td>'country in Northwest Europe, part of the Unit...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label                    node1;label  \\\n",
       "0       Q191  0.7959819436073303  similarity                   'Estonia'@en   \n",
       "1        Q37  0.7896063327789307  similarity                 'Lithuania'@en   \n",
       "2        Q34  0.7771986722946167  similarity                    'Sweden'@en   \n",
       "3        Q35  0.7717932462692261  similarity                   'Denmark'@en   \n",
       "4    Q756617  0.7578498125076294  similarity        'Kingdom of Denmark'@en   \n",
       "5        Q33  0.7564055919647217  similarity                   'Finland'@en   \n",
       "6  Q16965019  0.7521861791610718  similarity  'North borough of Brescia'@en   \n",
       "7   Q1526538  0.7520326972007751  similarity           'Reykjavík North'@en   \n",
       "8       Q189  0.7486690282821655  similarity                   'Iceland'@en   \n",
       "9        Q22  0.7369431257247925  similarity                  'Scotland'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0            'sovereign state in Northern Europe'@en  \n",
       "1        'sovereign state in Northeastern Europe'@en  \n",
       "2            'sovereign state in Northern Europe'@en  \n",
       "3  'sovereign state and Scandinavian country in n...  \n",
       "4  'sovereign unitary state in Europe, the Arctic...  \n",
       "5            'sovereign state in Northern Europe'@en  \n",
       "6                  'one of 5 boroughs of Brescia'@en  \n",
       "7  'one of the six constituencies (kjördæmi) of I...  \n",
       "8  'sovereign state in Northern Europe, situated ...  \n",
       "9  'country in Northwest Europe, part of the Unit...  "
      ]
     },
     "execution_count": 210,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Q27 Ireland\n",
    "kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the embeddings in queries to the KG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q281 whiskey\n",
    "# Q282 wine\n",
    "# Q3246609 mixed drink\n",
    "# Q374 vodka\n",
    "# Q332378 is absolut"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the most similar nodes to **absolut**, the swedish vodka using the text embeddings and put it in a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 320,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q332378 is absolut\n",
    "kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q332378.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 321,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q7312560</td>\n",
       "      <td>0.9494208097457886</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Renat'@en</td>\n",
       "      <td>'Swedish vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q406157</td>\n",
       "      <td>0.9068878293037415</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'bäsk'@en</td>\n",
       "      <td>'Swedish style spiced liquor'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1034035</td>\n",
       "      <td>0.8990318775177002</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Finlandia Vodka'@en</td>\n",
       "      <td>'Finnish brand of vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q374</td>\n",
       "      <td>0.8908252716064453</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'vodka'@en</td>\n",
       "      <td>'distilled alcoholic beverage'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.8866583108901978</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q268057</td>\n",
       "      <td>0.8860777616500854</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'cosmopolitan'@en</td>\n",
       "      <td>'cocktail made with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q4021706</td>\n",
       "      <td>0.8785413503646851</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Xan'@en</td>\n",
       "      <td>'Vodka from Goygol'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q4869283</td>\n",
       "      <td>0.8784171342849731</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Batini'@en</td>\n",
       "      <td>'vodka-based cocktail'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      node1               node2       label           node1;label  \\\n",
       "0  Q7312560  0.9494208097457886  similarity            'Renat'@en   \n",
       "1   Q406157  0.9068878293037415  similarity             'bäsk'@en   \n",
       "2  Q1034035  0.8990318775177002  similarity  'Finlandia Vodka'@en   \n",
       "3      Q374  0.8908252716064453  similarity            'vodka'@en   \n",
       "4  Q2553569  0.8900324106216431  similarity    'Vodka Martini'@en   \n",
       "5  Q2206588  0.8866583108901978  similarity       'Caipiroska'@en   \n",
       "6   Q268057  0.8860777616500854  similarity     'cosmopolitan'@en   \n",
       "7  Q4021706  0.8785413503646851  similarity              'Xan'@en   \n",
       "8  Q4869283  0.8784171342849731  similarity           'Batini'@en   \n",
       "\n",
       "                            node1;description  \n",
       "0                          'Swedish vodka'@en  \n",
       "1            'Swedish style spiced liquor'@en  \n",
       "2                 'Finnish brand of vodka'@en  \n",
       "3           'distilled alcoholic beverage'@en  \n",
       "4  'cocktail made with vodka and vermouth'@en  \n",
       "5           'cocktail prepared with vodka'@en  \n",
       "6               'cocktail made with vodka'@en  \n",
       "7                      'Vodka from Goygol'@en  \n",
       "8                   'vodka-based cocktail'@en  "
      ]
     },
     "execution_count": 321,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !head \"$TE\"/Q332378.sim.tsv\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose I have absolut vodka and I want to make a cocktail. I can use the KG graph of the most similar nodes to absolut, and search the KG for mixed drinks (`Q3246609`) that appear in the list of most similar nodes to absolut.\n",
    "\n",
    "Here are some drinks we can make with absolut vodka."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 323,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "      <th>ingredient</th>\n",
       "      <th>ingredient label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "      <td>Q1105343</td>\n",
       "      <td>'cocktail glass'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "      <td>Q1621080</td>\n",
       "      <td>'olive'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "      <td>Q26877166</td>\n",
       "      <td>'lemon twist'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "      <td>Q26877423</td>\n",
       "      <td>'dry vermouth'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.8900324106216431</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.8866583108901978</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q1966883</td>\n",
       "      <td>0.8709859848022461</td>\n",
       "      <td>'Yorsh'@en</td>\n",
       "      <td>'Russian drink of beer and vodka'@en</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q1966883</td>\n",
       "      <td>0.8709859848022461</td>\n",
       "      <td>'Yorsh'@en</td>\n",
       "      <td>'Russian drink of beer and vodka'@en</td>\n",
       "      <td>Q44</td>\n",
       "      <td>'beer'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q1723060</td>\n",
       "      <td>0.8683922290802002</td>\n",
       "      <td>'Kamikaze'@en</td>\n",
       "      <td>'cocktail of vodka, triple sec and lime juice'@en</td>\n",
       "      <td>Q1105343</td>\n",
       "      <td>'cocktail glass'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q1723060</td>\n",
       "      <td>0.8683922290802002</td>\n",
       "      <td>'Kamikaze'@en</td>\n",
       "      <td>'cocktail of vodka, triple sec and lime juice'@en</td>\n",
       "      <td>Q3539556</td>\n",
       "      <td>'triple sec'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Q1723060</td>\n",
       "      <td>0.8683922290802002</td>\n",
       "      <td>'Kamikaze'@en</td>\n",
       "      <td>'cocktail of vodka, triple sec and lime juice'@en</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Q1723060</td>\n",
       "      <td>0.8683922290802002</td>\n",
       "      <td>'Kamikaze'@en</td>\n",
       "      <td>'cocktail of vodka, triple sec and lime juice'@en</td>\n",
       "      <td>Q5361217</td>\n",
       "      <td>'lime juice'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Q5580053</td>\n",
       "      <td>0.8639324903488159</td>\n",
       "      <td>'Golden Russian'@en</td>\n",
       "      <td>'cocktail of vodka and Galliano'@en</td>\n",
       "      <td>Q1331962</td>\n",
       "      <td>'Galliano'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Q5580053</td>\n",
       "      <td>0.8639324903488159</td>\n",
       "      <td>'Golden Russian'@en</td>\n",
       "      <td>'cocktail of vodka and Galliano'@en</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Q5580053</td>\n",
       "      <td>0.8639324903488159</td>\n",
       "      <td>'Golden Russian'@en</td>\n",
       "      <td>'cocktail of vodka and Galliano'@en</td>\n",
       "      <td>Q5361217</td>\n",
       "      <td>'lime juice'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.8580197095870972</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "      <td>Q26877133</td>\n",
       "      <td>'lime wedge'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.8580197095870972</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "      <td>Q26879660</td>\n",
       "      <td>'peach schnapps'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.8580197095870972</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "      <td>Q374</td>\n",
       "      <td>'vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.8580197095870972</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "      <td>Q4131010</td>\n",
       "      <td>'Highball glass'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.8580197095870972</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "      <td>Q865448</td>\n",
       "      <td>'Cranberry juice'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2          node1;label  \\\n",
       "0   Q2553569  0.8900324106216431   'Vodka Martini'@en   \n",
       "1   Q2553569  0.8900324106216431   'Vodka Martini'@en   \n",
       "2   Q2553569  0.8900324106216431   'Vodka Martini'@en   \n",
       "3   Q2553569  0.8900324106216431   'Vodka Martini'@en   \n",
       "4   Q2553569  0.8900324106216431   'Vodka Martini'@en   \n",
       "5   Q2206588  0.8866583108901978      'Caipiroska'@en   \n",
       "6   Q1966883  0.8709859848022461           'Yorsh'@en   \n",
       "7   Q1966883  0.8709859848022461           'Yorsh'@en   \n",
       "8   Q1723060  0.8683922290802002        'Kamikaze'@en   \n",
       "9   Q1723060  0.8683922290802002        'Kamikaze'@en   \n",
       "10  Q1723060  0.8683922290802002        'Kamikaze'@en   \n",
       "11  Q1723060  0.8683922290802002        'Kamikaze'@en   \n",
       "12  Q5580053  0.8639324903488159  'Golden Russian'@en   \n",
       "13  Q5580053  0.8639324903488159  'Golden Russian'@en   \n",
       "14  Q5580053  0.8639324903488159  'Golden Russian'@en   \n",
       "15  Q8032131  0.8580197095870972         'Woo Woo'@en   \n",
       "16  Q8032131  0.8580197095870972         'Woo Woo'@en   \n",
       "17  Q8032131  0.8580197095870972         'Woo Woo'@en   \n",
       "18  Q8032131  0.8580197095870972         'Woo Woo'@en   \n",
       "19  Q8032131  0.8580197095870972         'Woo Woo'@en   \n",
       "\n",
       "                                    node1;description ingredient  \\\n",
       "0          'cocktail made with vodka and vermouth'@en   Q1105343   \n",
       "1          'cocktail made with vodka and vermouth'@en   Q1621080   \n",
       "2          'cocktail made with vodka and vermouth'@en  Q26877166   \n",
       "3          'cocktail made with vodka and vermouth'@en  Q26877423   \n",
       "4          'cocktail made with vodka and vermouth'@en       Q374   \n",
       "5                   'cocktail prepared with vodka'@en       Q374   \n",
       "6                'Russian drink of beer and vodka'@en       Q374   \n",
       "7                'Russian drink of beer and vodka'@en        Q44   \n",
       "8   'cocktail of vodka, triple sec and lime juice'@en   Q1105343   \n",
       "9   'cocktail of vodka, triple sec and lime juice'@en   Q3539556   \n",
       "10  'cocktail of vodka, triple sec and lime juice'@en       Q374   \n",
       "11  'cocktail of vodka, triple sec and lime juice'@en   Q5361217   \n",
       "12                'cocktail of vodka and Galliano'@en   Q1331962   \n",
       "13                'cocktail of vodka and Galliano'@en       Q374   \n",
       "14                'cocktail of vodka and Galliano'@en   Q5361217   \n",
       "15  'alcoholic beverage made of vodka, peach schna...  Q26877133   \n",
       "16  'alcoholic beverage made of vodka, peach schna...  Q26879660   \n",
       "17  'alcoholic beverage made of vodka, peach schna...       Q374   \n",
       "18  'alcoholic beverage made of vodka, peach schna...   Q4131010   \n",
       "19  'alcoholic beverage made of vodka, peach schna...    Q865448   \n",
       "\n",
       "        ingredient label  \n",
       "0    'cocktail glass'@en  \n",
       "1             'olive'@en  \n",
       "2       'lemon twist'@en  \n",
       "3      'dry vermouth'@en  \n",
       "4             'vodka'@en  \n",
       "5             'vodka'@en  \n",
       "6             'vodka'@en  \n",
       "7              'beer'@en  \n",
       "8    'cocktail glass'@en  \n",
       "9        'triple sec'@en  \n",
       "10            'vodka'@en  \n",
       "11       'lime juice'@en  \n",
       "12         'Galliano'@en  \n",
       "13            'vodka'@en  \n",
       "14       'lime juice'@en  \n",
       "15       'lime wedge'@en  \n",
       "16   'peach schnapps'@en  \n",
       "17            'vodka'@en  \n",
       "18   'Highball glass'@en  \n",
       "19  'Cranberry juice'@en  "
      ]
     },
     "execution_count": 323,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$TE\"/Q332378.sim.tsv -i \"$Q154CLAIMS\" -i \"$Q154LABEL\" \\\n",
    "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), \\\n",
    "  claims: (n1)-[:P186]->(:Q374), claims: (n1)-[:P186]->(ingredient), label: (ingredient)-[]->(i_label)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description, \\\n",
    "  ingredient as ingredient, i_label as `ingredient label`' \\\n",
    "--order-by 'cast(similarity, float) desc' \\\n",
    "--where 'class = \"Q3246609\"' \\\n",
    "--limit 20 \n",
    "\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 291,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1966883</td>\n",
       "      <td>0.7984070181846619</td>\n",
       "      <td>'Yorsh'@en</td>\n",
       "      <td>'Russian drink of beer and vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.7781851291656494</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q5580053</td>\n",
       "      <td>0.7759937047958374</td>\n",
       "      <td>'Golden Russian'@en</td>\n",
       "      <td>'cocktail of vodka and Galliano'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q2553569</td>\n",
       "      <td>0.7755716443061829</td>\n",
       "      <td>'Vodka Martini'@en</td>\n",
       "      <td>'cocktail made with vodka and vermouth'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q26883085</td>\n",
       "      <td>0.7711346745491028</td>\n",
       "      <td>'Russian Spring Punch'@en</td>\n",
       "      <td>'sparkling cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q455914</td>\n",
       "      <td>0.7694578170776367</td>\n",
       "      <td>'Vodka Red Bull'@en</td>\n",
       "      <td>'alcoholic beverage'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q1723060</td>\n",
       "      <td>0.7578018307685852</td>\n",
       "      <td>'Kamikaze'@en</td>\n",
       "      <td>'cocktail of vodka, triple sec and lime juice'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q621302</td>\n",
       "      <td>0.757564902305603</td>\n",
       "      <td>'Appletini'@en</td>\n",
       "      <td>'apple-flavored vodka cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q8032131</td>\n",
       "      <td>0.7451797723770142</td>\n",
       "      <td>'Woo Woo'@en</td>\n",
       "      <td>'alcoholic beverage made of vodka, peach schna...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q1507096</td>\n",
       "      <td>0.744042158126831</td>\n",
       "      <td>'Moscow mule'@en</td>\n",
       "      <td>'mule cocktail with vodka, ginger beer and lim...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2                node1;label  \\\n",
       "0   Q1966883  0.7984070181846619                 'Yorsh'@en   \n",
       "1   Q2206588  0.7781851291656494            'Caipiroska'@en   \n",
       "2   Q5580053  0.7759937047958374        'Golden Russian'@en   \n",
       "3   Q2553569  0.7755716443061829         'Vodka Martini'@en   \n",
       "4  Q26883085  0.7711346745491028  'Russian Spring Punch'@en   \n",
       "5    Q455914  0.7694578170776367        'Vodka Red Bull'@en   \n",
       "6   Q1723060  0.7578018307685852              'Kamikaze'@en   \n",
       "7    Q621302   0.757564902305603             'Appletini'@en   \n",
       "8   Q8032131  0.7451797723770142               'Woo Woo'@en   \n",
       "9   Q1507096   0.744042158126831           'Moscow mule'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0               'Russian drink of beer and vodka'@en  \n",
       "1                  'cocktail prepared with vodka'@en  \n",
       "2                'cocktail of vodka and Galliano'@en  \n",
       "3         'cocktail made with vodka and vermouth'@en  \n",
       "4                            'sparkling cocktail'@en  \n",
       "5                            'alcoholic beverage'@en  \n",
       "6  'cocktail of vodka, triple sec and lime juice'@en  \n",
       "7                 'apple-flavored vodka cocktail'@en  \n",
       "8  'alcoholic beverage made of vodka, peach schna...  \n",
       "9  'mule cocktail with vodka, ginger beer and lim...  "
      ]
     },
     "execution_count": 291,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$TE\"/Q332378.sim.tsv -i \"$Q154CLAIMS\" \\\n",
    "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), claims: (n1)-[:P186]->(:Q374)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description' \\\n",
    "--order-by 'cast(similarity, float) desc' \\\n",
    "--where 'class = \"Q3246609\"' \\\n",
    "--limit 10 \n",
    "\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are good, lots of choices of cocktails. Note that the embeddings are able to generalize from a specific vodka to vodka in general. The example also illustrates that KGTK can use the results of queries to gensim within queries to the KG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q332378 is absolut\n",
    "kgtk_most_similar(ge_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['GE'] + \"/Q332378.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['node1\\tnode2\\tlabel\\tnode1;label\\tnode1;description', \"Q3527971\\t0.4424980580806732\\tsimilarity\\t'Ti\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'Punch'@en\\t'cocktail'@en\", \"Q594392\\t0.38892069458961487\\tsimilarity\\t'B-52'@en\\t'cocktail of coffee liqueur, Irish cream and triple sec'@en\", \"Q7535970\\t0.37358343601226807\\tsimilarity\\t'Skittle Bomb'@en\\t'bomb shot cocktail'@en\", \"Q7209010\\t0.37143874168395996\\tsimilarity\\t'Polar Bear'@en\\t'mint chocolate cocktail'@en\", \"Q3309707\\t0.37052232027053833\\tsimilarity\\t'Hawaiian Punch'@en\\t'Fruit punch brand'@en\", \"Q12738893\\t0.3702288269996643\\tsimilarity\\t'Quentão'@en\\t'Brazilian hot drink made \\u200b\\u200bfrom cachaça and some spices'@en\", \"Q2935472\\t0.36788904666900635\\tsimilarity\\t'Campari Soda'@en\\t'pre-mixed drink made by Campari'@en\", \"Q70428\\t0.3663345277309418\\tsimilarity\\t'Karsk'@en\\t'Scandinavian cocktail'@en\", \"Q590793\\t0.3614485263824463\\tsimilarity\\t'Vesper'@en\\t'cocktail originally made of gin, vodka, and Kina Lillet'@en\"]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q3527971</td>\n",
       "      <td>0.4424980580806732</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Ti\\\\\\\\\\\\\\\\'Punch'@en</td>\n",
       "      <td>'cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q594392</td>\n",
       "      <td>0.38892069458961487</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'B-52'@en</td>\n",
       "      <td>'cocktail of coffee liqueur, Irish cream and t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q7535970</td>\n",
       "      <td>0.37358343601226807</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Skittle Bomb'@en</td>\n",
       "      <td>'bomb shot cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q7209010</td>\n",
       "      <td>0.37143874168395996</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Polar Bear'@en</td>\n",
       "      <td>'mint chocolate cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q3309707</td>\n",
       "      <td>0.37052232027053833</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Hawaiian Punch'@en</td>\n",
       "      <td>'Fruit punch brand'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q12738893</td>\n",
       "      <td>0.3702288269996643</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Quentão'@en</td>\n",
       "      <td>'Brazilian hot drink made ​​from cachaça and s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q2935472</td>\n",
       "      <td>0.36788904666900635</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Campari Soda'@en</td>\n",
       "      <td>'pre-mixed drink made by Campari'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q70428</td>\n",
       "      <td>0.3663345277309418</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Karsk'@en</td>\n",
       "      <td>'Scandinavian cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q590793</td>\n",
       "      <td>0.3614485263824463</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vesper'@en</td>\n",
       "      <td>'cocktail originally made of gin, vodka, and K...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1                node2       label            node1;label  \\\n",
       "0   Q3527971   0.4424980580806732  similarity  'Ti\\\\\\\\\\\\\\\\'Punch'@en   \n",
       "1    Q594392  0.38892069458961487  similarity              'B-52'@en   \n",
       "2   Q7535970  0.37358343601226807  similarity      'Skittle Bomb'@en   \n",
       "3   Q7209010  0.37143874168395996  similarity        'Polar Bear'@en   \n",
       "4   Q3309707  0.37052232027053833  similarity    'Hawaiian Punch'@en   \n",
       "5  Q12738893   0.3702288269996643  similarity           'Quentão'@en   \n",
       "6   Q2935472  0.36788904666900635  similarity      'Campari Soda'@en   \n",
       "7     Q70428   0.3663345277309418  similarity             'Karsk'@en   \n",
       "8    Q590793   0.3614485263824463  similarity            'Vesper'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                                      'cocktail'@en  \n",
       "1  'cocktail of coffee liqueur, Irish cream and t...  \n",
       "2                            'bomb shot cocktail'@en  \n",
       "3                       'mint chocolate cocktail'@en  \n",
       "4                             'Fruit punch brand'@en  \n",
       "5  'Brazilian hot drink made ​​from cachaça and s...  \n",
       "6               'pre-mixed drink made by Campari'@en  \n",
       "7                         'Scandinavian cocktail'@en  \n",
       "8  'cocktail originally made of gin, vodka, and K...  "
      ]
     },
     "execution_count": 199,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q332378.sim.tsv \\\n",
    "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n",
    "--order-by 'cast(similarity, float) desc' \\\n",
    "--where 'class = \"Q3246609\"' \\\n",
    "--limit 10 \n",
    "\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are poor as for the most part, the retrieved cocktails do not have vodka. Let's try the query with vodka instead of absolut vodka"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q374 vodka\n",
    "kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q374.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>node2</th>\n",
       "      <th>label</th>\n",
       "      <th>node1;label</th>\n",
       "      <th>node1;description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q11328065</td>\n",
       "      <td>0.8384641408920288</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Balalaika'@en</td>\n",
       "      <td>'Japanese short drink, cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2206588</td>\n",
       "      <td>0.8186914920806885</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Caipiroska'@en</td>\n",
       "      <td>'cocktail prepared with vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q3562046</td>\n",
       "      <td>0.6592038869857788</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka Stinger'@en</td>\n",
       "      <td>'type of cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1966883</td>\n",
       "      <td>0.5952204465866089</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Yorsh'@en</td>\n",
       "      <td>'Russian drink of beer and vodka'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q5459745</td>\n",
       "      <td>0.5736489295959473</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'flirtini'@en</td>\n",
       "      <td>'cocktail containing vodka, champagne and pine...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Q455914</td>\n",
       "      <td>0.5721926093101501</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Vodka Red Bull'@en</td>\n",
       "      <td>'alcoholic beverage'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Q5103598</td>\n",
       "      <td>0.5712590217590332</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Chocolate Cake'@en</td>\n",
       "      <td>'cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Q26879480</td>\n",
       "      <td>0.5568693280220032</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Godmother'@en</td>\n",
       "      <td>'cocktail'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Q5580053</td>\n",
       "      <td>0.5458002090454102</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Golden Russian'@en</td>\n",
       "      <td>'cocktail of vodka and Galliano'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Q3900577</td>\n",
       "      <td>0.5457539558410645</td>\n",
       "      <td>similarity</td>\n",
       "      <td>'Pertini'@en</td>\n",
       "      <td>'cocktail drink with honey'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1               node2       label          node1;label  \\\n",
       "0  Q11328065  0.8384641408920288  similarity       'Balalaika'@en   \n",
       "1   Q2206588  0.8186914920806885  similarity      'Caipiroska'@en   \n",
       "2   Q3562046  0.6592038869857788  similarity   'Vodka Stinger'@en   \n",
       "3   Q1966883  0.5952204465866089  similarity           'Yorsh'@en   \n",
       "4   Q5459745  0.5736489295959473  similarity        'flirtini'@en   \n",
       "5    Q455914  0.5721926093101501  similarity  'Vodka Red Bull'@en   \n",
       "6   Q5103598  0.5712590217590332  similarity  'Chocolate Cake'@en   \n",
       "7  Q26879480  0.5568693280220032  similarity       'Godmother'@en   \n",
       "8   Q5580053  0.5458002090454102  similarity  'Golden Russian'@en   \n",
       "9   Q3900577  0.5457539558410645  similarity         'Pertini'@en   \n",
       "\n",
       "                                   node1;description  \n",
       "0                'Japanese short drink, cocktail'@en  \n",
       "1                  'cocktail prepared with vodka'@en  \n",
       "2                              'type of cocktail'@en  \n",
       "3               'Russian drink of beer and vodka'@en  \n",
       "4  'cocktail containing vodka, champagne and pine...  \n",
       "5                            'alcoholic beverage'@en  \n",
       "6                                      'cocktail'@en  \n",
       "7                                      'cocktail'@en  \n",
       "8                'cocktail of vodka and Galliano'@en  \n",
       "9                     'cocktail drink with honey'@en  "
      ]
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q374.sim.tsv \\\n",
    "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n",
    "--order-by 'cast(similarity, float) desc' \\\n",
    "--where 'class = \"Q3246609\"' \\\n",
    "--limit 10 \n",
    "\n",
    "kgtk_to_dataframe(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are good. Somehow, the graph embeddings are able to rerieve the cocktails that have vodka, but cannot generalize from absolut vodka to vodka."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Produce files to load in the Google Embedding Projector\n",
    "We need two files:\n",
    "\n",
    "- a TSV file with the vectors\n",
    "- a TSV file with the metadata, in the same order as the vectors\n",
    "\n",
    "We don't want to load all the vectors in the projectors because it is too many to visualize. We will load only the following types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "focus_types = {\n",
    "    \"Q3246609\": \"mixed drink\",\n",
    "    \"Q44\": \"beer\",\n",
    "    \"Q282\": \"wine\",\n",
    "    \"Q281\": \"whiskey\",\n",
    "    \"Q374\": \"vodka\",\n",
    "    \"Q6256\": \"country\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct a dictionary that maps every q-node in the KG to the set of all its superclasses. We will use this dictionary later to tag each q-node with one of the focus types. For every q-node we willtest if the focus type is in the set of all super-classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "classes_result = !$kypher_raw -i \"$ISA\" -i \"$Q154CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv -i \"$P279STAR\" \\\n",
    "--match 'isa: (n1)-[]->(c), P279: (c)-[]->(class), claims: ()-[]->(class), descendant: (n1)-[]->()' \\\n",
    "--return 'distinct n1 as qnode, class as class' \n",
    "\n",
    "class_dict = {}\n",
    "for r in classes_result[1:]:\n",
    "    row = r.split(\"\\t\")\n",
    "    qnode = row[0]\n",
    "    isa = row[1]\n",
    "    entry = class_dict.get(qnode)\n",
    "    if entry is None:\n",
    "        class_dict[qnode] = set()\n",
    "        entry = class_dict[qnode]\n",
    "    entry.add(isa)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Q102205',\n",
       " 'Q1048607',\n",
       " 'Q11024',\n",
       " 'Q11028',\n",
       " 'Q11064354',\n",
       " 'Q111352',\n",
       " 'Q11435',\n",
       " 'Q1150070',\n",
       " 'Q1166770',\n",
       " 'Q11795009',\n",
       " 'Q1190554',\n",
       " 'Q1194058',\n",
       " 'Q12055130',\n",
       " 'Q124291',\n",
       " 'Q12767945',\n",
       " 'Q131257',\n",
       " 'Q13878858',\n",
       " 'Q1400881',\n",
       " 'Q1422299',\n",
       " 'Q14819853',\n",
       " 'Q14912053',\n",
       " 'Q154',\n",
       " 'Q15401930',\n",
       " 'Q1554231',\n",
       " 'Q1632297',\n",
       " 'Q16686448',\n",
       " 'Q16722960',\n",
       " 'Q167270',\n",
       " 'Q1681365',\n",
       " 'Q16887380',\n",
       " 'Q16889133',\n",
       " 'Q169336',\n",
       " 'Q1704572',\n",
       " 'Q174984',\n",
       " 'Q1786828',\n",
       " 'Q1865992',\n",
       " 'Q187931',\n",
       " 'Q1914636',\n",
       " 'Q20817253',\n",
       " 'Q20937557',\n",
       " 'Q2095',\n",
       " 'Q214609',\n",
       " 'Q2150504',\n",
       " 'Q2200417',\n",
       " 'Q22269697',\n",
       " 'Q22272508',\n",
       " 'Q22294683',\n",
       " 'Q22299433',\n",
       " 'Q22299483',\n",
       " 'Q223557',\n",
       " 'Q23009552',\n",
       " 'Q23009675',\n",
       " 'Q2424752',\n",
       " 'Q25481995',\n",
       " 'Q266328',\n",
       " 'Q26717101',\n",
       " 'Q26907166',\n",
       " 'Q2695280',\n",
       " 'Q27166344',\n",
       " 'Q281',\n",
       " 'Q2844972',\n",
       " 'Q28555911',\n",
       " 'Q28728771',\n",
       " 'Q28732711',\n",
       " 'Q28823',\n",
       " 'Q28877',\n",
       " 'Q28921572',\n",
       " 'Q2944660',\n",
       " 'Q29651519',\n",
       " 'Q2990593',\n",
       " 'Q2996394',\n",
       " 'Q31464082',\n",
       " 'Q3249551',\n",
       " 'Q337060',\n",
       " 'Q34394',\n",
       " 'Q3505845',\n",
       " 'Q35120',\n",
       " 'Q35758',\n",
       " 'Q3695082',\n",
       " 'Q382947',\n",
       " 'Q386724',\n",
       " 'Q40050',\n",
       " 'Q4026292',\n",
       " 'Q427581',\n",
       " 'Q42848',\n",
       " 'Q43460564',\n",
       " 'Q4406616',\n",
       " 'Q4437984',\n",
       " 'Q46737',\n",
       " 'Q478798',\n",
       " 'Q483247',\n",
       " 'Q488383',\n",
       " 'Q492',\n",
       " 'Q5127848',\n",
       " 'Q517596',\n",
       " 'Q52948',\n",
       " 'Q5371079',\n",
       " 'Q54989186',\n",
       " 'Q551997',\n",
       " 'Q56139',\n",
       " 'Q58415929',\n",
       " 'Q58416391',\n",
       " 'Q58778',\n",
       " 'Q6005984',\n",
       " 'Q6031064',\n",
       " 'Q64732777',\n",
       " 'Q6671777',\n",
       " 'Q7184903',\n",
       " 'Q781413',\n",
       " 'Q79529',\n",
       " 'Q80071',\n",
       " 'Q813912',\n",
       " 'Q8171',\n",
       " 'Q8205328',\n",
       " 'Q82799',\n",
       " 'Q837718',\n",
       " 'Q9081',\n",
       " 'Q921513',\n",
       " 'Q9332',\n",
       " 'Q937228',\n",
       " 'novalue'}"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class_dict['Q502268']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [],
   "source": [
    "def focus_type(qnode):\n",
    "    for t in focus_types.keys():\n",
    "        classes = class_dict.get(qnode)\n",
    "        if classes and t in classes:\n",
    "            return focus_types[t]\n",
    "        if qnode in country_qnodes:\n",
    "            return \"country\"\n",
    "    return \"other\""
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Doesn't work because partition didin't work and we don't have the derived.isa file\n",
    "country_qnodes = set()\n",
    "!$kypher -i \"$Q154ISA\" \\\n",
    "--match '(n1)-[]->(:Q6256)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct `country_qnodes`, the set of all country qnodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "country_result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$Q154CLAIMS\" \\\n",
    "--match 'claims: (country)-[]->(), isa: (country)-[:isa]->(c), P279: (c)-[]->(:Q6256)' \\\n",
    "--return 'distinct country as country' \n",
    "\n",
    "country_qnodes = set()\n",
    "for r in country_result[1:]:\n",
    "    country_qnodes.add(r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct `alcoholic_qnodes`, the set of all alcoholic beverage qnodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "alcoholic_qnodes = set()\n",
    "for line in open(os.environ[\"TEMP\"] + \"/Q154.descendant.tsv\", \"r\"):\n",
    "    alcoholic_qnodes.add(line.split(\"\\t\")[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_embedding_projector_vectors(embeddings_path):\n",
    "    input_path = embeddings_path + \"/embeddings.txt\"\n",
    "    vectors_path = embeddings_path + \"/projector.vectors.tsv\"\n",
    "    qnodes_path = embeddings_path + \"/projector.qnodes.tsv\"\n",
    "\n",
    "    input_file = open(input_path, \"r\")\n",
    "    vectors_file = open(vectors_path, \"w\")\n",
    "    qnodes_file = open(qnodes_path, \"w\")\n",
    "\n",
    "    qnodes_file.write(\"node1\\n\")\n",
    "\n",
    "    with open(input_path, \"r\") as w2v_file:\n",
    "        next(w2v_file)\n",
    "        for line in w2v_file:\n",
    "            items = line.split(\" \")\n",
    "            qnode = items[0]\n",
    "            if qnode in alcoholic_qnodes or qnode in country_qnodes:\n",
    "                vectors_file.write(\"\\t\".join(items[1:]))\n",
    "                qnodes_file.write(\"{}\\n\".format(qnode))\n",
    "\n",
    "    input_file.close()\n",
    "    vectors_file.close()\n",
    "    qnodes_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "build_embedding_projector_vectors(os.environ[\"GE\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\n",
      "Q3242283\n",
      "Q3866024\n",
      "Q1112057\n",
      "Q3866020\n",
      "Q1513599\n",
      "Q17329207\n",
      "Q16620320\n",
      "Q3895013\n",
      "Q4880027\n"
     ]
    }
   ],
   "source": [
    "!head \"$GE\"/translation.projector.qnodes.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_embedding_projector_metadata(embeddings_path):\n",
    "    kg_path = os.environ[\"OUT\"] + \"/parts\"\n",
    "    os.environ[\"_label_graph\"] = kg_path + \"/labels.en.tsv.gz\"\n",
    "    os.environ[\"_description_graph\"] = kg_path + \"/descriptions.en.tsv.gz\"\n",
    "    os.environ[\"_qnodes\"] = embeddings_path + \"/projector.qnodes.tsv\"\n",
    "\n",
    "    #result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_qnodes\" \\\n",
    "    #--match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n",
    "    #--return 'distinct n1 as node1, lab as `node1;label`, des as `node1;description`' \n",
    "    \n",
    "    result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_qnodes\" \\\n",
    "    --match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab)' \\\n",
    "    --return 'distinct n1 as node1, lab as `node1;label`'\n",
    "    \n",
    "    metadata_path = embeddings_path + \"/projector.metadata.tsv\"\n",
    "    metadata_file = open(metadata_path, \"w\")\n",
    "    metadata_file.write(\"tag\\tqnode\\ttype\\n\")\n",
    "\n",
    "    qnode_dict = {}\n",
    "    for line in result[1:]:\n",
    "        items = line.split(\"\\t\")\n",
    "        qnode = items[0]\n",
    "        # qnode_dict[qnode] = \"{} ({})\".format(items[1], items[2])\n",
    "        qnode_dict[qnode] = \"{}\".format(items[1])\n",
    "\n",
    "    with open(os.environ[\"_qnodes\"]) as qnodes_file:\n",
    "        next(qnodes_file)\n",
    "        for line in qnodes_file:\n",
    "            qnode = line[:-1]\n",
    "            ftype = focus_type(qnode)\n",
    "            tag = qnode_dict.get(qnode)\n",
    "            if tag is None:\n",
    "                tag = qnode\n",
    "            tag = \"{} ({})\".format(qnode_dict.get(qnode), ftype)\n",
    "            metadata_file.write(\"{}\\t{}\\t{}\\n\".format(tag, qnode, ftype))\n",
    "\n",
    "    metadata_file.close()\n",
    "    qnodes_file.close()       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [],
   "source": [
    "build_embedding_projector_metadata(os.environ[\"GE\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that the file sizes are correct, the metadata file has one more line as it as headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    2244   14157  116997 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.metadata.tsv\n",
      "    2243  224300 2805636 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.vectors.tsv\n",
      "    4487  238457 2922633 total\n"
     ]
    }
   ],
   "source": [
    "!wc \"$GE\"/projector.metadata.tsv \"$GE\"/projector.vectors.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.695853055\t-0.072303891\t0.496231377\t-0.293976039\t0.193507940\t0.096196420\t0.043117594\t-0.580413938\t-0.423150927\t0.348393738\t-0.044707101\t0.447685152\t-0.251975268\t0.192745760\t-0.357472301\t0.204551399\t-0.013355692\t0.216426134\t-0.170541272\t-0.189649135\t-0.299910724\t0.295587122\t0.594068944\t-0.064507566\t0.261834234\t-0.458304882\t-0.426072240\t-0.082138501\t0.007850863\t-0.320901960\t0.727239370\t0.642546177\t-0.339439988\t0.260855168\t0.066383749\t0.018122014\t0.614691317\t-0.109721325\t-0.066969074\t-0.123010576\t0.231307715\t0.633326292\t0.570168674\t-0.550969541\t0.073210679\t-0.459269404\t0.093307532\t0.358197242\t0.623394549\t-0.309046119\t-0.467551976\t0.312151939\t-0.491982907\t0.400699556\t-0.383774340\t-0.446712554\t0.047239214\t0.598234832\t-0.471011013\t-0.039659370\t-0.254376531\t-0.012475031\t-0.207778856\t0.335359454\t0.302034408\t0.153741017\t0.902297437\t-0.261785030\t0.502385259\t-0.139487550\t0.090193652\t-0.114394628\t-0.246014833\t-0.570263982\t0.746979654\t0.009215424\t-0.472881168\t0.205686644\t-0.781571090\t0.133758202\t-0.197057635\t-0.022827761\t-0.097072124\t-0.930668116\t-0.564921737\t-0.811056256\t-0.459467322\t-0.352878183\t-0.494716078\t0.520463228\t0.076241963\t-0.020195168\t0.423226446\t0.302821845\t-0.207172275\t-0.163210511\t0.028312737\t0.138087898\t0.582748592\t0.285810173\n"
     ]
    }
   ],
   "source": [
    "!head -1 \"$GE\"/projector.vectors.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [],
   "source": [
    "build_embedding_projector_vectors(os.environ[\"TE\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [],
   "source": [
    "build_embedding_projector_metadata(os.environ[\"TE\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    2782   14542  118309 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.metadata.tsv\n",
      "    2781 2847744 31710917 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.vectors.tsv\n",
      "    2782    2782   24800 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.qnodes.tsv\n",
      "    8345 2865068 31854026 total\n"
     ]
    }
   ],
   "source": [
    "!wc \"$TE\"/projector.metadata.tsv \"$TE\"/projector.vectors.tsv \"$TE\"/projector.qnodes.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "!$kgtk lexicalize -i $OUT/all.tsv.gz \\\n",
    "--label-properties label \\\n",
    "--isa-properties P31 P279 P452 P106 \\\n",
    "--description-properties description \\\n",
    "--property-value P186 P17 P127 P176 \\\n",
    "--has-properties \"\" \\\n",
    "--add-entity-labels-from-input True \\\n",
    "-o \"$TE\"/sentences.tsv "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q374 is vodka\n",
    "kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q374.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q502268.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q332378 is absolut\n",
    "kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q332378.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q27 Ireland\n",
    "kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q27.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q29 Spain\n",
    "kgtk_most_similar(te_vectors, positive=['Q29'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q29.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q29 Spain, Q45 Portugal, Q142 France\n",
    "kgtk_most_similar(te_vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['TE'] + \"/Q29.Q45.Q142.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q33 Finland\n",
    "kgtk_most_similar(te_vectors, positive=['Q33'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q33.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'vectors' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-88-38039079c250>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Q502268 is Johnnie Walker\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mkgtk_most_similar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvectors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpositive\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Q502268'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkg_path\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'OUT'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m\"/parts\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtopn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1000\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_path\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'GE'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m\"/Q502268.sim.tsv\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m: name 'vectors' is not defined"
     ]
    }
   ],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q502268.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q502268 is Johnnie Walker\n",
    "kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q502268.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q374 is vodka\n",
    "kgtk_most_similar(vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q374.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q332378 is absolut\n",
    "kgtk_most_similar(vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q332378.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q27 Ireland\n",
    "kgtk_most_similar(vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q27.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q29 Spain\n",
    "kgtk_most_similar(vectors, positive=['Q29'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q29.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q29 Spain, Q45 Portugal, Q142 France\n",
    "kgtk_most_similar(vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['GE'] + \"/Q29.Q45.Q142.sim.tsv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        0.51 real         0.38 user         0.11 sys\n",
      "node1     node2                label       node1;label            node1;description\n",
      "Q3527971  0.4424980580806732   similarity  'Ti\\\\\\\\\\\\\\\\'Punch'@en  'cocktail'@en\n",
      "Q594392   0.38892069458961487  similarity  'B-52'@en              'cocktail of coffee liqueur, Irish cream and triple sec'@en\n"
     ]
    }
   ],
   "source": [
    "# Q281 whiskey\n",
    "# Q282 wine\n",
    "# Q3246609 mixed drink\n",
    "# Q374 vodka\n",
    "# Q332378 is absolut\n",
    "!$kypher -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q332378.sim.tsv \\\n",
    "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n",
    "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n",
    "--order-by 'cast(similarity, float) desc' \\\n",
    "--where 'class = \"Q3246609\"' \\\n",
    "--limit 10 \\\n",
    "| column -t -s $'\\t'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "lines = !kgtk remove-columns -i \"$Q154LABEL\" --all-except --columns node1 node2 \n",
    "label_dict = {}\n",
    "for line in lines[1:]:\n",
    "    items = line.split(\"\\t\")\n",
    "    label_dict[items[0]] = items[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "lines = !kgtk remove-columns -i \"$Q154DESCRIPTION\" --all-except --columns node1 node2 \n",
    "description_dict = {}\n",
    "for line in lines[1:]:\n",
    "    items = line.split(\"\\t\")\n",
    "    description_dict[items[0]] = items[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_labels(similar_list):\n",
    "    result = []\n",
    "    for x in similar_list:\n",
    "        text = \"{}, {} ({}), {}\".format(label_dict.get(x[0]), description_dict.get(x[0]), x[0], x[1])\n",
    "        result.append((text))\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stop here: the stuff below is Pedro's scratchpad, will be deleted later"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleanup\n",
    "\n",
    "Remove `novalue` and `somevalue`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kgtk",
   "language": "python",
   "name": "kgtk"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}