{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Augment A Knowledge Graph Using CSV Files\n",
    "\n",
    "A common knowledge graph augmentation use case is to integrate structured data present in databases, CSV and Excel files. KGTK provides an alternative to languages such as R2RML (https://www.w3.org/TR/r2rml/) and RML (https://rml.io/specs/rml/) or tools such as Karma (https://github.com/usc-isi-i2/Web-Karma) for integrating tabluar data. These languages and tools work by defining a mapping between a structured souce and an ontology. KGTK provides capabilities to transform the original structured data into the TSV format used in KGTK to store KGs. \n",
    "\n",
    "This tutorial illustrates the KGTK approach using a CSV file downloaded from Kaggle (https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset) containing information about moves from IMDb (https://www.imdb.com/).\n",
    "\n",
    "This tutorial is divided into multiple sections:\n",
    "- Survey what is available in our KG and in the IMDb file\n",
    "- Prepare the IMDb file for ingestion in KGTK\n",
    "- Find overlap betwwen our KG and the IMDb KG\n",
    "- Integrate knowledge from the IMDb KG into our KG\n",
    "- Implement a better model for the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: Install KGTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only run the following cell if KGTK is not installed.\n",
    " For example, if running in [Google Colab](https://colab.research.google.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install kgtk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preamble: set up the environment and files used in the tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "import csv\n",
    "import pandas as pd\n",
    "\n",
    "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n",
    "from kgtk.functions import kgtk, kypher"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "\n",
    "# Folder on local machine where to create the output and temporary folders\n",
    "input_path = None\n",
    "\n",
    "output_path = \"/tmp/projects\"\n",
    "project_name = \"tutorial-augment\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are all the KG files that we use in this tutorial:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "files = [\n",
    "    \"all\",\n",
    "    \"label\",\n",
    "    \"alias\",\n",
    "    \"description\",\n",
    "    \"external_id\",\n",
    "    \"monolingualtext\",\n",
    "    \"quantity\",\n",
    "    \"string\",\n",
    "    \"time\",\n",
    "    \"item\",\n",
    "    \"wikibase_property\",\n",
    "    \"qualifiers\",\n",
    "    \"datatypes\",\n",
    "    \"p279\",\n",
    "    \"p279star\",\n",
    "    \"p31\",\n",
    "    \"in_degree\",\n",
    "    \"out_degree\",\n",
    "    \"pagerank_directed\",\n",
    "    \"pagerank_undirected\"\n",
    "]\n",
    "input_files_url = \"https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled\"\n",
    "ck = ConfigureKGTK(files, input_files_url=input_files_url)\n",
    "ck.configure_kgtk(input_graph_path=input_path,\n",
    "                  output_path=output_path,\n",
    "                  project_name=project_name)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TEMP: /tmp/projects/tutorial-augment/temp.tutorial-augment\n",
      "STORE: /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n",
      "kypher: kgtk query --graph-cache /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n",
      "KGTK_GRAPH_CACHE: /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n",
      "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n",
      "kgtk: kgtk\n",
      "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n",
      "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\n",
      "KGTK_OPTION_DEBUG: false\n",
      "GRAPH: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input\n",
      "OUT: /tmp/projects/tutorial-augment\n",
      "all: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/all.tsv.gz\n",
      "label: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\n",
      "alias: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/aliases.en.tsv.gz\n",
      "description: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/descriptions.en.tsv.gz\n",
      "external_id: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.external-id.tsv.gz\n",
      "monolingualtext: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.monolingualtext.tsv.gz\n",
      "quantity: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.quantity.tsv.gz\n",
      "string: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.string.tsv.gz\n",
      "time: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.time.tsv.gz\n",
      "item: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-item.tsv.gz\n",
      "wikibase_property: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-property.tsv.gz\n",
      "qualifiers: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/qualifiers.tsv.gz\n",
      "datatypes: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.property.datatypes.tsv.gz\n",
      "p279: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279.tsv.gz\n",
      "p279star: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279star.tsv.gz\n",
      "p31: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P31.tsv.gz\n",
      "in_degree: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.in_degree.tsv.gz\n",
      "out_degree: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.out_degree.tsv.gz\n",
      "pagerank_directed: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.directed.tsv.gz\n",
      "pagerank_undirected: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.undirected.tsv.gz\n"
     ]
    }
   ],
   "source": [
    "ck.print_env_variables()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load all my files into the kypher cache so that all graph aliases are defined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "kgtk query --graph-cache /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/all.tsv.gz\" --as all  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\" --as label  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/aliases.en.tsv.gz\" --as alias  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/descriptions.en.tsv.gz\" --as description  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.external-id.tsv.gz\" --as external_id  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.monolingualtext.tsv.gz\" --as monolingualtext  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.quantity.tsv.gz\" --as quantity  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.string.tsv.gz\" --as string  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.time.tsv.gz\" --as time  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-item.tsv.gz\" --as item  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-property.tsv.gz\" --as wikibase_property  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/qualifiers.tsv.gz\" --as qualifiers  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.property.datatypes.tsv.gz\" --as datatypes  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279.tsv.gz\" --as p279  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279star.tsv.gz\" --as p279star  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P31.tsv.gz\" --as p31  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.in_degree.tsv.gz\" --as in_degree  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.out_degree.tsv.gz\" --as out_degree  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.directed.tsv.gz\" --as pagerank_directed  -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.undirected.tsv.gz\" --as pagerank_undirected  --limit 3\n",
      "node1\tlabel\tnode2\tid\tnode2;wikidatatype\n",
      "P10\talias\t'gif'@en\tP10-alias-en-282226-0\t\n",
      "P10\talias\t'animation'@en\tP10-alias-en-2f86d8-0\t\n",
      "P10\talias\t'media'@en\tP10-alias-en-c1427e-0\t\n",
      "CPU times: user 3.69 ms, sys: 10.3 ms, total: 14 ms\n",
      "Wall time: 35.6 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "ck.load_files_into_cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Survey the data in the IMDb file\n",
    "\n",
    "The first step is to determine whether our KG has IMDb identifiers, as this will make it easy to integrate the IMDb data.\n",
    "The following query counts the number of items in our KG that have an `IMDb ID (P345)`.\n",
    "Fortunately, our KG has many entities with IMDb identifiers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movies with IMDb identifiers</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6698</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movies with IMDb identifiers\n",
       "0                          6698"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i external_id\n",
    "        --match '(movie)-[:P345]->(imdbid)'\n",
    "        --return 'count(distinct movie) as `\"movies with IMDb identifiers\"`'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As there are many items that have IMDb ids, we can augment our graph using data from IMDB. The following file comes from Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset\n",
    "\n",
    "Our KG, a subset of Wikidata is missing many fields that are present in the IMDb file. \n",
    "For example, when we look at `film (Q11424)` in our browser http://ckg07.isi.edu:3008/browser/Q11424, we see that there is no information about ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/imdb\"\n",
    "other_files = [\n",
    "    \"IMDB.csv.gz\",\n",
    "    \"imdb-kg-node.tsv\",\n",
    "    \"imdb-kg.tsv\"\n",
    "]\n",
    "\n",
    "for file in other_files:\n",
    "    cmd = f\" wget {url}/{file} --directory-prefix={os.environ['GRAPH']}/imdb\"\n",
    "    print(subprocess.getoutput(cmd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/qv/cxzpwz3j29x7n79vwpw253v80000gn/T/ipykernel_81921/3614653023.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  imdb = pd.read_csv(os.environ['GRAPH'] + \"/imdb/IMDB.csv.gz\")\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>imdb_title_id</th>\n",
       "      <th>title</th>\n",
       "      <th>original_title</th>\n",
       "      <th>year</th>\n",
       "      <th>date_published</th>\n",
       "      <th>genre</th>\n",
       "      <th>duration</th>\n",
       "      <th>country</th>\n",
       "      <th>language</th>\n",
       "      <th>director</th>\n",
       "      <th>...</th>\n",
       "      <th>actors</th>\n",
       "      <th>description</th>\n",
       "      <th>avg_vote</th>\n",
       "      <th>votes</th>\n",
       "      <th>budget</th>\n",
       "      <th>usa_gross_income</th>\n",
       "      <th>worlwide_gross_income</th>\n",
       "      <th>metascore</th>\n",
       "      <th>reviews_from_users</th>\n",
       "      <th>reviews_from_critics</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>Miss Jerry</td>\n",
       "      <td>Miss Jerry</td>\n",
       "      <td>1894</td>\n",
       "      <td>1894-10-09</td>\n",
       "      <td>Romance</td>\n",
       "      <td>45</td>\n",
       "      <td>USA</td>\n",
       "      <td>None</td>\n",
       "      <td>Alexander Black</td>\n",
       "      <td>...</td>\n",
       "      <td>Blanche Bayliss, William Courtenay, Chauncey D...</td>\n",
       "      <td>The adventures of a female reporter in the 1890s.</td>\n",
       "      <td>5.9</td>\n",
       "      <td>154</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tt0000574</td>\n",
       "      <td>The Story of the Kelly Gang</td>\n",
       "      <td>The Story of the Kelly Gang</td>\n",
       "      <td>1906</td>\n",
       "      <td>1906-12-26</td>\n",
       "      <td>Biography, Crime, Drama</td>\n",
       "      <td>70</td>\n",
       "      <td>Australia</td>\n",
       "      <td>None</td>\n",
       "      <td>Charles Tait</td>\n",
       "      <td>...</td>\n",
       "      <td>Elizabeth Tait, John Tait, Norman Campbell, Be...</td>\n",
       "      <td>True story of notorious Australian outlaw Ned ...</td>\n",
       "      <td>6.1</td>\n",
       "      <td>589</td>\n",
       "      <td>$ 2250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>tt0001892</td>\n",
       "      <td>Den sorte drøm</td>\n",
       "      <td>Den sorte drøm</td>\n",
       "      <td>1911</td>\n",
       "      <td>1911-08-19</td>\n",
       "      <td>Drama</td>\n",
       "      <td>53</td>\n",
       "      <td>Germany, Denmark</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Urban Gad</td>\n",
       "      <td>...</td>\n",
       "      <td>Asta Nielsen, Valdemar Psilander, Gunnar Helse...</td>\n",
       "      <td>Two men of high rank are both wooing the beaut...</td>\n",
       "      <td>5.8</td>\n",
       "      <td>188</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>tt0002101</td>\n",
       "      <td>Cleopatra</td>\n",
       "      <td>Cleopatra</td>\n",
       "      <td>1912</td>\n",
       "      <td>1912-11-13</td>\n",
       "      <td>Drama, History</td>\n",
       "      <td>100</td>\n",
       "      <td>USA</td>\n",
       "      <td>English</td>\n",
       "      <td>Charles L. Gaskill</td>\n",
       "      <td>...</td>\n",
       "      <td>Helen Gardner, Pearl Sindelar, Miss Fielding, ...</td>\n",
       "      <td>The fabled queen of Egypt's affair with Roman ...</td>\n",
       "      <td>5.2</td>\n",
       "      <td>446</td>\n",
       "      <td>$ 45000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>tt0002130</td>\n",
       "      <td>L'Inferno</td>\n",
       "      <td>L'Inferno</td>\n",
       "      <td>1911</td>\n",
       "      <td>1911-03-06</td>\n",
       "      <td>Adventure, Drama, Fantasy</td>\n",
       "      <td>68</td>\n",
       "      <td>Italy</td>\n",
       "      <td>Italian</td>\n",
       "      <td>Francesco Bertolini, Adolfo Padovan</td>\n",
       "      <td>...</td>\n",
       "      <td>Salvatore Papa, Arturo Pirovano, Giuseppe de L...</td>\n",
       "      <td>Loosely adapted from Dante's Divine Comedy and...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>2237</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31.0</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  imdb_title_id                        title               original_title  \\\n",
       "0     tt0000009                   Miss Jerry                   Miss Jerry   \n",
       "1     tt0000574  The Story of the Kelly Gang  The Story of the Kelly Gang   \n",
       "2     tt0001892               Den sorte drøm               Den sorte drøm   \n",
       "3     tt0002101                    Cleopatra                    Cleopatra   \n",
       "4     tt0002130                    L'Inferno                    L'Inferno   \n",
       "\n",
       "   year date_published                      genre  duration           country  \\\n",
       "0  1894     1894-10-09                    Romance        45               USA   \n",
       "1  1906     1906-12-26    Biography, Crime, Drama        70         Australia   \n",
       "2  1911     1911-08-19                      Drama        53  Germany, Denmark   \n",
       "3  1912     1912-11-13             Drama, History       100               USA   \n",
       "4  1911     1911-03-06  Adventure, Drama, Fantasy        68             Italy   \n",
       "\n",
       "  language                             director  ...  \\\n",
       "0     None                      Alexander Black  ...   \n",
       "1     None                         Charles Tait  ...   \n",
       "2      NaN                            Urban Gad  ...   \n",
       "3  English                   Charles L. Gaskill  ...   \n",
       "4  Italian  Francesco Bertolini, Adolfo Padovan  ...   \n",
       "\n",
       "                                              actors  \\\n",
       "0  Blanche Bayliss, William Courtenay, Chauncey D...   \n",
       "1  Elizabeth Tait, John Tait, Norman Campbell, Be...   \n",
       "2  Asta Nielsen, Valdemar Psilander, Gunnar Helse...   \n",
       "3  Helen Gardner, Pearl Sindelar, Miss Fielding, ...   \n",
       "4  Salvatore Papa, Arturo Pirovano, Giuseppe de L...   \n",
       "\n",
       "                                         description avg_vote votes   budget  \\\n",
       "0  The adventures of a female reporter in the 1890s.      5.9   154      NaN   \n",
       "1  True story of notorious Australian outlaw Ned ...      6.1   589   $ 2250   \n",
       "2  Two men of high rank are both wooing the beaut...      5.8   188      NaN   \n",
       "3  The fabled queen of Egypt's affair with Roman ...      5.2   446  $ 45000   \n",
       "4  Loosely adapted from Dante's Divine Comedy and...      7.0  2237      NaN   \n",
       "\n",
       "   usa_gross_income worlwide_gross_income metascore reviews_from_users  \\\n",
       "0               NaN                   NaN       NaN                1.0   \n",
       "1               NaN                   NaN       NaN                7.0   \n",
       "2               NaN                   NaN       NaN                5.0   \n",
       "3               NaN                   NaN       NaN               25.0   \n",
       "4               NaN                   NaN       NaN               31.0   \n",
       "\n",
       "   reviews_from_critics  \n",
       "0                   2.0  \n",
       "1                   7.0  \n",
       "2                   2.0  \n",
       "3                   3.0  \n",
       "4                  14.0  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imdb = pd.read_csv(os.environ['GRAPH'] + \"/imdb/IMDB.csv.gz\")\n",
    "imdb.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the IMDb file for ingestion in KGTK\n",
    "This step illustrates the power of the KGTK approach: you canb convert the CSV file into a KG by simply renaming the column in the file that contains the identifier for the records to have the column heading `id`. Once you do this, the file becomes a KGTK node file, where the `id` column identifies a node, and the other columns define properties about the node. \n",
    "\n",
    "The initial node file is only a first approaximation, as additional transformations will be needed to make the data compatible with the KG.\n",
    "In this step, you will use use Pandas to convert the IMDb file to a KG node file by renaming the `imdb_title_id` column to `id` as it is the identifier for a movie.\n",
    "\n",
    "We will also convert non-numeric values in the cells to strings to enable us later to use string transformations to clean the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>original_title</th>\n",
       "      <th>year</th>\n",
       "      <th>date_published</th>\n",
       "      <th>genre</th>\n",
       "      <th>duration</th>\n",
       "      <th>country</th>\n",
       "      <th>language</th>\n",
       "      <th>director</th>\n",
       "      <th>...</th>\n",
       "      <th>actors</th>\n",
       "      <th>description</th>\n",
       "      <th>avg_vote</th>\n",
       "      <th>votes</th>\n",
       "      <th>budget</th>\n",
       "      <th>usa_gross_income</th>\n",
       "      <th>worlwide_gross_income</th>\n",
       "      <th>metascore</th>\n",
       "      <th>reviews_from_users</th>\n",
       "      <th>reviews_from_critics</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>Miss Jerry</td>\n",
       "      <td>Miss Jerry</td>\n",
       "      <td>1894</td>\n",
       "      <td>1894-10-09</td>\n",
       "      <td>Romance</td>\n",
       "      <td>45</td>\n",
       "      <td>USA</td>\n",
       "      <td>None</td>\n",
       "      <td>Alexander Black</td>\n",
       "      <td>...</td>\n",
       "      <td>Blanche Bayliss, William Courtenay, Chauncey D...</td>\n",
       "      <td>The adventures of a female reporter in the 1890s.</td>\n",
       "      <td>5.9</td>\n",
       "      <td>154</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tt0000574</td>\n",
       "      <td>The Story of the Kelly Gang</td>\n",
       "      <td>The Story of the Kelly Gang</td>\n",
       "      <td>1906</td>\n",
       "      <td>1906-12-26</td>\n",
       "      <td>Biography, Crime, Drama</td>\n",
       "      <td>70</td>\n",
       "      <td>Australia</td>\n",
       "      <td>None</td>\n",
       "      <td>Charles Tait</td>\n",
       "      <td>...</td>\n",
       "      <td>Elizabeth Tait, John Tait, Norman Campbell, Be...</td>\n",
       "      <td>True story of notorious Australian outlaw Ned ...</td>\n",
       "      <td>6.1</td>\n",
       "      <td>589</td>\n",
       "      <td>$ 2250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>tt0001892</td>\n",
       "      <td>Den sorte drøm</td>\n",
       "      <td>Den sorte drøm</td>\n",
       "      <td>1911</td>\n",
       "      <td>1911-08-19</td>\n",
       "      <td>Drama</td>\n",
       "      <td>53</td>\n",
       "      <td>Germany, Denmark</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Urban Gad</td>\n",
       "      <td>...</td>\n",
       "      <td>Asta Nielsen, Valdemar Psilander, Gunnar Helse...</td>\n",
       "      <td>Two men of high rank are both wooing the beaut...</td>\n",
       "      <td>5.8</td>\n",
       "      <td>188</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>tt0002101</td>\n",
       "      <td>Cleopatra</td>\n",
       "      <td>Cleopatra</td>\n",
       "      <td>1912</td>\n",
       "      <td>1912-11-13</td>\n",
       "      <td>Drama, History</td>\n",
       "      <td>100</td>\n",
       "      <td>USA</td>\n",
       "      <td>English</td>\n",
       "      <td>Charles L. Gaskill</td>\n",
       "      <td>...</td>\n",
       "      <td>Helen Gardner, Pearl Sindelar, Miss Fielding, ...</td>\n",
       "      <td>The fabled queen of Egypt's affair with Roman ...</td>\n",
       "      <td>5.2</td>\n",
       "      <td>446</td>\n",
       "      <td>$ 45000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>tt0002130</td>\n",
       "      <td>L'Inferno</td>\n",
       "      <td>L'Inferno</td>\n",
       "      <td>1911</td>\n",
       "      <td>1911-03-06</td>\n",
       "      <td>Adventure, Drama, Fantasy</td>\n",
       "      <td>68</td>\n",
       "      <td>Italy</td>\n",
       "      <td>Italian</td>\n",
       "      <td>Francesco Bertolini, Adolfo Padovan</td>\n",
       "      <td>...</td>\n",
       "      <td>Salvatore Papa, Arturo Pirovano, Giuseppe de L...</td>\n",
       "      <td>Loosely adapted from Dante's Divine Comedy and...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>2237</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31.0</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          id                        title               original_title  year  \\\n",
       "0  tt0000009                   Miss Jerry                   Miss Jerry  1894   \n",
       "1  tt0000574  The Story of the Kelly Gang  The Story of the Kelly Gang  1906   \n",
       "2  tt0001892               Den sorte drøm               Den sorte drøm  1911   \n",
       "3  tt0002101                    Cleopatra                    Cleopatra  1912   \n",
       "4  tt0002130                    L'Inferno                    L'Inferno  1911   \n",
       "\n",
       "  date_published                      genre  duration           country  \\\n",
       "0     1894-10-09                    Romance        45               USA   \n",
       "1     1906-12-26    Biography, Crime, Drama        70         Australia   \n",
       "2     1911-08-19                      Drama        53  Germany, Denmark   \n",
       "3     1912-11-13             Drama, History       100               USA   \n",
       "4     1911-03-06  Adventure, Drama, Fantasy        68             Italy   \n",
       "\n",
       "  language                             director  ...  \\\n",
       "0     None                      Alexander Black  ...   \n",
       "1     None                         Charles Tait  ...   \n",
       "2      NaN                            Urban Gad  ...   \n",
       "3  English                   Charles L. Gaskill  ...   \n",
       "4  Italian  Francesco Bertolini, Adolfo Padovan  ...   \n",
       "\n",
       "                                              actors  \\\n",
       "0  Blanche Bayliss, William Courtenay, Chauncey D...   \n",
       "1  Elizabeth Tait, John Tait, Norman Campbell, Be...   \n",
       "2  Asta Nielsen, Valdemar Psilander, Gunnar Helse...   \n",
       "3  Helen Gardner, Pearl Sindelar, Miss Fielding, ...   \n",
       "4  Salvatore Papa, Arturo Pirovano, Giuseppe de L...   \n",
       "\n",
       "                                         description avg_vote votes   budget  \\\n",
       "0  The adventures of a female reporter in the 1890s.      5.9   154      NaN   \n",
       "1  True story of notorious Australian outlaw Ned ...      6.1   589   $ 2250   \n",
       "2  Two men of high rank are both wooing the beaut...      5.8   188      NaN   \n",
       "3  The fabled queen of Egypt's affair with Roman ...      5.2   446  $ 45000   \n",
       "4  Loosely adapted from Dante's Divine Comedy and...      7.0  2237      NaN   \n",
       "\n",
       "   usa_gross_income worlwide_gross_income metascore reviews_from_users  \\\n",
       "0               NaN                   NaN       NaN                1.0   \n",
       "1               NaN                   NaN       NaN                7.0   \n",
       "2               NaN                   NaN       NaN                5.0   \n",
       "3               NaN                   NaN       NaN               25.0   \n",
       "4               NaN                   NaN       NaN               31.0   \n",
       "\n",
       "   reviews_from_critics  \n",
       "0                   2.0  \n",
       "1                   7.0  \n",
       "2                   2.0  \n",
       "3                   3.0  \n",
       "4                  14.0  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imdb_kg = imdb.rename(columns={'imdb_title_id':'id'})\n",
    "imdb_kg.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In KGTK, most commands rquire KGs to be in edge format (`node1/labal/node2`), so we use the `normalize-nodes` command to convert the IMDb nodes file to edge format.\n",
    "The IMDb file contains 85K rows and 22 columns, gemerating 1.8 million edges.\n",
    "\n",
    "The IMDb KG is completely isolated from our KG, but it is still a valid KG that we4 can manipulate in KGTK:\n",
    "> In KGTK, literals, such as the string identifiers of movies, can be used as the subjects (`node1`) of triples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After converting the original CSV file to a node file,  you can use the `normmalize-nodes` command to convert the node file to and edge file usinf the `node1/label/node2` headings. The columns of the original file become the labels of the edges and appear in the second column. Having done this, the original IMDB file has become a KG that you can transform using other KGTK commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 9.66 s, sys: 2.19 s, total: 11.8 s\n",
      "Wall time: 12.6 s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>title</td>\n",
       "      <td>Miss Jerry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>original_title</td>\n",
       "      <td>Miss Jerry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>year</td>\n",
       "      <td>1894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>date_published</td>\n",
       "      <td>1894-10-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>genre</td>\n",
       "      <td>Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1802950</th>\n",
       "      <td>tt9914942</td>\n",
       "      <td>usa_gross_income</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1802951</th>\n",
       "      <td>tt9914942</td>\n",
       "      <td>worlwide_gross_income</td>\n",
       "      <td>$ 59794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1802952</th>\n",
       "      <td>tt9914942</td>\n",
       "      <td>metascore</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1802953</th>\n",
       "      <td>tt9914942</td>\n",
       "      <td>reviews_from_users</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1802954</th>\n",
       "      <td>tt9914942</td>\n",
       "      <td>reviews_from_critics</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1802955 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             node1                  label       node2\n",
       "0        tt0000009                  title  Miss Jerry\n",
       "1        tt0000009         original_title  Miss Jerry\n",
       "2        tt0000009                   year        1894\n",
       "3        tt0000009         date_published  1894-10-09\n",
       "4        tt0000009                  genre     Romance\n",
       "...            ...                    ...         ...\n",
       "1802950  tt9914942       usa_gross_income         NaN\n",
       "1802951  tt9914942  worlwide_gross_income     $ 59794\n",
       "1802952  tt9914942              metascore         NaN\n",
       "1802953  tt9914942     reviews_from_users         NaN\n",
       "1802954  tt9914942   reviews_from_critics         2.0\n",
       "\n",
       "[1802955 rows x 3 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kgtk(imdb_kg, \"\"\"\n",
    "    normalize-nodes\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the IMDb KG to a temorary file and give it the alias `imdbkg`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.05 s, sys: 271 ms, total: 2.32 s\n",
      "Wall time: 15.1 s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>title</td>\n",
       "      <td>Miss Jerry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>original_title</td>\n",
       "      <td>Miss Jerry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>year</td>\n",
       "      <td>1894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>date_published</td>\n",
       "      <td>1894-10-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>genre</td>\n",
       "      <td>Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>duration</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>country</td>\n",
       "      <td>USA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>language</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>director</td>\n",
       "      <td>Alexander Black</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>tt0000009</td>\n",
       "      <td>writer</td>\n",
       "      <td>Alexander Black</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       node1           label            node2\n",
       "0  tt0000009           title       Miss Jerry\n",
       "1  tt0000009  original_title       Miss Jerry\n",
       "2  tt0000009            year             1894\n",
       "3  tt0000009  date_published       1894-10-09\n",
       "4  tt0000009           genre          Romance\n",
       "5  tt0000009        duration               45\n",
       "6  tt0000009         country              USA\n",
       "7  tt0000009        language             None\n",
       "8  tt0000009        director  Alexander Black\n",
       "9  tt0000009          writer  Alexander Black"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kgtk(imdb_kg, \"\"\"\n",
    "    normalize-nodes -o $TEMP/imdb-kg-edge.tsv\n",
    "\"\"\")\n",
    "\n",
    "kgtk(\"query -i $TEMP/imdb-kg-edge.tsv --as imdbkg --limit 10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find overlap between our KG and the IMDb KG\n",
    "\n",
    "The IMDb KG uses IMDb identifiers as nodes, and our KG, extracted from Wikidata, defined property `IMDb ID (P345)` to record the IMDb identifiers of movies.\n",
    "\n",
    "We can use KGTK to query both graphs simultaneously to find the movies in our graph for which there are nodes in the IMDb KG.\n",
    "> We are only interested in the `node1` in the IMDb KG , so we don't have to list the other elements of the pattern, thee `label` and `node2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>P345</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>tt0071360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>tt0095243</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>tt0304141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>tt0094824</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1026724</td>\n",
       "      <td>tt0046035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q977196</td>\n",
       "      <td>tt0098987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q978974</td>\n",
       "      <td>tt0104427</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q980041</td>\n",
       "      <td>tt0983193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q980308</td>\n",
       "      <td>tt0080661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>tt0116410</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            id       P345\n",
       "0     Q1009788  tt0071360\n",
       "1     Q1012216  tt0095243\n",
       "2      Q102448  tt0304141\n",
       "3     Q1025096  tt0094824\n",
       "4     Q1026724  tt0046035\n",
       "...        ...        ...\n",
       "1286   Q977196  tt0098987\n",
       "1287   Q978974  tt0104427\n",
       "1288   Q980041  tt0983193\n",
       "1289   Q980308  tt0080661\n",
       "1290   Q997206  tt0116410\n",
       "\n",
       "[1291 rows x 2 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i imdbkg -i external_id\n",
    "        --match '\n",
    "            imdb: (imdb_id),\n",
    "            external_id: (movie)-[:P345]->(imdb_id)\n",
    "            '\n",
    "        --return 'distinct movie as id, imdb_id as P345'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Count the number of entities in our KG that have an `IMDb ID (P345)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count_movies</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6698</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   count_movies\n",
       "0          6698"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i external_id\n",
    "        --match '(movie)-[:P345]->()'\n",
    "        --return 'count(distinct movie) as count_movies'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the number of `film (Q11424)` in our KG.\n",
    "> Our IMDb KG only has information for films."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count_film</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2447</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   count_film\n",
       "0        2447"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i all\n",
    "        --match '(film)-[:P31]->(class)-[:P279star]->(:Q11424)'\n",
    "        --return 'count(distinct film) as count_film'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, over 6,000 entities in our KG have IMDb identifers, about half of these are films, and about half of those appear in our IMDb KG."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Integrate knowledge from the IMDb KG into our KG\n",
    "\n",
    "There are many fields in the IMDb KG that we could integrate into our KG. In this tutorial we will focus on three fields: the `reviews_from_users`, `reviews_from_critics` and `duration`.\n",
    "\n",
    "Approach:\n",
    "- Query both KGs, joining on the IMDb identifier, and extract the  properties we want\n",
    "- ETL the data to conform to the Wikidata convetions used in our KG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with `reviews_from_users`, `reviews_from_critics`, which are counts. If we search in the browser we see very specific properties named `number of ...`, but none referring to number of reviews, so we will create two new properties, and we will give them the identifiers `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews`\n",
    "> We follow the Wikidata convention where the identifiers for properties start with `P` and the identifiers for entities start with `Q`. KGTK does not require to use numbers following `P` or `Q`, and it does not require that we follow the `P/Q` convention.\n",
    "\n",
    "To integrate the data we will use two KGs:\n",
    "- The `imdbkg` with the data we pulled directly from the IMDb csv file\n",
    "- The `external_id` subset of our KG that contains all external identifiers for all items in our KG.\n",
    "\n",
    "The `match` clause does a join on the identifers, enabling us to pull from the `imdbkg` the items thatalso exist in our KG.\n",
    "The `opt` clause (optional) retrieves the `reviews_from_users` property from the `imdbkg` KG\n",
    "We use the `opt` clause so that if the `imdbkg` has a null, we still output the result.\n",
    "\n",
    "The `return` clause builds a new node file: we use the `film` node from our KG as the `id` for our node file, and we translate the `reviews_from_users` data from the `imdbkg` as a `Pnumber_of_user_reviews`, a new property in our KG (below we extend this query to integrate all the data).\n",
    "> In KGTK, you can integrate data from external sources using queries to map from one schema (ontology) to another, and do ETL transformations in one place.\n",
    "\n",
    "Our KG is starting to take shape. For every film in the intersection of the two KGs, we now have a new property `Pnumber_of_user_reviews`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>Pnumber_of_user_reviews</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q172241</td>\n",
       "      <td>8232.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q163872</td>\n",
       "      <td>6938.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q18486021</td>\n",
       "      <td>6718.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q127367</td>\n",
       "      <td>5392.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q6074</td>\n",
       "      <td>4822.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q51274461</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q5437557</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q21427084</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q51274879</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q6510612</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             id  Pnumber_of_user_reviews\n",
       "0       Q172241                   8232.0\n",
       "1       Q163872                   6938.0\n",
       "2     Q18486021                   6718.0\n",
       "3       Q127367                   5392.0\n",
       "4         Q6074                   4822.0\n",
       "...         ...                      ...\n",
       "1286  Q51274461                      2.0\n",
       "1287   Q5437557                      1.0\n",
       "1288  Q21427084                      NaN\n",
       "1289  Q51274879                      NaN\n",
       "1290   Q6510612                      NaN\n",
       "\n",
       "[1291 rows x 2 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i imdbkg -i external_id\n",
    "        --match '\n",
    "            external_id: (film)-[:P345]->(imdb_id),\n",
    "            imdb: (imdb_id)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n",
    "        --return 'distinct\n",
    "            film as id,  \n",
    "            rfu as Pnumber_of_user_reviews\n",
    "            '\n",
    "        --order-by 'cast(rfu, int) desc'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can extend the query to integrate `reviews_from_critics` and `duration` by adding additional `opt` clauses.\n",
    "Wikidata has propery `duration (P2047`), so we reuse it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>Pnumber_of_user_reviews</th>\n",
       "      <th>Pnumber_of_critic_reviews</th>\n",
       "      <th>P2047</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q172241</td>\n",
       "      <td>8232.0</td>\n",
       "      <td>164.0</td>\n",
       "      <td>142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q163872</td>\n",
       "      <td>6938.0</td>\n",
       "      <td>423.0</td>\n",
       "      <td>152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q18486021</td>\n",
       "      <td>6718.0</td>\n",
       "      <td>717.0</td>\n",
       "      <td>152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q127367</td>\n",
       "      <td>5392.0</td>\n",
       "      <td>340.0</td>\n",
       "      <td>178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q6074</td>\n",
       "      <td>4822.0</td>\n",
       "      <td>909.0</td>\n",
       "      <td>138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q51274461</td>\n",
       "      <td>2.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q5437557</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q21427084</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q51274879</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q6510612</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>82</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             id  Pnumber_of_user_reviews  Pnumber_of_critic_reviews  P2047\n",
       "0       Q172241                   8232.0                      164.0    142\n",
       "1       Q163872                   6938.0                      423.0    152\n",
       "2     Q18486021                   6718.0                      717.0    152\n",
       "3       Q127367                   5392.0                      340.0    178\n",
       "4         Q6074                   4822.0                      909.0    138\n",
       "...         ...                      ...                        ...    ...\n",
       "1286  Q51274461                      2.0                        8.0     78\n",
       "1287   Q5437557                      1.0                        1.0    100\n",
       "1288  Q21427084                      NaN                        6.0     90\n",
       "1289  Q51274879                      NaN                        3.0     71\n",
       "1290   Q6510612                      NaN                        2.0     82\n",
       "\n",
       "[1291 rows x 4 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i imdbkg -i external_id\n",
    "        --match '\n",
    "            external_id: (film)-[:P345]->(imdb_id),\n",
    "            imdb: (imdb_id)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:duration]->(duration)'\n",
    "        --return 'distinct\n",
    "            film as id,  \n",
    "            rfu as Pnumber_of_user_reviews,\n",
    "            rfc as Pnumber_of_critic_reviews,\n",
    "            duration as P2047\n",
    "            '\n",
    "        --order-by 'cast(rfu, int) desc'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we browse `Terminator 2: Judgment Day (Q170564)`, we see that we ought to define units of measure (`minute (Q7727)`).\n",
    "In KGTK, quantities such as duration are represented as structured literals that incorporate the quantity and the units in one symbol.\n",
    "For example \"142 minutes\" is represented as `142Q7727`.\n",
    "\n",
    "It is easy to incorporate the units into our query by using the `printf` statement to combine multiple variables into a formatted string:\n",
    "> The case of units of measure is a good example of the benefit to do schema mapping and ETL in the same query.\n",
    "\n",
    "> We illustrate the use the KGTK function `kgtk_quantity_number` in the `order-by` clause to extract the numeric value of a quantity structured literals (https://kgtk.readthedocs.io/en/latest/transform/query/#functions-on-kgtk-numbers-and-quantities)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10.4 ms, sys: 20 ms, total: 30.4 ms\n",
      "Wall time: 1.59 s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>Pnumber_of_user_reviews</th>\n",
       "      <th>Pnumber_of_critic_reviews</th>\n",
       "      <th>P2047</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1218601</td>\n",
       "      <td>127.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>+260Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q2875</td>\n",
       "      <td>881.0</td>\n",
       "      <td>197.0</td>\n",
       "      <td>+238Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q228186</td>\n",
       "      <td>676.0</td>\n",
       "      <td>146.0</td>\n",
       "      <td>+228Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q746733</td>\n",
       "      <td>293.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>+220Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q148204</td>\n",
       "      <td>233.0</td>\n",
       "      <td>137.0</td>\n",
       "      <td>+219Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q51274879</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>+71Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q5532208</td>\n",
       "      <td>14.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>+71Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q736731</td>\n",
       "      <td>51.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>+71Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q43051</td>\n",
       "      <td>181.0</td>\n",
       "      <td>118.0</td>\n",
       "      <td>+70Q7727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q7549558</td>\n",
       "      <td>8.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>+64Q7727</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             id  Pnumber_of_user_reviews  Pnumber_of_critic_reviews      P2047\n",
       "0      Q1218601                    127.0                       22.0  +260Q7727\n",
       "1         Q2875                    881.0                      197.0  +238Q7727\n",
       "2       Q228186                    676.0                      146.0  +228Q7727\n",
       "3       Q746733                    293.0                       76.0  +220Q7727\n",
       "4       Q148204                    233.0                      137.0  +219Q7727\n",
       "...         ...                      ...                        ...        ...\n",
       "1286  Q51274879                      NaN                        3.0   +71Q7727\n",
       "1287   Q5532208                     14.0                        9.0   +71Q7727\n",
       "1288    Q736731                     51.0                       39.0   +71Q7727\n",
       "1289     Q43051                    181.0                      118.0   +70Q7727\n",
       "1290   Q7549558                      8.0                        1.0   +64Q7727\n",
       "\n",
       "[1291 rows x 4 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "kgtk(\"\"\"\n",
    "    query -i imdbkg -i external_id\n",
    "        --match '\n",
    "            external_id: (film)-[:P345]->(imdb_id),\n",
    "            imdb: (imdb_id)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:duration]->(duration)'\n",
    "        --return 'distinct\n",
    "            film as id,  \n",
    "            rfu as Pnumber_of_user_reviews,\n",
    "            rfc as Pnumber_of_critic_reviews,\n",
    "            printf(\"+%sQ7727\", duration) as P2047\n",
    "            '\n",
    "        --order-by 'kgtk_quantity_number(P2047) desc'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query above extracts data from the simple `imdbkg` graph that uses the column headings as properties. The query maps the properties and data to new or existing Wikidata properties that we use in our KG, and performs simple data cleaning to conform to the Wikidata and KGTK format requirement. The resulting KG is in KGTK node format where the properties appear in the column headings; we save this graph in `$TEMP/imdb-import.node.tsv`:\n",
    "\n",
    "> It is possible to create the KGTK node file using Pandas as the node file has very simmilar structure to the original CSV file. One of the advantages of KGTK is that it empowers users to use tools that they are familiar with, e.g., Pandas, to transform KGTK graphs. Integration with KGTK is seamless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i imdbkg -i external_id\n",
    "        --match '\n",
    "            external_id: (film)-[:P345]->(imdb_id),\n",
    "            imdb: (imdb_id)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n",
    "        --opt '\n",
    "            imdb: (imdb_id)-[:duration]->(duration)'\n",
    "        --return 'distinct\n",
    "            film as id,  \n",
    "            rfu as Pnumber_of_user_reviews,\n",
    "            rfc as Pnumber_of_critic_reviews,\n",
    "            printf(\"+%sQ7727\", duration) as P2047\n",
    "            '\n",
    "    -o $TEMP/imdb-import.node.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step is to convert the node file to a an edge file so that it is in the standard form used by all KGTK commands.\n",
    "Convert the node file to edges and add edge ids:\n",
    "\n",
    "> If you had used Pandas to create the graph in node format, this same command would transform it into and edge file, and make the4 graph ready for use with any other KGTL commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_user_reviews</td>\n",
       "      <td>386.0</td>\n",
       "      <td>Q1009788-Pnumber_of_user_reviews-7e80e4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_critic_reviews</td>\n",
       "      <td>169.0</td>\n",
       "      <td>Q1009788-Pnumber_of_critic_reviews-ff2927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+113Q7727</td>\n",
       "      <td>Q1009788-P2047-54edad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>Pnumber_of_user_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q1012216-Pnumber_of_user_reviews-d411e0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>Pnumber_of_critic_reviews</td>\n",
       "      <td>28.0</td>\n",
       "      <td>Q1012216-Pnumber_of_critic_reviews-3884d7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3868</th>\n",
       "      <td>Q980308</td>\n",
       "      <td>Pnumber_of_critic_reviews</td>\n",
       "      <td>149.0</td>\n",
       "      <td>Q980308-Pnumber_of_critic_reviews-c70c59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3869</th>\n",
       "      <td>Q980308</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+104Q7727</td>\n",
       "      <td>Q980308-P2047-75da46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3870</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>Pnumber_of_user_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q997206-Pnumber_of_user_reviews-d411e0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3871</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>Pnumber_of_critic_reviews</td>\n",
       "      <td>36.0</td>\n",
       "      <td>Q997206-Pnumber_of_critic_reviews-f85f0b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3872</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+130Q7727</td>\n",
       "      <td>Q997206-P2047-a5e559</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3873 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         node1                      label      node2  \\\n",
       "0     Q1009788    Pnumber_of_user_reviews      386.0   \n",
       "1     Q1009788  Pnumber_of_critic_reviews      169.0   \n",
       "2     Q1009788                      P2047  +113Q7727   \n",
       "3     Q1012216    Pnumber_of_user_reviews       69.0   \n",
       "4     Q1012216  Pnumber_of_critic_reviews       28.0   \n",
       "...        ...                        ...        ...   \n",
       "3868   Q980308  Pnumber_of_critic_reviews      149.0   \n",
       "3869   Q980308                      P2047  +104Q7727   \n",
       "3870   Q997206    Pnumber_of_user_reviews       69.0   \n",
       "3871   Q997206  Pnumber_of_critic_reviews       36.0   \n",
       "3872   Q997206                      P2047  +130Q7727   \n",
       "\n",
       "                                             id  \n",
       "0       Q1009788-Pnumber_of_user_reviews-7e80e4  \n",
       "1     Q1009788-Pnumber_of_critic_reviews-ff2927  \n",
       "2                         Q1009788-P2047-54edad  \n",
       "3       Q1012216-Pnumber_of_user_reviews-d411e0  \n",
       "4     Q1012216-Pnumber_of_critic_reviews-3884d7  \n",
       "...                                         ...  \n",
       "3868   Q980308-Pnumber_of_critic_reviews-c70c59  \n",
       "3869                       Q980308-P2047-75da46  \n",
       "3870     Q997206-Pnumber_of_user_reviews-d411e0  \n",
       "3871   Q997206-Pnumber_of_critic_reviews-f85f0b  \n",
       "3872                       Q997206-P2047-a5e559  \n",
       "\n",
       "[3873 rows x 4 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    normalize-nodes -i $TEMP/imdb-import.node.tsv\n",
    "    / add-id --id-style wikidata\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Store the IMDb data, now represented in the Wikidata ontology, into a KG file and give it alias `augment_imdb` so that we can use it in the next steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_user_reviews</td>\n",
       "      <td>386.0</td>\n",
       "      <td>Q1009788-Pnumber_of_user_reviews-7e80e4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_critic_reviews</td>\n",
       "      <td>169.0</td>\n",
       "      <td>Q1009788-Pnumber_of_critic_reviews-ff2927</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      node1                      label  node2  \\\n",
       "0  Q1009788    Pnumber_of_user_reviews  386.0   \n",
       "1  Q1009788  Pnumber_of_critic_reviews  169.0   \n",
       "\n",
       "                                          id  \n",
       "0    Q1009788-Pnumber_of_user_reviews-7e80e4  \n",
       "1  Q1009788-Pnumber_of_critic_reviews-ff2927  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    normalize-nodes -i $TEMP/imdb-import.node.tsv\n",
    "    / add-id --id-style wikidata\n",
    "    -o $TEMP/augment.imdb.tsv\n",
    "\"\"\")\n",
    "\n",
    "kgtk(\"query -i $TEMP/augment.imdb.tsv --as augment_imdb --limit 2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deduplicate the durations (`duration (P2047)`)\n",
    "Wikidata defines durations for many films. If we want to integrate the IMDb data into our KG, we must be careful as there may be inconsitent durations.\n",
    "\n",
    "Approach:\n",
    "- Find the films where the durations in Wikidata and IMDB differ\n",
    "- In case of conflict, prefer the IMDb durations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, find films where the durations are different using a query that combines the orignal `all` graph and the `augment_imdb` graph.\n",
    "There are many differences, and we see cases where the data in Wikidata is suspect (the uncertainty is the same as the value):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>film</th>\n",
       "      <th>duration_wd</th>\n",
       "      <th>duration_imdb</th>\n",
       "      <th>film;label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>+109Q7727</td>\n",
       "      <td>+113Q7727</td>\n",
       "      <td>'The Conversation'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>+136Q7727</td>\n",
       "      <td>+142Q7727</td>\n",
       "      <td>'Harry Potter and the Prisoner of Azkaban'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>+93Q7727</td>\n",
       "      <td>+98Q7727</td>\n",
       "      <td>'Caddyshack II'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1027212</td>\n",
       "      <td>+125[+125,+125]Q7727</td>\n",
       "      <td>+126Q7727</td>\n",
       "      <td>'Revolution'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q103474</td>\n",
       "      <td>+143Q7727</td>\n",
       "      <td>+149Q7727</td>\n",
       "      <td>'2001: A Space Odyssey'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>375</th>\n",
       "      <td>Q926825</td>\n",
       "      <td>+128[+128,+128]Q7727</td>\n",
       "      <td>+133Q7727</td>\n",
       "      <td>'Flower Drum Song'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>376</th>\n",
       "      <td>Q929647</td>\n",
       "      <td>+115Q7727</td>\n",
       "      <td>+120Q7727</td>\n",
       "      <td>'Mask'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>377</th>\n",
       "      <td>Q930372</td>\n",
       "      <td>+175Q7727</td>\n",
       "      <td>+179Q7727</td>\n",
       "      <td>'Camelot'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>378</th>\n",
       "      <td>Q936425</td>\n",
       "      <td>+128Q7727</td>\n",
       "      <td>+133Q7727</td>\n",
       "      <td>'A Very Long Engagement'@en</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>379</th>\n",
       "      <td>Q936576</td>\n",
       "      <td>+91Q7727</td>\n",
       "      <td>+95Q7727</td>\n",
       "      <td>'The Nutty Professor'@en</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>380 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         film           duration_wd duration_imdb  \\\n",
       "0    Q1009788             +109Q7727     +113Q7727   \n",
       "1     Q102448             +136Q7727     +142Q7727   \n",
       "2    Q1025096              +93Q7727      +98Q7727   \n",
       "3    Q1027212  +125[+125,+125]Q7727     +126Q7727   \n",
       "4     Q103474             +143Q7727     +149Q7727   \n",
       "..        ...                   ...           ...   \n",
       "375   Q926825  +128[+128,+128]Q7727     +133Q7727   \n",
       "376   Q929647             +115Q7727     +120Q7727   \n",
       "377   Q930372             +175Q7727     +179Q7727   \n",
       "378   Q936425             +128Q7727     +133Q7727   \n",
       "379   Q936576              +91Q7727      +95Q7727   \n",
       "\n",
       "                                        film;label  \n",
       "0                            'The Conversation'@en  \n",
       "1    'Harry Potter and the Prisoner of Azkaban'@en  \n",
       "2                               'Caddyshack II'@en  \n",
       "3                                  'Revolution'@en  \n",
       "4                       '2001: A Space Odyssey'@en  \n",
       "..                                             ...  \n",
       "375                          'Flower Drum Song'@en  \n",
       "376                                      'Mask'@en  \n",
       "377                                   'Camelot'@en  \n",
       "378                    'A Very Long Engagement'@en  \n",
       "379                       'The Nutty Professor'@en  \n",
       "\n",
       "[380 rows x 4 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i all -i augment_imdb\n",
    "        --match 'all: (film)-[:P2047]->(duration_wd), augment_imdb: (film)-[:P2047]->(duration_imdb)'\n",
    "        --where 'duration_wd < duration_imdb'\n",
    "        --return 'distinct film as film, duration_wd as duration_wd, duration_imdb as duration_imdb'\n",
    "        --order-by 'film'\n",
    "    / add-labels\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We decide to remove from our original graph the edges for durations that differ from the durations in IMDb. \n",
    "We enhance our query to return the ids of such edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788-P2047-16e299-ef37b7c7-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q102448-P2047-031b29-0e3d9b06-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1025096-P2047-01387b-bea42f5f-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1027212-P2047-c18970-a8e52174-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q103474-P2047-83a7f0-c5824aa3-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>375</th>\n",
       "      <td>Q926825-P2047-7abb58-87a8eb10-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>376</th>\n",
       "      <td>Q929647-P2047-b97fa8-669b35f9-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>377</th>\n",
       "      <td>Q930372-P2047-ed974c-06f071f1-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>378</th>\n",
       "      <td>Q936425-P2047-37faaa-7eb616cb-0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>379</th>\n",
       "      <td>Q936576-P2047-fc103b-bbd28441-0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>380 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   id\n",
       "0    Q1009788-P2047-16e299-ef37b7c7-0\n",
       "1     Q102448-P2047-031b29-0e3d9b06-0\n",
       "2    Q1025096-P2047-01387b-bea42f5f-0\n",
       "3    Q1027212-P2047-c18970-a8e52174-0\n",
       "4     Q103474-P2047-83a7f0-c5824aa3-0\n",
       "..                                ...\n",
       "375   Q926825-P2047-7abb58-87a8eb10-0\n",
       "376   Q929647-P2047-b97fa8-669b35f9-0\n",
       "377   Q930372-P2047-ed974c-06f071f1-0\n",
       "378   Q936425-P2047-37faaa-7eb616cb-0\n",
       "379   Q936576-P2047-fc103b-bbd28441-0\n",
       "\n",
       "[380 rows x 1 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i all -i augment_imdb\n",
    "        --match '\n",
    "            all: (film)-[l:P2047]->(duration_wd),\n",
    "            augment_imdb: (film)-[:P2047]->(duration_imdb)\n",
    "            '\n",
    "        --where 'duration_wd < duration_imdb'\n",
    "        --return 'distinct l as id'\n",
    "        --order-by 'l'\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Store the ids of the discrepant duration edges in a file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i all -i augment_imdb\n",
    "        --match '\n",
    "            all: (film)-[l:P2047]->(duration_wd),\n",
    "            augment_imdb: (film)-[:P2047]->(duration_imdb)\n",
    "            '\n",
    "        --where 'duration_wd < duration_imdb'\n",
    "        --return 'distinct l as id'\n",
    "        --order-by 'l'\n",
    "    -o $TEMP/id.discrepant_durations.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the intersection and difference of the two files: the original `$quantity` file that contains all the quantity edges for our subset of Wikidata, and `$TEMP/id.discrepant_durations.tsv`, the file that contains the ids of duration edges in `$quantity` where the values in Wikidata and IMDb differ.\n",
    "\n",
    "Use the KGTK `ifexists` command (https://kgtk.readthedocs.io/en/latest/transform/ifexists/).\n",
    "Provide the two files, the names of the columns used as keys  (`id`), and tell the command to output the `reject-file` containing edges in the input file that are not present in the filter file (this is the file we want):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "      <th>node2;wikidatatype</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+109Q7727</td>\n",
       "      <td>Q1009788-P2047-16e299-ef37b7c7-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+136Q7727</td>\n",
       "      <td>Q102448-P2047-031b29-0e3d9b06-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+93Q7727</td>\n",
       "      <td>Q1025096-P2047-01387b-bea42f5f-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1027212</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+125[+125,+125]Q7727</td>\n",
       "      <td>Q1027212-P2047-c18970-a8e52174-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q103474</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+143Q7727</td>\n",
       "      <td>Q103474-P2047-83a7f0-c5824aa3-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>375</th>\n",
       "      <td>Q926825</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+128[+128,+128]Q7727</td>\n",
       "      <td>Q926825-P2047-7abb58-87a8eb10-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>376</th>\n",
       "      <td>Q929647</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+115Q7727</td>\n",
       "      <td>Q929647-P2047-b97fa8-669b35f9-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>377</th>\n",
       "      <td>Q930372</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+175Q7727</td>\n",
       "      <td>Q930372-P2047-ed974c-06f071f1-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>378</th>\n",
       "      <td>Q936425</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+128Q7727</td>\n",
       "      <td>Q936425-P2047-37faaa-7eb616cb-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>379</th>\n",
       "      <td>Q936576</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+91Q7727</td>\n",
       "      <td>Q936576-P2047-fc103b-bbd28441-0</td>\n",
       "      <td>quantity</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>380 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        node1  label                 node2                                id  \\\n",
       "0    Q1009788  P2047             +109Q7727  Q1009788-P2047-16e299-ef37b7c7-0   \n",
       "1     Q102448  P2047             +136Q7727   Q102448-P2047-031b29-0e3d9b06-0   \n",
       "2    Q1025096  P2047              +93Q7727  Q1025096-P2047-01387b-bea42f5f-0   \n",
       "3    Q1027212  P2047  +125[+125,+125]Q7727  Q1027212-P2047-c18970-a8e52174-0   \n",
       "4     Q103474  P2047             +143Q7727   Q103474-P2047-83a7f0-c5824aa3-0   \n",
       "..        ...    ...                   ...                               ...   \n",
       "375   Q926825  P2047  +128[+128,+128]Q7727   Q926825-P2047-7abb58-87a8eb10-0   \n",
       "376   Q929647  P2047             +115Q7727   Q929647-P2047-b97fa8-669b35f9-0   \n",
       "377   Q930372  P2047             +175Q7727   Q930372-P2047-ed974c-06f071f1-0   \n",
       "378   Q936425  P2047             +128Q7727   Q936425-P2047-37faaa-7eb616cb-0   \n",
       "379   Q936576  P2047              +91Q7727   Q936576-P2047-fc103b-bbd28441-0   \n",
       "\n",
       "    node2;wikidatatype  \n",
       "0             quantity  \n",
       "1             quantity  \n",
       "2             quantity  \n",
       "3             quantity  \n",
       "4             quantity  \n",
       "..                 ...  \n",
       "375           quantity  \n",
       "376           quantity  \n",
       "377           quantity  \n",
       "378           quantity  \n",
       "379           quantity  \n",
       "\n",
       "[380 rows x 5 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    ifexists -i $quantity \n",
    "        --filter-on $TEMP/id.discrepant_durations.tsv\n",
    "        --input-keys id\n",
    "        --reject-file $OUT/quantity.minus_bad_durations.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sanity check: subtract the number of edges in the original quantity file `$quantity` minus the number of edges in `$OUT/quantity.minus_bad_durations.tsv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "380"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "old = !zcat < $quantity | wc -l\n",
    "new = !cat $OUT/quantity.minus_bad_durations.tsv | wc -l\n",
    "\n",
    "int(old[0]) - int(new[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement a better model for the IMDb data\n",
    "\n",
    "The data model we just implemented is very simple. In this section, we will improve it to be more compatible with how Wikidata models data.\n",
    "We will:\n",
    "- Add a `point in time (P585)` qualifier to record the time when the reviews were counted. We will use 2020-01-01, the date reported in the Kaggle site\n",
    "- Combine `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews` into a single `Pnumber_of_reviews` property, and use the `of (P642)` qualifier property \n",
    "to record whether the review counts are from a `customer (Q852835)` or a `critic (Q6430706)`\n",
    "\n",
    "The approach we will use is to tranform the `augment_imdb` data we created above to have the desired structure. \n",
    "> An alternative approach would be to generate the data in the desired structure in the first place."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following query transforms the `Pnumber_of_user_reviews` data into the desired structure. The work is done in the `return` statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>P642</th>\n",
       "      <th>P585</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>386.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>1600.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>89.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1026724</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>20.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q977196</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>120.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q978974</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>74.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q980041</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>499.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q980308</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>233.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         node1               label   node2     P642                      P585\n",
       "0     Q1009788  Pnumber_of_reviews   386.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "1     Q1012216  Pnumber_of_reviews    69.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "2      Q102448  Pnumber_of_reviews  1600.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "3     Q1025096  Pnumber_of_reviews    89.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "4     Q1026724  Pnumber_of_reviews    20.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "...        ...                 ...     ...      ...                       ...\n",
       "1286   Q977196  Pnumber_of_reviews   120.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "1287   Q978974  Pnumber_of_reviews    74.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "1288   Q980041  Pnumber_of_reviews   499.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "1289   Q980308  Pnumber_of_reviews   233.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "1290   Q997206  Pnumber_of_reviews    69.0  Q852835  ^2020-01-01T00:00:00Z/11\n",
       "\n",
       "[1291 rows x 5 columns]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i augment_imdb\n",
    "        --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n",
    "        --return '\n",
    "            film as node1,\n",
    "            \"Pnumber_of_reviews\" as label,\n",
    "            count as node2,\n",
    "            \"Q852835\" as P642,\n",
    "            \"^2020-01-01T00:00:00Z/11\" as P585\n",
    "            '\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As always, we want an edge file with ids, so we need to add the requisite `add-id` and `normalize` commands:\n",
    "> When doing this for real, we would not define multiple cells to revise the query. We would edit the cell, refining the pipeline to produce the desired results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>386.0</td>\n",
       "      <td>Q1009788-Pnumber_of_reviews-7e80e4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1009788-Pnumber_of_reviews-7e80e4</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>Q1009788-Pnumber_of_reviews-7e80e4-P642-Q85283...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q1009788-Pnumber_of_reviews-7e80e4</td>\n",
       "      <td>P585</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "      <td>Q1009788-Pnumber_of_reviews-7e80e4-P585-^2020-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q1012216-Pnumber_of_reviews-d411e0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1012216-Pnumber_of_reviews-d411e0</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>Q1012216-Pnumber_of_reviews-d411e0-P642-Q85283...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3868</th>\n",
       "      <td>Q980308-Pnumber_of_reviews-ae91a5</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>Q980308-Pnumber_of_reviews-ae91a5-P642-Q852835...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3869</th>\n",
       "      <td>Q980308-Pnumber_of_reviews-ae91a5</td>\n",
       "      <td>P585</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "      <td>Q980308-Pnumber_of_reviews-ae91a5-P585-^2020-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3870</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>69.0</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-d411e0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3871</th>\n",
       "      <td>Q997206-Pnumber_of_reviews-d411e0</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q852835</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-d411e0-P642-Q852835...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3872</th>\n",
       "      <td>Q997206-Pnumber_of_reviews-d411e0</td>\n",
       "      <td>P585</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-d411e0-P585-^2020-0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3873 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   node1               label  \\\n",
       "0                               Q1009788  Pnumber_of_reviews   \n",
       "1     Q1009788-Pnumber_of_reviews-7e80e4                P642   \n",
       "2     Q1009788-Pnumber_of_reviews-7e80e4                P585   \n",
       "3                               Q1012216  Pnumber_of_reviews   \n",
       "4     Q1012216-Pnumber_of_reviews-d411e0                P642   \n",
       "...                                  ...                 ...   \n",
       "3868   Q980308-Pnumber_of_reviews-ae91a5                P642   \n",
       "3869   Q980308-Pnumber_of_reviews-ae91a5                P585   \n",
       "3870                             Q997206  Pnumber_of_reviews   \n",
       "3871   Q997206-Pnumber_of_reviews-d411e0                P642   \n",
       "3872   Q997206-Pnumber_of_reviews-d411e0                P585   \n",
       "\n",
       "                         node2  \\\n",
       "0                        386.0   \n",
       "1                      Q852835   \n",
       "2     ^2020-01-01T00:00:00Z/11   \n",
       "3                         69.0   \n",
       "4                      Q852835   \n",
       "...                        ...   \n",
       "3868                   Q852835   \n",
       "3869  ^2020-01-01T00:00:00Z/11   \n",
       "3870                      69.0   \n",
       "3871                   Q852835   \n",
       "3872  ^2020-01-01T00:00:00Z/11   \n",
       "\n",
       "                                                     id  \n",
       "0                    Q1009788-Pnumber_of_reviews-7e80e4  \n",
       "1     Q1009788-Pnumber_of_reviews-7e80e4-P642-Q85283...  \n",
       "2     Q1009788-Pnumber_of_reviews-7e80e4-P585-^2020-...  \n",
       "3                    Q1012216-Pnumber_of_reviews-d411e0  \n",
       "4     Q1012216-Pnumber_of_reviews-d411e0-P642-Q85283...  \n",
       "...                                                 ...  \n",
       "3868  Q980308-Pnumber_of_reviews-ae91a5-P642-Q852835...  \n",
       "3869  Q980308-Pnumber_of_reviews-ae91a5-P585-^2020-0...  \n",
       "3870                  Q997206-Pnumber_of_reviews-d411e0  \n",
       "3871  Q997206-Pnumber_of_reviews-d411e0-P642-Q852835...  \n",
       "3872  Q997206-Pnumber_of_reviews-d411e0-P585-^2020-0...  \n",
       "\n",
       "[3873 rows x 4 columns]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i augment_imdb\n",
    "        --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n",
    "        --return '\n",
    "            film as node1,\n",
    "            \"Pnumber_of_reviews\" as label,\n",
    "            count as node2,\n",
    "            \"Q852835\" as P642,\n",
    "            \"^2020-01-01T00:00:00Z/11\" as P585\n",
    "            '\n",
    "    / add-id --id-style wikidata\n",
    "    / normalize --add-id True\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do, the same transformation for the counts of critic reviews, and concatenate the result for user (customaer) and critic into on result file `$TEMP/reviews.tsv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(\"\"\"\n",
    "    query -i augment_imdb\n",
    "        --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n",
    "        --return '\n",
    "            film as node1,\n",
    "            \"Pnumber_of_reviews\" as label,\n",
    "            count as node2,\n",
    "            \"Q852835\" as P642,\n",
    "            \"^2020-01-01T00:00:00Z/11\" as P585\n",
    "            '\n",
    "    / add-id --id-style wikidata\n",
    "    / normalize --add-id True\n",
    "    -o $TEMP/reviews.user.tsv\n",
    "\"\"\")\n",
    "\n",
    "kgtk(\"\"\"\n",
    "    query -i augment_imdb\n",
    "        --match '(film)-[:Pnumber_of_critic_reviews]->(count)'\n",
    "        --return '\n",
    "            film as node1,\n",
    "            \"Pnumber_of_reviews\" as label,\n",
    "            count as node2,\n",
    "            \"Q6430706\" as P642,\n",
    "            \"^2020-01-01T00:00:00Z/11\" as P585\n",
    "            '\n",
    "    / add-id --id-style wikidata\n",
    "    / normalize --add-id True\n",
    "    -o $TEMP/reviews.critic.tsv\n",
    "\"\"\")\n",
    "\n",
    "kgtk(\"\"\"\n",
    "    cat -i $TEMP/reviews.user.tsv -i $TEMP/reviews.critic.tsv\n",
    "    -o $TEMP/reviews.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to remove the edges using the old `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews` and put in the new ones. We will use the `filter` command to do \"grep\" on the old file, and return the lines that don't match `--invert True`. The `filter`command uses a simple pattern language of the form `node1 pattern ; lable pattern ; node2 pattern` (https://kgtk.readthedocs.io/en/latest/transform/filter/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+113Q7727</td>\n",
       "      <td>Q1009788-P2047-54edad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+129Q7727</td>\n",
       "      <td>Q1012216-P2047-321f7d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+142Q7727</td>\n",
       "      <td>Q102448-P2047-f4e24d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+98Q7727</td>\n",
       "      <td>Q1025096-P2047-6323a5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1026724</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+79Q7727</td>\n",
       "      <td>Q1026724-P2047-0eb45e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1286</th>\n",
       "      <td>Q977196</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+104Q7727</td>\n",
       "      <td>Q977196-P2047-75da46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>Q978974</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+140Q7727</td>\n",
       "      <td>Q978974-P2047-b00fbd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1288</th>\n",
       "      <td>Q980041</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+107Q7727</td>\n",
       "      <td>Q980041-P2047-dec492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1289</th>\n",
       "      <td>Q980308</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+104Q7727</td>\n",
       "      <td>Q980308-P2047-75da46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1290</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+130Q7727</td>\n",
       "      <td>Q997206-P2047-a5e559</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1291 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         node1  label      node2                     id\n",
       "0     Q1009788  P2047  +113Q7727  Q1009788-P2047-54edad\n",
       "1     Q1012216  P2047  +129Q7727  Q1012216-P2047-321f7d\n",
       "2      Q102448  P2047  +142Q7727   Q102448-P2047-f4e24d\n",
       "3     Q1025096  P2047   +98Q7727  Q1025096-P2047-6323a5\n",
       "4     Q1026724  P2047   +79Q7727  Q1026724-P2047-0eb45e\n",
       "...        ...    ...        ...                    ...\n",
       "1286   Q977196  P2047  +104Q7727   Q977196-P2047-75da46\n",
       "1287   Q978974  P2047  +140Q7727   Q978974-P2047-b00fbd\n",
       "1288   Q980041  P2047  +107Q7727   Q980041-P2047-dec492\n",
       "1289   Q980308  P2047  +104Q7727   Q980308-P2047-75da46\n",
       "1290   Q997206  P2047  +130Q7727   Q997206-P2047-a5e559\n",
       "\n",
       "[1291 rows x 4 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"\"\"\n",
    "    filter -i $TEMP/augment.imdb.tsv \n",
    "        --pattern '; Pnumber_of_user_reviews, Pnumber_of_critic_reviews ;'\n",
    "        --invert True\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the unix-like grep/cat pipeline that replaces the old statements with the new ones:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "kgtk(\"\"\"\n",
    "    filter -i $TEMP/augment.imdb.tsv \n",
    "        --pattern '; Pnumber_of_user_reviews, Pnumber_of_critic_reviews ;'\n",
    "        --invert True\n",
    "    \n",
    "    / cat -i - -i $TEMP/reviews.tsv\n",
    "        -o $OUT/augment.imdb.tsv\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's double check the results. The data looks good.\n",
    "\n",
    "To add the new data to our KG, and view it in the browser, we have to define the new `Pnumber_of_reviews` property, give it a label, aliases and description, and most importantly, define its `datatype` as `quantity`\n",
    "> We will not do this step in this tutorial. See https://github.com/usc-isi-i2/kgtk/blob/dev/kgtk-properties/kgtk.properties.tsv for examples of how to define new properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node1</th>\n",
       "      <th>label</th>\n",
       "      <th>node2</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Q1009788</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+113Q7727</td>\n",
       "      <td>Q1009788-P2047-54edad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Q1012216</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+129Q7727</td>\n",
       "      <td>Q1012216-P2047-321f7d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Q102448</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+142Q7727</td>\n",
       "      <td>Q102448-P2047-f4e24d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Q1025096</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+98Q7727</td>\n",
       "      <td>Q1025096-P2047-6323a5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Q1026724</td>\n",
       "      <td>P2047</td>\n",
       "      <td>+79Q7727</td>\n",
       "      <td>Q1026724-P2047-0eb45e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9032</th>\n",
       "      <td>Q980308-Pnumber_of_reviews-c70c59</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q6430706</td>\n",
       "      <td>Q980308-Pnumber_of_reviews-c70c59-P642-Q643070...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9033</th>\n",
       "      <td>Q980308-Pnumber_of_reviews-c70c59</td>\n",
       "      <td>P585</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "      <td>Q980308-Pnumber_of_reviews-c70c59-P585-^2020-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9034</th>\n",
       "      <td>Q997206</td>\n",
       "      <td>Pnumber_of_reviews</td>\n",
       "      <td>36.0</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-f85f0b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9035</th>\n",
       "      <td>Q997206-Pnumber_of_reviews-f85f0b</td>\n",
       "      <td>P642</td>\n",
       "      <td>Q6430706</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-f85f0b-P642-Q643070...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9036</th>\n",
       "      <td>Q997206-Pnumber_of_reviews-f85f0b</td>\n",
       "      <td>P585</td>\n",
       "      <td>^2020-01-01T00:00:00Z/11</td>\n",
       "      <td>Q997206-Pnumber_of_reviews-f85f0b-P585-^2020-0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>9037 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  node1               label  \\\n",
       "0                              Q1009788               P2047   \n",
       "1                              Q1012216               P2047   \n",
       "2                               Q102448               P2047   \n",
       "3                              Q1025096               P2047   \n",
       "4                              Q1026724               P2047   \n",
       "...                                 ...                 ...   \n",
       "9032  Q980308-Pnumber_of_reviews-c70c59                P642   \n",
       "9033  Q980308-Pnumber_of_reviews-c70c59                P585   \n",
       "9034                            Q997206  Pnumber_of_reviews   \n",
       "9035  Q997206-Pnumber_of_reviews-f85f0b                P642   \n",
       "9036  Q997206-Pnumber_of_reviews-f85f0b                P585   \n",
       "\n",
       "                         node2  \\\n",
       "0                    +113Q7727   \n",
       "1                    +129Q7727   \n",
       "2                    +142Q7727   \n",
       "3                     +98Q7727   \n",
       "4                     +79Q7727   \n",
       "...                        ...   \n",
       "9032                  Q6430706   \n",
       "9033  ^2020-01-01T00:00:00Z/11   \n",
       "9034                      36.0   \n",
       "9035                  Q6430706   \n",
       "9036  ^2020-01-01T00:00:00Z/11   \n",
       "\n",
       "                                                     id  \n",
       "0                                 Q1009788-P2047-54edad  \n",
       "1                                 Q1012216-P2047-321f7d  \n",
       "2                                  Q102448-P2047-f4e24d  \n",
       "3                                 Q1025096-P2047-6323a5  \n",
       "4                                 Q1026724-P2047-0eb45e  \n",
       "...                                                 ...  \n",
       "9032  Q980308-Pnumber_of_reviews-c70c59-P642-Q643070...  \n",
       "9033  Q980308-Pnumber_of_reviews-c70c59-P585-^2020-0...  \n",
       "9034                  Q997206-Pnumber_of_reviews-f85f0b  \n",
       "9035  Q997206-Pnumber_of_reviews-f85f0b-P642-Q643070...  \n",
       "9036  Q997206-Pnumber_of_reviews-f85f0b-P585-^2020-0...  \n",
       "\n",
       "[9037 rows x 4 columns]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kgtk(\"cat -i $OUT/augment.imdb.tsv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Summary of this tutorial:\n",
    "\n",
    "In this tutorial we:\n",
    "- Extracted data from a CSV file to add to our KG\n",
    "- Converted a CSV file into a KGTK graph using Pandas to create a simple KG (this step is similar to doing a direct mapping using R2RML https://www.w3.org/TR/rdb-direct-mapping/)\n",
    "- Used KGTK to extract/transform/load (ETL) the data into a simple model consistent with the Wikidata ontology\n",
    "- Transformed the simple model into a better model\n",
    "- The output of this section is\n",
    "  - `$OUT/augment.imdb.tsv`, containing the new edges we created from the IMDb file\n",
    "  - `$OUT/quantity.minus_bad_durations.tsv`, a modification of the orignal `$quantity` file where we removed discrepant durations.\n",
    "\n",
    "> The original IMDb file contains many other interesting fields that would be fun to import. Many of them such as genre, director, actors require much more work as we must link the names to entities in Wikidata. Entity linking is outside the scope of KGTK; tools such as OpenRefine (https://openrefine.org/) or our own Table Linker (https://github.com/usc-isi-i2/table-linker) can be used to do entity linking. The KGTK `replace-nodes` command is helpful to process files containing probabilistic \"same-as\" statements to integrate the results of external entity linking tools."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy the results (Optional)\n",
    "\n",
    "Deploy the tutorial files after completing this notebook."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "files_to_deploy = [\n",
    "    \"metadata.p31x.count.transitive.tsv\",\n",
    "    \"derived.P31x.tsv\",\n",
    "    \"derived.P1963computed.tsv\",\n",
    "    \"derived.Pproperty_domain.tsv\",\n",
    "    \"derived.Punits_used.tsv\",\n",
    "    \"derived.Paward_count.tsv\"\n",
    "]\n",
    "\n",
    "# First copy all the files from the add-derived-graphs, we will overwrite the ones that change, e.g., all.tsv\n",
    "!cp -p {tutorial_deployment_path + \"/arnold\"}/*.tsv* {project_deployment_path}\n",
    "\n",
    "for file in files_to_deploy:\n",
    "    path = \"$OUT/\" + file\n",
    "    !cp -p {path} {project_deployment_path} \n",
    "\n",
    "all_file_path = project_deployment_path + \"/all.tsv.gz\"\n",
    "if os.path.exists(all_file_path):\n",
    "    !rm {all_file_path}\n",
    "!kgtk cat -i {tutorial_deployment_path + \"/arnold/all.tsv.gz\"} -i {project_deployment_path}/*.tsv -o {all_file_path}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List all the files:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "!ls -l {project_deployment_path}"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kgtk-env",
   "language": "python",
   "name": "kgtk-env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}