{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Augment A Knowledge Graph Using CSV Files\n", "\n", "A common knowledge graph augmentation use case is to integrate structured data present in databases, CSV and Excel files. KGTK provides an alternative to languages such as R2RML (https://www.w3.org/TR/r2rml/) and RML (https://rml.io/specs/rml/) or tools such as Karma (https://github.com/usc-isi-i2/Web-Karma) for integrating tabluar data. These languages and tools work by defining a mapping between a structured souce and an ontology. KGTK provides capabilities to transform the original structured data into the TSV format used in KGTK to store KGs. \n", "\n", "This tutorial illustrates the KGTK approach using a CSV file downloaded from Kaggle (https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset) containing information about moves from IMDb (https://www.imdb.com/).\n", "\n", "This tutorial is divided into multiple sections:\n", "- Survey what is available in our KG and in the IMDb file\n", "- Prepare the IMDb file for ingestion in KGTK\n", "- Find overlap betwwen our KG and the IMDb KG\n", "- Integrate knowledge from the IMDb KG into our KG\n", "- Implement a better model for the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Install KGTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only run the following cell if KGTK is not installed.\n", " For example, if running in [Google Colab](https://colab.research.google.com/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install kgtk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preamble: set up the environment and files used in the tutorial" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "import csv\n", "import pandas as pd\n", "\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n", "from kgtk.functions import kgtk, kypher" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "\n", "# Folder on local machine where to create the output and temporary folders\n", "input_path = None\n", "\n", "output_path = \"/tmp/projects\"\n", "project_name = \"tutorial-augment\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are all the KG files that we use in this tutorial:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files = [\n", " \"all\",\n", " \"label\",\n", " \"alias\",\n", " \"description\",\n", " \"external_id\",\n", " \"monolingualtext\",\n", " \"quantity\",\n", " \"string\",\n", " \"time\",\n", " \"item\",\n", " \"wikibase_property\",\n", " \"qualifiers\",\n", " \"datatypes\",\n", " \"p279\",\n", " \"p279star\",\n", " \"p31\",\n", " \"in_degree\",\n", " \"out_degree\",\n", " \"pagerank_directed\",\n", " \"pagerank_undirected\"\n", "]\n", "input_files_url = \"https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled\"\n", "ck = ConfigureKGTK(files, input_files_url=input_files_url)\n", "ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEMP: /tmp/projects/tutorial-augment/temp.tutorial-augment\n", "STORE: /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n", "kypher: kgtk query --graph-cache /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n", "KGTK_GRAPH_CACHE: /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db\n", "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n", "kgtk: kgtk\n", "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n", "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\n", "KGTK_OPTION_DEBUG: false\n", "GRAPH: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input\n", "OUT: /tmp/projects/tutorial-augment\n", "all: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/all.tsv.gz\n", "label: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\n", "alias: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/aliases.en.tsv.gz\n", "description: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/descriptions.en.tsv.gz\n", "external_id: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.external-id.tsv.gz\n", "monolingualtext: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.monolingualtext.tsv.gz\n", "quantity: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.quantity.tsv.gz\n", "string: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.string.tsv.gz\n", "time: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.time.tsv.gz\n", "item: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-item.tsv.gz\n", "wikibase_property: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-property.tsv.gz\n", "qualifiers: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/qualifiers.tsv.gz\n", "datatypes: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.property.datatypes.tsv.gz\n", "p279: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279.tsv.gz\n", "p279star: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279star.tsv.gz\n", "p31: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P31.tsv.gz\n", "in_degree: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.in_degree.tsv.gz\n", "out_degree: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.out_degree.tsv.gz\n", "pagerank_directed: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.directed.tsv.gz\n", "pagerank_undirected: /Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.undirected.tsv.gz\n" ] } ], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load all my files into the kypher cache so that all graph aliases are defined" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "kgtk query --graph-cache /tmp/projects/tutorial-augment/temp.tutorial-augment/wikidata.sqlite3.db -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/all.tsv.gz\" --as all -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/labels.en.tsv.gz\" --as label -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/aliases.en.tsv.gz\" --as alias -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/descriptions.en.tsv.gz\" --as description -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.external-id.tsv.gz\" --as external_id -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.monolingualtext.tsv.gz\" --as monolingualtext -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.quantity.tsv.gz\" --as quantity -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.string.tsv.gz\" --as string -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.time.tsv.gz\" --as time -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-item.tsv.gz\" --as item -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/claims.wikibase-property.tsv.gz\" --as wikibase_property -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/qualifiers.tsv.gz\" --as qualifiers -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.property.datatypes.tsv.gz\" --as datatypes -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279.tsv.gz\" --as p279 -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P279star.tsv.gz\" --as p279star -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/derived.P31.tsv.gz\" --as p31 -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.in_degree.tsv.gz\" --as in_degree -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.out_degree.tsv.gz\" --as out_degree -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.directed.tsv.gz\" --as pagerank_directed -i \"/Users/amandeep/isi-kgtk-tutorial/tutorial-augment_input/metadata.pagerank.undirected.tsv.gz\" --as pagerank_undirected --limit 3\n", "node1\tlabel\tnode2\tid\tnode2;wikidatatype\n", "P10\talias\t'gif'@en\tP10-alias-en-282226-0\t\n", "P10\talias\t'animation'@en\tP10-alias-en-2f86d8-0\t\n", "P10\talias\t'media'@en\tP10-alias-en-c1427e-0\t\n", "CPU times: user 3.69 ms, sys: 10.3 ms, total: 14 ms\n", "Wall time: 35.6 s\n" ] } ], "source": [ "%%time\n", "ck.load_files_into_cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Survey the data in the IMDb file\n", "\n", "The first step is to determine whether our KG has IMDb identifiers, as this will make it easy to integrate the IMDb data.\n", "The following query counts the number of items in our KG that have an `IMDb ID (P345)`.\n", "Fortunately, our KG has many entities with IMDb identifiers:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movies with IMDb identifiers
06698
\n", "
" ], "text/plain": [ " movies with IMDb identifiers\n", "0 6698" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i external_id\n", " --match '(movie)-[:P345]->(imdbid)'\n", " --return 'count(distinct movie) as `\"movies with IMDb identifiers\"`'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As there are many items that have IMDb ids, we can augment our graph using data from IMDB. The following file comes from Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset\n", "\n", "Our KG, a subset of Wikidata is missing many fields that are present in the IMDb file. \n", "For example, when we look at `film (Q11424)` in our browser http://ckg07.isi.edu:3008/browser/Q11424, we see that there is no information about ratings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/imdb\"\n", "other_files = [\n", " \"IMDB.csv.gz\",\n", " \"imdb-kg-node.tsv\",\n", " \"imdb-kg.tsv\"\n", "]\n", "\n", "for file in other_files:\n", " cmd = f\" wget {url}/{file} --directory-prefix={os.environ['GRAPH']}/imdb\"\n", " print(subprocess.getoutput(cmd))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/qv/cxzpwz3j29x7n79vwpw253v80000gn/T/ipykernel_81921/3614653023.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n", " imdb = pd.read_csv(os.environ['GRAPH'] + \"/imdb/IMDB.csv.gz\")\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imdb_title_idtitleoriginal_titleyeardate_publishedgenredurationcountrylanguagedirector...actorsdescriptionavg_votevotesbudgetusa_gross_incomeworlwide_gross_incomemetascorereviews_from_usersreviews_from_critics
0tt0000009Miss JerryMiss Jerry18941894-10-09Romance45USANoneAlexander Black...Blanche Bayliss, William Courtenay, Chauncey D...The adventures of a female reporter in the 1890s.5.9154NaNNaNNaNNaN1.02.0
1tt0000574The Story of the Kelly GangThe Story of the Kelly Gang19061906-12-26Biography, Crime, Drama70AustraliaNoneCharles Tait...Elizabeth Tait, John Tait, Norman Campbell, Be...True story of notorious Australian outlaw Ned ...6.1589$ 2250NaNNaNNaN7.07.0
2tt0001892Den sorte drømDen sorte drøm19111911-08-19Drama53Germany, DenmarkNaNUrban Gad...Asta Nielsen, Valdemar Psilander, Gunnar Helse...Two men of high rank are both wooing the beaut...5.8188NaNNaNNaNNaN5.02.0
3tt0002101CleopatraCleopatra19121912-11-13Drama, History100USAEnglishCharles L. Gaskill...Helen Gardner, Pearl Sindelar, Miss Fielding, ...The fabled queen of Egypt's affair with Roman ...5.2446$ 45000NaNNaNNaN25.03.0
4tt0002130L'InfernoL'Inferno19111911-03-06Adventure, Drama, Fantasy68ItalyItalianFrancesco Bertolini, Adolfo Padovan...Salvatore Papa, Arturo Pirovano, Giuseppe de L...Loosely adapted from Dante's Divine Comedy and...7.02237NaNNaNNaNNaN31.014.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " imdb_title_id title original_title \\\n", "0 tt0000009 Miss Jerry Miss Jerry \n", "1 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n", "2 tt0001892 Den sorte drøm Den sorte drøm \n", "3 tt0002101 Cleopatra Cleopatra \n", "4 tt0002130 L'Inferno L'Inferno \n", "\n", " year date_published genre duration country \\\n", "0 1894 1894-10-09 Romance 45 USA \n", "1 1906 1906-12-26 Biography, Crime, Drama 70 Australia \n", "2 1911 1911-08-19 Drama 53 Germany, Denmark \n", "3 1912 1912-11-13 Drama, History 100 USA \n", "4 1911 1911-03-06 Adventure, Drama, Fantasy 68 Italy \n", "\n", " language director ... \\\n", "0 None Alexander Black ... \n", "1 None Charles Tait ... \n", "2 NaN Urban Gad ... \n", "3 English Charles L. Gaskill ... \n", "4 Italian Francesco Bertolini, Adolfo Padovan ... \n", "\n", " actors \\\n", "0 Blanche Bayliss, William Courtenay, Chauncey D... \n", "1 Elizabeth Tait, John Tait, Norman Campbell, Be... \n", "2 Asta Nielsen, Valdemar Psilander, Gunnar Helse... \n", "3 Helen Gardner, Pearl Sindelar, Miss Fielding, ... \n", "4 Salvatore Papa, Arturo Pirovano, Giuseppe de L... \n", "\n", " description avg_vote votes budget \\\n", "0 The adventures of a female reporter in the 1890s. 5.9 154 NaN \n", "1 True story of notorious Australian outlaw Ned ... 6.1 589 $ 2250 \n", "2 Two men of high rank are both wooing the beaut... 5.8 188 NaN \n", "3 The fabled queen of Egypt's affair with Roman ... 5.2 446 $ 45000 \n", "4 Loosely adapted from Dante's Divine Comedy and... 7.0 2237 NaN \n", "\n", " usa_gross_income worlwide_gross_income metascore reviews_from_users \\\n", "0 NaN NaN NaN 1.0 \n", "1 NaN NaN NaN 7.0 \n", "2 NaN NaN NaN 5.0 \n", "3 NaN NaN NaN 25.0 \n", "4 NaN NaN NaN 31.0 \n", "\n", " reviews_from_critics \n", "0 2.0 \n", "1 7.0 \n", "2 2.0 \n", "3 3.0 \n", "4 14.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imdb = pd.read_csv(os.environ['GRAPH'] + \"/imdb/IMDB.csv.gz\")\n", "imdb.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare the IMDb file for ingestion in KGTK\n", "This step illustrates the power of the KGTK approach: you canb convert the CSV file into a KG by simply renaming the column in the file that contains the identifier for the records to have the column heading `id`. Once you do this, the file becomes a KGTK node file, where the `id` column identifies a node, and the other columns define properties about the node. \n", "\n", "The initial node file is only a first approaximation, as additional transformations will be needed to make the data compatible with the KG.\n", "In this step, you will use use Pandas to convert the IMDb file to a KG node file by renaming the `imdb_title_id` column to `id` as it is the identifier for a movie.\n", "\n", "We will also convert non-numeric values in the cells to strings to enable us later to use string transformations to clean the data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitleoriginal_titleyeardate_publishedgenredurationcountrylanguagedirector...actorsdescriptionavg_votevotesbudgetusa_gross_incomeworlwide_gross_incomemetascorereviews_from_usersreviews_from_critics
0tt0000009Miss JerryMiss Jerry18941894-10-09Romance45USANoneAlexander Black...Blanche Bayliss, William Courtenay, Chauncey D...The adventures of a female reporter in the 1890s.5.9154NaNNaNNaNNaN1.02.0
1tt0000574The Story of the Kelly GangThe Story of the Kelly Gang19061906-12-26Biography, Crime, Drama70AustraliaNoneCharles Tait...Elizabeth Tait, John Tait, Norman Campbell, Be...True story of notorious Australian outlaw Ned ...6.1589$ 2250NaNNaNNaN7.07.0
2tt0001892Den sorte drømDen sorte drøm19111911-08-19Drama53Germany, DenmarkNaNUrban Gad...Asta Nielsen, Valdemar Psilander, Gunnar Helse...Two men of high rank are both wooing the beaut...5.8188NaNNaNNaNNaN5.02.0
3tt0002101CleopatraCleopatra19121912-11-13Drama, History100USAEnglishCharles L. Gaskill...Helen Gardner, Pearl Sindelar, Miss Fielding, ...The fabled queen of Egypt's affair with Roman ...5.2446$ 45000NaNNaNNaN25.03.0
4tt0002130L'InfernoL'Inferno19111911-03-06Adventure, Drama, Fantasy68ItalyItalianFrancesco Bertolini, Adolfo Padovan...Salvatore Papa, Arturo Pirovano, Giuseppe de L...Loosely adapted from Dante's Divine Comedy and...7.02237NaNNaNNaNNaN31.014.0
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " id title original_title year \\\n", "0 tt0000009 Miss Jerry Miss Jerry 1894 \n", "1 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1906 \n", "2 tt0001892 Den sorte drøm Den sorte drøm 1911 \n", "3 tt0002101 Cleopatra Cleopatra 1912 \n", "4 tt0002130 L'Inferno L'Inferno 1911 \n", "\n", " date_published genre duration country \\\n", "0 1894-10-09 Romance 45 USA \n", "1 1906-12-26 Biography, Crime, Drama 70 Australia \n", "2 1911-08-19 Drama 53 Germany, Denmark \n", "3 1912-11-13 Drama, History 100 USA \n", "4 1911-03-06 Adventure, Drama, Fantasy 68 Italy \n", "\n", " language director ... \\\n", "0 None Alexander Black ... \n", "1 None Charles Tait ... \n", "2 NaN Urban Gad ... \n", "3 English Charles L. Gaskill ... \n", "4 Italian Francesco Bertolini, Adolfo Padovan ... \n", "\n", " actors \\\n", "0 Blanche Bayliss, William Courtenay, Chauncey D... \n", "1 Elizabeth Tait, John Tait, Norman Campbell, Be... \n", "2 Asta Nielsen, Valdemar Psilander, Gunnar Helse... \n", "3 Helen Gardner, Pearl Sindelar, Miss Fielding, ... \n", "4 Salvatore Papa, Arturo Pirovano, Giuseppe de L... \n", "\n", " description avg_vote votes budget \\\n", "0 The adventures of a female reporter in the 1890s. 5.9 154 NaN \n", "1 True story of notorious Australian outlaw Ned ... 6.1 589 $ 2250 \n", "2 Two men of high rank are both wooing the beaut... 5.8 188 NaN \n", "3 The fabled queen of Egypt's affair with Roman ... 5.2 446 $ 45000 \n", "4 Loosely adapted from Dante's Divine Comedy and... 7.0 2237 NaN \n", "\n", " usa_gross_income worlwide_gross_income metascore reviews_from_users \\\n", "0 NaN NaN NaN 1.0 \n", "1 NaN NaN NaN 7.0 \n", "2 NaN NaN NaN 5.0 \n", "3 NaN NaN NaN 25.0 \n", "4 NaN NaN NaN 31.0 \n", "\n", " reviews_from_critics \n", "0 2.0 \n", "1 7.0 \n", "2 2.0 \n", "3 3.0 \n", "4 14.0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imdb_kg = imdb.rename(columns={'imdb_title_id':'id'})\n", "imdb_kg.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In KGTK, most commands rquire KGs to be in edge format (`node1/labal/node2`), so we use the `normalize-nodes` command to convert the IMDb nodes file to edge format.\n", "The IMDb file contains 85K rows and 22 columns, gemerating 1.8 million edges.\n", "\n", "The IMDb KG is completely isolated from our KG, but it is still a valid KG that we4 can manipulate in KGTK:\n", "> In KGTK, literals, such as the string identifiers of movies, can be used as the subjects (`node1`) of triples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After converting the original CSV file to a node file, you can use the `normmalize-nodes` command to convert the node file to and edge file usinf the `node1/label/node2` headings. The columns of the original file become the labels of the edges and appear in the second column. Having done this, the original IMDB file has become a KG that you can transform using other KGTK commands." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.66 s, sys: 2.19 s, total: 11.8 s\n", "Wall time: 12.6 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0tt0000009titleMiss Jerry
1tt0000009original_titleMiss Jerry
2tt0000009year1894
3tt0000009date_published1894-10-09
4tt0000009genreRomance
............
1802950tt9914942usa_gross_incomeNaN
1802951tt9914942worlwide_gross_income$ 59794
1802952tt9914942metascoreNaN
1802953tt9914942reviews_from_usersNaN
1802954tt9914942reviews_from_critics2.0
\n", "

1802955 rows × 3 columns

\n", "
" ], "text/plain": [ " node1 label node2\n", "0 tt0000009 title Miss Jerry\n", "1 tt0000009 original_title Miss Jerry\n", "2 tt0000009 year 1894\n", "3 tt0000009 date_published 1894-10-09\n", "4 tt0000009 genre Romance\n", "... ... ... ...\n", "1802950 tt9914942 usa_gross_income NaN\n", "1802951 tt9914942 worlwide_gross_income $ 59794\n", "1802952 tt9914942 metascore NaN\n", "1802953 tt9914942 reviews_from_users NaN\n", "1802954 tt9914942 reviews_from_critics 2.0\n", "\n", "[1802955 rows x 3 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(imdb_kg, \"\"\"\n", " normalize-nodes\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the IMDb KG to a temorary file and give it the alias `imdbkg`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.05 s, sys: 271 ms, total: 2.32 s\n", "Wall time: 15.1 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0tt0000009titleMiss Jerry
1tt0000009original_titleMiss Jerry
2tt0000009year1894
3tt0000009date_published1894-10-09
4tt0000009genreRomance
5tt0000009duration45
6tt0000009countryUSA
7tt0000009languageNone
8tt0000009directorAlexander Black
9tt0000009writerAlexander Black
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 tt0000009 title Miss Jerry\n", "1 tt0000009 original_title Miss Jerry\n", "2 tt0000009 year 1894\n", "3 tt0000009 date_published 1894-10-09\n", "4 tt0000009 genre Romance\n", "5 tt0000009 duration 45\n", "6 tt0000009 country USA\n", "7 tt0000009 language None\n", "8 tt0000009 director Alexander Black\n", "9 tt0000009 writer Alexander Black" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(imdb_kg, \"\"\"\n", " normalize-nodes -o $TEMP/imdb-kg-edge.tsv\n", "\"\"\")\n", "\n", "kgtk(\"query -i $TEMP/imdb-kg-edge.tsv --as imdbkg --limit 10\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find overlap between our KG and the IMDb KG\n", "\n", "The IMDb KG uses IMDb identifiers as nodes, and our KG, extracted from Wikidata, defined property `IMDb ID (P345)` to record the IMDb identifiers of movies.\n", "\n", "We can use KGTK to query both graphs simultaneously to find the movies in our graph for which there are nodes in the IMDb KG.\n", "> We are only interested in the `node1` in the IMDb KG , so we don't have to list the other elements of the pattern, thee `label` and `node2`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idP345
0Q1009788tt0071360
1Q1012216tt0095243
2Q102448tt0304141
3Q1025096tt0094824
4Q1026724tt0046035
.........
1286Q977196tt0098987
1287Q978974tt0104427
1288Q980041tt0983193
1289Q980308tt0080661
1290Q997206tt0116410
\n", "

1291 rows × 2 columns

\n", "
" ], "text/plain": [ " id P345\n", "0 Q1009788 tt0071360\n", "1 Q1012216 tt0095243\n", "2 Q102448 tt0304141\n", "3 Q1025096 tt0094824\n", "4 Q1026724 tt0046035\n", "... ... ...\n", "1286 Q977196 tt0098987\n", "1287 Q978974 tt0104427\n", "1288 Q980041 tt0983193\n", "1289 Q980308 tt0080661\n", "1290 Q997206 tt0116410\n", "\n", "[1291 rows x 2 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i imdbkg -i external_id\n", " --match '\n", " imdb: (imdb_id),\n", " external_id: (movie)-[:P345]->(imdb_id)\n", " '\n", " --return 'distinct movie as id, imdb_id as P345'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count the number of entities in our KG that have an `IMDb ID (P345)`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count_movies
06698
\n", "
" ], "text/plain": [ " count_movies\n", "0 6698" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i external_id\n", " --match '(movie)-[:P345]->()'\n", " --return 'count(distinct movie) as count_movies'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the number of `film (Q11424)` in our KG.\n", "> Our IMDb KG only has information for films." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count_film
02447
\n", "
" ], "text/plain": [ " count_film\n", "0 2447" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all\n", " --match '(film)-[:P31]->(class)-[:P279star]->(:Q11424)'\n", " --return 'count(distinct film) as count_film'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, over 6,000 entities in our KG have IMDb identifers, about half of these are films, and about half of those appear in our IMDb KG." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integrate knowledge from the IMDb KG into our KG\n", "\n", "There are many fields in the IMDb KG that we could integrate into our KG. In this tutorial we will focus on three fields: the `reviews_from_users`, `reviews_from_critics` and `duration`.\n", "\n", "Approach:\n", "- Query both KGs, joining on the IMDb identifier, and extract the properties we want\n", "- ETL the data to conform to the Wikidata convetions used in our KG" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with `reviews_from_users`, `reviews_from_critics`, which are counts. If we search in the browser we see very specific properties named `number of ...`, but none referring to number of reviews, so we will create two new properties, and we will give them the identifiers `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews`\n", "> We follow the Wikidata convention where the identifiers for properties start with `P` and the identifiers for entities start with `Q`. KGTK does not require to use numbers following `P` or `Q`, and it does not require that we follow the `P/Q` convention.\n", "\n", "To integrate the data we will use two KGs:\n", "- The `imdbkg` with the data we pulled directly from the IMDb csv file\n", "- The `external_id` subset of our KG that contains all external identifiers for all items in our KG.\n", "\n", "The `match` clause does a join on the identifers, enabling us to pull from the `imdbkg` the items thatalso exist in our KG.\n", "The `opt` clause (optional) retrieves the `reviews_from_users` property from the `imdbkg` KG\n", "We use the `opt` clause so that if the `imdbkg` has a null, we still output the result.\n", "\n", "The `return` clause builds a new node file: we use the `film` node from our KG as the `id` for our node file, and we translate the `reviews_from_users` data from the `imdbkg` as a `Pnumber_of_user_reviews`, a new property in our KG (below we extend this query to integrate all the data).\n", "> In KGTK, you can integrate data from external sources using queries to map from one schema (ontology) to another, and do ETL transformations in one place.\n", "\n", "Our KG is starting to take shape. For every film in the intersection of the two KGs, we now have a new property `Pnumber_of_user_reviews`:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idPnumber_of_user_reviews
0Q1722418232.0
1Q1638726938.0
2Q184860216718.0
3Q1273675392.0
4Q60744822.0
.........
1286Q512744612.0
1287Q54375571.0
1288Q21427084NaN
1289Q51274879NaN
1290Q6510612NaN
\n", "

1291 rows × 2 columns

\n", "
" ], "text/plain": [ " id Pnumber_of_user_reviews\n", "0 Q172241 8232.0\n", "1 Q163872 6938.0\n", "2 Q18486021 6718.0\n", "3 Q127367 5392.0\n", "4 Q6074 4822.0\n", "... ... ...\n", "1286 Q51274461 2.0\n", "1287 Q5437557 1.0\n", "1288 Q21427084 NaN\n", "1289 Q51274879 NaN\n", "1290 Q6510612 NaN\n", "\n", "[1291 rows x 2 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i imdbkg -i external_id\n", " --match '\n", " external_id: (film)-[:P345]->(imdb_id),\n", " imdb: (imdb_id)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n", " --return 'distinct\n", " film as id, \n", " rfu as Pnumber_of_user_reviews\n", " '\n", " --order-by 'cast(rfu, int) desc'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can extend the query to integrate `reviews_from_critics` and `duration` by adding additional `opt` clauses.\n", "Wikidata has propery `duration (P2047`), so we reuse it:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idPnumber_of_user_reviewsPnumber_of_critic_reviewsP2047
0Q1722418232.0164.0142
1Q1638726938.0423.0152
2Q184860216718.0717.0152
3Q1273675392.0340.0178
4Q60744822.0909.0138
...............
1286Q512744612.08.078
1287Q54375571.01.0100
1288Q21427084NaN6.090
1289Q51274879NaN3.071
1290Q6510612NaN2.082
\n", "

1291 rows × 4 columns

\n", "
" ], "text/plain": [ " id Pnumber_of_user_reviews Pnumber_of_critic_reviews P2047\n", "0 Q172241 8232.0 164.0 142\n", "1 Q163872 6938.0 423.0 152\n", "2 Q18486021 6718.0 717.0 152\n", "3 Q127367 5392.0 340.0 178\n", "4 Q6074 4822.0 909.0 138\n", "... ... ... ... ...\n", "1286 Q51274461 2.0 8.0 78\n", "1287 Q5437557 1.0 1.0 100\n", "1288 Q21427084 NaN 6.0 90\n", "1289 Q51274879 NaN 3.0 71\n", "1290 Q6510612 NaN 2.0 82\n", "\n", "[1291 rows x 4 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i imdbkg -i external_id\n", " --match '\n", " external_id: (film)-[:P345]->(imdb_id),\n", " imdb: (imdb_id)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n", " --opt '\n", " imdb: (imdb_id)-[:duration]->(duration)'\n", " --return 'distinct\n", " film as id, \n", " rfu as Pnumber_of_user_reviews,\n", " rfc as Pnumber_of_critic_reviews,\n", " duration as P2047\n", " '\n", " --order-by 'cast(rfu, int) desc'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we browse `Terminator 2: Judgment Day (Q170564)`, we see that we ought to define units of measure (`minute (Q7727)`).\n", "In KGTK, quantities such as duration are represented as structured literals that incorporate the quantity and the units in one symbol.\n", "For example \"142 minutes\" is represented as `142Q7727`.\n", "\n", "It is easy to incorporate the units into our query by using the `printf` statement to combine multiple variables into a formatted string:\n", "> The case of units of measure is a good example of the benefit to do schema mapping and ETL in the same query.\n", "\n", "> We illustrate the use the KGTK function `kgtk_quantity_number` in the `order-by` clause to extract the numeric value of a quantity structured literals (https://kgtk.readthedocs.io/en/latest/transform/query/#functions-on-kgtk-numbers-and-quantities)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.4 ms, sys: 20 ms, total: 30.4 ms\n", "Wall time: 1.59 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idPnumber_of_user_reviewsPnumber_of_critic_reviewsP2047
0Q1218601127.022.0+260Q7727
1Q2875881.0197.0+238Q7727
2Q228186676.0146.0+228Q7727
3Q746733293.076.0+220Q7727
4Q148204233.0137.0+219Q7727
...............
1286Q51274879NaN3.0+71Q7727
1287Q553220814.09.0+71Q7727
1288Q73673151.039.0+71Q7727
1289Q43051181.0118.0+70Q7727
1290Q75495588.01.0+64Q7727
\n", "

1291 rows × 4 columns

\n", "
" ], "text/plain": [ " id Pnumber_of_user_reviews Pnumber_of_critic_reviews P2047\n", "0 Q1218601 127.0 22.0 +260Q7727\n", "1 Q2875 881.0 197.0 +238Q7727\n", "2 Q228186 676.0 146.0 +228Q7727\n", "3 Q746733 293.0 76.0 +220Q7727\n", "4 Q148204 233.0 137.0 +219Q7727\n", "... ... ... ... ...\n", "1286 Q51274879 NaN 3.0 +71Q7727\n", "1287 Q5532208 14.0 9.0 +71Q7727\n", "1288 Q736731 51.0 39.0 +71Q7727\n", "1289 Q43051 181.0 118.0 +70Q7727\n", "1290 Q7549558 8.0 1.0 +64Q7727\n", "\n", "[1291 rows x 4 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "kgtk(\"\"\"\n", " query -i imdbkg -i external_id\n", " --match '\n", " external_id: (film)-[:P345]->(imdb_id),\n", " imdb: (imdb_id)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n", " --opt '\n", " imdb: (imdb_id)-[:duration]->(duration)'\n", " --return 'distinct\n", " film as id, \n", " rfu as Pnumber_of_user_reviews,\n", " rfc as Pnumber_of_critic_reviews,\n", " printf(\"+%sQ7727\", duration) as P2047\n", " '\n", " --order-by 'kgtk_quantity_number(P2047) desc'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query above extracts data from the simple `imdbkg` graph that uses the column headings as properties. The query maps the properties and data to new or existing Wikidata properties that we use in our KG, and performs simple data cleaning to conform to the Wikidata and KGTK format requirement. The resulting KG is in KGTK node format where the properties appear in the column headings; we save this graph in `$TEMP/imdb-import.node.tsv`:\n", "\n", "> It is possible to create the KGTK node file using Pandas as the node file has very simmilar structure to the original CSV file. One of the advantages of KGTK is that it empowers users to use tools that they are familiar with, e.g., Pandas, to transform KGTK graphs. Integration with KGTK is seamless." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"\n", " query -i imdbkg -i external_id\n", " --match '\n", " external_id: (film)-[:P345]->(imdb_id),\n", " imdb: (imdb_id)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_users]->(rfu)'\n", " --opt '\n", " imdb: (imdb_id)-[:reviews_from_critics]->(rfc)'\n", " --opt '\n", " imdb: (imdb_id)-[:duration]->(duration)'\n", " --return 'distinct\n", " film as id, \n", " rfu as Pnumber_of_user_reviews,\n", " rfc as Pnumber_of_critic_reviews,\n", " printf(\"+%sQ7727\", duration) as P2047\n", " '\n", " -o $TEMP/imdb-import.node.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last step is to convert the node file to a an edge file so that it is in the standard form used by all KGTK commands.\n", "Convert the node file to edges and add edge ids:\n", "\n", "> If you had used Pandas to create the graph in node format, this same command would transform it into and edge file, and make the4 graph ready for use with any other KGTL commands." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q1009788Pnumber_of_user_reviews386.0Q1009788-Pnumber_of_user_reviews-7e80e4
1Q1009788Pnumber_of_critic_reviews169.0Q1009788-Pnumber_of_critic_reviews-ff2927
2Q1009788P2047+113Q7727Q1009788-P2047-54edad
3Q1012216Pnumber_of_user_reviews69.0Q1012216-Pnumber_of_user_reviews-d411e0
4Q1012216Pnumber_of_critic_reviews28.0Q1012216-Pnumber_of_critic_reviews-3884d7
...............
3868Q980308Pnumber_of_critic_reviews149.0Q980308-Pnumber_of_critic_reviews-c70c59
3869Q980308P2047+104Q7727Q980308-P2047-75da46
3870Q997206Pnumber_of_user_reviews69.0Q997206-Pnumber_of_user_reviews-d411e0
3871Q997206Pnumber_of_critic_reviews36.0Q997206-Pnumber_of_critic_reviews-f85f0b
3872Q997206P2047+130Q7727Q997206-P2047-a5e559
\n", "

3873 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q1009788 Pnumber_of_user_reviews 386.0 \n", "1 Q1009788 Pnumber_of_critic_reviews 169.0 \n", "2 Q1009788 P2047 +113Q7727 \n", "3 Q1012216 Pnumber_of_user_reviews 69.0 \n", "4 Q1012216 Pnumber_of_critic_reviews 28.0 \n", "... ... ... ... \n", "3868 Q980308 Pnumber_of_critic_reviews 149.0 \n", "3869 Q980308 P2047 +104Q7727 \n", "3870 Q997206 Pnumber_of_user_reviews 69.0 \n", "3871 Q997206 Pnumber_of_critic_reviews 36.0 \n", "3872 Q997206 P2047 +130Q7727 \n", "\n", " id \n", "0 Q1009788-Pnumber_of_user_reviews-7e80e4 \n", "1 Q1009788-Pnumber_of_critic_reviews-ff2927 \n", "2 Q1009788-P2047-54edad \n", "3 Q1012216-Pnumber_of_user_reviews-d411e0 \n", "4 Q1012216-Pnumber_of_critic_reviews-3884d7 \n", "... ... \n", "3868 Q980308-Pnumber_of_critic_reviews-c70c59 \n", "3869 Q980308-P2047-75da46 \n", "3870 Q997206-Pnumber_of_user_reviews-d411e0 \n", "3871 Q997206-Pnumber_of_critic_reviews-f85f0b \n", "3872 Q997206-P2047-a5e559 \n", "\n", "[3873 rows x 4 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " normalize-nodes -i $TEMP/imdb-import.node.tsv\n", " / add-id --id-style wikidata\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the IMDb data, now represented in the Wikidata ontology, into a KG file and give it alias `augment_imdb` so that we can use it in the next steps:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q1009788Pnumber_of_user_reviews386.0Q1009788-Pnumber_of_user_reviews-7e80e4
1Q1009788Pnumber_of_critic_reviews169.0Q1009788-Pnumber_of_critic_reviews-ff2927
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 Q1009788 Pnumber_of_user_reviews 386.0 \n", "1 Q1009788 Pnumber_of_critic_reviews 169.0 \n", "\n", " id \n", "0 Q1009788-Pnumber_of_user_reviews-7e80e4 \n", "1 Q1009788-Pnumber_of_critic_reviews-ff2927 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " normalize-nodes -i $TEMP/imdb-import.node.tsv\n", " / add-id --id-style wikidata\n", " -o $TEMP/augment.imdb.tsv\n", "\"\"\")\n", "\n", "kgtk(\"query -i $TEMP/augment.imdb.tsv --as augment_imdb --limit 2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deduplicate the durations (`duration (P2047)`)\n", "Wikidata defines durations for many films. If we want to integrate the IMDb data into our KG, we must be careful as there may be inconsitent durations.\n", "\n", "Approach:\n", "- Find the films where the durations in Wikidata and IMDB differ\n", "- In case of conflict, prefer the IMDb durations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, find films where the durations are different using a query that combines the orignal `all` graph and the `augment_imdb` graph.\n", "There are many differences, and we see cases where the data in Wikidata is suspect (the uncertainty is the same as the value):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filmduration_wdduration_imdbfilm;label
0Q1009788+109Q7727+113Q7727'The Conversation'@en
1Q102448+136Q7727+142Q7727'Harry Potter and the Prisoner of Azkaban'@en
2Q1025096+93Q7727+98Q7727'Caddyshack II'@en
3Q1027212+125[+125,+125]Q7727+126Q7727'Revolution'@en
4Q103474+143Q7727+149Q7727'2001: A Space Odyssey'@en
...............
375Q926825+128[+128,+128]Q7727+133Q7727'Flower Drum Song'@en
376Q929647+115Q7727+120Q7727'Mask'@en
377Q930372+175Q7727+179Q7727'Camelot'@en
378Q936425+128Q7727+133Q7727'A Very Long Engagement'@en
379Q936576+91Q7727+95Q7727'The Nutty Professor'@en
\n", "

380 rows × 4 columns

\n", "
" ], "text/plain": [ " film duration_wd duration_imdb \\\n", "0 Q1009788 +109Q7727 +113Q7727 \n", "1 Q102448 +136Q7727 +142Q7727 \n", "2 Q1025096 +93Q7727 +98Q7727 \n", "3 Q1027212 +125[+125,+125]Q7727 +126Q7727 \n", "4 Q103474 +143Q7727 +149Q7727 \n", ".. ... ... ... \n", "375 Q926825 +128[+128,+128]Q7727 +133Q7727 \n", "376 Q929647 +115Q7727 +120Q7727 \n", "377 Q930372 +175Q7727 +179Q7727 \n", "378 Q936425 +128Q7727 +133Q7727 \n", "379 Q936576 +91Q7727 +95Q7727 \n", "\n", " film;label \n", "0 'The Conversation'@en \n", "1 'Harry Potter and the Prisoner of Azkaban'@en \n", "2 'Caddyshack II'@en \n", "3 'Revolution'@en \n", "4 '2001: A Space Odyssey'@en \n", ".. ... \n", "375 'Flower Drum Song'@en \n", "376 'Mask'@en \n", "377 'Camelot'@en \n", "378 'A Very Long Engagement'@en \n", "379 'The Nutty Professor'@en \n", "\n", "[380 rows x 4 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all -i augment_imdb\n", " --match 'all: (film)-[:P2047]->(duration_wd), augment_imdb: (film)-[:P2047]->(duration_imdb)'\n", " --where 'duration_wd < duration_imdb'\n", " --return 'distinct film as film, duration_wd as duration_wd, duration_imdb as duration_imdb'\n", " --order-by 'film'\n", " / add-labels\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We decide to remove from our original graph the edges for durations that differ from the durations in IMDb. \n", "We enhance our query to return the ids of such edges:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id
0Q1009788-P2047-16e299-ef37b7c7-0
1Q102448-P2047-031b29-0e3d9b06-0
2Q1025096-P2047-01387b-bea42f5f-0
3Q1027212-P2047-c18970-a8e52174-0
4Q103474-P2047-83a7f0-c5824aa3-0
......
375Q926825-P2047-7abb58-87a8eb10-0
376Q929647-P2047-b97fa8-669b35f9-0
377Q930372-P2047-ed974c-06f071f1-0
378Q936425-P2047-37faaa-7eb616cb-0
379Q936576-P2047-fc103b-bbd28441-0
\n", "

380 rows × 1 columns

\n", "
" ], "text/plain": [ " id\n", "0 Q1009788-P2047-16e299-ef37b7c7-0\n", "1 Q102448-P2047-031b29-0e3d9b06-0\n", "2 Q1025096-P2047-01387b-bea42f5f-0\n", "3 Q1027212-P2047-c18970-a8e52174-0\n", "4 Q103474-P2047-83a7f0-c5824aa3-0\n", ".. ...\n", "375 Q926825-P2047-7abb58-87a8eb10-0\n", "376 Q929647-P2047-b97fa8-669b35f9-0\n", "377 Q930372-P2047-ed974c-06f071f1-0\n", "378 Q936425-P2047-37faaa-7eb616cb-0\n", "379 Q936576-P2047-fc103b-bbd28441-0\n", "\n", "[380 rows x 1 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i all -i augment_imdb\n", " --match '\n", " all: (film)-[l:P2047]->(duration_wd),\n", " augment_imdb: (film)-[:P2047]->(duration_imdb)\n", " '\n", " --where 'duration_wd < duration_imdb'\n", " --return 'distinct l as id'\n", " --order-by 'l'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the ids of the discrepant duration edges in a file:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"\n", " query -i all -i augment_imdb\n", " --match '\n", " all: (film)-[l:P2047]->(duration_wd),\n", " augment_imdb: (film)-[:P2047]->(duration_imdb)\n", " '\n", " --where 'duration_wd < duration_imdb'\n", " --return 'distinct l as id'\n", " --order-by 'l'\n", " -o $TEMP/id.discrepant_durations.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the intersection and difference of the two files: the original `$quantity` file that contains all the quantity edges for our subset of Wikidata, and `$TEMP/id.discrepant_durations.tsv`, the file that contains the ids of duration edges in `$quantity` where the values in Wikidata and IMDb differ.\n", "\n", "Use the KGTK `ifexists` command (https://kgtk.readthedocs.io/en/latest/transform/ifexists/).\n", "Provide the two files, the names of the columns used as keys (`id`), and tell the command to output the `reject-file` containing edges in the input file that are not present in the filter file (this is the file we want):" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2idnode2;wikidatatype
0Q1009788P2047+109Q7727Q1009788-P2047-16e299-ef37b7c7-0quantity
1Q102448P2047+136Q7727Q102448-P2047-031b29-0e3d9b06-0quantity
2Q1025096P2047+93Q7727Q1025096-P2047-01387b-bea42f5f-0quantity
3Q1027212P2047+125[+125,+125]Q7727Q1027212-P2047-c18970-a8e52174-0quantity
4Q103474P2047+143Q7727Q103474-P2047-83a7f0-c5824aa3-0quantity
..................
375Q926825P2047+128[+128,+128]Q7727Q926825-P2047-7abb58-87a8eb10-0quantity
376Q929647P2047+115Q7727Q929647-P2047-b97fa8-669b35f9-0quantity
377Q930372P2047+175Q7727Q930372-P2047-ed974c-06f071f1-0quantity
378Q936425P2047+128Q7727Q936425-P2047-37faaa-7eb616cb-0quantity
379Q936576P2047+91Q7727Q936576-P2047-fc103b-bbd28441-0quantity
\n", "

380 rows × 5 columns

\n", "
" ], "text/plain": [ " node1 label node2 id \\\n", "0 Q1009788 P2047 +109Q7727 Q1009788-P2047-16e299-ef37b7c7-0 \n", "1 Q102448 P2047 +136Q7727 Q102448-P2047-031b29-0e3d9b06-0 \n", "2 Q1025096 P2047 +93Q7727 Q1025096-P2047-01387b-bea42f5f-0 \n", "3 Q1027212 P2047 +125[+125,+125]Q7727 Q1027212-P2047-c18970-a8e52174-0 \n", "4 Q103474 P2047 +143Q7727 Q103474-P2047-83a7f0-c5824aa3-0 \n", ".. ... ... ... ... \n", "375 Q926825 P2047 +128[+128,+128]Q7727 Q926825-P2047-7abb58-87a8eb10-0 \n", "376 Q929647 P2047 +115Q7727 Q929647-P2047-b97fa8-669b35f9-0 \n", "377 Q930372 P2047 +175Q7727 Q930372-P2047-ed974c-06f071f1-0 \n", "378 Q936425 P2047 +128Q7727 Q936425-P2047-37faaa-7eb616cb-0 \n", "379 Q936576 P2047 +91Q7727 Q936576-P2047-fc103b-bbd28441-0 \n", "\n", " node2;wikidatatype \n", "0 quantity \n", "1 quantity \n", "2 quantity \n", "3 quantity \n", "4 quantity \n", ".. ... \n", "375 quantity \n", "376 quantity \n", "377 quantity \n", "378 quantity \n", "379 quantity \n", "\n", "[380 rows x 5 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " ifexists -i $quantity \n", " --filter-on $TEMP/id.discrepant_durations.tsv\n", " --input-keys id\n", " --reject-file $OUT/quantity.minus_bad_durations.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sanity check: subtract the number of edges in the original quantity file `$quantity` minus the number of edges in `$OUT/quantity.minus_bad_durations.tsv`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "380" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "old = !zcat < $quantity | wc -l\n", "new = !cat $OUT/quantity.minus_bad_durations.tsv | wc -l\n", "\n", "int(old[0]) - int(new[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implement a better model for the IMDb data\n", "\n", "The data model we just implemented is very simple. In this section, we will improve it to be more compatible with how Wikidata models data.\n", "We will:\n", "- Add a `point in time (P585)` qualifier to record the time when the reviews were counted. We will use 2020-01-01, the date reported in the Kaggle site\n", "- Combine `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews` into a single `Pnumber_of_reviews` property, and use the `of (P642)` qualifier property \n", "to record whether the review counts are from a `customer (Q852835)` or a `critic (Q6430706)`\n", "\n", "The approach we will use is to tranform the `augment_imdb` data we created above to have the desired structure. \n", "> An alternative approach would be to generate the data in the desired structure in the first place." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following query transforms the `Pnumber_of_user_reviews` data into the desired structure. The work is done in the `return` statement:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2P642P585
0Q1009788Pnumber_of_reviews386.0Q852835^2020-01-01T00:00:00Z/11
1Q1012216Pnumber_of_reviews69.0Q852835^2020-01-01T00:00:00Z/11
2Q102448Pnumber_of_reviews1600.0Q852835^2020-01-01T00:00:00Z/11
3Q1025096Pnumber_of_reviews89.0Q852835^2020-01-01T00:00:00Z/11
4Q1026724Pnumber_of_reviews20.0Q852835^2020-01-01T00:00:00Z/11
..................
1286Q977196Pnumber_of_reviews120.0Q852835^2020-01-01T00:00:00Z/11
1287Q978974Pnumber_of_reviews74.0Q852835^2020-01-01T00:00:00Z/11
1288Q980041Pnumber_of_reviews499.0Q852835^2020-01-01T00:00:00Z/11
1289Q980308Pnumber_of_reviews233.0Q852835^2020-01-01T00:00:00Z/11
1290Q997206Pnumber_of_reviews69.0Q852835^2020-01-01T00:00:00Z/11
\n", "

1291 rows × 5 columns

\n", "
" ], "text/plain": [ " node1 label node2 P642 P585\n", "0 Q1009788 Pnumber_of_reviews 386.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "1 Q1012216 Pnumber_of_reviews 69.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "2 Q102448 Pnumber_of_reviews 1600.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "3 Q1025096 Pnumber_of_reviews 89.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "4 Q1026724 Pnumber_of_reviews 20.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "... ... ... ... ... ...\n", "1286 Q977196 Pnumber_of_reviews 120.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "1287 Q978974 Pnumber_of_reviews 74.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "1288 Q980041 Pnumber_of_reviews 499.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "1289 Q980308 Pnumber_of_reviews 233.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "1290 Q997206 Pnumber_of_reviews 69.0 Q852835 ^2020-01-01T00:00:00Z/11\n", "\n", "[1291 rows x 5 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i augment_imdb\n", " --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n", " --return '\n", " film as node1,\n", " \"Pnumber_of_reviews\" as label,\n", " count as node2,\n", " \"Q852835\" as P642,\n", " \"^2020-01-01T00:00:00Z/11\" as P585\n", " '\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, we want an edge file with ids, so we need to add the requisite `add-id` and `normalize` commands:\n", "> When doing this for real, we would not define multiple cells to revise the query. We would edit the cell, refining the pipeline to produce the desired results:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q1009788Pnumber_of_reviews386.0Q1009788-Pnumber_of_reviews-7e80e4
1Q1009788-Pnumber_of_reviews-7e80e4P642Q852835Q1009788-Pnumber_of_reviews-7e80e4-P642-Q85283...
2Q1009788-Pnumber_of_reviews-7e80e4P585^2020-01-01T00:00:00Z/11Q1009788-Pnumber_of_reviews-7e80e4-P585-^2020-...
3Q1012216Pnumber_of_reviews69.0Q1012216-Pnumber_of_reviews-d411e0
4Q1012216-Pnumber_of_reviews-d411e0P642Q852835Q1012216-Pnumber_of_reviews-d411e0-P642-Q85283...
...............
3868Q980308-Pnumber_of_reviews-ae91a5P642Q852835Q980308-Pnumber_of_reviews-ae91a5-P642-Q852835...
3869Q980308-Pnumber_of_reviews-ae91a5P585^2020-01-01T00:00:00Z/11Q980308-Pnumber_of_reviews-ae91a5-P585-^2020-0...
3870Q997206Pnumber_of_reviews69.0Q997206-Pnumber_of_reviews-d411e0
3871Q997206-Pnumber_of_reviews-d411e0P642Q852835Q997206-Pnumber_of_reviews-d411e0-P642-Q852835...
3872Q997206-Pnumber_of_reviews-d411e0P585^2020-01-01T00:00:00Z/11Q997206-Pnumber_of_reviews-d411e0-P585-^2020-0...
\n", "

3873 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label \\\n", "0 Q1009788 Pnumber_of_reviews \n", "1 Q1009788-Pnumber_of_reviews-7e80e4 P642 \n", "2 Q1009788-Pnumber_of_reviews-7e80e4 P585 \n", "3 Q1012216 Pnumber_of_reviews \n", "4 Q1012216-Pnumber_of_reviews-d411e0 P642 \n", "... ... ... \n", "3868 Q980308-Pnumber_of_reviews-ae91a5 P642 \n", "3869 Q980308-Pnumber_of_reviews-ae91a5 P585 \n", "3870 Q997206 Pnumber_of_reviews \n", "3871 Q997206-Pnumber_of_reviews-d411e0 P642 \n", "3872 Q997206-Pnumber_of_reviews-d411e0 P585 \n", "\n", " node2 \\\n", "0 386.0 \n", "1 Q852835 \n", "2 ^2020-01-01T00:00:00Z/11 \n", "3 69.0 \n", "4 Q852835 \n", "... ... \n", "3868 Q852835 \n", "3869 ^2020-01-01T00:00:00Z/11 \n", "3870 69.0 \n", "3871 Q852835 \n", "3872 ^2020-01-01T00:00:00Z/11 \n", "\n", " id \n", "0 Q1009788-Pnumber_of_reviews-7e80e4 \n", "1 Q1009788-Pnumber_of_reviews-7e80e4-P642-Q85283... \n", "2 Q1009788-Pnumber_of_reviews-7e80e4-P585-^2020-... \n", "3 Q1012216-Pnumber_of_reviews-d411e0 \n", "4 Q1012216-Pnumber_of_reviews-d411e0-P642-Q85283... \n", "... ... \n", "3868 Q980308-Pnumber_of_reviews-ae91a5-P642-Q852835... \n", "3869 Q980308-Pnumber_of_reviews-ae91a5-P585-^2020-0... \n", "3870 Q997206-Pnumber_of_reviews-d411e0 \n", "3871 Q997206-Pnumber_of_reviews-d411e0-P642-Q852835... \n", "3872 Q997206-Pnumber_of_reviews-d411e0-P585-^2020-0... \n", "\n", "[3873 rows x 4 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " query -i augment_imdb\n", " --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n", " --return '\n", " film as node1,\n", " \"Pnumber_of_reviews\" as label,\n", " count as node2,\n", " \"Q852835\" as P642,\n", " \"^2020-01-01T00:00:00Z/11\" as P585\n", " '\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do, the same transformation for the counts of critic reviews, and concatenate the result for user (customaer) and critic into on result file `$TEMP/reviews.tsv`:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"\n", " query -i augment_imdb\n", " --match '(film)-[:Pnumber_of_user_reviews]->(count)'\n", " --return '\n", " film as node1,\n", " \"Pnumber_of_reviews\" as label,\n", " count as node2,\n", " \"Q852835\" as P642,\n", " \"^2020-01-01T00:00:00Z/11\" as P585\n", " '\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $TEMP/reviews.user.tsv\n", "\"\"\")\n", "\n", "kgtk(\"\"\"\n", " query -i augment_imdb\n", " --match '(film)-[:Pnumber_of_critic_reviews]->(count)'\n", " --return '\n", " film as node1,\n", " \"Pnumber_of_reviews\" as label,\n", " count as node2,\n", " \"Q6430706\" as P642,\n", " \"^2020-01-01T00:00:00Z/11\" as P585\n", " '\n", " / add-id --id-style wikidata\n", " / normalize --add-id True\n", " -o $TEMP/reviews.critic.tsv\n", "\"\"\")\n", "\n", "kgtk(\"\"\"\n", " cat -i $TEMP/reviews.user.tsv -i $TEMP/reviews.critic.tsv\n", " -o $TEMP/reviews.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to remove the edges using the old `Pnumber_of_user_reviews` and `Pnumber_of_critic_reviews` and put in the new ones. We will use the `filter` command to do \"grep\" on the old file, and return the lines that don't match `--invert True`. The `filter`command uses a simple pattern language of the form `node1 pattern ; lable pattern ; node2 pattern` (https://kgtk.readthedocs.io/en/latest/transform/filter/):" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q1009788P2047+113Q7727Q1009788-P2047-54edad
1Q1012216P2047+129Q7727Q1012216-P2047-321f7d
2Q102448P2047+142Q7727Q102448-P2047-f4e24d
3Q1025096P2047+98Q7727Q1025096-P2047-6323a5
4Q1026724P2047+79Q7727Q1026724-P2047-0eb45e
...............
1286Q977196P2047+104Q7727Q977196-P2047-75da46
1287Q978974P2047+140Q7727Q978974-P2047-b00fbd
1288Q980041P2047+107Q7727Q980041-P2047-dec492
1289Q980308P2047+104Q7727Q980308-P2047-75da46
1290Q997206P2047+130Q7727Q997206-P2047-a5e559
\n", "

1291 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label node2 id\n", "0 Q1009788 P2047 +113Q7727 Q1009788-P2047-54edad\n", "1 Q1012216 P2047 +129Q7727 Q1012216-P2047-321f7d\n", "2 Q102448 P2047 +142Q7727 Q102448-P2047-f4e24d\n", "3 Q1025096 P2047 +98Q7727 Q1025096-P2047-6323a5\n", "4 Q1026724 P2047 +79Q7727 Q1026724-P2047-0eb45e\n", "... ... ... ... ...\n", "1286 Q977196 P2047 +104Q7727 Q977196-P2047-75da46\n", "1287 Q978974 P2047 +140Q7727 Q978974-P2047-b00fbd\n", "1288 Q980041 P2047 +107Q7727 Q980041-P2047-dec492\n", "1289 Q980308 P2047 +104Q7727 Q980308-P2047-75da46\n", "1290 Q997206 P2047 +130Q7727 Q997206-P2047-a5e559\n", "\n", "[1291 rows x 4 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"\"\"\n", " filter -i $TEMP/augment.imdb.tsv \n", " --pattern '; Pnumber_of_user_reviews, Pnumber_of_critic_reviews ;'\n", " --invert True\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the unix-like grep/cat pipeline that replaces the old statements with the new ones:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "kgtk(\"\"\"\n", " filter -i $TEMP/augment.imdb.tsv \n", " --pattern '; Pnumber_of_user_reviews, Pnumber_of_critic_reviews ;'\n", " --invert True\n", " \n", " / cat -i - -i $TEMP/reviews.tsv\n", " -o $OUT/augment.imdb.tsv\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's double check the results. The data looks good.\n", "\n", "To add the new data to our KG, and view it in the browser, we have to define the new `Pnumber_of_reviews` property, give it a label, aliases and description, and most importantly, define its `datatype` as `quantity`\n", "> We will not do this step in this tutorial. See https://github.com/usc-isi-i2/kgtk/blob/dev/kgtk-properties/kgtk.properties.tsv for examples of how to define new properties." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2id
0Q1009788P2047+113Q7727Q1009788-P2047-54edad
1Q1012216P2047+129Q7727Q1012216-P2047-321f7d
2Q102448P2047+142Q7727Q102448-P2047-f4e24d
3Q1025096P2047+98Q7727Q1025096-P2047-6323a5
4Q1026724P2047+79Q7727Q1026724-P2047-0eb45e
...............
9032Q980308-Pnumber_of_reviews-c70c59P642Q6430706Q980308-Pnumber_of_reviews-c70c59-P642-Q643070...
9033Q980308-Pnumber_of_reviews-c70c59P585^2020-01-01T00:00:00Z/11Q980308-Pnumber_of_reviews-c70c59-P585-^2020-0...
9034Q997206Pnumber_of_reviews36.0Q997206-Pnumber_of_reviews-f85f0b
9035Q997206-Pnumber_of_reviews-f85f0bP642Q6430706Q997206-Pnumber_of_reviews-f85f0b-P642-Q643070...
9036Q997206-Pnumber_of_reviews-f85f0bP585^2020-01-01T00:00:00Z/11Q997206-Pnumber_of_reviews-f85f0b-P585-^2020-0...
\n", "

9037 rows × 4 columns

\n", "
" ], "text/plain": [ " node1 label \\\n", "0 Q1009788 P2047 \n", "1 Q1012216 P2047 \n", "2 Q102448 P2047 \n", "3 Q1025096 P2047 \n", "4 Q1026724 P2047 \n", "... ... ... \n", "9032 Q980308-Pnumber_of_reviews-c70c59 P642 \n", "9033 Q980308-Pnumber_of_reviews-c70c59 P585 \n", "9034 Q997206 Pnumber_of_reviews \n", "9035 Q997206-Pnumber_of_reviews-f85f0b P642 \n", "9036 Q997206-Pnumber_of_reviews-f85f0b P585 \n", "\n", " node2 \\\n", "0 +113Q7727 \n", "1 +129Q7727 \n", "2 +142Q7727 \n", "3 +98Q7727 \n", "4 +79Q7727 \n", "... ... \n", "9032 Q6430706 \n", "9033 ^2020-01-01T00:00:00Z/11 \n", "9034 36.0 \n", "9035 Q6430706 \n", "9036 ^2020-01-01T00:00:00Z/11 \n", "\n", " id \n", "0 Q1009788-P2047-54edad \n", "1 Q1012216-P2047-321f7d \n", "2 Q102448-P2047-f4e24d \n", "3 Q1025096-P2047-6323a5 \n", "4 Q1026724-P2047-0eb45e \n", "... ... \n", "9032 Q980308-Pnumber_of_reviews-c70c59-P642-Q643070... \n", "9033 Q980308-Pnumber_of_reviews-c70c59-P585-^2020-0... \n", "9034 Q997206-Pnumber_of_reviews-f85f0b \n", "9035 Q997206-Pnumber_of_reviews-f85f0b-P642-Q643070... \n", "9036 Q997206-Pnumber_of_reviews-f85f0b-P585-^2020-0... \n", "\n", "[9037 rows x 4 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk(\"cat -i $OUT/augment.imdb.tsv\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Summary of this tutorial:\n", "\n", "In this tutorial we:\n", "- Extracted data from a CSV file to add to our KG\n", "- Converted a CSV file into a KGTK graph using Pandas to create a simple KG (this step is similar to doing a direct mapping using R2RML https://www.w3.org/TR/rdb-direct-mapping/)\n", "- Used KGTK to extract/transform/load (ETL) the data into a simple model consistent with the Wikidata ontology\n", "- Transformed the simple model into a better model\n", "- The output of this section is\n", " - `$OUT/augment.imdb.tsv`, containing the new edges we created from the IMDb file\n", " - `$OUT/quantity.minus_bad_durations.tsv`, a modification of the orignal `$quantity` file where we removed discrepant durations.\n", "\n", "> The original IMDb file contains many other interesting fields that would be fun to import. Many of them such as genre, director, actors require much more work as we must link the names to entities in Wikidata. Entity linking is outside the scope of KGTK; tools such as OpenRefine (https://openrefine.org/) or our own Table Linker (https://github.com/usc-isi-i2/table-linker) can be used to do entity linking. The KGTK `replace-nodes` command is helpful to process files containing probabilistic \"same-as\" statements to integrate the results of external entity linking tools." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Deploy the results (Optional)\n", "\n", "Deploy the tutorial files after completing this notebook." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "files_to_deploy = [\n", " \"metadata.p31x.count.transitive.tsv\",\n", " \"derived.P31x.tsv\",\n", " \"derived.P1963computed.tsv\",\n", " \"derived.Pproperty_domain.tsv\",\n", " \"derived.Punits_used.tsv\",\n", " \"derived.Paward_count.tsv\"\n", "]\n", "\n", "# First copy all the files from the add-derived-graphs, we will overwrite the ones that change, e.g., all.tsv\n", "!cp -p {tutorial_deployment_path + \"/arnold\"}/*.tsv* {project_deployment_path}\n", "\n", "for file in files_to_deploy:\n", " path = \"$OUT/\" + file\n", " !cp -p {path} {project_deployment_path} \n", "\n", "all_file_path = project_deployment_path + \"/all.tsv.gz\"\n", "if os.path.exists(all_file_path):\n", " !rm {all_file_path}\n", "!kgtk cat -i {tutorial_deployment_path + \"/arnold/all.tsv.gz\"} -i {project_deployment_path}/*.tsv -o {all_file_path}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List all the files:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "!ls -l {project_deployment_path}" ] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }