{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "51d077a7",
   "metadata": {},
   "source": [
    "# Query knowledge graphs with KGTK Kypher"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0bef068",
   "metadata": {},
   "source": [
    "This tutorial gives an overview of the core features of the KGTK Kypher query language.  \n",
    "\n",
    "For additional details refer to the reading list at the end. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1617744d-8f2d-4af9-af36-9d8b570b4aa0",
   "metadata": {},
   "source": [
    "## Step 0: Install KGTK"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd4818d0-ffd4-4bfd-bc15-6a28e6067d62",
   "metadata": {},
   "source": [
    "Only run the following cell if KGTK is not installed.\n",
    " For example, if running in [Google Colab](https://colab.research.google.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30d32722-d768-4b12-8723-3596c33cdaab",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install kgtk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28ca9cbc",
   "metadata": {},
   "source": [
    "### Notebook data setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2d11ddcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import os.path\n",
    "\n",
    "from kgtk.configure_kgtk_notebooks import ConfigureKGTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cfe8ba35",
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "\n",
    "# Folders on local machine where to create the output and temporary files:\n",
    "input_path = None\n",
    "output_path = \"/tmp/projects\"\n",
    "project_name = \"tutorial-kypher\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e28525d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# These are all the KG files that we use in this notebook:\n",
    "additional_files = {\n",
    "    \"small_graph\": \"small-graph.tsv\",\n",
    "    \"small_films\": \"small-films.tsv\",\n",
    "    \"small_props\": \"small-props.tsv\",\n",
    "    \"small_quals\": \"small-quals.tsv\",\n",
    "}\n",
    "\n",
    "big_files = [\n",
    "    \"all\",\n",
    "    \"label\",\n",
    "    \"pagerank_undirected\",\n",
    "]\n",
    "\n",
    "ck = ConfigureKGTK(big_files)\n",
    "ck.configure_kgtk(input_graph_path=input_path, \n",
    "                  output_path=output_path, \n",
    "                  project_name=project_name,\n",
    "                  additional_files=additional_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "22ddd656",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "kypher: kgtk query --graph-cache /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n",
      "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n",
      "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/labels.en.tsv.gz\n",
      "KGTK_OPTION_DEBUG: false\n",
      "STORE: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n",
      "GRAPH: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input\n",
      "OUT: /tmp/projects/tutorial-kypher\n",
      "kgtk: kgtk\n",
      "KGTK_GRAPH_CACHE: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n",
      "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n",
      "TEMP: /tmp/projects/tutorial-kypher/temp.tutorial-kypher\n",
      "all: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/all.tsv.gz\n",
      "label: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/labels.en.tsv.gz\n",
      "pagerank_undirected: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/metadata.pagerank.undirected.tsv.gz\n",
      "small_graph: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv\n",
      "small_films: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-films.tsv\n",
      "small_props: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-props.tsv\n",
      "small_quals: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-quals.tsv\n"
     ]
    }
   ],
   "source": [
    "ck.print_env_variables()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50f97f88",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "## Usage of `query` command\n",
    "```\n",
    "usage: kgtk query [-h] [-i INPUT_FILE [INPUT_FILE ...]] [--as NAME]\n",
    "                  [--comment COMMENT] [--query QUERY] [--match PATTERN]\n",
    "                  [--where CLAUSE] [--opt PATTERN] [--with CLAUSE]\n",
    "                  [--where: CLAUSE] [--return CLAUSE] [--order-by CLAUSE]\n",
    "                  [--skip CLAUSE] [--limit CLAUSE] [--para NAME=VAL]\n",
    "                  [--spara NAME=VAL] [--lqpara NAME=VAL] [--no-header]\n",
    "                  [--force] [--index MODE [MODE ...]] [--idx SPEC [SPEC ...]]\n",
    "                  [--explain [MODE]] [--graph-cache GRAPH_CACHE_FILE]\n",
    "                  [--show-cache] [--import MODULE_LIST] [-o OUTPUT]\n",
    "\n",
    "Query one or more KGTK files with Kypher.\n",
    "IMPORTANT: input can come from stdin but chaining queries is not yet supported.\n",
    "\n",
    "Input options:\n",
    "  -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]\n",
    "                        One or more input files to query, maybe compressed\n",
    "                        (May be omitted or '-' for stdin.)\n",
    "  --as NAME             alias name to be used for preceding input\n",
    "  --comment COMMENT     comment string to store for the preceding input\n",
    "                        (displayed by --show-cache)\n",
    "\n",
    "Query options:\n",
    "  --query QUERY         complete Kypher query combining all clauses, if\n",
    "                        supplied, all other specialized clause arguments will\n",
    "                        be ignored\n",
    "  --match PATTERN       MATCH pattern of a Kypher query, defaults to universal\n",
    "                        node pattern `()'\n",
    "  --where CLAUSE        WHERE clause to a preceding --match, --opt or --with\n",
    "                        clause\n",
    "  --opt PATTERN, --optional PATTERN\n",
    "                        OPTIONAL MATCH pattern(s) of a Kypher query (zero or\n",
    "                        more)\n",
    "  --with CLAUSE         WITH clause of a Kypher query (only 'WITH * ...' is\n",
    "                        currently supported)\n",
    "  --where: CLAUSE       final global WHERE clause, shorthand for 'WITH * WHERE\n",
    "                        ...'\n",
    "  --return CLAUSE       RETURN clause of a Kypher query (defaults to *)\n",
    "  --order-by CLAUSE     ORDER BY clause of a Kypher query\n",
    "  --skip CLAUSE         SKIP clause of a Kypher query\n",
    "  --limit CLAUSE        LIMIT clause of a Kypher query\n",
    "  --para NAME=VAL       zero or more named value parameters to be passed to\n",
    "                        the query\n",
    "  --spara NAME=VAL      zero or more named string parameters to be passed to\n",
    "                        the query\n",
    "  --lqpara NAME=VAL     zero or more named LQ-string parameters to be passed\n",
    "                        to the query\n",
    "\n",
    "Output options:\n",
    "  --no-header           do not generate a header row with column names\n",
    "  -o OUTPUT, --out OUTPUT\n",
    "                        output file to write to, if `-' (the default) output\n",
    "                        goes to stdout. Files with extensions .gz, .bz2 or .xz\n",
    "                        will be appropriately compressed.\n",
    "\n",
    "Control options:\n",
    "  --force               force problematic queries to run against advice\n",
    "  --index MODE [MODE ...], --index-mode MODE [MODE ...]\n",
    "                        default index creation MODE for all inputs (default:\n",
    "                        auto); can be overridden with --idx for specific\n",
    "                        inputs\n",
    "  --idx SPEC [SPEC ...], --input-index SPEC [SPEC ...]\n",
    "                        create index(es) according to SPEC for the preceding\n",
    "                        input only\n",
    "  --explain [MODE]      explain the query execution and indexing plan\n",
    "                        according to MODE (plan, full, expert, default: plan).\n",
    "                        This will not actually run or create anything.\n",
    "  --graph-cache GRAPH_CACHE_FILE\n",
    "                        database cache where graphs will be imported before\n",
    "                        they are queried (defaults to per-user temporary file)\n",
    "  --show-cache          describe the current content of the graph cache and\n",
    "                        exit (does not actually run a query or import data)\n",
    "  --import MODULE_LIST  Python modules needed to define user extensions to\n",
    "                        built-in functions\n",
    "  -h, --help            show this help message and exit\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "becbc95b",
   "metadata": {},
   "source": [
    "## Selecting edges with the `--match` clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bfb0429",
   "metadata": {},
   "source": [
    "- KGTK `query` either takes a full Kypher `--query` or individual Kypher clauses such as `--match`,\n",
    "  `--return`, etc.\n",
    "- individual clauses are automatically assembled into the proper order, easier in Unix shell environment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bcfa3a4",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- simple query on a single input graph with an anonymous edge pattern\n",
    "- for convenience we use shell variable `$GRAPH` to point to small data file\n",
    "- **IMPORTANT**: use quotes around the match pattern to protect from the shell\n",
    "- this matches all edges and returns the whole file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1d2a9d08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv\n"
     ]
    }
   ],
   "source": [
    "!echo $small_graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5130f672",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n",
      "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --match '()-[]->()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7733bff5",
   "metadata": {},
   "source": [
    "- match pattern starts with anonymous node connecting via an\n",
    "  anonymous relation to another anonymous node\n",
    "- matched against the four core columns specifying an edge in each line of the KGTK\n",
    "  input file\n",
    "- from-node is matched against `node1`, relation is\n",
    "  matched against `id` and `label` (more on that distinction later), and\n",
    "  the to-node is matched against `node2`\n",
    "- for each KGTK line matching the pattern, output is generated according to the `--return` clause\n",
    "- default for `--return` is `*` which means all columns of a matching line will be output"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c4fb588",
   "metadata": {},
   "source": [
    "**Equivalent query:**\n",
    "- singular anonymous node pattern will be completed to a full edge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "db30d4a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n",
      "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --match '()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "547b54f2",
   "metadata": {},
   "source": [
    "**Equivalent query:**\n",
    "- that pattern is also the default for `--match`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3510b2fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n",
      "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2634d31c",
   "metadata": {},
   "source": [
    "**Equivalent command:** `kgtk cat`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9f348d66",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n",
      "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk cat -i $small_graph"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc441053",
   "metadata": {},
   "source": [
    "### Restricting output\n",
    "- use `--limit` and/or `--skip` to control output volume,\n",
    "  especially to test queries on larger data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6807d9af",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --limit 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bbd440e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n",
      "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --skip 2 --limit 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "caa2991c",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n",
      "e3\tg_rydstrom\tcolleague\tg_borders\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph / head -n 3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21484bae",
   "metadata": {},
   "source": [
    "### Restricting edges\n",
    "- patterns can be made more specific by restricting some of the elements of an edge\n",
    "- for example, filter for all edges that start with `m_shriver`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "579b24f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --match '(:m_shriver)-[]->()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56ac3776",
   "metadata": {},
   "source": [
    "- **Kypher vs. Cypher**: in Cypher the restriction `m_shriver` would be interpreted as a node *type* in a property graph, in KGTK Kypher\n",
    "it is interpreted as the ID of a particular node"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af32722e",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- filter on the relation of an edge\n",
    "- select all edges with label `name` using the same colon-restriction syntax we used for nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fa10dac3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --match '()-[:name]->()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70207f7c",
   "metadata": {},
   "source": [
    "- for relations, the interpretation of restrictions on the label of an\n",
    "edge (as opposed to its `id`) is more in line with standard Cypher."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "841a95e3",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- node and relation restrictions can be combined\n",
    "- for example, here we select all `name` edges starting from node `a_schwarzenegger`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "d54b0533",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --match '(:a_schwarzenegger)-[:name]->()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dd48f36",
   "metadata": {},
   "source": [
    "## Filtering with the `--where` clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ecba3ba",
   "metadata": {},
   "source": [
    "- `--where` holds a possibly complex Boolean expression that gets evaluated as additional edge filter\n",
    "- only edges for which it evaluates to true will be returned\n",
    "- `--where` can be used to express more complex conditions and computations that cannot be expressed in `--match`\n",
    "- to connect to values selected by `--match` we need pattern *variables*\n",
    "- variables are specified with simple names in the node or relationship part of a pattern\n",
    "- below we use `p` as the variable for the starting node of the edge pattern\n",
    "- in the `--where` clause we restrict which values are allowed for `p`\n",
    "- this query is equivalent to the one above where we restricted the starting node directly in the match pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3ec0af12",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->()' \\\n",
    "     --where 'p = \"m_shriver\"'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7957db5f",
   "metadata": {},
   "source": [
    "**Equivalent query:**\n",
    "- specifies the starting node restriction twice which is ok but redundant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a394af91",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p:m_shriver)-[:name]->()' \\\n",
    "     --where 'p = \"m_shriver\"'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ffdb95d",
   "metadata": {},
   "source": [
    "- **IMPORTANT**: constants such as `m_shriver` need to be quoted when used in `--where`, similar to SQL\n",
    "- this needs extra care so quotes will not be consumed by the Unix shell"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b08921e4",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- use a regular expression to filter on the names attached to nodes\n",
    "- the `=~` operator matches a value against a regular expression\n",
    "- **Kypher vs. Cypher**: Kypher use Python regexp syntax, Cypher uses Java regexps\n",
    "- select all `name` edges that lead to a name that contains a double letter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "8df2fb4c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3263a68",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- filter based on a list of values (a form of disjunction)\n",
    "- any edge where `p` is equal to one of the listed values will be returned\n",
    "- **Kypher vs. Cypher**: Kypher only allows lists of literals such as strings\n",
    "  or numbers, Cypher also allows variables and expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e0b986bf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n",
      "e8\tg_borders\tname\t'Gloria S. Borders'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where 'p IN [\"g_borders\", \"g_rydstrom\"]'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b83cbdf1",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- filter based on a comparison operator and a computation using built-in functions\n",
    "- **IMPORTANT**: all columns in a KGTK file are treated as text (even if they contain numbers)\n",
    "- the expression below filters for names that start with the letter `L` or later\n",
    "- note that quotes of KGTK string literals are part of their value and need to be accounted for\n",
    "- for this we use the built-in function `substr` to extract the first letter of each name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f80eb655",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where \"substr(n,2,1) >= 'L'\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d4ceebc",
   "metadata": {},
   "source": [
    "## Sorting results with the `--order-by` clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d513b98a",
   "metadata": {},
   "source": [
    "- use `--order-by` to sort results just like in Cypher and SQL\n",
    "- this query sorts matched results by names in ascending order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a0f82ae8",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where \"upper(substr(n,2,1)) >= 'L'\" \\\n",
    "     --order-by n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a375a76",
   "metadata": {},
   "source": [
    "- order by an expression value, this time in descending order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "180baf9d",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e9\tm_shriver\tname\t'Maria Shriver'@en\n",
      "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where \"substr(n,2,1) >= 'L'\" \\\n",
    "     --order-by \"substr(n,2,1) desc\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f27a5dac",
   "metadata": {},
   "source": [
    "## Controlling results with the `--return` clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7755871b",
   "metadata": {},
   "source": [
    "- use `--return` to control which columns are output, in which order, with what headers\n",
    "- by default `*` is used which means all columns of matching edge(s) are output\n",
    "- use KGTK and SQLite built-in functions to transform output values or compute additional ones\n",
    "- use `distict` to eliminate duplicates\n",
    "- use aggregation function to get counts, max, average, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ca3b0fd",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- select only the `node1` and `node2` columns by referencing pattern variables `p` and `n`\n",
    "- note, that the result generated here is not valid KGTK (missing `id` and `label`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "c2ebb33c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tnode2\n",
      "a_schwarzenegger\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"' \\\n",
    "     --return 'p, n'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f04c5488",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- return all columns but switching their order\n",
    "- **Kypher vs. Cypher**: Kypher relation variables such as `r` below\n",
    "  get bound to edge IDs; other components of an edge such as its `label`\n",
    "  can then be access via property syntax such as `r.label`\n",
    "- this query now produces valid KGTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "6a07ab49",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tnode2\tid\tlabel\n",
      "a_schwarzenegger\t'Arnold Schwarzenegger'@de\te5\tname\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"' \\\n",
    "     --return 'p, n, r, r.label'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49970cf4",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- summarize data via `distinct` to eliminate duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "6d286af7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "label\n",
      "coactor\n",
      "colleague\n",
      "name\n",
      "spouse\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r]->(n)' \\\n",
    "     --return 'distinct r.label' \\\n",
    "     --order-by r.label"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecd5305b",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- transform data with built-in functions, e.g., change value to lower case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "fcb5c6f4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tlower(graph_1_c1.\"node2\")\tid\n",
      "a_schwarzenegger\tname\t'arnold schwarzenegger'@de\te5\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"' \\\n",
    "     --return 'p, r.label, lower(n), r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0d7b19f",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- specify proper column headers for output values, e.g., designate the transformed value as `node2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "6ff2aa95",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tid\n",
      "a_schwarzenegger\tname\t'arnold schwarzenegger'@de\te5\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"' \\\n",
    "     --return 'p, r.label, lower(n) as node2, r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffa27b10",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- use KGTK built-in functions to manipulate KGTK values\n",
    "- here we change a symbol to a string (which is a legal `node1` in KGTK data model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "da737147",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tid\n",
      "\"a_schwarzenegger\"\tname\t'Arnold Schwarzenegger'@de\te5\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n =~ \".*(.)\\\\1.*\"' \\\n",
    "     --return 'kgtk_stringify(p) as node1, r.label, n, r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "472d713e",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- access components of structured literals with property syntax (treat them as *virtual properties*)\n",
    "- but this is just syntactic sugar for calling the KGTK function\n",
    "- extra output column is named with column path syntax (`lang` of `node2`)\n",
    "- special characters need to be backtick-quoted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "afd031b8",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\tnode2;lang\n",
      "e5\ta_schwarzenegger\tname\t'arnold schwarzenegger'@de\tde\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'kgtk_lqstring(n) and n.kgtk_lqstring_lang != \"en\"' \\\n",
    "     --return 'r, p, r.label, lower(n) as node2, n.kgtk_lqstring_lang as `node2;lang`'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e65fa69d",
   "metadata": {},
   "source": [
    "## Querying connected edges through graph patterns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5f8938d",
   "metadata": {},
   "source": [
    "- in KGs  we will often want to combine multiple edges into a query\n",
    "- connection could be within a graph or across graphs or both\n",
    "- in database parlance this is generally called a *join*\n",
    "- in Kypher (or Cypher) we can express such queries very concisely using graph patterns\n",
    "- this ASCII pattern language is probably the single-most useful aspect of K/Cypher\n",
    "- for example, find colleagues and their names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a25f411b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e3\t'Gary R. Rydstrom'@en\tcolleague\t'Gloria S. Borders'@en\n",
      "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(na)<-[:name]-(a)-[r:colleague]->(b)-[:name]->(nb)' \\\n",
    "     --return 'r, na as node1, r.label, nb as node2'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ace1986",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- combine path patterns with `--where` expression for more elaborate filtering\n",
    "- below we select starting edges where at least one of the nodes has a German name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "955b4593",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(na)<-[:name]-(a)-[r:colleague]->(b)-[:name]->(nb)' \\\n",
    "     --where 'na.kgtk_lqstring_lang = \"de\" OR nb.kgtk_lqstring_lang = \"de\"' \\\n",
    "     --return 'r, na as node1, r.label, nb as node2'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b6840fc",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- it is good practice to only name pattern variables that are actually needed and leave others anonymous"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "09fdf20d",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "e3\t'Gary R. Rydstrom'@en\tcolleague\t'Gloria S. Borders'@en\n",
      "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(na)<-[:name]-()-[r:colleague]->()-[:name]->(nb)' \\\n",
    "     --return 'r, na as node1, r.label, nb as node2'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ec2f765",
   "metadata": {},
   "source": [
    "## Querying connected edges across multiple graphs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8061cef7",
   "metadata": {},
   "source": [
    "- Kypher can combine information from different graphs described in separate KGTK files\n",
    "- allows one to mix and match information, or to organize data into different logical parts\n",
    "- to query multiple graphs we need to specify two or more input files\n",
    "- *graph variables* are used to associate pattern clauses with the graph they apply to\n",
    "- for graphs to connect they need to have some node or edge IDs in common\n",
    "- **Kypher vs. Cypher*: Cypher does not address multi-graph queries\n",
    "- to illustrate this we use a second example graph of film data shown here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "98faf82c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-films.tsv\n"
     ]
    }
   ],
   "source": [
    "!echo $small_films"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "2834c11b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "t1\tterminator2_jd\tname\t'Terminator 2'@en\n",
      "t2\tterminator2_jd\tinstance_of\tfilm\n",
      "t3\tterminator2_jd\tgenre\tscience_fiction\n",
      "t4\tterminator2_jd\tgenre\taction\n",
      "t5\tterminator2_jd\tcast\ta_schwarzenegger\n",
      "t6\tterminator2_jd\tcast\tl_hamilton\n",
      "t7\tterminator2_jd\tcrew\tg_rydstrom\n",
      "t8\tterminator2_jd\tcrew\tg_borders\n",
      "t9\tterminator2_jd\taward\tacademy-best-sound-editing\n",
      "t10\tterminator2_jd\tpublication_date\t^1991-07-03T00:00:00Z/11\n",
      "t11\tterminator2_jd\tpublication_date\t^1995-07-01T00:00:00Z/11\n",
      "t12\tfilm\tsubclass_of\tvisual_artwork\n",
      "t13\ts_connor\tname\t'Sarah Connor'@en\n",
      "t14\tterminator\tname\t'Terminator'@en\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_films --match '()-[]->()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5b19871",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- let's query for people whose spouse is a film actor\n",
    "- a graph variable followed by `:` means all clauses following it are to be applied to that graph,\n",
    "  until overridden by another graph variable\n",
    "- graph variables are matched to input files in a simple greedy matching scheme (see manual),\n",
    "  for example, here we use a substring of each input file name\n",
    "- multiple edges in the pattern are separated by commas to allow for graph switching"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "3100a084",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\tnode2;acted_in\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\tterminator2_jd\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_films \\\n",
    "     --match 'graph: (x)-[r:spouse]->(y), \\\n",
    "              films: (y)<-[:cast]-(f)-[:instance_of]->(:film)' \\\n",
    "     --return 'r, x, r.label, y, f as `node2;acted_in`'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86f272e7",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- let's further restrict that the acting had to be done before 1985"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "1fb135c4",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\tnode2;acted_in\tnode2;acted_when\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\tterminator2_jd\t^1991-07-03T00:00:00Z/11\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_films \\\n",
    "     --match 'graph: (x)-[r:spouse]->(y), \\\n",
    "              films: (y)<-[:cast]-(f)-[:instance_of]->(:film), \\\n",
    "                     (f)-[:publication_date]->(d)' \\\n",
    "     --where 'd < \"^1995\"' \\\n",
    "     --return 'r, x, r.label, y, f as `node2;acted_in`, d as `node2;acted_when`'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "040183db",
   "metadata": {},
   "source": [
    "## Aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "878f3af7",
   "metadata": {},
   "source": [
    "- similar to SQL and Cypher, Kypher supports aggregation functions such\n",
    "  as `count`, `min`, `max`, `avg`, etc.\n",
    "- one of the simplest and most frequently used operations is counting rows via `count`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a27616e",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- count how many edges have `terminator2_jd` as their starting node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "d093572e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N\n",
      "11\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_films \\\n",
    "     --match '(:terminator2_jd)-[r]->()' \\\n",
    "     --return 'count(r) as N'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24a502d",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- counts may include duplicate values (every selected edge will be counted)\n",
    "- we can use `distinct` keyword as the first argument to `count` (or any aggregation function) to exclude duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "3d5e4e83",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N\n",
      "7\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_films \\\n",
    "     --match '(:terminator2_jd)-[r]->()' \\\n",
    "     --return 'count(distinct r.label) as N'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "838358eb",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- Kypher does not have an explicit `group by` clause (different from SPARQL or SQL)\n",
    "- grouping is inferred from clause type and order in the `return` statement\n",
    "  and sorts rows into into groups before an aggregation operation is applied to each group\n",
    "- below we group by relation label and then select the maximum `node2` for each label (lexicographically)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "28935f79",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "label\tnode2\tnode1\tid\n",
      "award\tacademy-best-sound-editing\tterminator2_jd\tt9\n",
      "cast\tl_hamilton\tterminator2_jd\tt6\n",
      "crew\tg_rydstrom\tterminator2_jd\tt7\n",
      "genre\tscience_fiction\tterminator2_jd\tt3\n",
      "instance_of\tfilm\tterminator2_jd\tt2\n",
      "name\t'Terminator'@en\tterminator\tt14\n",
      "publication_date\t^1995-07-01T00:00:00Z/11\tterminator2_jd\tt11\n",
      "subclass_of\tvisual_artwork\tfilm\tt12\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_films \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --return 'r.label, max(y) as node2, x, r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc9ea6c6",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "**Mildly tricky**: the query applied the `max` function to groups of\n",
    "result rows where `r.label` had the same value.  But for this to\n",
    "work we had to move the other output variables `x` and `r` to the\n",
    "end, otherwise they would have been the grouping criterion (that's\n",
    "the drawback of implicit grouping)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f35835f3",
   "metadata": {},
   "source": [
    "## Optional match"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e44058a",
   "metadata": {},
   "source": [
    "- real-world KGs often only have partial coverage of certain relations\n",
    "- for example, not every node might have a name or label\n",
    "- Kypher's optional match patterns can be used to hande incomplete information,\n",
    "  since they are allowed to fail and will generate NULL values for such cases\n",
    "- each Kypher query must have exactly one strict `--match` clause and\n",
    "  can have zero or more optional match clauses introduced by `--opt`\n",
    "- **Kypher vs. Cypher**: this is more than Cypher which can have any\n",
    "  number of strict and/or optional patterns in any order\n",
    "- each strict and optional match clause can have its own `--where` clause\n",
    "- below we sometimes use a third graph of edge qualifiers shown here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "16ad326b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-quals.tsv\n"
     ]
    }
   ],
   "source": [
    "!echo $small_quals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "11b32803",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "q1\tt5\trole\tterminator\n",
      "q2\tt6\trole\ts_connor\n",
      "q3\tt9\tpoint_in_time\t^1992-03-30T00:00:00Z/11\n",
      "q4 \tt9\twinner\tg_rydstrom\n",
      "q5 \tt9\twinner\tg_borders\n",
      "q6 \tt10\tlocation\tunited_states\n",
      "q8 \tt11\tlocation\trussia\n",
      "q9\te1\tstart\t^1986-04-26T08:00:00Z/11\n",
      "q10\te1\tend\t^2011-07-01T10:00:00Z/11\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_quals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d774227d",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- start with a strict query to find people associated with a movie and\n",
    "  any awards they may have won"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "d1cea397",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "person\tfilm\taward\n",
      "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\n",
      "g_borders\tterminator2_jd\tacademy-best-sound-editing\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n",
    "     --match  'graph: (p)-[:name]->(), \\\n",
    "               films: (f)-[:instance_of]->(:film), (f)-[]->(p), (f)-[ra:award]->(a), \\\n",
    "               quals: (ra)-[:winner]->(p)' \\\n",
    "     --return 'p as person, f as film, a as award'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f394b5d",
   "metadata": {},
   "source": [
    "- result only lists some of the people, since not all participants won awards\n",
    "- this makes us miss some potentially useful data\n",
    "- to make sure we retrieve all people associated with a movie and optionally\n",
    "  any awards they may have received, we can use this query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "7a03c50b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "person\tfilm\taward\n",
      "a_schwarzenegger\tterminator2_jd\t\n",
      "l_hamilton\tterminator2_jd\t\n",
      "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\n",
      "g_borders\tterminator2_jd\tacademy-best-sound-editing\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n",
    "     --match  'graph: (p)-[:name]->(), \\\n",
    "               films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n",
    "     --opt    'films: (f)-[ra:award]->(a), \\\n",
    "               quals: (ra)-[:winner]->(p)' \\\n",
    "     --return 'p as person, f as film, a as award'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edf97278",
   "metadata": {},
   "source": [
    "- now we get all relevant people and missing awards are simply empty (or NULL)\n",
    "- note how edge qualifiers associate information to an edge ID (e.g., the winner of an award)\n",
    "- optional patterns are either fully satisfied for a set of bindings, or not at all,\n",
    "  there are no partial matches"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d1ab442",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- multiple independent optional clauses to also get spouses and to filter awards\n",
    "  after 1990"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "20da4b70",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "person\tfilm\taward\tspouse\n",
      "a_schwarzenegger\tterminator2_jd\t\tm_shriver\n",
      "l_hamilton\tterminator2_jd\t\t\n",
      "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\t\n",
      "g_borders\tterminator2_jd\tacademy-best-sound-editing\t\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n",
    "     --match  'graph: (p)-[:name]->(), \\\n",
    "               films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n",
    "     --opt    'films: (f)-[ra:award]->(a), \\\n",
    "               quals: (ra)-[:winner]->(p), (ra)-[:point_in_time]->(ad)' \\\n",
    "     --where  'ad >= 1990' \\\n",
    "     --opt    'graph: (s)-[:spouse]->(p)' \\\n",
    "     --return 'p as person, f as film, a as award, s as spouse'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d137982",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- compare that to the SQL produced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "e04ba8cf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2022-03-16 15:15:43 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT graph_1_c1.\"node1\" \"_aLias.person\", graph_2_c2.\"node1\" \"_aLias.film\", graph_2_c4.\"node2\" \"_aLias.award\", graph_1_c7.\"node1\" \"_aLias.spouse\"\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     INNER JOIN graph_2 AS graph_2_c2, graph_2 AS graph_2_c3\n",
      "     ON graph_1_c1.\"node1\" = graph_2_c3.\"node2\"\n",
      "        AND graph_2_c2.\"node1\" = graph_2_c3.\"node1\"\n",
      "        AND graph_1_c1.\"label\" = ?\n",
      "        AND graph_2_c2.\"label\" = ?\n",
      "        AND graph_2_c2.\"node2\" = ?\n",
      "     LEFT JOIN (graph_2 AS graph_2_c4\n",
      "         INNER JOIN graph_3 AS graph_3_c5, graph_3 AS graph_3_c6\n",
      "         ON graph_2_c4.\"id\" = graph_3_c5.\"node1\"\n",
      "            AND graph_2_c4.\"id\" = graph_3_c6.\"node1\"\n",
      "            AND graph_2_c4.\"label\" = ?\n",
      "            AND graph_3_c5.\"label\" = ?\n",
      "            AND graph_3_c6.\"label\" = ?\n",
      "            AND (graph_3_c6.\"node2\" >= ?))\n",
      "     ON graph_1_c1.\"node1\" = graph_3_c5.\"node2\"\n",
      "        AND graph_2_c2.\"node1\" = graph_2_c4.\"node1\"\n",
      "     LEFT JOIN graph_1 AS graph_1_c7\n",
      "     ON graph_1_c1.\"node1\" = graph_1_c7.\"node2\"\n",
      "        AND graph_1_c7.\"label\" = ?\n",
      "  PARAS: ['name', 'instance_of', 'film', 'award', 'winner', 'point_in_time', 1990, 'spouse']\n",
      "---------------------------------------------\n",
      "person\tfilm\taward\tspouse\n",
      "a_schwarzenegger\tterminator2_jd\t\tm_shriver\n",
      "l_hamilton\tterminator2_jd\t\t\n",
      "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\t\n",
      "g_borders\tterminator2_jd\tacademy-best-sound-editing\t\n"
     ]
    }
   ],
   "source": [
    "!kgtk --debug query -i $small_graph -i $small_films -i $small_quals \\\n",
    "     --match  'graph: (p)-[:name]->(), \\\n",
    "               films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n",
    "     --opt    'films: (f)-[ra:award]->(a), \\\n",
    "               quals: (ra)-[:winner]->(p), (ra)-[:point_in_time]->(ad)' \\\n",
    "     --where  'ad >= 1990' \\\n",
    "     --opt    'graph: (s)-[:spouse]->(p)' \\\n",
    "     --return 'p as person, f as film, a as award, s as spouse'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b7017cf",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "- see the manual for more examples\n",
    "- e.g., how to do `NOT EXISTS` with an optional clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27883225",
   "metadata": {},
   "source": [
    "## Full-text search"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17fc223a",
   "metadata": {},
   "source": [
    "- Kypher supports efficient full-text search over large graph data via SQLite's FTS5 module\n",
    "- matching uses specialized indexes which can be created easily and efficiently"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd6efd32",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- we define a text index on the `node2` column of `GRAPH` and then use `textmatch` to match against `node2` values\n",
    "- multiple indexes can be defined on the same graph column, we can provide names to disambiguate\n",
    "- **minor caveat**: use of an older Python version requires us to provide extra index definition options,\n",
    "  since the default `trigram` tokenizer is currently only supported in Python 3.9 or later\n",
    "- match scores are negative with the best being the smallest (most negative) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "d00ef3eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -f $KGTK_GRAPH_CACHE  # bug workaround"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "b5280356",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tscore\n",
      "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n",
      "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n",
      "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --idx auto text:node2//name=myidx//tokenize=ascii//prefix=3 \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --where 'textmatch(y, \"sch*\")' \\\n",
    "     --return 'x, r.label, y, matchscore(y) as score' \\\n",
    "     --order 'score'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c723779c",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- `textmatch` patterns use a phrase-based language that allows\n",
    "  multi-word phrases, Boolean expressions, multi-column expressions, suffix patterns, etc. (see manual)\n",
    "- here is an example of a Boolean expression\n",
    "- note that the index only needs to be specified the first time around, repeating it is a no-op"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "8824450f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tscore\n",
      "l_hamilton\tname\t'Linda Hamilton'@en\t-1.6796313036328494\n",
      "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n",
      "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n",
      "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --where 'textmatch(y, \"sch* OR linda\")' \\\n",
    "     --return 'x, r.label, y, matchscore(y) as score' \\\n",
    "     --order 'score'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "766202b2",
   "metadata": {},
   "source": [
    "The `trigram` tokenizer (available with Python 3.9) also supports\n",
    "case-insensitive SQL `LIKE` patterns and case-sensitive `GLOB`\n",
    "patterns (see manual)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84bc14e9",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- text indexes can be qualified with their name to select between multiple options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "c21dada9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tscore\n",
      "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n",
      "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n",
      "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --where 'textmatch(myidx.y, \"schw*\")' \\\n",
    "     --return 'x, r.label, y, matchscore(myidx.y) as score' \\\n",
    "     --order 'score'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "515437d9",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- a text index may index more than one column\n",
    "- then the text match expression can use column-specific filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "e223c619",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tscore\n",
      "g_rydstrom\tcolleague\tg_borders\t-1.391983808592556\n",
      "g_borders\tname\t'Gloria S. Borders'@en\t-1.1761930172126576\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --idx auto text:node1,node2//name=multi//tokenize=ascii//prefix=3 \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --where 'textmatch(multi.r, \"node1: g* AND node2 : borders\")' \\\n",
    "     --return 'x, r.label, y, matchscore(multi.r) as score' \\\n",
    "     --order 'score'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "aa5b9961",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "node1\tlabel\tnode2\tscore\n",
      "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.2149880837001077\n",
      "g_rydstrom\tname\t'Gary R. Rydstrom'@en\t-0.18165978747100012\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph \\\n",
    "     --match '(x)-[r]->(y)' \\\n",
    "     --where 'textmatch(multi.r, \"node1: g* NOT node2 : borders\")' \\\n",
    "     --return 'x, r.label, y, matchscore(multi.r) as score' \\\n",
    "     --order 'score'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f639e370",
   "metadata": {},
   "source": [
    "## Input and output specification options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d2198c1",
   "metadata": {},
   "source": [
    "```\n",
    "  -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]\n",
    "                        One or more input files to query (maybe compressed).\n",
    "                        (Required, use '-' for stdin.)\n",
    "  --as NAME             alias name to be used for preceding input\n",
    "  --comment COMMENT     comment string to store for the preceding input\n",
    "                        (displayed by --show-cache)\n",
    "  --no-header           do not generate a header row with column names\n",
    "  -o OUTPUT, --out OUTPUT\n",
    "                        output file to write to, if `-' (the default) output\n",
    "                        goes to stdout. Files with extensions .gz, .bz2 or .xz\n",
    "                        will be appropriately compressed.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef7fdcd3",
   "metadata": {},
   "source": [
    "**Example**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "7fb32f69",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "!kgtk query -i $small_graph -i $small_films --as works --comment 'Film information' --no-header \\\n",
    "     --match 'graph: (x)-[]->(y), works: (f)-[r]->(x)' \\\n",
    "     --return 'r, f, r.label, x' \\\n",
    "     -o $TEMP/example-query.tsv.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e62f7dca",
   "metadata": {},
   "source": [
    "## Graph cache"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed49d72e",
   "metadata": {},
   "source": [
    "- when input files are queries they are first imported into SQLite database tables\n",
    "- this is very fast but can be noticable for larger data\n",
    "- import of WD claims file with 1B edges (~16 GB compressed) takes 20 minutes on laptop\n",
    "- resulting database is cached into a *graph cache* to amortize this over multiple queries\n",
    "- indexes built to speed up queries or text search are also cached\n",
    "- if data change is detected, data will be re-imported and re-indexed automatically"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf1e0351",
   "metadata": {},
   "source": [
    "- location of the cache file can be controlled with `--graph-cache FILE` or the `KGTK_GRAPH_CACHE`\n",
    "  environment variable\n",
    "- otherwise a default location in `/tmp` will be used"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5168ef96",
   "metadata": {},
   "source": [
    "- cache is a *true cache*, that is it is safe to delete it and it will be rebuilt automatically\n",
    "  (as long as the relevant data files are still available)\n",
    "- however, we increasingly find cases where the cache is explicitly managed, for example,\n",
    "  to support the browser or to ship large data to somebody else\n",
    "- cache files can become very large and should be on an SSD drive for best performance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5e9f195",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- display the current content of the cache with `--show-cache`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "ee45124d",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Graph Cache:\n",
      "DB file: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n",
      "  size:  72.00 KB   \tfree:  0 Bytes   \tmodified:  2022-03-16 15:16:57\n",
      "\n",
      "KGTK File Information:\n",
      "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv:\n",
      "  size:  382 Bytes   \tmodified:  2022-03-16 15:05:47   \tgraph:  graph_1\n",
      "works:\n",
      "  size:  590 Bytes   \tmodified:  2022-03-16 15:05:48   \tgraph:  graph_2\n",
      "  comment:  Film information\n",
      "\n",
      "Graph Table Information:\n",
      "graph_1:\n",
      "  size:  40.00 KB   \tcreated:  2022-03-16 15:16:15\n",
      "  header:  ['id', 'node1', 'label', 'node2']\n",
      "graph_2:\n",
      "  size:  12.00 KB   \tcreated:  2022-03-16 15:16:57\n",
      "  header:  ['id', 'node1', 'label', 'node2']\n"
     ]
    }
   ],
   "source": [
    "!kgtk query --show-cache"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18ec64d4",
   "metadata": {},
   "source": [
    "## Defining and using custom functions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b87f16f5",
   "metadata": {},
   "source": [
    "- when provided built-in functions are not enough, custom functions can be executed via Python"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca613f46",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- perform some random modifications on the `GRAPH` data\n",
    "- note that values returned by `pyeval` and `pycall` must be simple literals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "d6032d24",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name\tswapname\tisodd\tuuid\n",
      "'Arnold Schwarzenegger'@de\t'aRNOLD sCHWARZENEGGER'@DE\t0.0\ta67a6524-e993-4f9e-b09f-ac99c9d39c0f\n",
      "'Linda Hamilton'@en\t'lINDA hAMILTON'@EN\t1.0\tb8c68315-c455-477c-ac90-675ae5f4f127\n",
      "'Gary R. Rydstrom'@en\t'gARY r. rYDSTROM'@EN\t1.0\t42020bf4-47aa-43f5-a73d-e2d1b8482fa8\n",
      "'Gloria S. Borders'@en\t'gLORIA s. bORDERS'@EN\t0.0\tff6098af-5432-4c98-a47b-acdca0e2a655\n",
      "'Maria Shriver'@en\t'mARIA sHRIVER'@EN\t0.0\t4c094d9e-515a-4a6d-be77-e75feda64cf4\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph --import 'uuid, math as m' \\\n",
    "     --match '(x)-[r:name]->(y)' \\\n",
    "     --where 'kgtk_lqstring(y)' \\\n",
    "     --return 'y as name, \\\n",
    "               pyeval(printf($FMT, y)) as swapname, \\\n",
    "               pycall(\"m.fmod\", length(y), 2) as isodd, \\\n",
    "               pycall(\"uuid.uuid4\") as uuid' \\\n",
    "     --para FMT='\"%s\".swapcase()'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d53abb0",
   "metadata": {},
   "source": [
    "## Debugging"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e2fee34",
   "metadata": {},
   "source": [
    "**Example**\n",
    "- `query` accepts the `--debug` and `--expert` options to show additional output\n",
    "- this shows query translation, data import and indexing, etc.\n",
    "- here we can see how built-in functions are called directly in SQL:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "9c61736f",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2022-03-16 15:17:33 sqlstore]: IMPORT graph directly into table graph_1 from /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv ...\n",
      "[2022-03-16 15:17:33 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"label\" = ?\n",
      "        AND (kgtk_lqstring_lang(graph_1_c1.\"node2\") = ?)\n",
      "  PARAS: ['name', 'de']\n",
      "---------------------------------------------\n",
      "[2022-03-16 15:17:33 sqlstore]: CREATE INDEX \"graph_1_label_idx\" ON \"graph_1\" (\"label\")\n",
      "[2022-03-16 15:17:33 sqlstore]: ANALYZE \"graph_1_label_idx\"\n",
      "id\tnode1\tlabel\tnode2\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!rm -f $KGTK_GRAPH_CACHE\n",
    "\n",
    "!kgtk --debug query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n.kgtk_lqstring_lang = \"de\"'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0f3bd4a",
   "metadata": {},
   "source": [
    "- run it again to show how imported data and indexes are reused"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "a0272483",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2022-03-16 15:17:42 query]: SQL Translation:\n",
      "---------------------------------------------\n",
      "  SELECT *\n",
      "     FROM graph_1 AS graph_1_c1\n",
      "     WHERE graph_1_c1.\"label\" = ?\n",
      "        AND (kgtk_lqstring_lang(graph_1_c1.\"node2\") = ?)\n",
      "  PARAS: ['name', 'de']\n",
      "---------------------------------------------\n",
      "id\tnode1\tlabel\tnode2\n",
      "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n"
     ]
    }
   ],
   "source": [
    "!kgtk --debug query -i $small_graph \\\n",
    "     --match '(p)-[r:name]->(n)' \\\n",
    "     --where 'n.kgtk_lqstring_lang = \"de\"'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f57eb238",
   "metadata": {},
   "source": [
    "## Querying based on edge qualifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc83ef1c",
   "metadata": {},
   "source": [
    "- one of the motivations for developing Kypher was the \"Wikidata time machine use case\"\n",
    "- find all statements with temporal annotations after a certain time and remove them\n",
    "  to only retain facts known before that time\n",
    "- in Wikidata temporal annotations are attached via edge qualifiers\n",
    "- in KGTK those are represented via edges linked to another edge's ID"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19f71825",
   "metadata": {},
   "source": [
    "**Example**\n",
    "1. look for base edges in the `GRAPH` graph\n",
    "2. link to qualifiers in the `QUALS` graph via edge id `r`\n",
    "3. restrict the qualifiers based on edge labels\n",
    "   `ql` that are listed in the `PROPS` graph\n",
    "4. restrict to edges that have a time with year of at most 2000\n",
    "5. output the qualifying base edges with their temporal annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "4c70b1bd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "q1\tt5\trole\tterminator\n",
      "q2\tt6\trole\ts_connor\n",
      "q3\tt9\tpoint_in_time\t^1992-03-30T00:00:00Z/11\n",
      "q4 \tt9\twinner\tg_rydstrom\n",
      "q5 \tt9\twinner\tg_borders\n",
      "q6 \tt10\tlocation\tunited_states\n",
      "q8 \tt11\tlocation\trussia\n",
      "q9\te1\tstart\t^1986-04-26T08:00:00Z/11\n",
      "q10\te1\tend\t^2011-07-01T10:00:00Z/11\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_quals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "af9b03bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\n",
      "p11\tstart\tmember\tset1\n",
      "p12\tend\tmember\tset1\n",
      "p12\tpoint_in_time\tmember\tset1\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_props"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd7867ea",
   "metadata": {},
   "source": [
    "- the crucial part of the query below is how we use the `id` of the base edge `r` as the `node1` of the qualifier edge `q` whose label `ql` has to be one of the properties listed in `PROPS`\n",
    "- we also need to use Kypher property syntax to introduce the match variable `ql`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "a96b0621",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id\tnode1\tlabel\tnode2\ttrel\ttime\n",
      "e1\tm_shriver\tspouse\ta_schwarzenegger\tstart\t^1986-04-26T08:00:00Z/11\n"
     ]
    }
   ],
   "source": [
    "!kgtk query -i $small_graph -i $small_quals -i $small_props  \\\n",
    "     --match \"graph: (x)-[r]->(y),  \\\n",
    "              quals: (r)-[q {label: ql}]->(time), \\\n",
    "              props: (ql)-[]->()\" \\\n",
    "     --where \"time.kgtk_date_year <= 2000\" \\\n",
    "     --return \"r as id, x, r.label, y, ql as trel, time as time\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16b1ac79",
   "metadata": {},
   "source": [
    "## Further reading"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b87e6aa7",
   "metadata": {},
   "source": [
    "- [Kypher manual](https://github.com/usc-isi-i2/kgtk/blob/dev/docs/transform/query.md)\n",
    "  - [Kypher vs. Cypher and unfinished features](https://github.com/usc-isi-i2/kgtk/blob/dev/docs/transform/query.md#differences-to-cypher)\n",
    "  - [Kypher API](https://github.com/usc-isi-i2/kgtk/blob/dev/kgtk/kypher/api.py)\n",
    "- [KGTK manual](https://kgtk.readthedocs.io/en/dev/)\n",
    "- [KGTK data model](https://kgtk.readthedocs.io/en/dev/data_model/)\n",
    "- [KGTK GitHub site](https://github.com/usc-isi-i2/kgtk)\n",
    "- [KGTK Tutorial use cases](https://github.com/usc-isi-i2/kgtk-notebooks/tree/main/tutorial)\n",
    "- [KGTK use cases](https://github.com/usc-isi-i2/kgtk/tree/dev/use-cases)\n",
    "- [openCypher](https://www.opencypher.org/)\n",
    "- [Cypher](https://neo4j.com/developer/cypher/)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,py:light"
  },
  "kernelspec": {
   "display_name": "kgtk-env",
   "language": "python",
   "name": "kgtk-env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}