{ "cells": [ { "cell_type": "markdown", "id": "51d077a7", "metadata": {}, "source": [ "# Query knowledge graphs with KGTK Kypher" ] }, { "cell_type": "markdown", "id": "a0bef068", "metadata": {}, "source": [ "This tutorial gives an overview of the core features of the KGTK Kypher query language. \n", "\n", "For additional details refer to the reading list at the end. " ] }, { "cell_type": "markdown", "id": "1617744d-8f2d-4af9-af36-9d8b570b4aa0", "metadata": {}, "source": [ "## Step 0: Install KGTK" ] }, { "cell_type": "markdown", "id": "fd4818d0-ffd4-4bfd-bc15-6a28e6067d62", "metadata": {}, "source": [ "Only run the following cell if KGTK is not installed.\n", " For example, if running in [Google Colab](https://colab.research.google.com/)" ] }, { "cell_type": "code", "execution_count": null, "id": "30d32722-d768-4b12-8723-3596c33cdaab", "metadata": {}, "outputs": [], "source": [ "!pip install kgtk" ] }, { "cell_type": "markdown", "id": "28ca9cbc", "metadata": {}, "source": [ "### Notebook data setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "2d11ddcf", "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK" ] }, { "cell_type": "code", "execution_count": 2, "id": "cfe8ba35", "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "\n", "# Folders on local machine where to create the output and temporary files:\n", "input_path = None\n", "output_path = \"/tmp/projects\"\n", "project_name = \"tutorial-kypher\"" ] }, { "cell_type": "code", "execution_count": null, "id": "0e28525d", "metadata": {}, "outputs": [], "source": [ "# These are all the KG files that we use in this notebook:\n", "additional_files = {\n", " \"small_graph\": \"small-graph.tsv\",\n", " \"small_films\": \"small-films.tsv\",\n", " \"small_props\": \"small-props.tsv\",\n", " \"small_quals\": \"small-quals.tsv\",\n", "}\n", "\n", "big_files = [\n", " \"all\",\n", " \"label\",\n", " \"pagerank_undirected\",\n", "]\n", "\n", "ck = ConfigureKGTK(big_files)\n", "ck.configure_kgtk(input_graph_path=input_path, \n", " output_path=output_path, \n", " project_name=project_name,\n", " additional_files=additional_files)" ] }, { "cell_type": "code", "execution_count": 5, "id": "22ddd656", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "kypher: kgtk query --graph-cache /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n", "EXAMPLES_DIR: /Users/amandeep/Github/kgtk-notebooks/examples\n", "KGTK_LABEL_FILE: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/labels.en.tsv.gz\n", "KGTK_OPTION_DEBUG: false\n", "STORE: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n", "GRAPH: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input\n", "OUT: /tmp/projects/tutorial-kypher\n", "kgtk: kgtk\n", "KGTK_GRAPH_CACHE: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n", "USE_CASES_DIR: /Users/amandeep/Github/kgtk-notebooks/use-cases\n", "TEMP: /tmp/projects/tutorial-kypher/temp.tutorial-kypher\n", "all: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/all.tsv.gz\n", "label: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/labels.en.tsv.gz\n", "pagerank_undirected: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/metadata.pagerank.undirected.tsv.gz\n", "small_graph: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv\n", "small_films: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-films.tsv\n", "small_props: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-props.tsv\n", "small_quals: /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-quals.tsv\n" ] } ], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "markdown", "id": "50f97f88", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Usage of `query` command\n", "```\n", "usage: kgtk query [-h] [-i INPUT_FILE [INPUT_FILE ...]] [--as NAME]\n", " [--comment COMMENT] [--query QUERY] [--match PATTERN]\n", " [--where CLAUSE] [--opt PATTERN] [--with CLAUSE]\n", " [--where: CLAUSE] [--return CLAUSE] [--order-by CLAUSE]\n", " [--skip CLAUSE] [--limit CLAUSE] [--para NAME=VAL]\n", " [--spara NAME=VAL] [--lqpara NAME=VAL] [--no-header]\n", " [--force] [--index MODE [MODE ...]] [--idx SPEC [SPEC ...]]\n", " [--explain [MODE]] [--graph-cache GRAPH_CACHE_FILE]\n", " [--show-cache] [--import MODULE_LIST] [-o OUTPUT]\n", "\n", "Query one or more KGTK files with Kypher.\n", "IMPORTANT: input can come from stdin but chaining queries is not yet supported.\n", "\n", "Input options:\n", " -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]\n", " One or more input files to query, maybe compressed\n", " (May be omitted or '-' for stdin.)\n", " --as NAME alias name to be used for preceding input\n", " --comment COMMENT comment string to store for the preceding input\n", " (displayed by --show-cache)\n", "\n", "Query options:\n", " --query QUERY complete Kypher query combining all clauses, if\n", " supplied, all other specialized clause arguments will\n", " be ignored\n", " --match PATTERN MATCH pattern of a Kypher query, defaults to universal\n", " node pattern `()'\n", " --where CLAUSE WHERE clause to a preceding --match, --opt or --with\n", " clause\n", " --opt PATTERN, --optional PATTERN\n", " OPTIONAL MATCH pattern(s) of a Kypher query (zero or\n", " more)\n", " --with CLAUSE WITH clause of a Kypher query (only 'WITH * ...' is\n", " currently supported)\n", " --where: CLAUSE final global WHERE clause, shorthand for 'WITH * WHERE\n", " ...'\n", " --return CLAUSE RETURN clause of a Kypher query (defaults to *)\n", " --order-by CLAUSE ORDER BY clause of a Kypher query\n", " --skip CLAUSE SKIP clause of a Kypher query\n", " --limit CLAUSE LIMIT clause of a Kypher query\n", " --para NAME=VAL zero or more named value parameters to be passed to\n", " the query\n", " --spara NAME=VAL zero or more named string parameters to be passed to\n", " the query\n", " --lqpara NAME=VAL zero or more named LQ-string parameters to be passed\n", " to the query\n", "\n", "Output options:\n", " --no-header do not generate a header row with column names\n", " -o OUTPUT, --out OUTPUT\n", " output file to write to, if `-' (the default) output\n", " goes to stdout. Files with extensions .gz, .bz2 or .xz\n", " will be appropriately compressed.\n", "\n", "Control options:\n", " --force force problematic queries to run against advice\n", " --index MODE [MODE ...], --index-mode MODE [MODE ...]\n", " default index creation MODE for all inputs (default:\n", " auto); can be overridden with --idx for specific\n", " inputs\n", " --idx SPEC [SPEC ...], --input-index SPEC [SPEC ...]\n", " create index(es) according to SPEC for the preceding\n", " input only\n", " --explain [MODE] explain the query execution and indexing plan\n", " according to MODE (plan, full, expert, default: plan).\n", " This will not actually run or create anything.\n", " --graph-cache GRAPH_CACHE_FILE\n", " database cache where graphs will be imported before\n", " they are queried (defaults to per-user temporary file)\n", " --show-cache describe the current content of the graph cache and\n", " exit (does not actually run a query or import data)\n", " --import MODULE_LIST Python modules needed to define user extensions to\n", " built-in functions\n", " -h, --help show this help message and exit\n", "```" ] }, { "cell_type": "markdown", "id": "becbc95b", "metadata": {}, "source": [ "## Selecting edges with the `--match` clause" ] }, { "cell_type": "markdown", "id": "6bfb0429", "metadata": {}, "source": [ "- KGTK `query` either takes a full Kypher `--query` or individual Kypher clauses such as `--match`,\n", " `--return`, etc.\n", "- individual clauses are automatically assembled into the proper order, easier in Unix shell environment" ] }, { "cell_type": "markdown", "id": "1bcfa3a4", "metadata": {}, "source": [ "**Example**\n", "- simple query on a single input graph with an anonymous edge pattern\n", "- for convenience we use shell variable `$GRAPH` to point to small data file\n", "- **IMPORTANT**: use quotes around the match pattern to protect from the shell\n", "- this matches all edges and returns the whole file" ] }, { "cell_type": "code", "execution_count": 6, "id": "1d2a9d08", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv\n" ] } ], "source": [ "!echo $small_graph" ] }, { "cell_type": "code", "execution_count": 7, "id": "5130f672", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n", "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph --match '()-[]->()'" ] }, { "cell_type": "markdown", "id": "7733bff5", "metadata": {}, "source": [ "- match pattern starts with anonymous node connecting via an\n", " anonymous relation to another anonymous node\n", "- matched against the four core columns specifying an edge in each line of the KGTK\n", " input file\n", "- from-node is matched against `node1`, relation is\n", " matched against `id` and `label` (more on that distinction later), and\n", " the to-node is matched against `node2`\n", "- for each KGTK line matching the pattern, output is generated according to the `--return` clause\n", "- default for `--return` is `*` which means all columns of a matching line will be output" ] }, { "cell_type": "markdown", "id": "9c4fb588", "metadata": {}, "source": [ "**Equivalent query:**\n", "- singular anonymous node pattern will be completed to a full edge" ] }, { "cell_type": "code", "execution_count": 8, "id": "db30d4a5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n", "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph --match '()'" ] }, { "cell_type": "markdown", "id": "547b54f2", "metadata": {}, "source": [ "**Equivalent query:**\n", "- that pattern is also the default for `--match`" ] }, { "cell_type": "code", "execution_count": 9, "id": "3510b2fc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n", "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph" ] }, { "cell_type": "markdown", "id": "2634d31c", "metadata": {}, "source": [ "**Equivalent command:** `kgtk cat`" ] }, { "cell_type": "code", "execution_count": 10, "id": "9f348d66", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n", "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk cat -i $small_graph" ] }, { "cell_type": "markdown", "id": "cc441053", "metadata": {}, "source": [ "### Restricting output\n", "- use `--limit` and/or `--skip` to control output volume,\n", " especially to test queries on larger data" ] }, { "cell_type": "code", "execution_count": 11, "id": "6807d9af", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n" ] } ], "source": [ "!kgtk query -i $small_graph --limit 3" ] }, { "cell_type": "code", "execution_count": 12, "id": "bbd440e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e3\tg_rydstrom\tcolleague\tg_borders\n", "e4\tg_rydstrom\tcolleague\ta_schwarzenegger\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph --skip 2 --limit 3" ] }, { "cell_type": "code", "execution_count": 13, "id": "caa2991c", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e2\ta_schwarzenegger\tcoactor\tl_hamilton\n", "e3\tg_rydstrom\tcolleague\tg_borders\n" ] } ], "source": [ "!kgtk query -i $small_graph / head -n 3" ] }, { "cell_type": "markdown", "id": "21484bae", "metadata": {}, "source": [ "### Restricting edges\n", "- patterns can be made more specific by restricting some of the elements of an edge\n", "- for example, filter for all edges that start with `m_shriver`" ] }, { "cell_type": "code", "execution_count": 14, "id": "579b24f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph --match '(:m_shriver)-[]->()'" ] }, { "cell_type": "markdown", "id": "56ac3776", "metadata": {}, "source": [ "- **Kypher vs. Cypher**: in Cypher the restriction `m_shriver` would be interpreted as a node *type* in a property graph, in KGTK Kypher\n", "it is interpreted as the ID of a particular node" ] }, { "cell_type": "markdown", "id": "af32722e", "metadata": {}, "source": [ "**Example**\n", "- filter on the relation of an edge\n", "- select all edges with label `name` using the same colon-restriction syntax we used for nodes" ] }, { "cell_type": "code", "execution_count": 15, "id": "fa10dac3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph --match '()-[:name]->()'" ] }, { "cell_type": "markdown", "id": "70207f7c", "metadata": {}, "source": [ "- for relations, the interpretation of restrictions on the label of an\n", "edge (as opposed to its `id`) is more in line with standard Cypher." ] }, { "cell_type": "markdown", "id": "841a95e3", "metadata": {}, "source": [ "**Example**\n", "- node and relation restrictions can be combined\n", "- for example, here we select all `name` edges starting from node `a_schwarzenegger`" ] }, { "cell_type": "code", "execution_count": 16, "id": "d54b0533", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph --match '(:a_schwarzenegger)-[:name]->()'" ] }, { "cell_type": "markdown", "id": "7dd48f36", "metadata": {}, "source": [ "## Filtering with the `--where` clause" ] }, { "cell_type": "markdown", "id": "0ecba3ba", "metadata": {}, "source": [ "- `--where` holds a possibly complex Boolean expression that gets evaluated as additional edge filter\n", "- only edges for which it evaluates to true will be returned\n", "- `--where` can be used to express more complex conditions and computations that cannot be expressed in `--match`\n", "- to connect to values selected by `--match` we need pattern *variables*\n", "- variables are specified with simple names in the node or relationship part of a pattern\n", "- below we use `p` as the variable for the starting node of the edge pattern\n", "- in the `--where` clause we restrict which values are allowed for `p`\n", "- this query is equivalent to the one above where we restricted the starting node directly in the match pattern:" ] }, { "cell_type": "code", "execution_count": 17, "id": "3ec0af12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->()' \\\n", " --where 'p = \"m_shriver\"'" ] }, { "cell_type": "markdown", "id": "7957db5f", "metadata": {}, "source": [ "**Equivalent query:**\n", "- specifies the starting node restriction twice which is ok but redundant" ] }, { "cell_type": "code", "execution_count": 18, "id": "a394af91", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p:m_shriver)-[:name]->()' \\\n", " --where 'p = \"m_shriver\"'" ] }, { "cell_type": "markdown", "id": "5ffdb95d", "metadata": {}, "source": [ "- **IMPORTANT**: constants such as `m_shriver` need to be quoted when used in `--where`, similar to SQL\n", "- this needs extra care so quotes will not be consumed by the Unix shell" ] }, { "cell_type": "markdown", "id": "b08921e4", "metadata": {}, "source": [ "**Example**\n", "- use a regular expression to filter on the names attached to nodes\n", "- the `=~` operator matches a value against a regular expression\n", "- **Kypher vs. Cypher**: Kypher use Python regexp syntax, Cypher uses Java regexps\n", "- select all `name` edges that lead to a name that contains a double letter:" ] }, { "cell_type": "code", "execution_count": 19, "id": "8df2fb4c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"'" ] }, { "cell_type": "markdown", "id": "b3263a68", "metadata": {}, "source": [ "**Example**\n", "- filter based on a list of values (a form of disjunction)\n", "- any edge where `p` is equal to one of the listed values will be returned\n", "- **Kypher vs. Cypher**: Kypher only allows lists of literals such as strings\n", " or numbers, Cypher also allows variables and expressions" ] }, { "cell_type": "code", "execution_count": 20, "id": "e0b986bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e7\tg_rydstrom\tname\t'Gary R. Rydstrom'@en\n", "e8\tg_borders\tname\t'Gloria S. Borders'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where 'p IN [\"g_borders\", \"g_rydstrom\"]'" ] }, { "cell_type": "markdown", "id": "b83cbdf1", "metadata": {}, "source": [ "**Example**\n", "- filter based on a comparison operator and a computation using built-in functions\n", "- **IMPORTANT**: all columns in a KGTK file are treated as text (even if they contain numbers)\n", "- the expression below filters for names that start with the letter `L` or later\n", "- note that quotes of KGTK string literals are part of their value and need to be accounted for\n", "- for this we use the built-in function `substr` to extract the first letter of each name" ] }, { "cell_type": "code", "execution_count": 21, "id": "f80eb655", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where \"substr(n,2,1) >= 'L'\"" ] }, { "cell_type": "markdown", "id": "2d4ceebc", "metadata": {}, "source": [ "## Sorting results with the `--order-by` clause" ] }, { "cell_type": "markdown", "id": "d513b98a", "metadata": {}, "source": [ "- use `--order-by` to sort results just like in Cypher and SQL\n", "- this query sorts matched results by names in ascending order:" ] }, { "cell_type": "code", "execution_count": 22, "id": "a0f82ae8", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where \"upper(substr(n,2,1)) >= 'L'\" \\\n", " --order-by n" ] }, { "cell_type": "markdown", "id": "7a375a76", "metadata": {}, "source": [ "- order by an expression value, this time in descending order:" ] }, { "cell_type": "code", "execution_count": 23, "id": "180baf9d", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e9\tm_shriver\tname\t'Maria Shriver'@en\n", "e6\tl_hamilton\tname\t'Linda Hamilton'@en\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where \"substr(n,2,1) >= 'L'\" \\\n", " --order-by \"substr(n,2,1) desc\"" ] }, { "cell_type": "markdown", "id": "f27a5dac", "metadata": {}, "source": [ "## Controlling results with the `--return` clause" ] }, { "cell_type": "markdown", "id": "7755871b", "metadata": {}, "source": [ "- use `--return` to control which columns are output, in which order, with what headers\n", "- by default `*` is used which means all columns of matching edge(s) are output\n", "- use KGTK and SQLite built-in functions to transform output values or compute additional ones\n", "- use `distict` to eliminate duplicates\n", "- use aggregation function to get counts, max, average, etc." ] }, { "cell_type": "markdown", "id": "7ca3b0fd", "metadata": {}, "source": [ "**Example**\n", "- select only the `node1` and `node2` columns by referencing pattern variables `p` and `n`\n", "- note, that the result generated here is not valid KGTK (missing `id` and `label`)" ] }, { "cell_type": "code", "execution_count": 24, "id": "c2ebb33c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tnode2\n", "a_schwarzenegger\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"' \\\n", " --return 'p, n'" ] }, { "cell_type": "markdown", "id": "f04c5488", "metadata": {}, "source": [ "**Example**\n", "- return all columns but switching their order\n", "- **Kypher vs. Cypher**: Kypher relation variables such as `r` below\n", " get bound to edge IDs; other components of an edge such as its `label`\n", " can then be access via property syntax such as `r.label`\n", "- this query now produces valid KGTK" ] }, { "cell_type": "code", "execution_count": 25, "id": "6a07ab49", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tnode2\tid\tlabel\n", "a_schwarzenegger\t'Arnold Schwarzenegger'@de\te5\tname\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"' \\\n", " --return 'p, n, r, r.label'" ] }, { "cell_type": "markdown", "id": "49970cf4", "metadata": {}, "source": [ "**Example**\n", "- summarize data via `distinct` to eliminate duplicates" ] }, { "cell_type": "code", "execution_count": 26, "id": "6d286af7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "label\n", "coactor\n", "colleague\n", "name\n", "spouse\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r]->(n)' \\\n", " --return 'distinct r.label' \\\n", " --order-by r.label" ] }, { "cell_type": "markdown", "id": "ecd5305b", "metadata": {}, "source": [ "**Example**\n", "- transform data with built-in functions, e.g., change value to lower case" ] }, { "cell_type": "code", "execution_count": 27, "id": "fcb5c6f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tlower(graph_1_c1.\"node2\")\tid\n", "a_schwarzenegger\tname\t'arnold schwarzenegger'@de\te5\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"' \\\n", " --return 'p, r.label, lower(n), r'" ] }, { "cell_type": "markdown", "id": "c0d7b19f", "metadata": {}, "source": [ "**Example**\n", "- specify proper column headers for output values, e.g., designate the transformed value as `node2`" ] }, { "cell_type": "code", "execution_count": 28, "id": "6ff2aa95", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\n", "a_schwarzenegger\tname\t'arnold schwarzenegger'@de\te5\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"' \\\n", " --return 'p, r.label, lower(n) as node2, r'" ] }, { "cell_type": "markdown", "id": "ffa27b10", "metadata": {}, "source": [ "**Example**\n", "- use KGTK built-in functions to manipulate KGTK values\n", "- here we change a symbol to a string (which is a legal `node1` in KGTK data model)" ] }, { "cell_type": "code", "execution_count": 29, "id": "da737147", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\n", "\"a_schwarzenegger\"\tname\t'Arnold Schwarzenegger'@de\te5\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n =~ \".*(.)\\\\1.*\"' \\\n", " --return 'kgtk_stringify(p) as node1, r.label, n, r'" ] }, { "cell_type": "markdown", "id": "472d713e", "metadata": {}, "source": [ "**Example**\n", "- access components of structured literals with property syntax (treat them as *virtual properties*)\n", "- but this is just syntactic sugar for calling the KGTK function\n", "- extra output column is named with column path syntax (`lang` of `node2`)\n", "- special characters need to be backtick-quoted" ] }, { "cell_type": "code", "execution_count": 30, "id": "afd031b8", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\tnode2;lang\n", "e5\ta_schwarzenegger\tname\t'arnold schwarzenegger'@de\tde\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'kgtk_lqstring(n) and n.kgtk_lqstring_lang != \"en\"' \\\n", " --return 'r, p, r.label, lower(n) as node2, n.kgtk_lqstring_lang as `node2;lang`'" ] }, { "cell_type": "markdown", "id": "e65fa69d", "metadata": {}, "source": [ "## Querying connected edges through graph patterns" ] }, { "cell_type": "markdown", "id": "c5f8938d", "metadata": {}, "source": [ "- in KGs we will often want to combine multiple edges into a query\n", "- connection could be within a graph or across graphs or both\n", "- in database parlance this is generally called a *join*\n", "- in Kypher (or Cypher) we can express such queries very concisely using graph patterns\n", "- this ASCII pattern language is probably the single-most useful aspect of K/Cypher\n", "- for example, find colleagues and their names:" ] }, { "cell_type": "code", "execution_count": 31, "id": "a25f411b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e3\t'Gary R. Rydstrom'@en\tcolleague\t'Gloria S. Borders'@en\n", "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(na)<-[:name]-(a)-[r:colleague]->(b)-[:name]->(nb)' \\\n", " --return 'r, na as node1, r.label, nb as node2'" ] }, { "cell_type": "markdown", "id": "0ace1986", "metadata": {}, "source": [ "**Example**\n", "- combine path patterns with `--where` expression for more elaborate filtering\n", "- below we select starting edges where at least one of the nodes has a German name" ] }, { "cell_type": "code", "execution_count": 32, "id": "955b4593", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(na)<-[:name]-(a)-[r:colleague]->(b)-[:name]->(nb)' \\\n", " --where 'na.kgtk_lqstring_lang = \"de\" OR nb.kgtk_lqstring_lang = \"de\"' \\\n", " --return 'r, na as node1, r.label, nb as node2'" ] }, { "cell_type": "markdown", "id": "3b6840fc", "metadata": {}, "source": [ "**Example**\n", "- it is good practice to only name pattern variables that are actually needed and leave others anonymous" ] }, { "cell_type": "code", "execution_count": 33, "id": "09fdf20d", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "e3\t'Gary R. Rydstrom'@en\tcolleague\t'Gloria S. Borders'@en\n", "e4\t'Gary R. Rydstrom'@en\tcolleague\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(na)<-[:name]-()-[r:colleague]->()-[:name]->(nb)' \\\n", " --return 'r, na as node1, r.label, nb as node2'" ] }, { "cell_type": "markdown", "id": "4ec2f765", "metadata": {}, "source": [ "## Querying connected edges across multiple graphs" ] }, { "cell_type": "markdown", "id": "8061cef7", "metadata": {}, "source": [ "- Kypher can combine information from different graphs described in separate KGTK files\n", "- allows one to mix and match information, or to organize data into different logical parts\n", "- to query multiple graphs we need to specify two or more input files\n", "- *graph variables* are used to associate pattern clauses with the graph they apply to\n", "- for graphs to connect they need to have some node or edge IDs in common\n", "- **Kypher vs. Cypher*: Cypher does not address multi-graph queries\n", "- to illustrate this we use a second example graph of film data shown here:" ] }, { "cell_type": "code", "execution_count": 34, "id": "98faf82c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-films.tsv\n" ] } ], "source": [ "!echo $small_films" ] }, { "cell_type": "code", "execution_count": 35, "id": "2834c11b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "t1\tterminator2_jd\tname\t'Terminator 2'@en\n", "t2\tterminator2_jd\tinstance_of\tfilm\n", "t3\tterminator2_jd\tgenre\tscience_fiction\n", "t4\tterminator2_jd\tgenre\taction\n", "t5\tterminator2_jd\tcast\ta_schwarzenegger\n", "t6\tterminator2_jd\tcast\tl_hamilton\n", "t7\tterminator2_jd\tcrew\tg_rydstrom\n", "t8\tterminator2_jd\tcrew\tg_borders\n", "t9\tterminator2_jd\taward\tacademy-best-sound-editing\n", "t10\tterminator2_jd\tpublication_date\t^1991-07-03T00:00:00Z/11\n", "t11\tterminator2_jd\tpublication_date\t^1995-07-01T00:00:00Z/11\n", "t12\tfilm\tsubclass_of\tvisual_artwork\n", "t13\ts_connor\tname\t'Sarah Connor'@en\n", "t14\tterminator\tname\t'Terminator'@en\n" ] } ], "source": [ "!kgtk query -i $small_films --match '()-[]->()'" ] }, { "cell_type": "markdown", "id": "a5b19871", "metadata": {}, "source": [ "**Example**\n", "- let's query for people whose spouse is a film actor\n", "- a graph variable followed by `:` means all clauses following it are to be applied to that graph,\n", " until overridden by another graph variable\n", "- graph variables are matched to input files in a simple greedy matching scheme (see manual),\n", " for example, here we use a substring of each input file name\n", "- multiple edges in the pattern are separated by commas to allow for graph switching" ] }, { "cell_type": "code", "execution_count": 36, "id": "3100a084", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\tnode2;acted_in\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\tterminator2_jd\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_films \\\n", " --match 'graph: (x)-[r:spouse]->(y), \\\n", " films: (y)<-[:cast]-(f)-[:instance_of]->(:film)' \\\n", " --return 'r, x, r.label, y, f as `node2;acted_in`'" ] }, { "cell_type": "markdown", "id": "86f272e7", "metadata": {}, "source": [ "**Example**\n", "- let's further restrict that the acting had to be done before 1985" ] }, { "cell_type": "code", "execution_count": 37, "id": "1fb135c4", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\tnode2;acted_in\tnode2;acted_when\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\tterminator2_jd\t^1991-07-03T00:00:00Z/11\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_films \\\n", " --match 'graph: (x)-[r:spouse]->(y), \\\n", " films: (y)<-[:cast]-(f)-[:instance_of]->(:film), \\\n", " (f)-[:publication_date]->(d)' \\\n", " --where 'd < \"^1995\"' \\\n", " --return 'r, x, r.label, y, f as `node2;acted_in`, d as `node2;acted_when`'" ] }, { "cell_type": "markdown", "id": "040183db", "metadata": {}, "source": [ "## Aggregation" ] }, { "cell_type": "markdown", "id": "878f3af7", "metadata": {}, "source": [ "- similar to SQL and Cypher, Kypher supports aggregation functions such\n", " as `count`, `min`, `max`, `avg`, etc.\n", "- one of the simplest and most frequently used operations is counting rows via `count`" ] }, { "cell_type": "markdown", "id": "3a27616e", "metadata": {}, "source": [ "**Example**\n", "- count how many edges have `terminator2_jd` as their starting node" ] }, { "cell_type": "code", "execution_count": 38, "id": "d093572e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "N\n", "11\n" ] } ], "source": [ "!kgtk query -i $small_films \\\n", " --match '(:terminator2_jd)-[r]->()' \\\n", " --return 'count(r) as N'" ] }, { "cell_type": "markdown", "id": "c24a502d", "metadata": {}, "source": [ "**Example**\n", "- counts may include duplicate values (every selected edge will be counted)\n", "- we can use `distinct` keyword as the first argument to `count` (or any aggregation function) to exclude duplicates" ] }, { "cell_type": "code", "execution_count": 39, "id": "3d5e4e83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "N\n", "7\n" ] } ], "source": [ "!kgtk query -i $small_films \\\n", " --match '(:terminator2_jd)-[r]->()' \\\n", " --return 'count(distinct r.label) as N'" ] }, { "cell_type": "markdown", "id": "838358eb", "metadata": {}, "source": [ "**Example**\n", "- Kypher does not have an explicit `group by` clause (different from SPARQL or SQL)\n", "- grouping is inferred from clause type and order in the `return` statement\n", " and sorts rows into into groups before an aggregation operation is applied to each group\n", "- below we group by relation label and then select the maximum `node2` for each label (lexicographically)" ] }, { "cell_type": "code", "execution_count": 40, "id": "28935f79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "label\tnode2\tnode1\tid\n", "award\tacademy-best-sound-editing\tterminator2_jd\tt9\n", "cast\tl_hamilton\tterminator2_jd\tt6\n", "crew\tg_rydstrom\tterminator2_jd\tt7\n", "genre\tscience_fiction\tterminator2_jd\tt3\n", "instance_of\tfilm\tterminator2_jd\tt2\n", "name\t'Terminator'@en\tterminator\tt14\n", "publication_date\t^1995-07-01T00:00:00Z/11\tterminator2_jd\tt11\n", "subclass_of\tvisual_artwork\tfilm\tt12\n" ] } ], "source": [ "!kgtk query -i $small_films \\\n", " --match '(x)-[r]->(y)' \\\n", " --return 'r.label, max(y) as node2, x, r'" ] }, { "cell_type": "markdown", "id": "bc9ea6c6", "metadata": { "lines_to_next_cell": 2 }, "source": [ "**Mildly tricky**: the query applied the `max` function to groups of\n", "result rows where `r.label` had the same value. But for this to\n", "work we had to move the other output variables `x` and `r` to the\n", "end, otherwise they would have been the grouping criterion (that's\n", "the drawback of implicit grouping)." ] }, { "cell_type": "markdown", "id": "f35835f3", "metadata": {}, "source": [ "## Optional match" ] }, { "cell_type": "markdown", "id": "9e44058a", "metadata": {}, "source": [ "- real-world KGs often only have partial coverage of certain relations\n", "- for example, not every node might have a name or label\n", "- Kypher's optional match patterns can be used to hande incomplete information,\n", " since they are allowed to fail and will generate NULL values for such cases\n", "- each Kypher query must have exactly one strict `--match` clause and\n", " can have zero or more optional match clauses introduced by `--opt`\n", "- **Kypher vs. Cypher**: this is more than Cypher which can have any\n", " number of strict and/or optional patterns in any order\n", "- each strict and optional match clause can have its own `--where` clause\n", "- below we sometimes use a third graph of edge qualifiers shown here:" ] }, { "cell_type": "code", "execution_count": 41, "id": "16ad326b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-quals.tsv\n" ] } ], "source": [ "!echo $small_quals" ] }, { "cell_type": "code", "execution_count": 42, "id": "11b32803", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "q1\tt5\trole\tterminator\n", "q2\tt6\trole\ts_connor\n", "q3\tt9\tpoint_in_time\t^1992-03-30T00:00:00Z/11\n", "q4 \tt9\twinner\tg_rydstrom\n", "q5 \tt9\twinner\tg_borders\n", "q6 \tt10\tlocation\tunited_states\n", "q8 \tt11\tlocation\trussia\n", "q9\te1\tstart\t^1986-04-26T08:00:00Z/11\n", "q10\te1\tend\t^2011-07-01T10:00:00Z/11\n" ] } ], "source": [ "!kgtk query -i $small_quals" ] }, { "cell_type": "markdown", "id": "d774227d", "metadata": {}, "source": [ "**Example**\n", "- start with a strict query to find people associated with a movie and\n", " any awards they may have won" ] }, { "cell_type": "code", "execution_count": 44, "id": "d1cea397", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "person\tfilm\taward\n", "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\n", "g_borders\tterminator2_jd\tacademy-best-sound-editing\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n", " --match 'graph: (p)-[:name]->(), \\\n", " films: (f)-[:instance_of]->(:film), (f)-[]->(p), (f)-[ra:award]->(a), \\\n", " quals: (ra)-[:winner]->(p)' \\\n", " --return 'p as person, f as film, a as award'" ] }, { "cell_type": "markdown", "id": "3f394b5d", "metadata": {}, "source": [ "- result only lists some of the people, since not all participants won awards\n", "- this makes us miss some potentially useful data\n", "- to make sure we retrieve all people associated with a movie and optionally\n", " any awards they may have received, we can use this query:" ] }, { "cell_type": "code", "execution_count": 45, "id": "7a03c50b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "person\tfilm\taward\n", "a_schwarzenegger\tterminator2_jd\t\n", "l_hamilton\tterminator2_jd\t\n", "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\n", "g_borders\tterminator2_jd\tacademy-best-sound-editing\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n", " --match 'graph: (p)-[:name]->(), \\\n", " films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n", " --opt 'films: (f)-[ra:award]->(a), \\\n", " quals: (ra)-[:winner]->(p)' \\\n", " --return 'p as person, f as film, a as award'" ] }, { "cell_type": "markdown", "id": "edf97278", "metadata": {}, "source": [ "- now we get all relevant people and missing awards are simply empty (or NULL)\n", "- note how edge qualifiers associate information to an edge ID (e.g., the winner of an award)\n", "- optional patterns are either fully satisfied for a set of bindings, or not at all,\n", " there are no partial matches" ] }, { "cell_type": "markdown", "id": "5d1ab442", "metadata": {}, "source": [ "**Example**\n", "- multiple independent optional clauses to also get spouses and to filter awards\n", " after 1990" ] }, { "cell_type": "code", "execution_count": 46, "id": "20da4b70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "person\tfilm\taward\tspouse\n", "a_schwarzenegger\tterminator2_jd\t\tm_shriver\n", "l_hamilton\tterminator2_jd\t\t\n", "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\t\n", "g_borders\tterminator2_jd\tacademy-best-sound-editing\t\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_films -i $small_quals \\\n", " --match 'graph: (p)-[:name]->(), \\\n", " films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n", " --opt 'films: (f)-[ra:award]->(a), \\\n", " quals: (ra)-[:winner]->(p), (ra)-[:point_in_time]->(ad)' \\\n", " --where 'ad >= 1990' \\\n", " --opt 'graph: (s)-[:spouse]->(p)' \\\n", " --return 'p as person, f as film, a as award, s as spouse'" ] }, { "cell_type": "markdown", "id": "3d137982", "metadata": {}, "source": [ "**Example**\n", "- compare that to the SQL produced" ] }, { "cell_type": "code", "execution_count": 47, "id": "e04ba8cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-03-16 15:15:43 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_1_c1.\"node1\" \"_aLias.person\", graph_2_c2.\"node1\" \"_aLias.film\", graph_2_c4.\"node2\" \"_aLias.award\", graph_1_c7.\"node1\" \"_aLias.spouse\"\n", " FROM graph_1 AS graph_1_c1\n", " INNER JOIN graph_2 AS graph_2_c2, graph_2 AS graph_2_c3\n", " ON graph_1_c1.\"node1\" = graph_2_c3.\"node2\"\n", " AND graph_2_c2.\"node1\" = graph_2_c3.\"node1\"\n", " AND graph_1_c1.\"label\" = ?\n", " AND graph_2_c2.\"label\" = ?\n", " AND graph_2_c2.\"node2\" = ?\n", " LEFT JOIN (graph_2 AS graph_2_c4\n", " INNER JOIN graph_3 AS graph_3_c5, graph_3 AS graph_3_c6\n", " ON graph_2_c4.\"id\" = graph_3_c5.\"node1\"\n", " AND graph_2_c4.\"id\" = graph_3_c6.\"node1\"\n", " AND graph_2_c4.\"label\" = ?\n", " AND graph_3_c5.\"label\" = ?\n", " AND graph_3_c6.\"label\" = ?\n", " AND (graph_3_c6.\"node2\" >= ?))\n", " ON graph_1_c1.\"node1\" = graph_3_c5.\"node2\"\n", " AND graph_2_c2.\"node1\" = graph_2_c4.\"node1\"\n", " LEFT JOIN graph_1 AS graph_1_c7\n", " ON graph_1_c1.\"node1\" = graph_1_c7.\"node2\"\n", " AND graph_1_c7.\"label\" = ?\n", " PARAS: ['name', 'instance_of', 'film', 'award', 'winner', 'point_in_time', 1990, 'spouse']\n", "---------------------------------------------\n", "person\tfilm\taward\tspouse\n", "a_schwarzenegger\tterminator2_jd\t\tm_shriver\n", "l_hamilton\tterminator2_jd\t\t\n", "g_rydstrom\tterminator2_jd\tacademy-best-sound-editing\t\n", "g_borders\tterminator2_jd\tacademy-best-sound-editing\t\n" ] } ], "source": [ "!kgtk --debug query -i $small_graph -i $small_films -i $small_quals \\\n", " --match 'graph: (p)-[:name]->(), \\\n", " films: (f)-[:instance_of]->(:film), (f)-[]->(p)' \\\n", " --opt 'films: (f)-[ra:award]->(a), \\\n", " quals: (ra)-[:winner]->(p), (ra)-[:point_in_time]->(ad)' \\\n", " --where 'ad >= 1990' \\\n", " --opt 'graph: (s)-[:spouse]->(p)' \\\n", " --return 'p as person, f as film, a as award, s as spouse'" ] }, { "cell_type": "markdown", "id": "8b7017cf", "metadata": { "lines_to_next_cell": 2 }, "source": [ "- see the manual for more examples\n", "- e.g., how to do `NOT EXISTS` with an optional clause" ] }, { "cell_type": "markdown", "id": "27883225", "metadata": {}, "source": [ "## Full-text search" ] }, { "cell_type": "markdown", "id": "17fc223a", "metadata": {}, "source": [ "- Kypher supports efficient full-text search over large graph data via SQLite's FTS5 module\n", "- matching uses specialized indexes which can be created easily and efficiently" ] }, { "cell_type": "markdown", "id": "bd6efd32", "metadata": {}, "source": [ "**Example**\n", "- we define a text index on the `node2` column of `GRAPH` and then use `textmatch` to match against `node2` values\n", "- multiple indexes can be defined on the same graph column, we can provide names to disambiguate\n", "- **minor caveat**: use of an older Python version requires us to provide extra index definition options,\n", " since the default `trigram` tokenizer is currently only supported in Python 3.9 or later\n", "- match scores are negative with the best being the smallest (most negative) " ] }, { "cell_type": "code", "execution_count": 48, "id": "d00ef3eb", "metadata": {}, "outputs": [], "source": [ "!rm -f $KGTK_GRAPH_CACHE # bug workaround" ] }, { "cell_type": "code", "execution_count": 49, "id": "b5280356", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tscore\n", "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n", "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n", "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n" ] } ], "source": [ "!kgtk query -i $small_graph --idx auto text:node2//name=myidx//tokenize=ascii//prefix=3 \\\n", " --match '(x)-[r]->(y)' \\\n", " --where 'textmatch(y, \"sch*\")' \\\n", " --return 'x, r.label, y, matchscore(y) as score' \\\n", " --order 'score'" ] }, { "cell_type": "markdown", "id": "c723779c", "metadata": {}, "source": [ "**Example**\n", "- `textmatch` patterns use a phrase-based language that allows\n", " multi-word phrases, Boolean expressions, multi-column expressions, suffix patterns, etc. (see manual)\n", "- here is an example of a Boolean expression\n", "- note that the index only needs to be specified the first time around, repeating it is a no-op" ] }, { "cell_type": "code", "execution_count": 50, "id": "8824450f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tscore\n", "l_hamilton\tname\t'Linda Hamilton'@en\t-1.6796313036328494\n", "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n", "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n", "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(x)-[r]->(y)' \\\n", " --where 'textmatch(y, \"sch* OR linda\")' \\\n", " --return 'x, r.label, y, matchscore(y) as score' \\\n", " --order 'score'" ] }, { "cell_type": "markdown", "id": "766202b2", "metadata": {}, "source": [ "The `trigram` tokenizer (available with Python 3.9) also supports\n", "case-insensitive SQL `LIKE` patterns and case-sensitive `GLOB`\n", "patterns (see manual)." ] }, { "cell_type": "markdown", "id": "84bc14e9", "metadata": {}, "source": [ "**Example**\n", "- text indexes can be qualified with their name to select between multiple options" ] }, { "cell_type": "code", "execution_count": 51, "id": "c21dada9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tscore\n", "m_shriver\tspouse\ta_schwarzenegger\t-0.6991202559002525\n", "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.6991202559002525\n", "a_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\t-0.5994217687032093\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(x)-[r]->(y)' \\\n", " --where 'textmatch(myidx.y, \"schw*\")' \\\n", " --return 'x, r.label, y, matchscore(myidx.y) as score' \\\n", " --order 'score'" ] }, { "cell_type": "markdown", "id": "515437d9", "metadata": {}, "source": [ "**Example**\n", "- a text index may index more than one column\n", "- then the text match expression can use column-specific filters" ] }, { "cell_type": "code", "execution_count": 52, "id": "e223c619", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tscore\n", "g_rydstrom\tcolleague\tg_borders\t-1.391983808592556\n", "g_borders\tname\t'Gloria S. Borders'@en\t-1.1761930172126576\n" ] } ], "source": [ "!kgtk query -i $small_graph --idx auto text:node1,node2//name=multi//tokenize=ascii//prefix=3 \\\n", " --match '(x)-[r]->(y)' \\\n", " --where 'textmatch(multi.r, \"node1: g* AND node2 : borders\")' \\\n", " --return 'x, r.label, y, matchscore(multi.r) as score' \\\n", " --order 'score'" ] }, { "cell_type": "code", "execution_count": 53, "id": "aa5b9961", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tscore\n", "g_rydstrom\tcolleague\ta_schwarzenegger\t-0.2149880837001077\n", "g_rydstrom\tname\t'Gary R. Rydstrom'@en\t-0.18165978747100012\n" ] } ], "source": [ "!kgtk query -i $small_graph \\\n", " --match '(x)-[r]->(y)' \\\n", " --where 'textmatch(multi.r, \"node1: g* NOT node2 : borders\")' \\\n", " --return 'x, r.label, y, matchscore(multi.r) as score' \\\n", " --order 'score'" ] }, { "cell_type": "markdown", "id": "f639e370", "metadata": {}, "source": [ "## Input and output specification options" ] }, { "cell_type": "markdown", "id": "6d2198c1", "metadata": {}, "source": [ "```\n", " -i INPUT_FILE [INPUT_FILE ...], --input-files INPUT_FILE [INPUT_FILE ...]\n", " One or more input files to query (maybe compressed).\n", " (Required, use '-' for stdin.)\n", " --as NAME alias name to be used for preceding input\n", " --comment COMMENT comment string to store for the preceding input\n", " (displayed by --show-cache)\n", " --no-header do not generate a header row with column names\n", " -o OUTPUT, --out OUTPUT\n", " output file to write to, if `-' (the default) output\n", " goes to stdout. Files with extensions .gz, .bz2 or .xz\n", " will be appropriately compressed.\n", "```" ] }, { "cell_type": "markdown", "id": "ef7fdcd3", "metadata": {}, "source": [ "**Example**" ] }, { "cell_type": "code", "execution_count": 54, "id": "7fb32f69", "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "!kgtk query -i $small_graph -i $small_films --as works --comment 'Film information' --no-header \\\n", " --match 'graph: (x)-[]->(y), works: (f)-[r]->(x)' \\\n", " --return 'r, f, r.label, x' \\\n", " -o $TEMP/example-query.tsv.gz" ] }, { "cell_type": "markdown", "id": "e62f7dca", "metadata": {}, "source": [ "## Graph cache" ] }, { "cell_type": "markdown", "id": "ed49d72e", "metadata": {}, "source": [ "- when input files are queries they are first imported into SQLite database tables\n", "- this is very fast but can be noticable for larger data\n", "- import of WD claims file with 1B edges (~16 GB compressed) takes 20 minutes on laptop\n", "- resulting database is cached into a *graph cache* to amortize this over multiple queries\n", "- indexes built to speed up queries or text search are also cached\n", "- if data change is detected, data will be re-imported and re-indexed automatically" ] }, { "cell_type": "markdown", "id": "cf1e0351", "metadata": {}, "source": [ "- location of the cache file can be controlled with `--graph-cache FILE` or the `KGTK_GRAPH_CACHE`\n", " environment variable\n", "- otherwise a default location in `/tmp` will be used" ] }, { "cell_type": "markdown", "id": "5168ef96", "metadata": {}, "source": [ "- cache is a *true cache*, that is it is safe to delete it and it will be rebuilt automatically\n", " (as long as the relevant data files are still available)\n", "- however, we increasingly find cases where the cache is explicitly managed, for example,\n", " to support the browser or to ship large data to somebody else\n", "- cache files can become very large and should be on an SSD drive for best performance" ] }, { "cell_type": "markdown", "id": "b5e9f195", "metadata": {}, "source": [ "**Example**\n", "- display the current content of the cache with `--show-cache`" ] }, { "cell_type": "code", "execution_count": 55, "id": "ee45124d", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Graph Cache:\n", "DB file: /tmp/projects/tutorial-kypher/temp.tutorial-kypher/wikidata.sqlite3.db\n", " size: 72.00 KB \tfree: 0 Bytes \tmodified: 2022-03-16 15:16:57\n", "\n", "KGTK File Information:\n", "/Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv:\n", " size: 382 Bytes \tmodified: 2022-03-16 15:05:47 \tgraph: graph_1\n", "works:\n", " size: 590 Bytes \tmodified: 2022-03-16 15:05:48 \tgraph: graph_2\n", " comment: Film information\n", "\n", "Graph Table Information:\n", "graph_1:\n", " size: 40.00 KB \tcreated: 2022-03-16 15:16:15\n", " header: ['id', 'node1', 'label', 'node2']\n", "graph_2:\n", " size: 12.00 KB \tcreated: 2022-03-16 15:16:57\n", " header: ['id', 'node1', 'label', 'node2']\n" ] } ], "source": [ "!kgtk query --show-cache" ] }, { "cell_type": "markdown", "id": "18ec64d4", "metadata": {}, "source": [ "## Defining and using custom functions" ] }, { "cell_type": "markdown", "id": "b87f16f5", "metadata": {}, "source": [ "- when provided built-in functions are not enough, custom functions can be executed via Python" ] }, { "cell_type": "markdown", "id": "ca613f46", "metadata": {}, "source": [ "**Example**\n", "- perform some random modifications on the `GRAPH` data\n", "- note that values returned by `pyeval` and `pycall` must be simple literals" ] }, { "cell_type": "code", "execution_count": 56, "id": "d6032d24", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name\tswapname\tisodd\tuuid\n", "'Arnold Schwarzenegger'@de\t'aRNOLD sCHWARZENEGGER'@DE\t0.0\ta67a6524-e993-4f9e-b09f-ac99c9d39c0f\n", "'Linda Hamilton'@en\t'lINDA hAMILTON'@EN\t1.0\tb8c68315-c455-477c-ac90-675ae5f4f127\n", "'Gary R. Rydstrom'@en\t'gARY r. rYDSTROM'@EN\t1.0\t42020bf4-47aa-43f5-a73d-e2d1b8482fa8\n", "'Gloria S. Borders'@en\t'gLORIA s. bORDERS'@EN\t0.0\tff6098af-5432-4c98-a47b-acdca0e2a655\n", "'Maria Shriver'@en\t'mARIA sHRIVER'@EN\t0.0\t4c094d9e-515a-4a6d-be77-e75feda64cf4\n" ] } ], "source": [ "!kgtk query -i $small_graph --import 'uuid, math as m' \\\n", " --match '(x)-[r:name]->(y)' \\\n", " --where 'kgtk_lqstring(y)' \\\n", " --return 'y as name, \\\n", " pyeval(printf($FMT, y)) as swapname, \\\n", " pycall(\"m.fmod\", length(y), 2) as isodd, \\\n", " pycall(\"uuid.uuid4\") as uuid' \\\n", " --para FMT='\"%s\".swapcase()'" ] }, { "cell_type": "markdown", "id": "2d53abb0", "metadata": {}, "source": [ "## Debugging" ] }, { "cell_type": "markdown", "id": "7e2fee34", "metadata": {}, "source": [ "**Example**\n", "- `query` accepts the `--debug` and `--expert` options to show additional output\n", "- this shows query translation, data import and indexing, etc.\n", "- here we can see how built-in functions are called directly in SQL:" ] }, { "cell_type": "code", "execution_count": 57, "id": "9c61736f", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-03-16 15:17:33 sqlstore]: IMPORT graph directly into table graph_1 from /Users/amandeep/isi-kgtk-tutorial/tutorial-kypher_input/small-graph.tsv ...\n", "[2022-03-16 15:17:33 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"label\" = ?\n", " AND (kgtk_lqstring_lang(graph_1_c1.\"node2\") = ?)\n", " PARAS: ['name', 'de']\n", "---------------------------------------------\n", "[2022-03-16 15:17:33 sqlstore]: CREATE INDEX \"graph_1_label_idx\" ON \"graph_1\" (\"label\")\n", "[2022-03-16 15:17:33 sqlstore]: ANALYZE \"graph_1_label_idx\"\n", "id\tnode1\tlabel\tnode2\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!rm -f $KGTK_GRAPH_CACHE\n", "\n", "!kgtk --debug query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n.kgtk_lqstring_lang = \"de\"'" ] }, { "cell_type": "markdown", "id": "c0f3bd4a", "metadata": {}, "source": [ "- run it again to show how imported data and indexes are reused" ] }, { "cell_type": "code", "execution_count": 58, "id": "a0272483", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-03-16 15:17:42 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"label\" = ?\n", " AND (kgtk_lqstring_lang(graph_1_c1.\"node2\") = ?)\n", " PARAS: ['name', 'de']\n", "---------------------------------------------\n", "id\tnode1\tlabel\tnode2\n", "e5\ta_schwarzenegger\tname\t'Arnold Schwarzenegger'@de\n" ] } ], "source": [ "!kgtk --debug query -i $small_graph \\\n", " --match '(p)-[r:name]->(n)' \\\n", " --where 'n.kgtk_lqstring_lang = \"de\"'" ] }, { "cell_type": "markdown", "id": "f57eb238", "metadata": {}, "source": [ "## Querying based on edge qualifiers" ] }, { "cell_type": "markdown", "id": "cc83ef1c", "metadata": {}, "source": [ "- one of the motivations for developing Kypher was the \"Wikidata time machine use case\"\n", "- find all statements with temporal annotations after a certain time and remove them\n", " to only retain facts known before that time\n", "- in Wikidata temporal annotations are attached via edge qualifiers\n", "- in KGTK those are represented via edges linked to another edge's ID" ] }, { "cell_type": "markdown", "id": "19f71825", "metadata": {}, "source": [ "**Example**\n", "1. look for base edges in the `GRAPH` graph\n", "2. link to qualifiers in the `QUALS` graph via edge id `r`\n", "3. restrict the qualifiers based on edge labels\n", " `ql` that are listed in the `PROPS` graph\n", "4. restrict to edges that have a time with year of at most 2000\n", "5. output the qualifying base edges with their temporal annotations" ] }, { "cell_type": "code", "execution_count": 59, "id": "4c70b1bd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "q1\tt5\trole\tterminator\n", "q2\tt6\trole\ts_connor\n", "q3\tt9\tpoint_in_time\t^1992-03-30T00:00:00Z/11\n", "q4 \tt9\twinner\tg_rydstrom\n", "q5 \tt9\twinner\tg_borders\n", "q6 \tt10\tlocation\tunited_states\n", "q8 \tt11\tlocation\trussia\n", "q9\te1\tstart\t^1986-04-26T08:00:00Z/11\n", "q10\te1\tend\t^2011-07-01T10:00:00Z/11\n" ] } ], "source": [ "!kgtk query -i $small_quals" ] }, { "cell_type": "code", "execution_count": 60, "id": "af9b03bc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "p11\tstart\tmember\tset1\n", "p12\tend\tmember\tset1\n", "p12\tpoint_in_time\tmember\tset1\n" ] } ], "source": [ "!kgtk query -i $small_props" ] }, { "cell_type": "markdown", "id": "cd7867ea", "metadata": {}, "source": [ "- the crucial part of the query below is how we use the `id` of the base edge `r` as the `node1` of the qualifier edge `q` whose label `ql` has to be one of the properties listed in `PROPS`\n", "- we also need to use Kypher property syntax to introduce the match variable `ql`" ] }, { "cell_type": "code", "execution_count": 61, "id": "a96b0621", "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\ttrel\ttime\n", "e1\tm_shriver\tspouse\ta_schwarzenegger\tstart\t^1986-04-26T08:00:00Z/11\n" ] } ], "source": [ "!kgtk query -i $small_graph -i $small_quals -i $small_props \\\n", " --match \"graph: (x)-[r]->(y), \\\n", " quals: (r)-[q {label: ql}]->(time), \\\n", " props: (ql)-[]->()\" \\\n", " --where \"time.kgtk_date_year <= 2000\" \\\n", " --return \"r as id, x, r.label, y, ql as trel, time as time\"" ] }, { "cell_type": "markdown", "id": "16b1ac79", "metadata": {}, "source": [ "## Further reading" ] }, { "cell_type": "markdown", "id": "b87e6aa7", "metadata": {}, "source": [ "- [Kypher manual](https://github.com/usc-isi-i2/kgtk/blob/dev/docs/transform/query.md)\n", " - [Kypher vs. Cypher and unfinished features](https://github.com/usc-isi-i2/kgtk/blob/dev/docs/transform/query.md#differences-to-cypher)\n", " - [Kypher API](https://github.com/usc-isi-i2/kgtk/blob/dev/kgtk/kypher/api.py)\n", "- [KGTK manual](https://kgtk.readthedocs.io/en/dev/)\n", "- [KGTK data model](https://kgtk.readthedocs.io/en/dev/data_model/)\n", "- [KGTK GitHub site](https://github.com/usc-isi-i2/kgtk)\n", "- [KGTK Tutorial use cases](https://github.com/usc-isi-i2/kgtk-notebooks/tree/main/tutorial)\n", "- [KGTK use cases](https://github.com/usc-isi-i2/kgtk/tree/dev/use-cases)\n", "- [openCypher](https://www.opencypher.org/)\n", "- [Cypher](https://neo4j.com/developer/cypher/)" ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:light" }, "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }