{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TPC-H Query 5 – Graphs as Clans\n", "\n", "This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified [TPC-H][] query. \n", "\n", "The tutorials assume basic familiarity with our library; we suggest working through our [Hello_World][] tutorial first. They also assume basic knowledge of relational data, RDF and XML. \n", "\n", "In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence:\n", "\n", "- [1-Introduction][]: Introduces this series of tutorials, TPC-H, the TPC-H query 5 and our modifications to it.\n", "- [2-Tables][]: Introduces our representation of tabular data, on the example of CSV.\n", "- **[3-Graphs][] (this tutorial)**: Introduces our representation of RDF-style tabular graph data, on the example of Turtle.\n", "- [4-Hierarchies][]: Introduces our representation of hierarchical data, on the example of XML.\n", "- [5-Query][]: Brings it all together and explains the whole query.\n", "\n", "[TPC-H]: (TPC-H Benchmark Main Page)\n", "[Hello_World]: <../Hello_World.ipynb> (IPython Notebook: Hello World)\n", "[1-Introduction]: <1-Introduction.ipynb> (IPython Notebook: TPC-H Query 5 - Introduction)\n", "[2-Tables]: <2-Tables.ipynb> (IPython Notebook: TPC-H Query 5 - Tables)\n", "[3-Graphs]: <3-Graphs.ipynb> (IPython Notebook: TPC-H Query 5 - Graphs)\n", "[4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies)\n", "[5-Query]: <5-Query.ipynb> (IPython Notebook: TPC-H Query 5 - Query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph Data In the Query\n", "\n", "We converted the table `supplier` from the original data into an RDF graph `supplier`. In this tutorial, we present our clan representation of RDF graphs, how to import them from [N-Triples][] and [Turtle][] formats and how to execute simple SPARQL matching (in the form of basic graph patterns) on the example of the SPARQL 'subquery' that is embedded in our modified query 5:\n", "\n", "``` sparql\n", "(\n", " # This is a pseudo-subquery in SPARQL. It extracts a table with the\n", " # columns 'suppkey' and 'nationkey' from the RDF graph 'supplier'\n", " # and injects them into the outer query as table 'supplier'.\n", " SELECT \n", " ?suppkey, ?nationkey\n", " FROM \n", " supplier\n", " WHERE {\n", " ?supplier ?suppkey .\n", " ?supplier ?nationkey .\n", " }\n", ") AS supplier_solutions\n", "```\n", "\n", "This query creates a table with the columns `suppkey` and `nationkey` that associates supplier keys with the nation keys of the suppliers' nation.\n", "\n", "[N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph)\n", "[Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Representation of RDF Graphs as Clans\n", "\n", "The common RDF graph serialization formats provide the graph data in tabular form, with three columns that represent subject, predicate and object. This applies most clearly to the [N-Triples][] format; [Turtle][] and other formats provide syntactic shortcuts that allow for example repetition of subject over predicate/object lists or subject and predicate over object lists, but this doesn't invalidate the principle.\n", "\n", "With this, our internal representation is not different from the one we use for tables; such a graph is represented as a [regular clan][] with the lefts `s`, `p` and `o` (representing subject, predicate and object, respectively). The description of our internal [representation of tables][] applies here as well. What makes RDF graphs different from 'normal' tables are the operations that are used for the common questions. \n", "\n", "Our Turtle import function is not a full-featured RDF implementation. The most important limitations are:\n", "\n", "- Escape sequences are not fully supported.\n", "- Unicode may or may not work.\n", "- No support for blank nodes.\n", "- No support for predicate-object lists and object lists.\n", "- Type support: Plain integers like `15` are converted into integer atoms, plain strings like `\"str\"` are converted into string atoms, and IRIs like `` are converted into `rdflib.URIRef` instances. There is no support for the IRI-based RDF type system.\n", "\n", "\n", "[N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph)\n", "[Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language)\n", "[regular clan]: ((left-)regular clan)\n", "[Representation of Tables]: <2-Tables.ipynb#Representation-of-Tables> (IPython Notebook: TPC-H Query 5 - Tables - Representation of Tables)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The `supplier` Graph\n", "\n", "We used a simple, straightforward method to convert the original `supplier` table into an RDF graph: subjects in the graph represent rows and are IRIs derived from the `suppkey` column value (the primary key), predicates represent columns and are IRIs derived from the column name, and the objects are the cell values for a given subject (row) and predicate (column).\n", "\n", "First we import the RDF graph `supplier` from a Turtle file `supplier.ttl`.\n", "\n", "- [`algebraixlib.io.rdf`][] provides utilities for processing RDF-style graph data.\n", "- [`iprint_latex`][] is a utility that prints our [`MathObject`][]s in LaTeX format in IPython notebooks. The `short=True` argument tells it to create abbreviated output. \n", "\n", "[`algebraixlib.io.rdf`]: (module algebraixlib.io.rdf)\n", "[`iprint_latex`]: (function algebraixlib.util.latexprinter.iprint_latex)\n", "[`MathObject`]: (class algebraixlib.mathobjects.mathobject.MathObject)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/latex": [ "$$suppliers = \\left\\{\\begin{array}{l}\\left\\{\\color{red}{\\mbox{'o'}}{\\mapsto}{\\color{blue}{\\mbox{' N kD4on9...}}},\\ \\color{red}{\\mbox{'p'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}},\\ \\color{red}{\\mbox{'s'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'o'}}{\\mapsto}{\\color{blue}{\\mbox{' slyly bo...}}},\\ \\color{red}{\\mbox{'p'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}},\\ \\color{red}{\\mbox{'s'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'o'}}{\\mapsto}{\\color{blue}{\\mbox{'. slyly r...}}},\\ \\color{red}{\\mbox{'p'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}},\\ \\color{red}{\\mbox{'s'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'o'}}{\\mapsto}{\\color{blue}{\\mbox{'11-383-51...}}},\\ \\color{red}{\\mbox{'p'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}},\\ \\color{red}{\\mbox{'s'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\text{...}(66)\\end{array}\\right\\}$$" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import algebraixlib.io.rdf as rdf\n", "from algebraixlib.util.latexprinter import iprint_latex\n", "\n", "suppliers = rdf.import_graph('supplier.ttl')\n", "iprint_latex('suppliers', short=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main selection and combination mechanism in SPARQL are sequences of basic graph patterns (BGPs) like \"`?supplier ?suppkey`\". Such a BGP is a sequence of two operations:\n", "\n", "- First a superstriction that selects the relations that match the given value(s). In our first BGP we match the value `` in the couplet with left part `p`.\n", "- This is followed by a composition that selects and renames the lefts of interest. The lefts of interest are the ones associated with SPARQL variables; they are one or more of the `s`, `p` or `o` lefts and are renamed to a variable name. In our first BGP, we need both the `s` and the `o` lefts and we name them `supplier` resp. `suppkey`. \n", "\n", "A superstriction is defined on sets (relations) and sets of sets (clans) as follows:\n", "\n", "$$\n", "\\begin{align*}\n", " \\text{Set superstriction}:\\ \n", " &S \\vartriangleright T \n", " &&:= \n", " \\begin{cases}\n", " S &\\text{if } S \\supset T \\\\\n", " \\text{undefined} &\\text{otherwise}\n", " \\end{cases}\n", " &\\text{for } S, T \\in P(M)\\\\\n", " \\text{Set of sets superstriction}:\\ \n", " &\\mathbb{S} \\blacktriangleright \\mathbb{T} \n", " &&:= \n", " \\{X \\vartriangleright Y\\ : X \\in \\mathbb{S} \\text{ and } Y \\in \\mathbb{T}\\}\n", " &\\text{for } \\mathbb{S}, \\mathbb{T} \\in P^2(M)\n", "\\end{align*}\n", "$$\n", "\n", "The clan (set of sets) superstriction restricts the members of one clan to the ones that are a superset of a member of the other clan. \n", "\n", "Composition essentially selects certain couplets and changes their lefts or rights. In our case, we want to select and rename the lefts `s` and `o` and rename them to `supplier` and `suppkey`, respectively.\n", "\n", "With this, the execution of a BGP results in a clan, where the variable names in the BGP are the lefts in the clan. (We strip the leading question mark from the variable names; this is merely an artifact of the SPARQL syntax.)\n", "\n", "In data algebra, we get this:\n", "\n", "$$\n", "\\begin{align*}\n", "BGP_{suppkey,matches} &= Supplier \\blacktriangleright \\{\\{p{\\mapsto}\\text{}\\}\\} \\\\\n", "BGP_{suppkey} &= BGP_{suppkey,matches} \\circ \\{\\{supplier{\\mapsto}s, suppkey{\\mapsto}o\\}\\}\n", "\\end{align*}\n", "$$\n", "\n", "- [`rdflib`][], [`rdflib.URIRef`][]: IRIs in our RDF graphs are represented as atoms with a value type of `rdflib.URIRef`, so the IRI `` becomes `rdflib.URIRef('tpch:suppkey')` in the code.\n", "- [`algebraixlib.algebras.clans`][] contains the functions that are related to our algebra of clans.\n", "- [`clans.superstrict`][] executes the clan superstriction ($\\blacktriangleright$).\n", "- [`clans.compose`][] executes the clan composition ($\\circ$).\n", "- [`clans.from_dict`][] creates a clan with a single relation. The relation represents the data in the `dict` argument, converting the keys to lefts and the values to rights.\n", "\n", "[`rdflib`]: (A Python library for working with RDF.)\n", "[`rdflib.URIRef`]: (RDF URI Reference)\n", "[`algebraixlib.algebras.clans`]: (module algebraixlib.algebras.clans)\n", "[`clans.superstrict`]: (function algebraixlib.algebras.clans.superstrict)\n", "[`clans.compose`]: (function algebraixlib.algebras.clans.compose)\n", "[`clans.from_dict`]: (function algebraixlib.algebras.clans.from_dict)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/latex": [ "$$bgp\\_suppkey = \\left\\{\\begin{array}{l}\\left\\{\\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{1}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{10}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{2}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{3}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\text{...}(6)\\end{array}\\right\\}$$" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import rdflib\n", "import algebraixlib.algebras.clans as clans\n", "\n", "bgp_suppkey_matches = clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:suppkey')}))\n", "bgp_suppkey = clans.compose(bgp_suppkey_matches, clans.from_dict({'supplier': 's', 'suppkey': 'o'}))\n", "iprint_latex('bgp_suppkey', short=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For converting the second BGP \"`?supplier ?nationkey`\", we use the same technique, only this time we remove the intermediate result ($BGP_{suppkey,matches}$ in the equation above):\n", "\n", "$$\n", "BGP_{nationkey} = (Supplier \\blacktriangleright \\{\\{p{\\mapsto}\\text{}\\}\\}) \n", " \\circ \\{\\{supplier{\\mapsto}s, nationkey{\\mapsto}o\\}\\}\n", "$$\n", "\n", "In the code we did something similar, removing the intermediate variable (`bgp_suppkey_matches` in the code above):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/latex": [ "$$bgp\\_nationkey = \\left\\{\\begin{array}{l}\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{1}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{10}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{11}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{14}}},\\ \\color{red}{\\mbox{'supplier'}}{\\mapsto}{\\color{blue}{\\mbox{rdflib.ter...}}}\\right\\},\\\\\n", "\\text{...}(6)\\end{array}\\right\\}$$" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bgp_nationkey = clans.compose(clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:nationkey')})), \n", " clans.from_dict({'supplier': 's', 'nationkey': 'o'}))\n", "iprint_latex('bgp_nationkey', short=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last step of processing a sequence of BGPs is to join the clans they generated. A solution is defined as a set where every variable has exactly one value, no matter in how many BGPs it appears. Mathematically, we represent this as functional cross-union (similar to an inner join in SQL).\n", "\n", "The implicit join of all BGPs to create a set of solutions is represented by a functional cross-union of the individual BGP result clans. It is again a clan, with a column for every variable present in any of the BGPs.\n", "\n", "Typically, there are variables in a query that are only needed for the join and that are not part of the query result (`?supplier` in our example). To get rid of these, we project the result just as we did in the [`customer` Table][] example.\n", "\n", "In data algebra:\n", "\n", "$$\n", "Supplier_{Solutions} = (BGP_{suppkey} \\underset{f}{\\blacktriangledown} BGP_{nationkey}) \\circ \\{\\{nationkey{\\mapsto}nationkey, suppkey{\\mapsto}suppkey\\}\\}\n", "$$\n", "\n", "- [`clans.project`][] executes the composition of a clan with a clan diagonal.\n", "- [`clans.cross_functional_union`][] executes the functional cross-union of two clans.\n", "\n", "[`customer` Table]: <2-Tables.ipynb#The-customer-Table> (IPython Notebook: TPC-H Query 5 - Tables - The `customer` Table)\n", "[`clans.project`]: (function algebraixlib.algebras.clans.project)\n", "[`clans.cross_functional_union`]: (function algebraixlib.algebras.clans.cross_functional_union)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/latex": [ "$$supplier\\_solutions = \\left\\{\\begin{array}{l}\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{1}}},\\ \\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{3}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{10}}},\\ \\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{9}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{11}}},\\ \\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{5}}}\\right\\},\\\\\n", "\\left\\{\\color{red}{\\mbox{'nationkey'}}{\\mapsto}{\\color{blue}{\\mbox{14}}},\\ \\color{red}{\\mbox{'suppkey'}}{\\mapsto}{\\color{blue}{\\mbox{6}}}\\right\\},\\\\\n", "\\text{...}(6)\\end{array}\\right\\}$$" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "supplier_solutions = clans.project(\n", " clans.cross_functional_union(bgp_suppkey, bgp_nationkey), 'nationkey', 'suppkey')\n", "iprint_latex('supplier_solutions', short=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next Step\n", "\n", "Continue with [4-Hierarchies][]; it introduces our representation of hierarchical data, on the example of XML.\n", "\n", "[4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "----\n", "© Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.\n", "\n", "This file is part of [`algebraixlib`][] .\n", "\n", "[`algebraixlib`][] is free software: you can redistribute it and/or modify it under the terms of [version 3 of the GNU Lesser General Public License][] as published by the [Free Software Foundation][].\n", "\n", "[`algebraixlib`][] is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.\n", "\n", "You should have received a copy of the GNU Lesser General Public License along with [`algebraixlib`][]. If not, see [GNU licenses][].\n", "\n", "[`algebraixlib`]: (A Python library for data algebra)\n", "[Version 3 of the GNU Lesser General Public License]: \n", "[Free Software Foundation]: \n", "[GNU licenses]: " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }