{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Relational Model\n", "\n", "- The relational model is a mathematical *abstraction* used to describe\n", "how data can be structured, queried and updated.\n", "\n", "- It is based on *set theory*.\n", "\n", "- It can be *implemented* in many different ways.\n", "\n", "- When it is implemented by storing relations on disk files, we have a *relational database*.\n", "\n", "- Functional programming languages such as Python naturally express many aspects of the relational model.\n", "\n", "- This is one of the reasons they are very useful for data science." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview\n", "\n", "1. The formal definition of the relational model\n", "\n", "2. Representing relations in Python using collections of tuples\n", "\n", "3. Querying relational data using Python set comprehensions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An Example Relational Dataset\n", "\n", "The following slides use relations to describe:\n", "\n", "- students, \n", "- the courses they are taking \n", "- the prerequisites of courses\n", "- their grades\n", "- which department they are in" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namestudent_numbercourse_iddepartment
0Smith171CS
1Brown182CS
\n", "
" ], "text/plain": [ " name student_number course_id department\n", "0 Smith 17 1 CS\n", "1 Brown 18 2 CS" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame({'name': ['Smith', 'Brown'], 'student_number': [17, 18], 'course_id': [1, 2], 'department': ['CS', 'CS']})\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Concepts\n", "\n", "- In a relational model, the data is a collection of *relations*.\n", "\n", "- Informally, a relation resembles a table of values.\n", "\n", "- When relations are stored on disk, they are called tables.\n", "\n", "- Each row represents a *fact* about a real-world entity or relationship.\n", "\n", "- The table name and column names are used to help interpret the meaning of the values.\n", "\n", "- A relational model is defined formally in terms of:\n", " - tuples\n", " - attributes\n", " - relations\n", " - domains" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Illustration of a relation\n", "\n", "![relation](figs/Relational_model_concepts.png)\n", "\n", "Image courtesy of [AutumnSnow](https://commons.wikimedia.org/wiki/User:AutumnSnow)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tuples\n", "\n", "A tuple is a mathematical abstraction which:\n", "\n", "- contains several other values\n", "\n", "- has a well-defined ordering over the values\n", "\n", "- can contain duplicate values\n", "\n", "- can contain values of different types\n", "\n", "- can contain the special value `None` or `Null`\n", "\n", "- is immutable; the values contained in the tuple cannot change over time" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The size of a tuple\n", "\n", "- We often restrict attention to tuples of a particular size or *degree*.\n", "\n", "- An $n-$tuple contains $n$ values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Attributes\n", "\n", "- An attribute refers to the value in a particular index of a tuple." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Atomic values\n", "\n", "- Atomic values are values which are not stored in collections.\n", "\n", "- Atomic values cannot be further decomposed into other values.\n", "\n", "- A tuple is therefore *not* atomic.\n", "\n", "- A tuple that contains only atomic values is called a *flat tuple*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Domain\n", "\n", "- A *domain* $D_i$ is a set of atomic values.\n", "\n", "- Each attribute within a relation has the *same* domain.\n", "\n", "- Intuitively, a domain specifies the allowable values in a column $i$.\n", "\n", "- Examples:\n", "\n", "$D_1 = \\mathbb{Z}$ \n", "\n", "$D_2 = \\{ 15, 16, \\ldots, 80 \\}$ \n", "\n", "$D_3 = \\{ \"CS\", \\; \"ECON\", \\; \"PHYS\" \\}$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Relation schema\n", "\n", "- A *relation schema* is denoted by $R(A_1, A_2, \\ldots, A_n)$.\n", "\n", "- Each *attribute* $A_i$ is the name of a role played by some domain $D_i$ in $R$.\n", "\n", "- $D_i$ is the *domain* of $A_i$ and is denoted by $\\operatorname{dom}(A_i)$.\n", "\n", "- The *degree* or *arity* of a relation is the number of attributes $n$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example\n", "\n", "- What is the arity of the following relation schema?\n", "\n", "~~~\n", "STUDENT(Name, Ssn, Home_phone, Address, Office_phone, Age, Gpa)\n", "~~~\n", "\n", "- Answer: 7\n", "\n", "- What is the name of the relation?\n", "\n", "- Answer: `STUDENT`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example of a Domain\n", "\n", "$\\operatorname{dom}(Gpa) = [0, 4]$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Relations\n", "\n", "- The schema represents the structure of a *relation*.\n", "\n", "- A relation contains the actual data.\n", "\n", "- It is sometimes called the *relation state*, *relation intension* or *relation extension*.\n", "\n", "- Let $r(R)$ denote the relation $r$ of a relation schema $R$.\n", "\n", "- The relation $r$ consists of a set of $n$-tuples $r = \\{t_1, t_2, \\ldots, t_m\\}$.\n", "\n", "- The $i^{th}$ value in tuple $t$ corresponds to the attribute $A_i$\n", "and is denoted $t.A_i$ or $t[i]$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Constraints\n", "\n", "- Domain constraints\n", "\n", "- Key constraints\n", "\n", "- NULL values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Relational Datasets\n", "\n", "- So far we have discussed single relations.\n", "\n", "- A typical data-set will comprise many relations.\n", "\n", "- A relational dataset schema $(S, IC)$ comprises:\n", "\n", " - a set of relation schemas $S = \\{ R_1, R_2, \\ldots, R_k \\}$\n", " \n", " - a set of integrity constraints $IC$\n", " \n", "- A relational dataset state $DB$ is a set of relation states $DB = \\{ r_1, r_2, \\ldots r_m \\}$\n", "\n", " - such that every $r_i$ satisfies every constraint in $IC$.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "## Data definition language\n", "\n", "- The data definition language (DDL) provides a concrete syntax and semantics for describing a relational schema.\n", "\n", "- Most commonly we use *SQL* - Structured Query Language." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data query language\n", "\n", "- The data query language provides a concrete syntax and semantics for querying the relational dataset.\n", "\n", "- Formally a query is a function mapping from existing relation states to new relations.\n", "\n", "- That is, we map from one set of tuples to another set of tuples.\n", "\n", "- Typically the mapping involves some combination of set-theoretic functions, e.g.\n", "\n", " - subset of tuples that satisfy a predicate $p$ \n", " - $\\{ x: x \\in X \\wedge p(x) \\}$\n", " - set union $X \\cup Y$, difference $X - Y$, intersection $X \\cap Y$\n", " - Cartesian product $X \\times Y$ \n", "\n", "- The most common data query language for relational databases is again SQL.\n", "\n", ". . .\n", "\n", "- Mathematically, there is nothing stopping us from using e.g. Python as a query language." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tuples in Python\n", "\n", "- Tuples in Python can be written by writing a sequence of values separated by\n", "commas and surrounded by round brackets. For example:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "tuple1 = (50, 6.5)\n", "tuple2 = (1, 2, 'hello')\n", "professor = ('Steve', 'Phelps', 'S6.18')\n", "student = ('John', 'Doe', None)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "- The individual values contain within a tuple can be obtained by indexing\n", "their position (counting from zero). To find the office number of the professor:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'S6.18'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "professor[2]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Tuples are a very flexible way to represent single pieces of data.\n", "\n", "- We only allow *flat tuples*. The following is not allowed in a relational model:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "this_is_not_allowed = (1, 3, (50, 6.5))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sets of Tuples\n", "\n", "- How can we use tuples to represent data-*sets* and relations?\n", "\n", "- We can use collections of tuples, e.g. a set of tuples.\n", "\n", "- So now we can represent one or more students:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Student tuples\n", "smith = ('Smith', 17, 1, 'CS')\n", "brown = ('Brown', 8, 2, 'CS')\n", "\n", "# The student relation\n", "students = {smith, brown}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Relational attributes in Python\n", "\n", "- Attributes are names for particular positions within a tuple.\n", "\n", "- We can use Python functions to represent relational attributes:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The attributes of a student\n", "\n", "def student_name(s):\n", " return s[0]\n", " \n", "def student_student_number(s):\n", " return s[1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "- Note that different relations can have the same attribute.\n", "\n", "- Therefore we need some convention to distinguish attributes from different relations.\n", "\n", "- In the above code, `student_student_number` refers to the `student_number` attribute of the `student` relation.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Queries in Python\n", "\n", "- We need some way to extract data from our data-set; i.e. to *query* the data.\n", "\n", "- A query will e.g.:\n", "\n", " - Take a subset of the tuples of a relation that satisfy a predicate.\n", " \n", " - *Join* two or more relations using a Cartesian product.\n", " \n", " - Take the intersection of tuples from two or more relations.\n", " \n", " - Take the union of tuples from two or more relations.\n", " \n", "\n", "- Python list comprehensions or set comprehensions provide all of this functionality." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Relational queries in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The set of students whose name is \"Smith\":" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('Smith', 17, 1, 'CS')}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{s for s in students if student_name(s) == 'Smith'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is equivalent to the SQL query:\n", "\n", "~~~SQL\n", "SELECT * FROM students WHERE students.name = \"SMITH\";\n", "~~~" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Joining relations\n", "\n", "- Now let's create another relation called `grades` which has tuples of the form `(ssn, course-name, mark)`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "grades = { (17, 'python', 'A'), (17, 'algebra', 'B'), (17, 'algebra', 'A')}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and a function to return the mark for a given grade tuple:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def grade_mark(g):\n", " return g[2]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now we can join the two relations using a Cartesian product:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('Brown', 'A'), ('Brown', 'B'), ('Smith', 'A'), ('Smith', 'B')}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{(student_name(s), grade_mark(g)) for s in students for g in grades}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "this is equivalent to the following SQL:\n", "\n", "~~~SQL\n", "SELECT students.name, grades.mark FROM students, grades;\n", "~~~" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We can also combine this with a predicate:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('Smith', 'A'), ('Smith', 'B')}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{(student_name(s), grade_mark(g)) for s in students for g in grades if student_name(s) == 'Smith'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "~~~SQL\n", "SELECT students.name, grades.mark FROM students, grades WHERE students.name = \"Smith\";\n", "~~~" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Functional-relational mapping\n", "\n", "- Functional programming languages such as Python are well suited to data science because functional programming constructs such as comprehensions map directly onto the relational model.\n", "\n", "- This is called functional-relational mapping.\n", "\n", "- There are [tools](https://www.lightbend.com/blog/slick-20-ga-functional-relational-mapping-made-easy) which allow relational database to be queried directly from functional expressions.\n", "\n", "- This is in contrast to [object-relational mapping (ORM)](https://www.fullstackpython.com/object-relational-mappers-orms.html).\n", "\n", "- There is *no* isomorphic mapping between an object model and a relational model, which leads to difficulties called [the object-relational impedance mismatch](https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch).\n", "\n", "- Functional programming allows us to express a computation as functions on sets of tuples, and thus comprehensions are completely isomorphic with relational queries. " ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }