{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# The Relational Model\n",
    "\n",
    "- The relational model is a mathematical *abstraction* used to describe\n",
    "how data can be structured, queried and updated.\n",
    "\n",
    "- It is based on *set theory*.\n",
    "\n",
    "- It can be *implemented* in many different ways.\n",
    "\n",
    "- When it is implemented by storing relations on disk files, we have a *relational database*.\n",
    "\n",
    "- Functional programming languages such as Python naturally express many aspects of the relational model.\n",
    "\n",
    "- This is one of the reasons they are very useful for data science."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Overview\n",
    "\n",
    "1. The formal definition of the relational model\n",
    "\n",
    "2. Representing relations in Python using collections of tuples\n",
    "\n",
    "3. Querying relational data using Python set comprehensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An Example Relational Dataset\n",
    "\n",
    "The following slides use relations to describe:\n",
    "\n",
    "- students, \n",
    "- the courses they are taking \n",
    "- the prerequisites of courses\n",
    "- their grades\n",
    "- which department they are in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>student_number</th>\n",
       "      <th>course_id</th>\n",
       "      <th>department</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Smith</td>\n",
       "      <td>17</td>\n",
       "      <td>1</td>\n",
       "      <td>CS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Brown</td>\n",
       "      <td>18</td>\n",
       "      <td>2</td>\n",
       "      <td>CS</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name  student_number  course_id department\n",
       "0  Smith              17          1         CS\n",
       "1  Brown              18          2         CS"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame({'name': ['Smith', 'Brown'], 'student_number': [17, 18], 'course_id': [1, 2], 'department': ['CS', 'CS']})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Concepts\n",
    "\n",
    "- In a relational model, the data is a collection of *relations*.\n",
    "\n",
    "- Informally, a relation resembles a table of values.\n",
    "\n",
    "- When relations are stored on disk, they are called tables.\n",
    "\n",
    "- Each row represents a *fact* about a real-world entity or relationship.\n",
    "\n",
    "- The table name and column names are used to help interpret the meaning of the values.\n",
    "\n",
    "- A relational model is defined formally in terms of:\n",
    "    - tuples\n",
    "    - attributes\n",
    "    - relations\n",
    "    - domains"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Illustration of a relation\n",
    "\n",
    "![relation](figs/Relational_model_concepts.png)\n",
    "\n",
    "Image courtesy of [AutumnSnow](https://commons.wikimedia.org/wiki/User:AutumnSnow)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Tuples\n",
    "\n",
    "A tuple is a mathematical abstraction which:\n",
    "\n",
    "- contains several other values\n",
    "\n",
    "- has a well-defined ordering over the values\n",
    "\n",
    "- can contain duplicate values\n",
    "\n",
    "- can contain values of different types\n",
    "\n",
    "- can contain the special value `None` or `Null`\n",
    "\n",
    "- is immutable; the values contained in the tuple cannot change over time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## The size of a tuple\n",
    "\n",
    "- We often restrict attention to tuples of a particular size or *degree*.\n",
    "\n",
    "- An $n-$tuple contains $n$ values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Attributes\n",
    "\n",
    "- An attribute refers to the value in a particular index of a tuple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Atomic values\n",
    "\n",
    "- Atomic values are values which are not stored in collections.\n",
    "\n",
    "- Atomic values cannot be further decomposed into other values.\n",
    "\n",
    "- A tuple is therefore *not* atomic.\n",
    "\n",
    "- A tuple that contains only atomic values is called a *flat tuple*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Domain\n",
    "\n",
    "- A *domain* $D_i$ is a set of atomic values.\n",
    "\n",
    "- Each attribute within a relation has the *same* domain.\n",
    "\n",
    "- Intuitively, a domain specifies the allowable values in a column $i$.\n",
    "\n",
    "- Examples:\n",
    "\n",
    "$D_1 = \\mathbb{Z}$ \n",
    "\n",
    "$D_2 = \\{ 15, 16, \\ldots, 80 \\}$ \n",
    "\n",
    "$D_3 = \\{ \"CS\", \\; \"ECON\", \\; \"PHYS\" \\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Relation schema\n",
    "\n",
    "- A *relation schema* is denoted by $R(A_1, A_2, \\ldots, A_n)$.\n",
    "\n",
    "- Each *attribute* $A_i$ is the name of a role played by some domain $D_i$ in $R$.\n",
    "\n",
    "- $D_i$ is the *domain* of $A_i$ and is denoted by $\\operatorname{dom}(A_i)$.\n",
    "\n",
    "- The *degree* or *arity* of a relation is the number of attributes $n$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example\n",
    "\n",
    "- What is the arity of the following relation schema?\n",
    "\n",
    "~~~\n",
    "STUDENT(Name, Ssn, Home_phone, Address, Office_phone, Age, Gpa)\n",
    "~~~\n",
    "\n",
    "- Answer: 7\n",
    "\n",
    "- What is the name of the relation?\n",
    "\n",
    "- Answer: `STUDENT`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example of a Domain\n",
    "\n",
    "$\\operatorname{dom}(Gpa) = [0, 4]$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Relations\n",
    "\n",
    "- The schema represents the structure of a *relation*.\n",
    "\n",
    "- A relation contains the actual data.\n",
    "\n",
    "- It is sometimes called the *relation state*, *relation intension* or *relation extension*.\n",
    "\n",
    "- Let $r(R)$ denote the relation $r$ of a relation schema $R$.\n",
    "\n",
    "- The relation $r$ consists of a set of $n$-tuples $r = \\{t_1, t_2, \\ldots, t_m\\}$.\n",
    "\n",
    "- The $i^{th}$ value in tuple $t$ corresponds to the attribute $A_i$\n",
    "and is denoted $t.A_i$ or $t[i]$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Constraints\n",
    "\n",
    "- Domain constraints\n",
    "\n",
    "- Key constraints\n",
    "\n",
    "- NULL values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Relational Datasets\n",
    "\n",
    "- So far we have discussed single relations.\n",
    "\n",
    "- A typical data-set will comprise many relations.\n",
    "\n",
    "- A relational dataset schema $(S, IC)$ comprises:\n",
    "\n",
    "    - a set of relation schemas $S = \\{ R_1, R_2, \\ldots, R_k \\}$\n",
    "    \n",
    "    - a set of integrity constraints $IC$\n",
    "    \n",
    "- A relational dataset state $DB$ is a set of relation states $DB = \\{ r_1, r_2, \\ldots r_m \\}$\n",
    "\n",
    "    - such that every $r_i$ satisfies every constraint in $IC$.\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "  \n",
    "## Data definition language\n",
    "\n",
    "- The data definition language (DDL) provides a concrete syntax and semantics for describing a relational schema.\n",
    "\n",
    "- Most commonly we use *SQL* - Structured Query Language."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Data query language\n",
    "\n",
    "- The data query language provides a concrete syntax and semantics for querying the relational dataset.\n",
    "\n",
    "- Formally a query is a function mapping from existing relation states to new relations.\n",
    "\n",
    "- That is, we map from one set of tuples to another set of tuples.\n",
    "\n",
    "- Typically the mapping involves some combination of set-theoretic functions, e.g.\n",
    "\n",
    "    - subset of tuples that satisfy a predicate $p$  \n",
    "        - $\\{ x: x \\in X \\wedge p(x) \\}$\n",
    "    - set union $X \\cup Y$, difference $X - Y$, intersection $X \\cap Y$\n",
    "    - Cartesian product $X \\times Y$ \n",
    "\n",
    "- The most common data query language for relational databases is again SQL.\n",
    "\n",
    ". . .\n",
    "\n",
    "- Mathematically, there is nothing stopping us from using e.g. Python as a query language."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Tuples in Python\n",
    "\n",
    "- Tuples in Python can be written by writing a sequence of values separated by\n",
    "commas and surrounded by round brackets. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "tuple1 = (50, 6.5)\n",
    "tuple2 = (1, 2, 'hello')\n",
    "professor = ('Steve', 'Phelps', 'S6.18')\n",
    "student = ('John', 'Doe', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "- The individual values contain within a tuple can be obtained by indexing\n",
    "their position (counting from zero).  To find the office number of the professor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'S6.18'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "professor[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Tuples are a very flexible way to represent single pieces of data.\n",
    "\n",
    "- We only allow *flat tuples*.  The following is not allowed in a relational model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "this_is_not_allowed = (1, 3, (50, 6.5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Sets of Tuples\n",
    "\n",
    "- How can we use tuples to represent data-*sets* and relations?\n",
    "\n",
    "- We can use collections of tuples, e.g. a set of tuples.\n",
    "\n",
    "- So now we can represent one or more students:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Student tuples\n",
    "smith = ('Smith', 17, 1, 'CS')\n",
    "brown = ('Brown', 8, 2, 'CS')\n",
    "\n",
    "# The student relation\n",
    "students = {smith, brown}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Relational attributes in Python\n",
    "\n",
    "- Attributes are names for particular positions within a tuple.\n",
    "\n",
    "- We can use Python functions to represent relational attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The attributes of a student\n",
    "\n",
    "def student_name(s):\n",
    "    return s[0]\n",
    "    \n",
    "def student_student_number(s):\n",
    "    return s[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "- Note that different relations can have the same attribute.\n",
    "\n",
    "- Therefore we need some convention to distinguish attributes from different relations.\n",
    "\n",
    "- In the above code, `student_student_number` refers to the `student_number` attribute of the `student` relation.\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Queries in Python\n",
    "\n",
    "- We need some way to extract data from our data-set; i.e. to *query* the data.\n",
    "\n",
    "- A query will e.g.:\n",
    "\n",
    "    - Take a subset of the tuples of a relation that satisfy a predicate.\n",
    "    \n",
    "    - *Join* two or more relations using a Cartesian product.\n",
    "    \n",
    "    - Take the intersection of tuples from two or more relations.\n",
    "    \n",
    "    - Take the union of tuples from two or more relations.\n",
    "    \n",
    "\n",
    "- Python list comprehensions or set comprehensions provide all of this functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Relational queries in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The set of students whose name is \"Smith\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('Smith', 17, 1, 'CS')}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{s for s in students if student_name(s) == 'Smith'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is equivalent to the SQL query:\n",
    "\n",
    "~~~SQL\n",
    "SELECT * FROM students WHERE students.name = \"SMITH\";\n",
    "~~~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Joining relations\n",
    "\n",
    "- Now let's create another relation called `grades` which has tuples of the form `(ssn, course-name, mark)`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "grades = { (17, 'python', 'A'), (17, 'algebra', 'B'), (17, 'algebra', 'A')}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and a function to return the mark for a given grade tuple:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def grade_mark(g):\n",
    "    return g[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now we can join the two relations using a Cartesian product:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('Brown', 'A'), ('Brown', 'B'), ('Smith', 'A'), ('Smith', 'B')}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{(student_name(s), grade_mark(g)) for s in students for g in grades}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "this is equivalent to the following SQL:\n",
    "\n",
    "~~~SQL\n",
    "SELECT students.name, grades.mark FROM students, grades;\n",
    "~~~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can also combine this with a predicate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('Smith', 'A'), ('Smith', 'B')}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{(student_name(s), grade_mark(g)) for s in students for g in grades if student_name(s) == 'Smith'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "~~~SQL\n",
    "SELECT students.name, grades.mark FROM students, grades WHERE students.name = \"Smith\";\n",
    "~~~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Functional-relational mapping\n",
    "\n",
    "- Functional programming languages such as Python are well suited to data science because functional programming constructs such as comprehensions map directly onto the relational model.\n",
    "\n",
    "- This is called functional-relational mapping.\n",
    "\n",
    "- There are [tools](https://www.lightbend.com/blog/slick-20-ga-functional-relational-mapping-made-easy) which allow relational database to be queried directly from functional expressions.\n",
    "\n",
    "- This is in contrast to [object-relational mapping (ORM)](https://www.fullstackpython.com/object-relational-mappers-orms.html).\n",
    "\n",
    "- There is *no* isomorphic mapping between an object model and a relational model, which leads to difficulties called [the object-relational impedance mismatch](https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch).\n",
    "\n",
    "- Functional programming allows us to express a computation as functions on sets of tuples, and thus comprehensions are completely isomorphic with relational queries. "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}