{ "cells": [ { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "# INTRODUCTION TO SQL ON POSTGRESQL SERVER" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "**BY SERHAT ÇEVİKEL**" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "In today's session, we will be continuing from where we left:\n", "\n", "- We explored the data\n", "- We transformed the data\n", "- Now we will start the postgresql daemon\n", "- Create a database\n", "- Import data into it\n", "- Start querying the database" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "For simplicity purposes and due to time/resource constraints, we will import and set primary keys on only selected few tables\n", "\n", "We will skip the melting of tables, a task for which we provided an example before\n", "\n", "We already have a complete working database as imdb2\n", "\n", "First as we did before, we unzip the gzipped tsv files and trim the header rows, since PostgreSQL do not allow for header rows while importing tsv files according to [its own documentation on the COPY Command](https://www.postgresql.org/docs/current/static/sql-copy.html)\n", "\n", ">HEADER\n", "\n", "> Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.\n", "\n", "\n", "One option might be to convert a tsv into a csv by translating tab characters to commas using \"sed\" or better \"tr\". We will also use this option for selected tables" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## From the shell to PostgreSQL" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### Start the postgresql daemon on the background" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Since that may take some time between 5-10 minutes in our binder environment, it is better that we do it now, so that we can proceed with file operations without waiting for the daemon to start:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "sudo service postgresql start" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "If an error is returned, you should rerun the cell." ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We can check whether the daemon has started with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "This will list all databases on our server" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Until this command returns something successfully, the PostgreSQL server will not respond.\n", "\n", "But until we reach 1.2, we do not need to connect to the PostgreSQL server." ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### Environment variables" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "First, declare the environment variables for paths" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "datadir=~/data\n", "imdbdir=$datadir/imdb" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### Unzip the gzipped files and trim headers" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Then create the directory for unzipped files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "mkdir -p $imdbdir/tsv2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And unzip the files if they do not exist yet:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir/tsv -mindepth 1 | \\\n", " parallel -k -j0 \"basenm=\\$(basename {});\n", " if [ ! -e ${imdbdir}/tsv2/\\${basenm%.gz} ];\n", " then\n", " gunzip -c {} > \\\n", " ${imdbdir}/tsv2/\\${basenm%.gz};\n", " fi\n", " \"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "the \"basename\" function extracts the filename portion from the full path name so we can reproduce the full path name under another directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "tldr basename" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "The ```${basenm%.gz}``` portion is a form of \"parameter substitution\"\n", "\n", "It means:\n", "\n", "> \\\\${var%Pattern}, \\\\${var%%Pattern}\n", "\n", "> \\\\${var%Pattern} Remove from \\\\$var the shortest part of \\\\$Pattern that matches the back end of \\\\$var.\n", "\n", "\n", "> \\\\${var%%Pattern} Remove from \\\\$var the longest part of \\\\$Pattern that matches the back end of \\\\$var.\n", "\n", "(https://www.tldp.org/LDP/abs/html/parameter-substitution.html)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now let's check whether the gunzipped files are there:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir/tsv2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now we will trim the first lines and dump into a new directory named tsv3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "mkdir -p $imdbdir/tsv3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir/tsv2 -mindepth 1 | \\\n", " parallel -k -j0 \"basenm=\\$(basename {});\n", " if [ ! -e ${imdbdir}/tsv3/\\${basenm/.tsv/2.tsv} ];\n", " then\n", " tail -n+2 {} > \\\n", " ${imdbdir}/tsv3/\\${basenm/.tsv/2.tsv};\n", " fi\n", " \"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"basenm/.tsv/2.tsv\" part is like the substitution command of sed: find \".tsv\" and replace with \"2.tsv\"\n", "\n", "\"-e\" checks the existence of the file and proceeds only if the file does not exist yet" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Check whether row counts are decremented:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir/tsv2 -mindepth 1 | \\\n", " parallel -k -j0 \"basenm=\\$(basename {});\n", " {\n", " printf '%s\\t' \\$basenm;\n", " cat $imdbdir/tsv2/\\$basenm | wc -l;\n", " cat $imdbdir/tsv3/\\${basenm/.tsv/2.tsv} | wc -l;\n", " } | tr '\\n' '\\t' | xargs echo -e\n", " \" | column -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Convert tsv's to csv's" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now let's convert tsv's to csv's" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "mkdir -p $imdbdir/csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir/tsv2 -mindepth 1 | \\\n", " parallel -k -j0 \"basenm=\\$(basename {});\n", " if [ ! -e ${imdbdir}/csv/\\${basenm/.tsv/.csv} ];\n", " then\n", " cat {} | tr '\\t' ',' > \\\n", " ${imdbdir}/csv/\\${basenm/.tsv/.csv};\n", " fi\n", " \"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "List the csv directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir/csv" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And view the head of a file with column command, with \",\" as the separator/delimiter character:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "head $imdbdir/csv/name.basics.csv | column -t -s \",\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We can also use the csv files to create the database " ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Ways to access the postgresql server" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Remember we probed whether postgresql server is listening by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "psql is the built-in interactive terminal client for PostgreSQL. It can both be used interactively with it shell or non-interactively with the \"-c\" (command) flag as we used above " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "tldr psql" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "psql is good for embedding sql codes in shell scripts. However for interactive purposes, it does not provide much features.\n", "\n", "A better option is the \"pgcli\" interface. It is a part of the dbcli project, that provides better command line interfaces (CLI) as DBclients:\n", "\n", "https://github.com/dbcli/pgcli\n", "\n", "https://www.pgcli.com/" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "First we open a new terminal window:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "last=$(( $(ps aux | awk -F \" \" '$7 ~ /pts/ { print $7 }' | grep -Po \"\\d+\" | sort -nu | tail -1) + 1 ))\n", "echo $last" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo \"Follow the working one of the following links to open a terminal: (that matches the domain of the URL above)\"\n", "echo \"https://hub.gke.mybinder.org/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"\n", "echo \"https://hub-binder.mybinder.ovh/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"\n", "echo \"https://notebooks.gesis.org/binder/jupyter/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now please copy and paste the below command inside that terminal:\n", "\n", "```Bash\n", "screen -S 1\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "scr=$(screen -ls | grep -P \"Attached\" | head -1 | grep -Po \"^\\t+.+?(?=\\s|\\t)\" | tr -d \"\\t\")\n", "screen -S $scr -X stuff \"pgcli -U postgres\\n\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Some options are automatically selected at build time.\n", "\n", "- Auto completion is on\n", "- Multiline mode is on: Statements canbe multiline, to execute them, put a semi-colon \";\" at the end\n", "- Vi-mode is on: If you are fluent at using vim editor, the same key bindings apply now" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We may be using other interfaces, and here we will prefer to execute our commands through the Jupyter notebook. However, it is better to have pgcli opened on the terminal, since the smart completion facility makes it easier to write SQL statements. You can copy the statement to the Jupyter cells to be executed" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Another client is pgAdmin4. It had a GUI interface before, but now it only works on web mode, hosted by your web browser - just as Jupyter is.\n", "\n", "Due to limitations of the binder environment, we won't be using pgAdmin4 but you can check it from https://www.pgadmin.org/" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "There are several ways to access postgresql server from the Jupyter:\n", "\n", "- Use bash kernel and send SQL commands via the \"-c\" flag of psql client\n", "- Use SQL kernel of the beakerx project\n", "- Use sql magic of python kernel through ipython_sql extension\n", "- Use postgres_kernel\n", "- Use any of the kernels through SoS kernel which is polyglot - each cell can be run on a separate kernel, which we do now" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "SQL kernel of beakerx extension is the most visually appealing and easy to use one, when used as a stand-alone notebook. However it does not play well from inside an SoS notebook, and we prefer the \"polyglot\" nature of the SoS notebook\n", "\n", "postgres_kernel is buggy\n", "\n", "So we will either use the sql magic on Python3 kernel or psql in non-interactive mode on bash_kernel" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "## Create a table space and a database" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "For the rest of the commands to work, PostgreSQL server should be listening on port 5432. Let's check:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "If the command does not list the existing databases, we wait until it does so by re-running the cell. \n", "\n", "Hopefully, we started the database daemon back in the beginning of the session." ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "First we will create a tablespace - a directory on the filesystem for the PostgreSQL server to store our new database:\n", "\n", "\n", ">22.6. Tablespaces\n", ">\n", ">Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the files representing database objects can be stored. Once created, a tablespace can be referred to by name when creating database objects.\n", ">\n", ">By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation. This is useful in at least two ways. First, if the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.\n", "\n", "(https://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Create a directory on the filesystem:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "sudo mkdir -p /pg1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Check the user and group ownership with the stat command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "stat -c \"%U %G\" /pg1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "tldr stat" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And make it owned by postgres user:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "sudo chown -R postgres:postgres /pg1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And check again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "stat -c \"%U %G\" /pg1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Note that, in this binder image, sudoer (admin) jovyan user becomes sudo without password prompt. This is for the non-interactive feature of bash_kernel on Jupyter (it cannot prompt for a password) and for better scriptability.\n", "\n", "This environment is a short lived container isolated from the rest of the operating system it sits atop. This feature should not be done in a production environment" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's check again whether our postgresql server responds to our requests:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now it is time to create a tablespace on /pg location and create a database there:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's review the list of databases again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "First let's enable the sql magic on python kernel, so that we can send SQL commands to the PostgreSQL server easily with minimal additional syntax (using psql from the bash is another option):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And create a connection with our server into the maintenance database \"postgres\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/postgres" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now from inside the Python3 kernel, we can execute sql statements easily if we insert a \"%%sql\" magic in the first line as such:\n", "\n", "Comments start with \"--\"\n", "\n", "Statements end with \";\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "First let's create a tablespace \"pg1\" on our /pg1 directory: (we need the END; statement so that tablespace creation is executed without error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "-- create a new tablespace\n", "END;\n", "CREATE TABLESPACE pg1 LOCATION '/pg1';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And let's create a database on pg1 tablespace.\n", "\n", "Note that, since we already have the \"imdb\" database, we will create imdb1 for practice purposes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "-- create a new database imdb1 on pg1 tablespace\n", "END;\n", "CREATE DATABASE imdb1\n", "TABLESPACE = pg1;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now connect to the new imdb1 database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/imdb1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### Create and copy data into title_ratings table" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Make sure we are connected to the newly created imdb1 database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/imdb1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And create a new table title_ratings with defined fields and types\n", "\n", "Note that, we cannot use \".\" dot inside table and field names" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "-- create a new table\n", "CREATE TABLE title_ratings\n", "(\n", " tconst text,\n", " averageRating numeric,\n", " numVotes integer\n", ");" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "In order to copy files from the local filesystem into the postgresql server, it is best to use the psql command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ " psql -U postgres -d imdb1 -c \"\\copy title_ratings from $imdbdir/tsv3/title.ratings2.tsv\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Note that this runs the COPY command of SQL under the hood, however it does not fall onto file read privilege problems:\n", "\n", "\n", "\n", ">Assuming the psql command-line tool, you may use \\copy instead of copy.\n", ">\n", ">\\copy opens the file and feeds the contents to the server, whereas copy tells the server the open the file itself and read it, which may be problematic permission-wise, or even impossible if client and server run on different machines with no file sharing in-between.\n", ">\n", ">Under the hood, \\copy is implemented as COPY FROM stdin and accepts the same options than the server-side COPY.\n", ">\n", "\n", "(https://stackoverflow.com/questions/19463074/postgres-error-could-not-open-file-for-reading-permission-denied#19466558)\n", "\n", "Another advantage of using the \"\\copy\" syntax through psql on Bash is that we can easily pass environment variables such as $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now we can check the table exists by using psql meta-command to describe the tables inside a database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 -c \"\\dt+\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And get the size of the database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT pg_size_pretty(pg_database_size('imdb1'));" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Create, insert and drop operations after some time may leave a junk space inside the database. In this case, it is better to run the \"vacuum\" command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "VACUUM FULL;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's view the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "To delete the data inside a table we can either use \"DELETE or TRUNCATE statements:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "DELETE FROM title_ratings;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now check whether data is deleted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's copy the data again, this time using the sql command COPY instead of psql built-in \\copy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "COPY title_ratings from '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And delete data with TRUNCATE:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "TRUNCATE title_ratings;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Check the size and vacuum:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT pg_size_pretty(pg_database_size('imdb1'));" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "VACUUM FULL;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT pg_size_pretty(pg_database_size('imdb1'));" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### Primary keys" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now copy the data again using \"\\copy\" meta-command of psql:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 -c \"\\copy title_ratings from $imdbdir/tsv3/title.ratings2.tsv\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's report the total row count using countrows alias. Note that \"count()\" is an example of the aggregate functions to summarize data:\n", "\n", "Note that we define an alias \"countrows\" for the aggregate calculation we do after SELECT:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT count(*) as countrows\n", "FROM title_ratings;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "More info on aggregate functions at:\n", "\n", "https://www.postgresql.org/docs/current/static/functions-aggregate.html\n", "\n", "And alias topic is covered in:\n", "\n", "https://www.postgresql.org/docs/10/static/queries-table-expressions.html" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now, with a mistake, run the COPY query again! (either with psql's \\copy or sql COPY) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "COPY title_ratings\n", "FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT count(*) as countrows\n", "FROM title_ratings;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "We have twice the number of rows now, because postgresql did not complain to duplicate the same data!\n", "\n", "We should define a primary key on the table and put a constraint on the uniqueness of that key\n", " But no problem we can ALTER anything in the table after it is first created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "ALTER TABLE title_ratings\n", "ADD CONSTRAINT title_ratings_pk \n", "PRIMARY KEY (tconst);" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "It threw an integrity error: The data are duplicated and a primary key cannot have duplicate values!" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's drop the table alltogether (not delete the contents):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "DROP TABLE title_ratings;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Create the table again with the constraint:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "-- create a new table\n", "CREATE TABLE title_ratings\n", "(\n", " tconst text,\n", " averageRating numeric,\n", " numVotes integer,\n", "\n", "CONSTRAINT title_ratings_pk\n", " PRIMARY KEY (tconst)\n", " \n", ");" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And import data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "COPY title_ratings\n", "FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Get the row count:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT count(*) as countrows\n", "FROM title_ratings;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And try to import the data again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "COPY title_ratings\n", "FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "It did not allow us because uniqueness constraint prevented us to create duplicate values of primary key fields" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 1:**\n", "\n", "Import the title.episode2.tsv into title_episode table in imdb1 database\n", "\n", "- First create the table and the fields with appropriate types, including the primary key\n", "- Then import the data\n", "\n", "Remember the structure of title.episode:\n", "\n", "- title.episode.tsv: 72 MB, 2,986,926 lines, 4 fields. Contains the tv episode information. Fields include:\n", "- tconst (string) - alphanumeric identifier of episode\n", "- parentTconst (string) - alphanumeric identifier of the parent TV Series\n", "- seasonNumber (integer) – season number the episode belongs to\n", "- episodeNumber (integer) – episode number of the tconst in the TV series.\n", "\n", "Note that, -- defines a commented line not to be executed\n", "\n", "And always finish your statements with a semicolon \";\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- type your statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "## Meta-commands" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "psql and pgcli clients offer some meta-commands that start with \"\\\" and that offer shortcuts to some common operations\n", "\n", "You can get help on that using \"\\?\" - a meta-command itself" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\?\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\l\" lists databases:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\c\" connects to database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\c imdb2\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\dt\" lists tables in a database:\n", "\"\\dt+\" gives more detail" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 -c \"\\dt+\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\d\" lists fields in a table\n", "\n", "\"\\d+\" gives more detail:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 -c \"\\d+ title_basics\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "To get detailed information on all tables and primary keys in the public schema (like a named directory of databases, tables and keys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 -c \"\\d+ public.*\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "See that here we have some additional tables to enhance our queries, and they do not exist in the original imdb database:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "crew_molten, splits multiple directors or writers from commas, and has an additional field for director/writer profession:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%load_ext sql\n", "%sql postgres://postgres@localhost/imdb2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT * FROM crew_molten\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "genres_molten, has tconst and genres field split from commas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT * FROM genres_molten\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "title_principals_melt has the tconst and principalcast field split from the commas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT * FROM title_principals_melt\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## Basic Query and Filter operations: SELECT with WHERE" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now we will use the full database \"imdb\" in our PostgreSQL server:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "list fields and data info" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/imdb2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "In our basic \"SELECT\" queries we will start with a few clauses:\n", "\n", "* SELECT: columns and aggretations on columns\n", "* FROM: tables to query\n", "* WHERE: conditions to filter rows\n", "* ORDER: the column(s) to order the results\n", "* LIMIT: maximum number of rows to return " ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "* Select all titles which include \"Godfather\" and \"Part\" words in it, with any number of character before, between or after\n", "* Report tconst id, original title, start year, run time in minutes and genres\n", "* Order them by increasing runtimeminutes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather.*Part'\n", " ORDER BY runtimeminutes;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "In fact we are not concerned with documentaries, shorts and comedy movies\n", "\n", "So we should include only dramas and exclude comedies and let's sort by runtimeminutes in descending order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather.*Part'\n", "\tAND genres ~ 'drama'\n", " ORDER BY runtimeminutes;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "The query did not return any rows?\n", "\n", "Why?\n", "\n", "Because regex pattern search is case sensitive unless otherwise states and \"drama\" does not match \"Drama\" \n", "\n", "In order to make the search case insensitive we add (?i) at the beginning of the pattern or we can use ~* operator which makes the pattern case insensitive:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather.*Part'\n", "\tAND genres ~ '(?i)drama'\n", " ORDER BY runtimeminutes DESC;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 2:**\n", "\n", "- Select originaltitle, startyear, runtimeminutes and genres columns\n", "- From title_basics table\n", "- Where originaltitle includes \"Star Wars\"\n", "- And start year greater than 1977\n", "- Sort by descending runtimeminutes\n", "\n", "Note: You can try below cells or pgcli from the terminal to test your commands" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Note that, unless we explicitly tell SQL server to exclude NA values, it also returns NA values" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now, back to the Godfather trilogy...\n", "\n", "We still have comedy. Now we enhance the WHERE clause so that comedy is also excluded\n", "\n", "Note that we can either alter regex as case insensitive (?i) or make the regex operator case insensitive (~*):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ '.*Godfather.*Part.*'\n", "\tAND genres ~ '(?i)drama'\n", "\tAND NOT genres ~* 'comedy'\n", " ORDER BY runtimeminutes;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "That's ok but we only want titles from the original trilogy last of which was screened in 1990" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ '.*Godfather.*Part.*'\n", "\tAND genres ~ '(?i)drama'\n", "\tAND NOT genres ~ '(?i)comedy'\n", "\tAND startyear <= 1990\n", " ORDER BY runtimeminutes;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "There, we have what we want for further queries" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 3:**\n", "\n", "- Select original title and startyear\n", "- From title_basics table\n", "- Those records in which originaltitle includes Star Wars\n", "- Genres include Sci-Fi\n", "- And not drama\n", "- Runtimeminutes shorter than 60 or longer than 150\n", "- Sort by runtimes minutes and startyear" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "We can represent the runtimeminutes filter with a BETWEEN condition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "pass1=\n", "encrypt=\"U2FsdGVkX19j4SGpgZW4DfLWs5VI0x/jrCAtXCRUsr/jp8yxZsrbRXmbN22UpfB1 kvsy3i51vT10+p4ItE9XGOXlVEVhgsvW+Zanf4+ZKyTvNkUOBdSydvElBTJM1VIZ HnB8JISJeG/kipPG2LnGuMLdhCxqL/yHtuQDK5uPF74x+Mpm9BQsg5E0MVHWulB1 6CPkEnugOjDm1pPt30v86LSXG8LvFq7ArlQd57BTBJE6cTo7PR3vNbV6z/9MLLaK kCIADJaGtGT+D8fhhUNolgG7Zv/t5uaurtd3xvGt96x7MmRuoHtfGb4KzfoVSZDC ASGVAbX2a1hiObKGrHB4CSFhg3zu748kAivxAjqMqdXb/m9WtlZAUvqiALPTSm13 pf4NqjJiGeidG2GAxSyhXHOJnqCmhH5wWKJwDavw3kc=\"\n", "solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## A discussion on regex" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Note that regex or regexp (regular expressions) is a domain specific querying language\n", "\n", "And a very powerful tool for mining, querying and manipulating textual data\n", "\n", "\n", "And is worth spending effort to learn if you deal with textual data" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "This tutorial is a good starting point:\n", "\n", "http://www.regular-expressions.info/tutorial.html" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And this site if for testing regex patterns on some text:\n", "\n", "https://regex101.com/" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And this competition website if good for progressing in regex and having fun at the same time:\n", "\n", "https://regexcrossword.com/" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## DISTINCT" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's say we want to get the unique valued rows from the result of a query, as we did in the first session.\n", "\n", "Get unique genres from genres_molten:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT DISTINCT genres\n", "FROM genres_molten;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## GROUP BY, HAVING and aggregate functions" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Let's get the average runtime and count of movies including Godfather by decades:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT (startyear / 10) * 10 as decade, avg(runtimeminutes) as runavg, count(*)\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather'\n", " GROUP BY decade;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's filter records on a condition including an aggregated field such as, filter for decades with average runtimeminutes greater than 80:\n", "\n", "Note that we can define an alias for a calculated field or a table with a long name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT (startyear / 10) * 10 as decade, avg(runtimeminutes) as runavg, count(*)\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather'\n", " GROUP BY decade\n", " HAVING avg(runtimeminutes) > 80;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 4:**\n", "\n", "- Select 5 year starting position of startyear, minimum runtimeminutes and maximum runtime minutes\n", "- From title_basics\n", "- Group by 5 year periods\n", "- And filter for groups where max runtime minutes of the group is less than 140 and the five year periods is not NULL\n", "- Order by fiveyear periods" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## JOINs, subqueries and aliases" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now we would like to join titles from the original Godfather trilogy last of which was screened in 1990 with id's of principal cast" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tb.tconst, tb.originaltitle, tb.startyear,\n", " tb.runtimeminutes, tb.genres, tp.principalcast\n", " FROM title_basics tb\n", " LEFT JOIN title_principals_melt tp ON tb.tconst=tp.tconst\n", " WHERE tb.originaltitle ~ 'Godfather.*Part'\n", "\tAND tb.genres ~ '(?i)drama'\n", "\tAND NOT tb.genres ~ '(?i)comedy'\n", "\tAND tb.startyear <= 1990\n", " ORDER BY tb.runtimeminutes;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "We could also get the same result by running a subquery\n", "\n", "A subquery is a query with a alias (a name) and executed inside another query\n", "\n", "Using the alias, the subquery results can be treated as a separate table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT subq1.*, tp.principalcast\n", " FROM\n", "(\n", "SELECT tconst, originaltitle, startyear, runtimeminutes, genres\n", " FROM title_basics\n", " WHERE originaltitle ~ 'Godfather.*Part'\n", "\tAND genres ~ '(?i)drama'\n", "\tAND NOT genres ~ '(?i)comedy'\n", "\tAND startyear <= 1990\n", " ORDER BY runtimeminutes\n", ") subq1\n", "\n", "LEFT JOIN title_principals_melt tp ON subq1.tconst=tp.tconst;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "You can find more information on subqueries:\n", "\n", "http://www.postgresqltutorial.com/postgresql-subquery/" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's do sth more complicated:\n", "\n", "- Starting with first option (not the subquery), let's make a three way join\n", "- First join titles and principal cast on title id's (tconst)\n", "- And then join principal cast and name basics on name id's (nconst)\n", "- Filter only for actors and actresses\n", "- And sort on first names (ascendng) then title years" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,\n", " tp.principalcast,\n", " nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession\n", "\n", "FROM title_basics tb\n", "LEFT JOIN title_principals_melt tp ON tb.tconst=tp.tconst\n", "\tLEFT JOIN name_basics nb ON tp.principalcast=nb.nconst\n", "\n", "WHERE tb.originaltitle ~ 'Godfather.*Part'\n", "\tAND tb.genres ~ '(?i)drama'\n", "\tAND NOT tb.genres ~ '(?i)comedy'\n", "\tAND tb.startyear <= 1990\n", "\tAND nb.primaryprofession ~'actor|actress'\n", "\n", "ORDER BY nb.primaryname, tb.startyear DESC;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "When the column names are the same in a join, we can use USING as a shorthand:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,\n", " tp.principalcast,\n", " nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession\n", "\n", "FROM title_basics tb\n", "LEFT JOIN title_principals_melt tp USING (tconst)\n", "\tLEFT JOIN name_basics nb ON tp.principalcast=nb.nconst\n", "\n", "WHERE tb.originaltitle ~ 'Godfather.*Part'\n", "\tAND tb.genres ~ '(?i)drama'\n", "\tAND NOT tb.genres ~ '(?i)comedy'\n", "\tAND tb.startyear <= 1990\n", "\tAND nb.primaryprofession ~'actor|actress'\n", "\n", "ORDER BY nb.primaryname, tb.startyear DESC;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 5:**\n", "\n", "List the living principal cast names in movies directed by Stanley Kubrick (nm0000040) in 1970's and 1980's. Report these columns:\n", "- Title of movie, staryear, genres and primaryname\n", "\n", "Use the following tables:\n", "- title_crew\n", "- title_basics\n", "- title_principals_melt\n", "- name_basics\n", "\n", "Follow this plan:\n", "- Left join title_crew to title_basics using common tconst field to combine directors with movie details\n", "- Left join to title_principals using common tconst field to combine director/movie details with pricipal cast\n", "- Left join to name_basics using the relation between principalcast and nconst fields to combine with the names of the principal cast\n", "\n", "- Filter for the director code in title_crew nm0000040\n", "- Filter for 1970's and 1980's startyear in title_basics\n", "- Filter for deathyear in name_basics being NULL\n", "\n", "Order by first startyear and then primaryname\n", "\n", "\n", "Hint:\n", "- Progress stepwise:\n", " - First you may start with joining title_crew and title_basics (filtering for directors)\n", " - And later for 1970's and 1980's\n", " - After that, you may join with title_principals_melt and see the results.\n", " - And last join with name_basics to add the names and filter for deathyear.\n", "- And You can write your statement in pgcli so that you can leverage the smart completion feature (but copy here frequently)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Unfortunately, R. Lee Ermey who starred in one of the definitive scenes in movie history (opening scene of Full Metal Jacket) and was still alive in 2017 (the data this version of the database was last updated) passed away in 2018." ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "## Insert, update and delete records" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### Insert a record" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's say we want to add a new record to title_ratings table\n", "\n", "First let's switch to imdb1 table to leave imdb2 intact" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/imdb1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "The structure of a record is as follows:\n", "\n", ">tt0000005\t6.2\t1565\n", "\n", "\n", "Suppose we want to add a record for a title as such:\n", ">tt0000000 8.2 2000" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "We can insert the record as such:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "INSERT INTO title_ratings\n", "(tconst, averagerating, numvotes)\n", "VALUES ('tt0000000', 8.2, 2000);" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Note that we may omit the (tconst, averagerating, numvotes) since we are referring to all columns" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Running the same code again returns an error as:\n", "```\n", "IntegrityError: duplicate key value violates unique constraint \"title_ratings_pk\"\n", "DETAIL: Key (tconst)=(tt0000000) already exists.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's check whether the record is added: (we will go over SELECT, FROM, ORDER BY and LIMIT command/clauses later):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT * FROM title_ratings WHERE tconst = 'tt0000000';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "For more info on INSERT command:\n", "\n", "https://www.postgresql.org/docs/current/static/sql-insert.html" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### Update a record" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now suppose we want to update the record such that averagerating is now 8.7 and numvotes is 3000:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "UPDATE title_ratings\n", "SET\n", "averagerating = 8.2,\n", "numvotes = 3000\n", "WHERE tconst = 'tt0000000';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And check whether the record is updated:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "WHERE tconst = 'tt0000000';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "For more info on the UPDATE command:\n", "\n", "https://www.postgresql.org/docs/current/static/sql-update.html" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### Delete a record" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's delete the record we just added:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "DELETE FROM title_ratings\n", "WHERE tconst = 'tt0000000';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM title_ratings\n", "WHERE tconst = 'tt0000000';" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "To get more info on the delete command:\n", "\n", "https://www.postgresql.org/docs/current/static/sql-delete.html" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 6:**\n", "\n", "Into imdb1 database:\n", "- Insert a record into title_ratings where tconst is tt9999999, and numvotes is 10000\n", "- Filter for the tconst and print the row\n", "- Update the same record so that averagerating is 9 and numvotes is 20000\n", "- Filter for the tconst and print the row\n", "- Delete the same record\n", "- Filter for the tconst and see that it does not exist anymore" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb1 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "## Ending exercise" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now use your imagination and using any of the tables in imdb2 (you can get the details with \\dt+ and \\d+ public.* from psql/pgcli), create any interesting query of yourself using (single or multiple) joins, (single or multiple) where, group by, having, order, distinct or limit clauses. You can also use subqueries.\n", "\n", "But you should first explain in clear English, what your query is meant to do. What is the purpose of your query?\n", "\n", "Save your query for future reference. You may need it for a compulsory homework!\n", "\n", "Feel free ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SoS", "language": "sos", "name": "sos" }, "language_info": { "codemirror_mode": "sos", "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", "pygments_lexer": "sos" }, "sos": { "kernels": [], "panel": { "displayed": true, "height": 0, "style": "side" }, "version": "0.19.18" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }