{ "cells": [ { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "# A breathing space: Review of basic shell tools and SQL\n", "\n", "**BY SERHAT ÇEVİKEL**" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "This week, we will take a deep breath and review what we have learnt so far: basic shell tools and SQL" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Before we start" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Please run the following line so that the postgresql server is ready when we get to the SQL section\n", "\n", "Wait for a minute and rerun on error message" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "sudo service postgresql start" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Jupyter kernels" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "A Jupyter notebook is mainly composed of two types of cells:\n", "\n", "- Informative \"markdown\" cells,\n", "- Executable \"code\" cells\n", "\n", "Each notebook is used with a kernel - a connector to the interpreter of a programming language such as Python3, R, bash etc.\n", "\n", "A notebook running on a Python3 kernel may also work with \"magics\" - a feature enabling executing a code of different language such as SQL as such:\n", "\n", "```Python\n", "%% sql\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### SoS kernel" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "SoS - stands for \"\"Script of Scripts\" - is a special kind of Jupyter kernel in that, each cell can be run with a separate kernel installed in the working environment. In our examples we use SoS to run bash, R and SQL codes from within the same notebook.\n", "\n", "This kind of a kernel is called as \"polyglot\" meaning multi language\n", "\n", "The dropdown menu on the right determines the kind of interpreter/programming language that the code will be executed with" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "## Shell tools" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "### Environment variables" ] }, { "cell_type": "markdown", "metadata": { "kernel": "SoS" }, "source": [ "Let's declare the environment variable named \"datadir\" again for accessing the path of our data directory more easily:\n", "\n", "Note that ~ is a shorthand for $HOME or the home directory or the current user:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "datadir=~/data" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "And we declare another environment variable based on $datadir: imdbdir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "imdbdir=$datadir/imdb" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Get help" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "You can get help on any command with any of the three:\n", "\n", "whatis: Most concise\n", "man: Most verbose\n", "tldr: Best of both worlds. Lists any most common usages\n", "\n", "Our cheatsheet for major commands is here:\n", "https://docs.google.com/spreadsheets/d/1TflHe1LaIA154GJBZOUV_CveJYvJDGf5khjdzp_tgcU\n" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### echo" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Echo either:\n", "\n", "- prints a value to the console or standard output (stdout)\n", "- prints the contents of a variable to the console or stdout" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo \"Hello World\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo ~" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo $datadir\n", "echo $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### ls" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "lists the file contents of a path " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir/tsv" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### glob \"*\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Generalizes actions by pattern matching:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir/tsv/*basics*" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### redirect operator \">\", \">>\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Redirects the output of a former command into a file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo \"Hello World\" >> ~/helloworld" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l ~/helloworld" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### cat" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Prints the contents of a file or files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "cat ~/helloworld" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### wc" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Returns the line, word, character or byte count of standard input" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "wc -l ~/helloworld" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### pipe operator \"|\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Connects the output of a former command into the input of a latter command. Can be chained" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "cat ~/helloworld | wc -l" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### wc" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Returns the line, word, character or byte count of standard input" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### mkdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Creates new directories" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "mkdir $imdbdir/deletethis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### rm" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Remove files or directories\n", "Use with recursive (-r) option to delete directories, even if they are empty" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "rm -r $imdbdir/deletethis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### cp" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Copies files and directories from a path to another path" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "cp ~/helloworld ~/helloworld1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "cat ~/helloworld1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### gzip, gunzip" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Zip and unzip files/directories" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "gzip ~/helloworld" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l ~/helloworld*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "gunzip ~/helloworld.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ls -l ~/helloworld*" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### zcat" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Print the contents of a zipped file without the need to unzip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "gzip ~/helloworld\n", "zcat ~/helloworld.gz\n", "gunzip ~/helloworld.gz" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### less, zless" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Pager for viewing and searching inside large files on the terminal (can work in a terminal window)\n", "\n", "zless does the same thing for zipped files without the need to unzip" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### screen" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Terminal multiplexer. Enables multiple terminal windows to work on simultaneously on remote connections without a need to make a separate connection for each window (can work in a terminal window)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### head" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Print initial n lines of a file or standard input (stdin)\n", "\n", "Can also be used in order to delete last n lines of a file or stdin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "zcat $imdbdir/tsv/name.basics.tsv.gz | head -10" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### tail" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Print last n lines of a file or standard input (stdin)\n", "\n", "Can also be used in order to delete initial n lines of a file or stdin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "zcat $imdbdir/tsv/name.basics.tsv.gz | tail -10" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### find" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Find files and directories recursively based on many options and optionally execute a command on each result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### numfmt" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Pretty print number formats" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo 1000000000 | numfmt --to=si" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### yes" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Print a term continuously until terminated\n", "\n", "Can be used for non-interactive inputting answers to some interactive commands (like deleting directories etc)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### for loop" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Repeat the same action on multiple inputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "for i in $imdbdir/*;\n", "do\n", " ls -l $i\n", "done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### while loop" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Repeats an action as long as a logical condition holds true " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "i=0\n", "\n", "while [ $i -lt 10 ]; # means while i is less than 10\n", "do\n", " echo $i;\n", " i=$(( $i + 1 )) # that is for an arithmetic operation for incrementing\n", "done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### if.. else" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "To test a condition and make an action on the result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "if [[ 1 == 1 ]];\n", "then\n", " echo \"they are equal\";\n", "else\n", " echo \"they are not equal\";\n", "fi" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "if [[ 1 == 2 ]];\n", "then\n", " echo \"they are equal\";\n", "else\n", " echo \"they are not equal\";\n", "fi" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### xargs" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Mimicks a for loop: Repeat the same action on multiple inputs, sequentially" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# find only directories and list content of each\n", "find $imdbdir -type d | \\\n", "xargs -i ls -l \"{}\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### parallel" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Similar to xargs, put operates on multiple inputs parallelly" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "find $imdbdir -type d | \\\n", "parallel -j0 ls -l \"{}\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### column" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Format input into well defined columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "cat $imdbdir/tsv2/name.basics.tsv | head -10 | column -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### seq" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Create a sequence of numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 5 1 15" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### printf" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Similar to echo but you can control the output format better" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# create a sequence from 5 to 15\n", "# print each number as two digits (with leading zeros) and put a newline between each\n", "\n", "seq 5 1 15 | xargs printf \"%02d\\n\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### grep" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Simple pattern match" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 100 | grep 1" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### sed" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Stream editor. Can do many things on textual and line orinted data as substitution, line printing, append, delete, insert, etc\n", "\n", "Most basic usage is the find a pattern and substitute with something else" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 20 | sed 's/1/a/g' # find 1's and change into \"a\" globally (all instances)" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### tr" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Translate a character to something else, squeeze repeating characters into a single instance or completely delete a character\n", "\n", "Simpler version of sed substitute" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 10 1 15 | tr -s \"1\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 10 1 15 | tr -d \"1\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 10 1 15 | tr \"1\" \"a\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### awk" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "A better grep and sed for column separated data that can work on each column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# print the third column separated with tabs\n", "\n", "zcat $imdbdir/tsv/name.basics.tsv.gz | head -10 | awk -F \"\\t\" '{ print $3 }'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# print all columns for rows third column of which is equal to 1899\n", "\n", "zcat $imdbdir/tsv/name.basics.tsv.gz |\\\n", " head -10 |\\\n", " awk -F \"\\t\" '$3 == 1899 { print $0 }'" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### sort" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Sort input numerically or as character" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# sort as character\n", "\n", "seq 5 1 15 | sort" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# reverse sort numerically\n", "\n", "seq 5 1 15 | sort -nr" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### uniq" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Get unique values of a sorted input (optionally with count values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# in the sequence to 20, replace 2's and 3's with one and get the counts of unique values\n", "\n", "\n", "seq 20 | tr \"2\" \"1\" | tr \"3\" \"1\" | sort -n | uniq -c" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### pr" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Pretty print input with some format options such as number of columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "seq 100 | pr -5 -t" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### gnuplot" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Simple plots on the terminal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# print two series 1:10 and 11:20\n", "\n", "seq 20 | pr -2 -t" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# plot those two series\n", "\n", "seq 20 |\\\n", " pr -2 -t |\\\n", " gnuplot -e \"set terminal dumb; plot '-' using 2:1 w points pt '*'\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### ps" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "get info on running processes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "ps aux | grep jupyter" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### hostname" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Get info on hostname of the system" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "hostname\n", "hostname -I" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### du" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Get recursive disk usage of files and directories" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "du -sh $imdbdir" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### df" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Get info on mounted file systems" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "df -h" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "## SQL" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Check whether SQL service is running by responding to the below command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### meta-commands of psql/pgcli" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "psql and pgcli clients offer some meta-commands that start with \"\\\" and that offer shortcuts to some common operations\n", "\n", "You can get help on that using \"\\?\" - a meta-command itself" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\?\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\l\" lists databases:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\l\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\c\" connects to database:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -c \"\\c imdb2\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\dt\" lists tables in a database:\n", "\"\\dt+\" gives more detail" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# connect to imdb2 database as postgres user and list details of all tables in the database\n", "\n", "psql -U postgres -d imdb2 -c \"\\dt+\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "\"\\d\" lists fields in a table\n", "\n", "\"\\d+\" gives more detail:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# connect to imdb2 database as postgres user and list details of columns of title_basics table in the database\n", "\n", "psql -U postgres -d imdb2 -c \"\\d+ title_basics\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "To get detailed information on all tables and primary keys in the public schema (like a named directory of databases, tables and keys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# connect to imdb2 database as postgres user and list details of columns of all tables and keys in the database\n", "\n", "psql -U postgres -d imdb2 -c \"\\d+ public.*\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "You can go back to this output frequently to view a \"map\" of the database: tables and columns\n", "\n", "not that \"not null\" fields collectively show the primary key of a table" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### tables and fields of imdb2 database" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Above command is repeated here in a separate section for easy navigation.\n", "\n", "You may need to returnback to this output while writing queries\n", "\n", "You can view the same output from the imdb_database text file on the home directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "# connect to imdb2 database as postgres user and list details of columns of all tables and keys in the database\n", "\n", "psql -U postgres -d imdb2 -c \"\\d+ public.*\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### pgcli" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Due to its syntax highlighting and smart completion features, it is easier to write an sql statement on pgcli" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "First we open a new terminal window:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "last=$(( $(ps aux | awk -F \" \" '$7 ~ /pts/ { print $7 }' | grep -Po \"\\d+\" | sort -nu | tail -1) + 1 ))\n", "echo $last" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "echo \"Follow the working one of the following links to open a terminal: (that matches the domain of the URL above)\"\n", "echo \"https://hub.gke.mybinder.org/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"\n", "echo \"https://hub-binder.mybinder.ovh/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"\n", "echo \"https://notebooks.gesis.org/binder/jupyter/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now please copy and paste the below command inside that terminal:\n", "\n", "```Bash\n", "screen -S 1\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "scr=$(screen -ls | grep -P \"Attached\" | head -1 | grep -Po \"^\\t+.+?(?=\\s|\\t)\" | tr -d \"\\t\")\n", "screen -S $scr -X stuff \"pgcli -U postgres\\n\"" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Sending sql queries to postgresql server through Python3 kernel" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We will use the Python3 kernel with sql magic (remember, writing a code of different language from a kernel)\n", "\n", "Note that the kernel selection on the right is Python3 now, not bash above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%sql postgres://postgres@localhost/imdb2" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "### start with a basic select query on title_ratings table" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Remember the fields of title_ratings table:\n", "\n", "```\n", " Table \"public.title_ratings\"\n", " Column | Type | Collation | Nullable | Default | Storage | Stats target | Description \n", "---------------+---------+-----------+----------+---------+----------+--------------+-------------\n", " tconst | text | | not null | | extended | | \n", " averagerating | numeric | | | | main | | \n", " numvotes | integer | | | | plain | | \n", "Indexes:\n", " \"tconst_pk_04\" PRIMARY KEY, btree (tconst)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now let's return the initial 10 rows of the title_ratings table, selecting onlu averagerating and numvotes columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT averagerating, numvotes\n", " FROM title_ratings\n", " LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now select only those titles with numvotes above 1,000,000 (all columns using glob \"\\*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT *\n", " FROM title_ratings\n", " WHERE numvotes > 1000000;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now select only those titles with a rating above or equal to 9.0. Limit to first 20 rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT *\n", " FROM title_ratings\n", " WHERE averagerating > 9\n", " LIMIT 20;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "The results are not so informative with numvotes as low as 5,6, etc\n", "\n", "Now your first exercise. You can reuse the SQL statements from the previous week:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 1:**\n", "\n", "Select those rows from title_ratings table\n", "- where average rating is above 8.8 and number of votes is above 500,000 (do not use equality)\n", "- sort by decreasing order of averagerating first and then decreasing order of numvotes\n", "\n", "Note: You can try below cells or pgcli from the terminal to test your commands\n", "\n", "The output should be:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "```\n", "tconst | averagerating | numvotes \n", "-----------+---------------+----------\n", " tt0944947 | 9.5 | 1238174\n", " tt0903747 | 9.5 | 1021165\n", " tt0111161 | 9.3 | 1858757\n", " tt0068646 | 9.2 | 1269042\n", " tt1475582 | 9.2 | 601611\n", " tt0468569 | 9.0 | 1837035\n", " tt0071562 | 9.0 | 874506\n", " tt0110912 | 8.9 | 1454537\n", " tt0167260 | 8.9 | 1330244\n", " tt0108052 | 8.9 | 954632\n", " tt0108778 | 8.9 | 551546\n", " tt0060196 | 8.9 | 551112\n", " tt0050083 | 8.9 | 507580\n", "(13 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "This is OK, but we don't have meaningful information unless we combine the results with other tables in the database" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Joins" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "ttXXXXXXX values do not mean anything.\n", "\n", "We know that details of movies reside in title_basics table.\n", "\n", "Let's recite the fields of title_basics and title_ratings table:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "```\n", " Table \"public.title_ratings\"\n", " Column | Type | Collation | Nullable | Default | Storage | Stats target | Description \n", "---------------+---------+-----------+----------+---------+----------+--------------+-------------\n", " tconst | text | | not null | | extended | | \n", " averagerating | numeric | | | | main | | \n", " numvotes | integer | | | | plain | | \n", "Indexes:\n", " \"tconst_pk_04\" PRIMARY KEY, btree (tconst)\n", "\n", " Table \"public.title_basics\"\n", " Column | Type | Collation | Nullable | Default | Storage | Stats target | Description \n", "----------------+---------+-----------+----------+---------+----------+--------------+-------------\n", " tconst | text | | not null | | extended | | \n", " titletype | text | | | | extended | | \n", " primarytitle | text | | | | extended | | \n", " originaltitle | text | | | | extended | | \n", " isadult | boolean | | | | plain | | \n", " startyear | integer | | | | plain | | \n", " endyear | integer | | | | plain | | \n", " runtimeminutes | integer | | | | plain | | \n", " genres | text | | | | extended | | \n", "Indexes:\n", " \"tconst_pk\" PRIMARY KEY, btree (tconst)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We see that tconst is the common field between two tables, we can join on that.\n", "\n", "Now let's return the titletype, primarytitle, startyear and genres information from title_basics along with all fields from title_ratings for rows filtered in the previous query\n", "\n", "We will use aliases tr and tb for the tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tr.*, tb.titletype, tb.primarytitle, tb.startyear, tb.genres\n", "FROM title_ratings tr\n", " LEFT JOIN title_basics tb ON tr.tconst = tb.tconst\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", "ORDER BY tr.averagerating DESC, tr.numvotes DESC;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now, let's say we are not concerned with tvSeries and exclude them for our search.\n", "\n", "Note that when the column names are the same for both sides, we can use the shorthand \"USING\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tr.*, tb.titletype, tb.primarytitle, tb.startyear, tb.genres\n", " FROM title_ratings tr\n", " LEFT JOIN title_basics tb USING (tconst)\n", " WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", " ORDER BY tr.averagerating DESC, tr.numvotes DESC;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now your second task:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 2:**\n", "\n", "- Filter for those rows from title_basics and THEN title_ratings where genres include Horror, runtimeminutes is below 10 minutes and startyear is 2017 and averagerating is above 8\n", "- So you should join title_basics to title_ratings from LEFT (inverse of the previous example where we joined title_ratings into title_basics from left)\n", "- Return only titletype and primarytitle from title_basics and averagerating from title_ratings\n", "- Order by increasing averagerating\n", "\n", "Output should be:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "```\n", "averagerating | titletype | primarytitle \n", "---------------+-----------+-------------------------------------------\n", " 8.1 | tvSeries | Gojira awakening\n", " 8.2 | short | The Hunt\n", " 8.2 | video | Leningrad: Kolshik\n", " 8.3 | short | One Silent Man\n", " 8.5 | short | Room for One More\n", " 8.5 | short | Wyrmwood: Chronicles of the Dead - Teaser\n", " 8.6 | short | Guardians\n", " 8.6 | short | House That Took My Soul\n", " 8.8 | short | Body Image\n", " 8.9 | short | A La Dolce Vita\n", " 9.0 | short | Feeling\n", " 9.2 | short | Get Away\n", " 9.2 | short | Fresh Blood\n", " 9.2 | short | The Rage\n", " 9.7 | short | Tarot\n", " 9.9 | short | Don't Open the Door!\n", " 10.0 | short | Office Terror\n", " 10.0 | short | Elise's Nightmare\n", " 10.0 | short | Darkness\n", "(19 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Of course most of them are short movies" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now let's join a third table to get the principalcast of those movies by adding title_principals_melt table\n", "\n", "Let's return only primarytitle and principalcast fields and order by primarytitle:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tb.primarytitle, tp.principalcast\n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "ORDER BY tb.primarytitle;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Fairly good. But those nmXXXXXXX values do not mean nothing.\n", "\n", "Let's join these tables into name_basics to get the names of the cast\n", "\n", "Remember the structure of name_basics:\n", "\n", "```\n", " Column | Type | Collation | Nullable | Default | Storage | Stats target | Description \n", "-------------------+---------+-----------+----------+---------+----------+--------------+-------------\n", " nconst | text | | not null | | extended | | \n", " primaryname | text | | | | extended | | \n", " birthyear | integer | | | | plain | | \n", " deathyear | integer | | | | plain | | \n", " primaryprofession | text | | | | extended | | \n", " knownfortitles | text | | | | extended | | \n", " ```" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "**EXERCISE 3:**\n", "\n", "Join name_basics table to the previous query.\n", "\n", "Note that the field names for matching nmXXXXXXX values are not the same between title_principals_melt and name_basics tables. So use \"ON\" clause as before:\n", "\n", "Report primarytitle and startyear from tb, primaryname and birthyear from name_basics and averagerating from title_ratings.\n", "\n", "Order by primary name first and primarytitle later\n", "\n", "Output should be:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "```\n", " primarytitle | startyear | primaryname | birthyear | averagerating \n", "-----------------------------------------------+-----------+-----------------------+-----------+---------------\n", " The Dark Knight | 2008 | Aaron Eckhart | 1968 | 9.0\n", " The Good, the Bad and the Ugly | 1966 | Agenore Incrocci | 1919 | 8.9\n", " The Good, the Bad and the Ugly | 1966 | Alberto Grimaldi | 1925 | 8.9\n", " The Godfather | 1972 | Albert S. Ruddy | 1930 | 9.2\n", " The Good, the Bad and the Ugly | 1966 | Aldo Giuffrè | 1924 | 8.9\n", " The Godfather | 1972 | Al Pacino | 1940 | 9.2\n", " The Godfather: Part II | 1974 | Al Pacino | 1940 | 9.0\n", " Pulp Fiction | 1994 | Andrzej Sekula | 1954 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Barrie M. Osborne | 1944 | 8.9\n", " The Godfather: Part II | 1974 | Barry Malkin | 1938 | 9.0\n", " Schindler's List | 1993 | Ben Kingsley | 1943 | 8.9\n", " The Shawshank Redemption | 1994 | Bob Gunton | 1945 | 9.3\n", " The Dark Knight | 2008 | Bob Kane | 1915 | 9.0\n", " 12 Angry Men | 1957 | Boris Kaufman | 1897 | 8.9\n", " Schindler's List | 1993 | Branko Lustig | 1932 | 8.9\n", " Pulp Fiction | 1994 | Bruce Willis | 1955 | 8.9\n", " 12 Angry Men | 1957 | Carl Lerner | 1912 | 8.9\n", " Schindler's List | 1993 | Caroline Goodall | 1959 | 8.9\n", " The Dark Knight | 2008 | Charles Roven | 1949 | 9.0\n", " The Dark Knight | 2008 | Christian Bale | 1974 | 9.0\n", " The Dark Knight | 2008 | Christopher Nolan | 1970 | 9.0\n", " The Good, the Bad and the Ugly | 1966 | Clint Eastwood | 1930 | 8.9\n", " The Dark Knight | 2008 | David S. Goyer | 1965 | 9.0\n", " Pulp Fiction | 1994 | David Wasco | | 8.9\n", " The Godfather | 1972 | Diane Keaton | 1946 | 9.2\n", " The Godfather: Part II | 1974 | Diane Keaton | 1946 | 9.0\n", " The Lord of the Rings: The Return of the King | 2003 | Elijah Wood | 1981 | 8.9\n", " The Good, the Bad and the Ugly | 1966 | Eli Wallach | 1915 | 8.9\n", " The Godfather | 1972 | Francis Ford Coppola | 1939 | 9.2\n", " The Godfather: Part II | 1974 | Francis Ford Coppola | 1939 | 9.0\n", " The Shawshank Redemption | 1994 | Frank Darabont | 1959 | 9.3\n", " The Lord of the Rings: The Return of the King | 2003 | Fran Walsh | 1959 | 8.9\n", " The Good, the Bad and the Ugly | 1966 | Furio Scarpelli | 1919 | 8.9\n", " Schindler's List | 1993 | Gerald R. Molen | 1935 | 8.9\n", " The Godfather | 1972 | Gordon Willis | 1931 | 9.2\n", " The Godfather: Part II | 1974 | Gordon Willis | 1931 | 9.0\n", " The Dark Knight | 2008 | Heath Ledger | 1979 | 9.0\n", " 12 Angry Men | 1957 | Henry Fonda | 1905 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Howard Shore | 1946 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Ian McKellen | 1939 | 8.9\n", " The Godfather | 1972 | James Caan | 1940 | 9.2\n", " 12 Angry Men | 1957 | John Fiedler | 1925 | 8.9\n", " Pulp Fiction | 1994 | John Travolta | 1954 | 8.9\n", " Schindler's List | 1993 | John Williams | 1932 | 8.9\n", " The Dark Knight | 2008 | Jonathan Nolan | 1976 | 9.0\n", " The Lord of the Rings: The Return of the King | 2003 | J.R.R. Tolkien | 1892 | 8.9\n", " 12 Angry Men | 1957 | Kenyon Hopkins | 1912 | 8.9\n", " Pulp Fiction | 1994 | Lawrence Bender | 1957 | 8.9\n", " 12 Angry Men | 1957 | Lee J. Cobb | 1911 | 8.9\n", " The Good, the Bad and the Ugly | 1966 | Lee Van Cleef | 1925 | 8.9\n", " Schindler's List | 1993 | Liam Neeson | 1952 | 8.9\n", " The Dark Knight | 2008 | Lorne Orleans | | 9.0\n", " The Good, the Bad and the Ugly | 1966 | Luciano Vincenzoni | 1926 | 8.9\n", " The Godfather | 1972 | Mario Puzo | 1920 | 9.2\n", " The Godfather: Part II | 1974 | Mario Puzo | 1920 | 9.0\n", " The Godfather | 1972 | Marlon Brando | 1924 | 9.2\n", " 12 Angry Men | 1957 | Martin Balsam | 1919 | 8.9\n", " The Dark Knight | 2008 | Michael Caine | 1933 | 9.0\n", " The Good, the Bad and the Ugly | 1966 | Mickey Knox | 1921 | 8.9\n", " The Shawshank Redemption | 1994 | Morgan Freeman | 1937 | 9.3\n", " The Shawshank Redemption | 1994 | Niki Marvin | | 9.3\n", " The Godfather | 1972 | Nino Rota | 1911 | 9.2\n", " The Godfather: Part II | 1974 | Nino Rota | 1911 | 9.0\n", " The Lord of the Rings: The Return of the King | 2003 | Orlando Bloom | 1977 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Peter Jackson | 1961 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Philippa Boyens | | 8.9\n", " Pulp Fiction | 1994 | Quentin Tarantino | 1963 | 8.9\n", " Schindler's List | 1993 | Ralph Fiennes | 1962 | 8.9\n", " 12 Angry Men | 1957 | Reginald Rose | 1920 | 8.9\n", " The Shawshank Redemption | 1994 | Richard Francis-Bruce | 1948 | 9.3\n", " The Godfather: Part II | 1974 | Richard Marks | 1943 | 9.0\n", " The Godfather: Part II | 1974 | Robert De Niro | 1943 | 9.0\n", " The Godfather: Part II | 1974 | Robert Duvall | 1931 | 9.0\n", " Pulp Fiction | 1994 | Roger Avary | 1965 | 8.9\n", " The Shawshank Redemption | 1994 | Roger Deakins | 1949 | 9.3\n", " Pulp Fiction | 1994 | Sally Menke | 1953 | 8.9\n", " Pulp Fiction | 1994 | Samuel L. Jackson | 1948 | 8.9\n", " The Good, the Bad and the Ugly | 1966 | Sergio Leone | 1929 | 8.9\n", " 12 Angry Men | 1957 | Sidney Lumet | 1924 | 8.9\n", " The Shawshank Redemption | 1994 | Stephen King | 1947 | 9.3\n", " Schindler's List | 1993 | Steven Spielberg | 1946 | 8.9\n", " Schindler's List | 1993 | Steven Zaillian | 1953 | 8.9\n", " Schindler's List | 1993 | Thomas Keneally | 1935 | 8.9\n", " The Shawshank Redemption | 1994 | Thomas Newman | 1955 | 9.3\n", " The Shawshank Redemption | 1994 | Tim Robbins | 1958 | 9.3\n", " Pulp Fiction | 1994 | Uma Thurman | 1970 | 8.9\n", " The Lord of the Rings: The Return of the King | 2003 | Viggo Mortensen | 1958 | 8.9\n", " The Godfather | 1972 | William Reynolds | 1910 | 9.2\n", " The Shawshank Redemption | 1994 | William Sadler | 1950 | 9.3\n", "(89 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### DISCTINCT" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "In the previous query, we had 89 rows returned. But there are some cases where a cast appeared in more than one movies.\n", "\n", "I might be interested in unique values" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's get the values of primaryname from the previous query (with duplicate returns):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT nb.primaryname\n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "ORDER BY nb.primaryname;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT DISTINCT nb.primaryname\n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "ORDER BY nb.primaryname;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "We have 83 rows returned" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### GROUP BY" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now I want to summarize the returned rows from the exercise 3 a little bit.\n", "\n", "I want to get the counts of each primarycast in the results along with their names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT nb.primaryname, count(nb.primaryname)\n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "GROUP BY nb.primaryname;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "See that some of the cast from the initial two installments of the Godfather Trilogy appeared in both films" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 4:**\n", "\n", "Report the primarytitles and the count of primary cast from each titles (number of rows a primarytitle appears in the results)\n", "\n", "Output should be as such:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "```\n", " primarytitle | count \n", "-----------------------------------------------+-------\n", " 12 Angry Men | 9\n", " Pulp Fiction | 10\n", " Schindler's List | 10\n", " The Dark Knight | 10\n", " The Godfather | 10\n", " The Godfather: Part II | 10\n", " The Good, the Bad and the Ugly | 10\n", " The Lord of the Rings: The Return of the King | 10\n", " The Shawshank Redemption | 10\n", "(9 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### HAVING" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "When we use conditionals after where statement, the ROWS are filtered before they are grouped\n", "\n", "What if we want to filter the GROUPED ROWS?\n", "\n", "We will use the having statement:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Let's filter for those movies, for which no principal cast is born before 1930" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT tb.primarytitle, min(nb.birthyear)\n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "GROUP BY tb.primarytitle\n", "HAVING min(nb.birthyear) > 1930;" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Whom do you think ist the oldest cast in Shawshank Redemption?\n", "\n", "We will get back to this!" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "**EXERCISE 5:**\n", "\n", "Using the first query in \"GROUP BY\" section (that returns the count of appearances of each cast in selected movies) and a HAVING clause, filter for those cast who appear more than once\n", "\n", "Output should be:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "```\n", " primaryname | count \n", "----------------------+-------\n", " Al Pacino | 2\n", " Diane Keaton | 2\n", " Francis Ford Coppola | 2\n", " Gordon Willis | 2\n", " Mario Puzo | 2\n", " Nino Rota | 2\n", "(6 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done\n" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "### Subqueries" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "Now, we may want to see the oldest cast in each of those selected movies.\n", "\n", "Note that cast does not only include actors/actresses but also writers, directors, directors of photography and even composers\n", "\n", "This cannot be implemented in a simple query combining only existing tables easily:\n", "\n", "- We may have a separete query that includes the minimum birthyear for each movie (as we did above) and name the output of this query so that we can use it as if it is a separate table (such as mb for \"minimum birthyear\")\n", "\n", "- The output will have primarytitle, minbirthyear and tconst\n", "\n", "- And we can join this new table to title_basics, title_principals_melt and name_basics as we did before to filter for rows where birthyear of a cast is equal to the minbirthyear" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "SELECT mb.primarytitle, nb.birthyear, mb.minbirthyear, nb.primaryname\n", " FROM\n", "\n", "(\n", "-- this is a subquery to get the minimum birthyear for each title\n", "SELECT tb.primarytitle, min(nb.birthyear) as minbirthyear, tb.tconst \n", "FROM title_basics tb\n", " LEFT JOIN title_ratings tr USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "WHERE tr.averagerating > 8.8\n", " AND tr.numvotes > 500000\n", " AND NOT tb.titletype = 'tvSeries'\n", "GROUP BY tb.primarytitle, tb.tconst\n", ") mb\n", "\n", "-- we refer to the output of this query as if it is a separate table called \"mb\"\n", "-- and use the same joins above \n", " LEFT JOIN title_basics tb USING (tconst)\n", " LEFT JOIN title_principals_melt tp USING (tconst)\n", " LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n", "\n", "-- and filter for the equality of the birthyear of each cast to the minbirthyear of the title\n", "WHERE nb.birthyear = mb.minbirthyear" ] }, { "cell_type": "markdown", "metadata": { "kernel": "bash" }, "source": [ "So we are sure now whom the oldest cast in The Shawshank Redemption is:\n", "\n", "The Turkish dubbed version of this famous scene is cited here to pay a tribute to great Turkish voice over actor Nur Subaşı who deceased recently:\n", "\n", "https://www.youtube.com/watch?v=vJ2NJPYOfjg" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "And again I want to pay tribute to another great Turkish voice over actor Payidar Tüfekçioğlu who also dubbed the lines of the same great actor and deceased recently:\n", "\n", "https://www.youtube.com/watch?v=g0gAAHDko8w" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "Now while you are watching these performance, your last exercise comes along:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "**EXERCISE 6:**\n", "\n", "Start from the previous query:\n", "\n", "- Now instead of minbirthyear, calculate the maxbirtyear\n", "\n", "- Report primarytitle, primaryname, birthyear of cast, startyear of movie and age of the youngest cast at the time of the movie (just the difference of previous two columns aliased as \"age\"\n", "\n", "The output should be:" ] }, { "cell_type": "markdown", "metadata": { "kernel": "python3" }, "source": [ "```\n", " primarytitle | birthyear | startyear | age | primaryname \n", "-----------------------------------------------+-----------+-----------+-----+----------------\n", " 12 Angry Men | 1925 | 1957 | 32 | John Fiedler\n", " The Good, the Bad and the Ugly | 1930 | 1966 | 36 | Clint Eastwood\n", " The Godfather | 1946 | 1972 | 26 | Diane Keaton\n", " The Godfather: Part II | 1946 | 1974 | 28 | Diane Keaton\n", " Schindler's List | 1962 | 1993 | 31 | Ralph Fiennes\n", " Pulp Fiction | 1970 | 1994 | 24 | Uma Thurman\n", " The Shawshank Redemption | 1959 | 1994 | 35 | Frank Darabont\n", " The Lord of the Rings: The Return of the King | 1981 | 2003 | 22 | Elijah Wood\n", " The Dark Knight | 1979 | 2008 | 29 | Heath Ledger\n", "(9 rows)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "python3" }, "outputs": [], "source": [ "%%sql\n", "\n", "-- statement here" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [ "psql -U postgres -d imdb2 < /dev/null)\n", "echo \"$solution\"; echo\n", "for l in \"$solution\"; do eval \"${l}\"; done\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "kernel": "bash" }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SoS", "language": "sos", "name": "sos" }, "language_info": { "codemirror_mode": "sos", "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", "pygments_lexer": "sos" }, "sos": { "kernels": [], "panel": { "displayed": true, "height": 0, "style": "side" }, "version": "0.19.18" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": { "height": "750.696px", "left": "0px", "right": "1309.09px", "top": "110.284px", "width": "436.364px" }, "toc_section_display": "block", "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }