{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "# SQL ON SHELL\n",
    "\n",
    "**SERHAT ÇEVİKEL**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "First let's start our PostgreSQL server again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "sudo service postgresql start"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "In order to check whether the server is responding by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "psql -U postgres -c \"\\l\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Now let's create our environment variables and extract the zipped tsv files if they are not yet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "datadir=~/data\n",
    "imdbdir=$datadir/imdb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "mkdir -p $imdbdir/tsv2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "find $imdbdir/tsv -mindepth 1 | \\\n",
    "    parallel -k -j0 \"basenm=\\$(basename {});\n",
    "        if [ ! -e ${imdbdir}/tsv2/\\${basenm%.gz} ];\n",
    "        then\n",
    "            gunzip -c {} > \\\n",
    "            ${imdbdir}/tsv2/\\${basenm%.gz};\n",
    "        fi\n",
    "        \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Now we will go through two similar shell tools to use sql syntax on csv or tsv files:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "## Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "The source page for q project is:\n",
    "\n",
    "https://github.com/harelba/q\n",
    "\n",
    "```\n",
    "q - Text as Data\n",
    "\n",
    "q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).\n",
    "\n",
    "q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.\n",
    "\n",
    "q's web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "The tsv files are as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "tldr q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "ls $imdbdir/tsv2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "A very basic query taking stdin as input: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "cat $imdbdir/tsv2/title.ratings.tsv | \\\n",
    "q -t -H \"SELECT * \\\n",
    "FROM - \\\n",
    "WHERE tconst = 'tt0000001';\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Or use the filename inside FROM clause:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "q -t -H \"SELECT * \\\n",
    "FROM $imdbdir/tsv2/title.ratings.tsv \\\n",
    "WHERE tconst = 'tt0000001';\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "## CSVSQL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Part of the csvkit suite:\n",
    "\n",
    "https://github.com/wireservice/csvkit\n",
    "\n",
    "https://csvkit.readthedocs.io/en/1.0.3/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "cat $imdbdir/tsv2/title.ratings.tsv | csvsql -t --query \"SELECT * \\\n",
    "FROM stdin \\\n",
    "WHERE tconst = 'tt0000001';\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "# COMMON TABLE EXPRESSIONS IN SQL (WITH QUERIES)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "In order to use sql magic on Python3 kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%sql postgres://postgres@localhost/imdb2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "You can remember the structure of the data by either typing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "psql -U postgres -d imdb2 -c \"\\d+ public.*\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Or viewing the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "cat ~/imdb_database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "## Godfather cast with CTE's"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Now let's remember the query from session 2:\n",
    "\n",
    "\n",
    "- Filter for movies titled Godfather.\\\\*Part\n",
    "- Genres include drama and exclude comedy\n",
    "- Start year not after 1990\n",
    "- People with primary profession actor|actress\n",
    "\n",
    "- First join titles and principal cast on title id's (tconst)\n",
    "- And then join principal cast and name basics on name id's (nconst)\n",
    "- Filter only for actors and actresses\n",
    "- And sort on first names (ascending) then title years"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "\n",
    "SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,\n",
    "    tp.principalcast,\n",
    "    nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession\n",
    "\n",
    "FROM title_basics tb\n",
    "LEFT JOIN title_principals_melt tp USING (tconst)\n",
    "\tLEFT JOIN name_basics nb ON tp.principalcast=nb.nconst\n",
    "\n",
    "WHERE tb.originaltitle ~ 'Godfather.*Part'\n",
    "\tAND tb.genres ~ '(?i)drama'\n",
    "\tAND NOT tb.genres ~ '(?i)comedy'\n",
    "\tAND tb.startyear <= 1990\n",
    "\tAND nb.primaryprofession ~'actor|actress'\n",
    "\n",
    "ORDER BY nb.primaryname, tb.startyear DESC;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "We can design this query so that smaller steps are combined:\n",
    "\n",
    "- Now let's rewrite the previous query with common table expressions (CTE) (aka \"WITH\" queries)\n",
    "- With CTE, we run a subquery or multiple subqueries before the main query, name it/them, refer to other subqueries with those names\n",
    "- And reuse the subqueries in multiple places in the main query - referring by its name, since PostgreSQL stores the results of subqueries\n",
    "- Just like a table!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "\n",
    "WITH basics_principles AS\n",
    "(\n",
    "SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes,\n",
    "\ttb.genres, tp.principalcast\n",
    "FROM title_basics tb\n",
    "    LEFT JOIN title_principals_melt tp\n",
    "    USING (tconst)\n",
    "WHERE tb.originaltitle ~ '.*Godfather.*Part.*'\n",
    "\tAND tb.genres ~ '(?i)drama'\n",
    "\tAND NOT tb.genres ~ '(?i)comedy'\n",
    "\tAND tb.startyear <= 1990\n",
    ")\n",
    "SELECT bp.tconst, bp.originaltitle, bp.startyear, bp.runtimeminutes,\n",
    "\tbp.genres, bp.principalcast, nb.primaryname,\n",
    "\tnb.birthyear, nb.deathyear, nb.primaryprofession\n",
    "\n",
    "FROM basics_principles bp LEFT JOIN name_basics nb ON bp.principalcast=nb.nconst\n",
    "\n",
    "WHERE nb.primaryprofession ~'actor|actress'\n",
    "\n",
    "ORDER BY nb.primaryname, bp.startyear DESC;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "## De Niro and Scorsese"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "This was already easily implemented without CTE's.\n",
    "\n",
    "Now a harder example which is much challenging to implement without CTE's:\n",
    "\n",
    "Return all fields from title_basics for \"movies\" which feature Martin Scorsese as director and Robert De Niro as actor, sorted by ascending startyear"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "\n",
    "-- first get the nconst of Martin Scorsese\n",
    "WITH ms_id AS\n",
    "(\n",
    "SELECT nb.nconst\n",
    "FROM name_basics nb\n",
    "WHERE nb.primaryname = 'Martin Scorsese'\n",
    "\tAND nb.primaryprofession ~ 'director'\n",
    "),\n",
    "-- then get the titles directed by Martin Scorsese\n",
    "ms_titles AS\n",
    "(\n",
    "SELECT tc.tconst\n",
    "FROM ms_id LEFT JOIN title_crew tc ON ms_id.nconst=tc.directors\n",
    "),\n",
    "-- and get the nconst of Robert De Niro\n",
    "rdn_id AS\n",
    "(\n",
    "SELECT nb.nconst\n",
    "FROM name_basics nb\n",
    "WHERE nb.primaryname = 'Robert De Niro'\n",
    "),\n",
    "-- and get the titles by Robert De Niro\n",
    "rdn_titles AS\n",
    "(\n",
    "SELECT tp.tconst\n",
    "FROM rdn_id LEFT JOIN title_principals_melt tp ON rdn_id.nconst=tp.principalcast\n",
    ")\n",
    "-- and last, get the intersection of titles by De Niro and Scorsese\n",
    "SELECT tb.*\n",
    "FROM ms_titles INNER JOIN rdn_titles ON ms_titles.tconst=rdn_titles.tconst\n",
    "LEFT JOIN title_basics tb ON rdn_titles.tconst=tb.tconst\n",
    "WHERE tb.titletype = 'movie'\n",
    "ORDER BY tb.startyear\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "## Ratings of Al Pacino"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "Now another:\n",
    "\n",
    "\n",
    "- Return all the fields from title_basics and \"averagerating\" field from title_ratings of the \"movies\" that Al Pacino took role in\n",
    "- Exclude titles which have no \"averagerating\" info (with \"IS NOT NULL\" logical test)\n",
    "- Order by descending averageratings\n",
    "- Note that: In order to get the id of Al Pacino, from the name_basics table, you should check the \"primaryname\" field for equality with \"Al Pacino\" and birthyear with \"1940\"  \n",
    "- You can use multiple \"common table expressions\" (CTE) or \"WITH\" queries to simplify the main query\n",
    "- You can use aliases for tables to simplify your query (e.g. FROM title_basics tb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "\n",
    "-- get the nconst of pacino\n",
    "WITH pacino AS\n",
    "(\n",
    "SELECT nconst\n",
    "FROM name_basics nb\n",
    "WHERE nb.primaryname = 'Al Pacino'\n",
    "\tAND nb.birthyear = 1940\n",
    "),\n",
    "-- get the titles of pacino\n",
    "t_ids AS\n",
    "(\n",
    "SELECT tp.tconst\n",
    "FROM pacino LEFT JOIN title_principals_melt tp ON pacino.nconst=tp.principalcast\n",
    ")\n",
    "-- get the details of those titles\n",
    "SELECT tb.*, tr.averagerating\n",
    "FROM t_ids LEFT JOIN title_basics tb ON t_ids.tconst=tb.tconst\n",
    "\tLEFT JOIN title_ratings tr ON tb.tconst=tr.tconst\n",
    "WHERE tb.titletype = 'movie'\n",
    "\tAND tr.averagerating IS NOT NULL\n",
    "ORDER BY tr.averagerating DESC\t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "## Youngest cast of each movie, revisited"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "**EXERCISE 1:**\n",
    "\n",
    "Now remember the last query from session 3, in which we tried to get the oldest cast of each filtered movie.\n",
    "\n",
    "With CTE's we can do it in a more efficient and neat way:\n",
    "\n",
    "\n",
    "- Select movies with averagerating > 8.8, numvotes > 500000, titletype is not tvSeries\n",
    "- Now calculate the max birthyear of the cast of each movies and select those names with max birthyear, hence youngest cast\n",
    "- Report primarytitle, primaryname, birthyear of cast, startyear of movie and age of the youngest cast at the time of the movie (just the difference of previous two columns aliased as \"age\"\n",
    "\n",
    "The version with subquery was as such:\n",
    "\n",
    "\n",
    "```SQL\n",
    "SELECT mb.primarytitle, nb.birthyear, tb.startyear, tb.startyear - nb.birthyear as age,nb.primaryname\n",
    "    FROM\n",
    "(\n",
    "SELECT tb.primarytitle, max(nb.birthyear) as maxbirthyear, tb.tconst \n",
    "FROM title_basics tb\n",
    "    LEFT JOIN title_ratings tr USING (tconst)\n",
    "    LEFT JOIN title_principals_melt tp USING (tconst)\n",
    "    LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n",
    "WHERE tr.averagerating > 8.8\n",
    "    AND tr.numvotes > 500000\n",
    "    AND NOT tb.titletype = 'tvSeries'\n",
    "GROUP BY tb.primarytitle, tb.tconst\n",
    ") mb\n",
    "    LEFT JOIN title_basics tb USING (tconst)\n",
    "    LEFT JOIN title_principals_melt tp USING (tconst)\n",
    "    LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n",
    "WHERE nb.birthyear = mb.maxbirthyear\n",
    "```\n",
    "\n",
    "and the basic query listing the primary casts was as such:\n",
    "\n",
    "```SQL\n",
    "SELECT tb.primarytitle, tb.startyear, nb.primaryname, nb.birthyear, tr.averagerating\n",
    "FROM title_basics tb\n",
    "    LEFT JOIN title_ratings tr USING (tconst)\n",
    "    LEFT JOIN title_principals_melt tp USING (tconst)\n",
    "    LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast\n",
    "WHERE tr.averagerating > 8.8\n",
    "    AND tr.numvotes > 500000\n",
    "    AND NOT tb.titletype = 'tvSeries'\n",
    "ORDER BY nb.primaryname, tb.primarytitle;\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "python3"
   },
   "outputs": [],
   "source": [
    "%%sql\n",
    "\n",
    "-- statement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "psql -U postgres -d imdb2 <<EOF\n",
    "\n",
    "-- or you may write your statement here\n",
    "EOF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "python3"
   },
   "source": [
    "**SOLUTION 1:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "pass1=3082\n",
    "encrypt=$(cat encrypt_04_01)\n",
    "solution=$(echo \"$encrypt\" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)\n",
    "echo \"$solution\"; echo\n",
    "for l in \"$solution\"; do eval \"${l}\"; done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "# SQL FROM R"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Now we will see how we can access to a PostgreSQL database from R"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "## Using RPostgreSQL package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "First load necessary package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "library(RPostgreSQL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Load PostgreSQL driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "drvv <- dbDriver(\"PostgreSQL\")\n",
    "drvv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Create a connection to a database using the driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "con <- dbConnect(drvv,\n",
    "                 dbname = \"imdb2\",\n",
    "                 host = \"localhost\",\n",
    "                 port = 5432,\n",
    "                 user = \"postgres\")\n",
    "\n",
    "con"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get list of table names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table_names <- dbGetQuery(con,\n",
    "                          \"SELECT table_name\n",
    "                          FROM information_schema.tables \n",
    "                          WHERE table_type = 'BASE TABLE'\n",
    "                          AND table_schema = 'public'\n",
    "                          ORDER BY table_name\n",
    "                          \"\n",
    "                          )\n",
    "\n",
    "table_names\n",
    "\n",
    "class(table_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Or more easily with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbListTables(con)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Good! The query returns a data frame\n",
    "\n",
    "Now we can loop through tables to automatize queries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "List tables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table_names_vec <- dbListTables(con)\n",
    "table_names_vec\n",
    "class(table_names_vec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "This yields a character vector, not a data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get row counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "# create a query template that we can change the parameters of\n",
    "# each \"%s\" stands for a variable that we will manipulate with sprintf\n",
    "query_text <- \"SELECT count (*) FROM %s\"\n",
    "\n",
    "# create an empty vector, the length of the table names\n",
    "# we will populate this vector with the row counts from tables\n",
    "row_counts <- rep(NA, length(table_names_vec))\n",
    "\n",
    "# update the names of the vector with table names\n",
    "names(row_counts) <- table_names_vec\n",
    "\n",
    "# for across table names indices (not the names themselves)\n",
    "for (tbl_ind in seq_along(table_names_vec))\n",
    "{\n",
    "    # replace the %s pointer in query text with table name\n",
    "    current_query <- sprintf(query_text, table_names_vec[tbl_ind])\n",
    "    \n",
    "    # run the query and get results\n",
    "    table_name <- dbGetQuery(con, current_query)\n",
    "    \n",
    "    # assign the query result to vector\n",
    "    row_counts[tbl_ind] <- table_name[[1]]\n",
    "}\n",
    "                  \n",
    "return(row_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Play with title_ratings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's import a table as a data frame into R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_ratings_df <- dbGetQuery(con, \"SELECT * from title_ratings\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get the head, attributes and summary of the data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "head(title_ratings_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "class(title_ratings_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now an easier way to read data into R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_ratings_df_b <- dbReadTable(con, \"title_ratings\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "See whether they are identical objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "identical(title_ratings_df, title_ratings_df_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "So, they are identical"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "What is the object size of each data frame?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "format(object.size(title_ratings_df), units = \"auto\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "format(object.size(title_ratings_df_b), units = \"auto\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "They take too much space. Now let's get rid of one of them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "rm(title_ratings_df_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "New get the classes of all columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "str(title_ratings_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Play with title_basics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's query title_basics as such:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's say, take rows in which \n",
    "\n",
    "name contains \"Star Wars\",\n",
    "\n",
    "not an adult movie\n",
    "\n",
    "title_type is a movie and\n",
    "\n",
    "start year is between 1977 and 2016"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics_df2 <- dbGetQuery(con,\n",
    "                          \"SELECT *\n",
    "                            FROM title_basics \n",
    "                            WHERE primarytitle ~* 'star.*wars'\n",
    "                            AND isadult = false\n",
    "                            AND titletype = 'movie'\n",
    "                            AND startyear BETWEEN 1977 AND 2016\"\n",
    "                          )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics_df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "str(title_basics_df2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Did not like NA's in tconst, let's enhance it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics_df3 <- title_basics_df2[\n",
    "    !is.na(title_basics_df2$tconst),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics_df3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Much better!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's write this data frame into our imdb2 database as a new table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbWriteTable(con, \"star_wars\", value = title_basics_df3, append = TRUE, row.names = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Check whether table is created:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbExistsTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And read from the database again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "star_wars <- dbReadTable(con, \"star_wars\")\n",
    "\n",
    "star_wars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "See, the only difference seems to be the rownames which we excluded on purpose\n",
    "\n",
    "Apart from that, let's check whet they are both identical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "rownames(title_basics_df3) <- NULL\n",
    "title_basics_df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "identical(title_basics_df3, star_wars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Yes they are identical!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Primary constraints"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's remove the table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbRemoveTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Check whether it still exists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbExistsTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's create the table again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbWriteTable(con, \"star_wars\", value = title_basics_df3, append = TRUE, row.names = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Check that it exists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbExistsTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get its rowcount:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbGetQuery(con, \"SELECT count (*) FROM star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And repeat the data write statement, with append option"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbWriteTable(con, \"star_wars\", value = title_basics_df3, append = TRUE, row.names = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get the rowcount again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbGetQuery(con, \"SELECT count (*) FROM star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "<a href=\"https://www.youtube.com/embed/tsXEToflqGs?start=10&autoplay=1\">OW! I HAVE A BAD FEELING ABOUT THIS</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "We did not define a primary key, so postgresql server allowed duplicates to be inserted in to the database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now delete and recreate table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbRemoveTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbExistsTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbWriteTable(con, \"star_wars\", value = title_basics_df3, append = TRUE, row.names = FALSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbExistsTable(con, \"star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And add a primary key constraint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbGetQuery(con,           \n",
    "\"\n",
    "ALTER TABLE star_wars\n",
    "ADD CONSTRAINT tconst_pk_10 \n",
    "PRIMARY KEY (tconst);\n",
    "\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's try to import the data again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbWriteTable(con, \"star_wars\", value = title_basics_df3, append = TRUE, row.names = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "See that:\n",
    "\n",
    "***could not Retrieve the result : ERROR:  duplicate key value violates unique constraint \"tconst_pk_10***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's check the row count again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbGetQuery(con, \"SELECT count (*) FROM star_wars\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "OK, we duplicates were really not allowed!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's close connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbDisconnect(con)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And unload the driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "dbUnloadDriver(drvv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "## Using sqldf package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "First let's load the package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "library(RPostgreSQL)\n",
    "library(sqldf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And set the options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "options(sqldf.RPostgreSQL.user =\"postgres\", \n",
    "  sqldf.RPostgreSQL.dbname =\"imdb2\",\n",
    "  sqldf.RPostgreSQL.host =\"localhost\", \n",
    "  sqldf.RPostgreSQL.port =5432)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Average runtimes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's run a query to get average runtimeminutes from title_basics and group them and order them by startyear  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query = \"SELECT avg(runtimeminutes) runtimeminutes, startyear \n",
    "            FROM title_basics\n",
    "            GROUP BY startyear\n",
    "            ORDER BY startyear;\"\n",
    "\n",
    "av_runtime <- sqldf(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "av_runtime\n",
    "\n",
    "class(av_runtime)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "It returns a data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now subset years between 1940 and 2016"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "av_runtime_subset <- av_runtime[av_runtime$startyear %in% 1940:2016,]\n",
    "\n",
    "av_runtime_subset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Check whether any NA's exist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "which(is.na(av_runtime_subset[[1]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's plot the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "runtime_ts <- ts(av_runtime_subset[[1]],\n",
    "                 start = av_runtime_subset[1,2],\n",
    "                 end = av_runtime_subset[nrow(av_runtime_subset),2])\n",
    "\n",
    "runtime_ts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "plot(runtime_ts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Well that data may have many different types of titles including shorts, etc\n",
    "\n",
    "Let's first subset for movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query = \"SELECT avg(runtimeminutes) runtimeminutes, startyear \n",
    "            FROM title_basics\n",
    "            WHERE titletype = 'movie'\n",
    "            GROUP BY startyear\n",
    "            ORDER BY startyear;\"\n",
    "\n",
    "av_runtime <- sqldf(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "av_runtime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "av_runtime_subset <- av_runtime[av_runtime$startyear %in% 1940:2016,]\n",
    "\n",
    "av_runtime_subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "runtime_ts <- ts(av_runtime_subset[[1]],\n",
    "                 start = av_runtime_subset[1,2],\n",
    "                 end = av_runtime_subset[nrow(av_runtime_subset),2])\n",
    "\n",
    "runtime_ts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "plot(runtime_ts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Explore Hitchcock movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "library(RPostgreSQL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "drvv <- dbDriver(\"PostgreSQL\")\n",
    "drvv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "con <- dbConnect(drvv,\n",
    "                 dbname = \"imdb2\",\n",
    "                 host = \"localhost\",\n",
    "                 port = 5432,\n",
    "                 user = \"postgres\")\n",
    "\n",
    "con"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now the task is:\n",
    "- First get the count of average rating of movies by Alfred Hitchcock (excluding NULL ratings) with an SQL query\n",
    "- And summarize these values by decades (total count of movies and average rating by decade)\n",
    "- You can use floor division by floor(a/b) or its operator %/% in R\n",
    "- You can use aggregate function in R\n",
    "- You can recycle all codes that we have written sor far, or use the net\n",
    "- You can cooperate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query <- \"\n",
    "WITH hitch AS\n",
    "(\n",
    "SELECT nconst\n",
    "FROM name_basics nb\n",
    "WHERE nb.primaryname = 'Alfred Hitchcock'\n",
    "\tAND nb.primaryprofession ~ 'director'\n",
    "),\n",
    "t_ids AS\n",
    "(\n",
    "SELECT tp.tconst\n",
    "FROM hitch LEFT JOIN title_crew tp ON hitch.nconst=tp.directors\n",
    ")\n",
    "SELECT startyear, count(*), avg(tr.averagerating) avrate\n",
    "FROM t_ids LEFT JOIN title_basics tb ON t_ids.tconst=tb.tconst\n",
    "\tLEFT JOIN title_ratings tr ON tb.tconst=tr.tconst\n",
    "WHERE tb.titletype = 'movie'\n",
    "\tAND tr.averagerating IS NOT NULL\n",
    "GROUP BY startyear\n",
    "ORDER BY startyear\n",
    "\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df <- dbGetQuery(con, query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "In order to get an average rating for each decade weighted with the count of movies for each year, it is good to have a column for sum of ratings for a year:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df$sumrate <- hitch_df$count * hitch_df$avrate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's get a column for decades:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df$dec <- (hitch_df$startyear %/% 10) * 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And let's aggregate for each decade the sums of sumrate values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_agg <- aggregate(hitch_df[,c(2,4)],\n",
    "                       by = list(hitch_df$dec),\n",
    "                       FUN = sum)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_agg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And let's get the average rating for each decade, deleting the sumrate column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_agg$avrate <- hitch_agg$sumrate / hitch_agg$count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_agg$sumrate <- NULL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "hitch_agg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's have a scatterplot of counts versus average ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "plot(hitch_agg[,-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's have the total counts by decade:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "plot(x = hitch_agg[[1]], y = hitch_agg[[2]], main = \"total count by decade\")\n",
    "lines(x = hitch_agg[[1]], y = hitch_agg[[2]], type = \"l\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And let's have the averate ratings by decade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "plot(x = hitch_agg[[1]], y = hitch_agg[[3]], main = \"av ratings by decade\")\n",
    "lines(x = hitch_agg[[1]], y = hitch_agg[[3]], type = \"l\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "## Using dbplyr package (optional)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's go through a final example using tidyverse and data.table and an R package called \"dbplyr\" that incorporates dplyr verbes, converts a dplyr pipe to an sql query and executes remotely.\n",
    "\n",
    "Apart from dbplyr usage, this example will be a simple showcase of tidyverse and data.table tools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "library(dbplyr)\n",
    "library(tidyverse)\n",
    "library(data.table)\n",
    "library(RPostgreSQL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Data connection and querying"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "drvv <- dbDriver(\"PostgreSQL\")\n",
    "drvv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Create a connection to a database using the driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "con <- dbConnect(drvv,\n",
    "                 dbname = \"imdb2\",\n",
    "                 host = \"localhost\",\n",
    "                 port = 5432,\n",
    "                 user = \"postgres\")\n",
    "\n",
    "con"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And create a remote source object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics <- tbl(con, \"title_basics\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "title_basics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's create an sqlquery using dplyr verbes:\n",
    "\n",
    "- Select genres and startyear columns\n",
    "- Filter for startyear between 1950-2017, movie titletype and exclude missing genres\n",
    "- Create a new column \"count\" for the record count (of any column)\n",
    "- Group by genres and startyear and return the count of records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query1 <- title_basics %>% select(c(\"titletype\", \"genres\", \"startyear\")) %>%\n",
    "filter(between(startyear, 1950, 2017) &\n",
    "       !is.na(genres) &\n",
    "       titletype == \"movie\") %>%\n",
    "group_by(genres, startyear) %>%\n",
    "summarise(count = n())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "This is lazy query it is not executed and returned yet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Show the resulting sql query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "query1 %>%show_query"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Execute and collect the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table1 <- query1 %>% collect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Data exploration and wrangling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Summarize the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "str(table1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table1 %>% glimpse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Get unique values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table1 %>% select(-count) %>% sapply(unique)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Any missing values?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "which(!complete.cases(table1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Convert to a data.table object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "setDT(table1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now what we will do with this data is that we will explore the count of multiple genres in titles:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Let's create a new column that shows how many genres each movie has:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table1[,genrec := stringr::str_extract_all(genres, \",\", simplify= T) %>% length %>% \"+\"(1),\n",
    "       by = 1:nrow(table1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "See how \"genres\" and \"genrec\" columns are related:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "table1[,.SD[1], by = genres]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Now let's create contingency tables of genre counts as proportions for each startyear:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "genre_counts <- table1[, as.list(table(genrec) %>% prop.table %>% \"*\"(100) %>% round(2)),\n",
    "                 by = startyear]\n",
    "genre_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "### Data visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "And let's visualize those proportions as a stacked area chart:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "ir"
   },
   "outputs": [],
   "source": [
    "genre_counts %>%\n",
    "    gather(\"key\", \"value\", -startyear) %>%\n",
    "    ggplot(aes(x = startyear, y = value, fill = key)) +\n",
    "        geom_area()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "See that over time movies represent multiple genres as opposed to single genres to a larger extent (increase in the area of \"3\").\n",
    "\n",
    "This is probably due to changes in spectators' tastes and their desire to see more complex screenplays."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "# QUIZ"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "ir"
   },
   "source": [
    "Enter the password provided and run the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "bash"
   },
   "outputs": [],
   "source": [
    "pass=\n",
    "cat quiz_2019_01q.ipynb.crypt | \\\n",
    "openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass 2> /dev/null > quiz_2019_01q.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "bash"
   },
   "source": [
    "Now open quiz_2019_01q.ipynb file and follow instructions"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "SoS",
   "language": "sos",
   "name": "sos"
  },
  "language_info": {
   "codemirror_mode": "sos",
   "file_extension": ".sos",
   "mimetype": "text/x-sos",
   "name": "sos",
   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
   "pygments_lexer": "sos"
  },
  "sos": {
   "kernels": [],
   "panel": {
    "displayed": true,
    "height": 0,
    "style": "side"
   },
   "version": "0.20.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}