{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Writing complex SQL queries in Python\n",
    "\n",
    "The coronavirus took the entire world by surprise, changing everyone's daily routine. City dwellers no longer spent their free time outside, going to cafes and malls; more people were home, reading books. That attracted the attention of startups that rushed to develop new apps for book lovers.\n",
    "\n",
    "I've been given a database of one of the services competing in this market. It contains data on books, publishers, authors, and customer ratings and reviews of books. This information will be used to generate a value proposition for a new product.\n",
    "\n",
    "## Description of the data\n",
    "**books:**\n",
    " - Contains data on books:\n",
    "     - book_id\n",
    "     - author_id\n",
    "     - title\n",
    "     - num_pages — number of pages\n",
    "     - publication_date\n",
    "     - publisher_id\n",
    "\n",
    "**authors:**\n",
    " - Contains data on authors:\n",
    "     - author_id\n",
    "     - author\n",
    "\n",
    "**publishers:**\n",
    " - Contains data on publishers:\n",
    "     - publisher_id\n",
    "     - publisher\n",
    "\n",
    "**ratings:**\n",
    " - Contains data on user ratings:\n",
    "     - rating_id\n",
    "     - book_id\n",
    "     - username — the name of the user who rated the book\n",
    "     - rating\n",
    "\n",
    "**reviews:**\n",
    " - Contains data on customer reviews:\n",
    "     - review_id\n",
    "     - book_id\n",
    "     - username — the name of the user who reviewed the book\n",
    "     - text — the text of the review\n",
    "     \n",
    "## Tasks\n",
    " - Find the number of books released after January 1, 2000.\n",
    " - Find the number of user reviews and the average rating for each book.\n",
    " - Identify the publisher that has released the greatest number of books with more than 50 pages (this will help you exclude brochures and similar publications from your analysis).\n",
    " - Identify the author with the highest average book rating: look only at books with at least 50 ratings.\n",
    " - Find the average number of text reviews among users who rated more than 50 books."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import libraries & connect to database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import libraries\n",
    "\n",
    "import pandas as pd\n",
    "from sqlalchemy import create_engine\n",
    "\n",
    "db_config = {'user': 'praktikum_student', # user name\n",
    "             'pwd': 'Sdf4$2;d-d30pp', # password\n",
    "             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',\n",
    "             'port': 6432, # connection port\n",
    "             'db': 'data-analyst-final-project-db'} # the name of the data base\n",
    "\n",
    "connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],\n",
    "                                                         db_config['pwd'],\n",
    "                                                         db_config['host'],\n",
    "                                                         db_config['port'],\n",
    "                                                         db_config['db'])\n",
    "\n",
    "engine = create_engine(connection_string, connect_args={'sslmode':'require'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Print first few rows of each table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>book_id</th>\n",
       "      <th>author_id</th>\n",
       "      <th>title</th>\n",
       "      <th>num_pages</th>\n",
       "      <th>publication_date</th>\n",
       "      <th>publisher_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>546</td>\n",
       "      <td>'Salem's Lot</td>\n",
       "      <td>594</td>\n",
       "      <td>2005-11-01</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>465</td>\n",
       "      <td>1 000 Places to See Before You Die</td>\n",
       "      <td>992</td>\n",
       "      <td>2003-05-22</td>\n",
       "      <td>336</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>407</td>\n",
       "      <td>13 Little Blue Envelopes (Little Blue Envelope...</td>\n",
       "      <td>322</td>\n",
       "      <td>2010-12-21</td>\n",
       "      <td>135</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>82</td>\n",
       "      <td>1491: New Revelations of the Americas Before C...</td>\n",
       "      <td>541</td>\n",
       "      <td>2006-10-10</td>\n",
       "      <td>309</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>125</td>\n",
       "      <td>1776</td>\n",
       "      <td>386</td>\n",
       "      <td>2006-07-04</td>\n",
       "      <td>268</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   book_id  author_id                                              title  \\\n",
       "0        1        546                                       'Salem's Lot   \n",
       "1        2        465                 1 000 Places to See Before You Die   \n",
       "2        3        407  13 Little Blue Envelopes (Little Blue Envelope...   \n",
       "3        4         82  1491: New Revelations of the Americas Before C...   \n",
       "4        5        125                                               1776   \n",
       "\n",
       "   num_pages publication_date  publisher_id  \n",
       "0        594       2005-11-01            93  \n",
       "1        992       2003-05-22           336  \n",
       "2        322       2010-12-21           135  \n",
       "3        541       2006-10-10           309  \n",
       "4        386       2006-07-04           268  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author_id</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>A.S. Byatt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Aesop/Laura Harris/Laura Gibbs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Agatha Christie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Alan Brennert</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Alan Moore/David   Lloyd</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   author_id                          author\n",
       "0          1                      A.S. Byatt\n",
       "1          2  Aesop/Laura Harris/Laura Gibbs\n",
       "2          3                 Agatha Christie\n",
       "3          4                   Alan Brennert\n",
       "4          5        Alan Moore/David   Lloyd"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>publisher_id</th>\n",
       "      <th>publisher</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Ace</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Ace Book</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Ace Books</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Ace Hardcover</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Addison Wesley Publishing Company</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   publisher_id                          publisher\n",
       "0             1                                Ace\n",
       "1             2                           Ace Book\n",
       "2             3                          Ace Books\n",
       "3             4                      Ace Hardcover\n",
       "4             5  Addison Wesley Publishing Company"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating_id</th>\n",
       "      <th>book_id</th>\n",
       "      <th>username</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>ryanfranco</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>grantpatricia</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>brandtandrea</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>lorichen</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>mariokeller</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rating_id  book_id       username  rating\n",
       "0          1        1     ryanfranco       4\n",
       "1          2        1  grantpatricia       2\n",
       "2          3        1   brandtandrea       5\n",
       "3          4        2       lorichen       3\n",
       "4          5        2    mariokeller       2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_id</th>\n",
       "      <th>book_id</th>\n",
       "      <th>username</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>brandtandrea</td>\n",
       "      <td>Mention society tell send professor analysis. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>ryanfranco</td>\n",
       "      <td>Foot glass pretty audience hit themselves. Amo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>lorichen</td>\n",
       "      <td>Listen treat keep worry. Miss husband tax but ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>johnsonamanda</td>\n",
       "      <td>Finally month interesting blue could nature cu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>scotttamara</td>\n",
       "      <td>Nation purpose heavy give wait song will. List...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   review_id  book_id       username  \\\n",
       "0          1        1   brandtandrea   \n",
       "1          2        1     ryanfranco   \n",
       "2          3        2       lorichen   \n",
       "3          4        3  johnsonamanda   \n",
       "4          5        3    scotttamara   \n",
       "\n",
       "                                                text  \n",
       "0  Mention society tell send professor analysis. ...  \n",
       "1  Foot glass pretty audience hit themselves. Amo...  \n",
       "2  Listen treat keep worry. Miss husband tax but ...  \n",
       "3  Finally month interesting blue could nature cu...  \n",
       "4  Nation purpose heavy give wait song will. List...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# write queries\n",
    "\n",
    "tables = ['books', 'authors', 'publishers', 'ratings', 'reviews']\n",
    "\n",
    "for table in tables:\n",
    "    query = 'SELECT * FROM ' + table + ' LIMIT 5'\n",
    "\n",
    "    display(pd.io.sql.read_sql(query, con = engine))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: SQL queries for each task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1: Books released after January 1, 2000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cnt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>819</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cnt\n",
       "0  819"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count number of books released after 1-1-2000\n",
    "\n",
    "query_t1 = \"\"\"SELECT\n",
    "                  COUNT (DISTINCT book_id) as cnt\n",
    "              FROM\n",
    "                  books\n",
    "               WHERE\n",
    "                   publication_date > '2000-01-01'\"\"\"\n",
    "\n",
    "pd.io.sql.read_sql(query_t1, con = engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "819 out of the 1000 books in the table were released after the year 2000. This represents roughly 82%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2: Number of user reviews & average rating for each book"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>avg_rating</th>\n",
       "      <th>review_cnt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Twilight (Twilight  #1)</td>\n",
       "      <td>3.662500</td>\n",
       "      <td>1120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The Hobbit  or There and Back Again</td>\n",
       "      <td>4.125000</td>\n",
       "      <td>528</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The Catcher in the Rye</td>\n",
       "      <td>3.825581</td>\n",
       "      <td>516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Harry Potter and the Prisoner of Azkaban (Harr...</td>\n",
       "      <td>4.414634</td>\n",
       "      <td>492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Harry Potter and the Chamber of Secrets (Harry...</td>\n",
       "      <td>4.287500</td>\n",
       "      <td>480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Angels &amp; Demons (Robert Langdon  #1)</td>\n",
       "      <td>3.678571</td>\n",
       "      <td>420</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Harry Potter and the Order of the Phoenix (Har...</td>\n",
       "      <td>4.186667</td>\n",
       "      <td>375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>The Lightning Thief (Percy Jackson and the Oly...</td>\n",
       "      <td>4.080645</td>\n",
       "      <td>372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>The Fellowship of the Ring (The Lord of the Ri...</td>\n",
       "      <td>4.391892</td>\n",
       "      <td>370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Animal Farm</td>\n",
       "      <td>3.729730</td>\n",
       "      <td>370</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  avg_rating  review_cnt\n",
       "0                            Twilight (Twilight  #1)    3.662500        1120\n",
       "1                The Hobbit  or There and Back Again    4.125000         528\n",
       "2                             The Catcher in the Rye    3.825581         516\n",
       "3  Harry Potter and the Prisoner of Azkaban (Harr...    4.414634         492\n",
       "4  Harry Potter and the Chamber of Secrets (Harry...    4.287500         480\n",
       "5               Angels & Demons (Robert Langdon  #1)    3.678571         420\n",
       "6  Harry Potter and the Order of the Phoenix (Har...    4.186667         375\n",
       "7  The Lightning Thief (Percy Jackson and the Oly...    4.080645         372\n",
       "8  The Fellowship of the Ring (The Lord of the Ri...    4.391892         370\n",
       "9                                        Animal Farm    3.729730         370"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count number of user reviews & avg book rating for each book\n",
    "\n",
    "query_t2 = \"\"\"SELECT\n",
    "              books.title,\n",
    "              subquery.avg_rating,\n",
    "              subquery.review_cnt\n",
    "              FROM\n",
    "                  (SELECT\n",
    "                       reviews.book_id as book_id,\n",
    "                       COUNT (reviews.review_id) AS review_cnt,\n",
    "                       AVG (ratings.rating) AS avg_rating\n",
    "                   FROM\n",
    "                       reviews\n",
    "                       INNER JOIN ratings ON ratings.book_id = reviews.book_id\n",
    "                   GROUP BY\n",
    "                       reviews.book_id\n",
    "                   ORDER BY\n",
    "                       review_cnt DESC,\n",
    "                       avg_rating DESC) AS subquery\n",
    "                   INNER JOIN books ON subquery.book_id = books.book_id\n",
    "              ORDER BY\n",
    "                  review_cnt DESC,\n",
    "                  avg_rating DESC\n",
    "              LIMIT 10\"\"\"\n",
    "\n",
    "pd.io.sql.read_sql(query_t2, con = engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Twilight received the greatest number of reviews by a long shot. It wasnt as popular as The Hobbit or any of the Harry Potter series, however. These books all received higher ratings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3: Identify the publisher with the greatest number of books with more than 50 pages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cnt</th>\n",
       "      <th>publisher</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>42</td>\n",
       "      <td>Penguin Books</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>31</td>\n",
       "      <td>Vintage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>25</td>\n",
       "      <td>Grand Central Publishing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>24</td>\n",
       "      <td>Penguin Classics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>19</td>\n",
       "      <td>Bantam</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cnt                 publisher\n",
       "0   42             Penguin Books\n",
       "1   31                   Vintage\n",
       "2   25  Grand Central Publishing\n",
       "3   24          Penguin Classics\n",
       "4   19                    Bantam"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# top publishers with books over 50 pages\n",
    "\n",
    "query_t3 = \"\"\"SELECT\n",
    "                   COUNT (books.book_id) AS cnt,\n",
    "                   publishers.publisher AS publisher\n",
    "              FROM\n",
    "                  books\n",
    "                  INNER JOIN publishers ON publishers.publisher_id = books.publisher_id\n",
    "              WHERE \n",
    "                  books.num_pages > 50\n",
    "              GROUP BY\n",
    "                  publishers.publisher\n",
    "              ORDER BY\n",
    "                  cnt DESC\n",
    "              LIMIT 5\"\"\"\n",
    "\n",
    "pd.io.sql.read_sql(query_t3, con = engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The top publishers are Penguin Books, Vintage, Grand Central Publishing, Pengin Classics, and Ballantine Books. If we group Penguin with Penguin Classics, then they are by far the largest publisher by volume of books."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 4: Identify the author with highest average book rating of books with over 50 ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>final_avg</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>J.K. Rowling/Mary GrandPré</td>\n",
       "      <td>4.283844</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Markus Zusak/Cao Xuân Việt Khương</td>\n",
       "      <td>4.264151</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>J.R.R. Tolkien</td>\n",
       "      <td>4.258446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Louisa May Alcott</td>\n",
       "      <td>4.192308</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rick Riordan</td>\n",
       "      <td>4.080645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              author  final_avg\n",
       "0         J.K. Rowling/Mary GrandPré   4.283844\n",
       "1  Markus Zusak/Cao Xuân Việt Khương   4.264151\n",
       "2                     J.R.R. Tolkien   4.258446\n",
       "3                  Louisa May Alcott   4.192308\n",
       "4                       Rick Riordan   4.080645"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# author with highest average rating\n",
    "\n",
    "query_t4 = \"\"\"SELECT\n",
    "                  authors.author,\n",
    "                  AVG (subquery2.avg_rating) as final_avg            \n",
    "              FROM\n",
    "                  (SELECT\n",
    "                      books.title,\n",
    "                      books.author_id,\n",
    "                      subquery1.avg_rating\n",
    "                  FROM\n",
    "                      (SELECT\n",
    "                          book_id,\n",
    "                          COUNT (rating_id) AS rating_cnt,\n",
    "                          AVG (rating) AS avg_rating\n",
    "                      FROM\n",
    "                          ratings\n",
    "                      GROUP BY\n",
    "                          book_id\n",
    "                      HAVING\n",
    "                          COUNT (rating_id) > 50) AS subquery1\n",
    "                      INNER JOIN books ON books.book_id = subquery1.book_id) AS subquery2\n",
    "                  INNER JOIN authors ON authors.author_id = subquery2.author_id\n",
    "              GROUP BY\n",
    "                  author\n",
    "              ORDER BY\n",
    "                  final_avg DESC\n",
    "              LIMIT 5\"\"\"\n",
    "\n",
    "pd.io.sql.read_sql(query_t4, con = engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "J.K. Rowling has the highest average rating out of all books with greater than 50 ratings. Not far behind are Markus Zusak and J.R.R. Tolkien."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 5: Average number of text reviews among users who rated more than 50 books"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>avg_review_cnt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>24.333333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   avg_review_cnt\n",
       "0       24.333333"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_t5 = \"\"\"SELECT\n",
    "                  AVG (subquery2.review_cnt) AS avg_review_cnt\n",
    "              FROM\n",
    "                  (SELECT\n",
    "                      COUNT (reviews.review_id) as review_cnt,\n",
    "                      subquery1.username\n",
    "                  FROM\n",
    "                      (SELECT\n",
    "                          username,\n",
    "                          COUNT (rating_id) AS rating_cnt\n",
    "                      FROM\n",
    "                          ratings\n",
    "                      GROUP BY\n",
    "                          username\n",
    "                      HAVING\n",
    "                          COUNT (rating_id) > 50) AS subquery1\n",
    "                      INNER JOIN reviews ON reviews.username = subquery1.username\n",
    "                  GROUP BY\n",
    "                      subquery1.username) AS subquery2\"\"\"\n",
    "\n",
    "pd.io.sql.read_sql(query_t5, con = engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Users who rated more than 50 books left on average 24.33 text reviews."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}