{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Writing complex SQL queries in Python\n", "\n", "The coronavirus took the entire world by surprise, changing everyone's daily routine. City dwellers no longer spent their free time outside, going to cafes and malls; more people were home, reading books. That attracted the attention of startups that rushed to develop new apps for book lovers.\n", "\n", "I've been given a database of one of the services competing in this market. It contains data on books, publishers, authors, and customer ratings and reviews of books. This information will be used to generate a value proposition for a new product.\n", "\n", "## Description of the data\n", "**books:**\n", " - Contains data on books:\n", " - book_id\n", " - author_id\n", " - title\n", " - num_pages — number of pages\n", " - publication_date\n", " - publisher_id\n", "\n", "**authors:**\n", " - Contains data on authors:\n", " - author_id\n", " - author\n", "\n", "**publishers:**\n", " - Contains data on publishers:\n", " - publisher_id\n", " - publisher\n", "\n", "**ratings:**\n", " - Contains data on user ratings:\n", " - rating_id\n", " - book_id\n", " - username — the name of the user who rated the book\n", " - rating\n", "\n", "**reviews:**\n", " - Contains data on customer reviews:\n", " - review_id\n", " - book_id\n", " - username — the name of the user who reviewed the book\n", " - text — the text of the review\n", " \n", "## Tasks\n", " - Find the number of books released after January 1, 2000.\n", " - Find the number of user reviews and the average rating for each book.\n", " - Identify the publisher that has released the greatest number of books with more than 50 pages (this will help you exclude brochures and similar publications from your analysis).\n", " - Identify the author with the highest average book rating: look only at books with at least 50 ratings.\n", " - Find the average number of text reviews among users who rated more than 50 books." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Import libraries & connect to database" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import libraries\n", "\n", "import pandas as pd\n", "from sqlalchemy import create_engine\n", "\n", "db_config = {'user': 'praktikum_student', # user name\n", " 'pwd': 'Sdf4$2;d-d30pp', # password\n", " 'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',\n", " 'port': 6432, # connection port\n", " 'db': 'data-analyst-final-project-db'} # the name of the data base\n", "\n", "connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],\n", " db_config['pwd'],\n", " db_config['host'],\n", " db_config['port'],\n", " db_config['db'])\n", "\n", "engine = create_engine(connection_string, connect_args={'sslmode':'require'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Print first few rows of each table" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
book_idauthor_idtitlenum_pagespublication_datepublisher_id
01546'Salem's Lot5942005-11-0193
124651 000 Places to See Before You Die9922003-05-22336
2340713 Little Blue Envelopes (Little Blue Envelope...3222010-12-21135
34821491: New Revelations of the Americas Before C...5412006-10-10309
4512517763862006-07-04268
\n", "
" ], "text/plain": [ " book_id author_id title \\\n", "0 1 546 'Salem's Lot \n", "1 2 465 1 000 Places to See Before You Die \n", "2 3 407 13 Little Blue Envelopes (Little Blue Envelope... \n", "3 4 82 1491: New Revelations of the Americas Before C... \n", "4 5 125 1776 \n", "\n", " num_pages publication_date publisher_id \n", "0 594 2005-11-01 93 \n", "1 992 2003-05-22 336 \n", "2 322 2010-12-21 135 \n", "3 541 2006-10-10 309 \n", "4 386 2006-07-04 268 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
author_idauthor
01A.S. Byatt
12Aesop/Laura Harris/Laura Gibbs
23Agatha Christie
34Alan Brennert
45Alan Moore/David Lloyd
\n", "
" ], "text/plain": [ " author_id author\n", "0 1 A.S. Byatt\n", "1 2 Aesop/Laura Harris/Laura Gibbs\n", "2 3 Agatha Christie\n", "3 4 Alan Brennert\n", "4 5 Alan Moore/David Lloyd" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publisher_idpublisher
01Ace
12Ace Book
23Ace Books
34Ace Hardcover
45Addison Wesley Publishing Company
\n", "
" ], "text/plain": [ " publisher_id publisher\n", "0 1 Ace\n", "1 2 Ace Book\n", "2 3 Ace Books\n", "3 4 Ace Hardcover\n", "4 5 Addison Wesley Publishing Company" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rating_idbook_idusernamerating
011ryanfranco4
121grantpatricia2
231brandtandrea5
342lorichen3
452mariokeller2
\n", "
" ], "text/plain": [ " rating_id book_id username rating\n", "0 1 1 ryanfranco 4\n", "1 2 1 grantpatricia 2\n", "2 3 1 brandtandrea 5\n", "3 4 2 lorichen 3\n", "4 5 2 mariokeller 2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_idbook_idusernametext
011brandtandreaMention society tell send professor analysis. ...
121ryanfrancoFoot glass pretty audience hit themselves. Amo...
232lorichenListen treat keep worry. Miss husband tax but ...
343johnsonamandaFinally month interesting blue could nature cu...
453scotttamaraNation purpose heavy give wait song will. List...
\n", "
" ], "text/plain": [ " review_id book_id username \\\n", "0 1 1 brandtandrea \n", "1 2 1 ryanfranco \n", "2 3 2 lorichen \n", "3 4 3 johnsonamanda \n", "4 5 3 scotttamara \n", "\n", " text \n", "0 Mention society tell send professor analysis. ... \n", "1 Foot glass pretty audience hit themselves. Amo... \n", "2 Listen treat keep worry. Miss husband tax but ... \n", "3 Finally month interesting blue could nature cu... \n", "4 Nation purpose heavy give wait song will. List... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# write queries\n", "\n", "tables = ['books', 'authors', 'publishers', 'ratings', 'reviews']\n", "\n", "for table in tables:\n", " query = 'SELECT * FROM ' + table + ' LIMIT 5'\n", "\n", " display(pd.io.sql.read_sql(query, con = engine))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: SQL queries for each task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1: Books released after January 1, 2000" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cnt
0819
\n", "
" ], "text/plain": [ " cnt\n", "0 819" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count number of books released after 1-1-2000\n", "\n", "query_t1 = \"\"\"SELECT\n", " COUNT (DISTINCT book_id) as cnt\n", " FROM\n", " books\n", " WHERE\n", " publication_date > '2000-01-01'\"\"\"\n", "\n", "pd.io.sql.read_sql(query_t1, con = engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "819 out of the 1000 books in the table were released after the year 2000. This represents roughly 82%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2: Number of user reviews & average rating for each book" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleavg_ratingreview_cnt
0Twilight (Twilight #1)3.6625001120
1The Hobbit or There and Back Again4.125000528
2The Catcher in the Rye3.825581516
3Harry Potter and the Prisoner of Azkaban (Harr...4.414634492
4Harry Potter and the Chamber of Secrets (Harry...4.287500480
5Angels & Demons (Robert Langdon #1)3.678571420
6Harry Potter and the Order of the Phoenix (Har...4.186667375
7The Lightning Thief (Percy Jackson and the Oly...4.080645372
8The Fellowship of the Ring (The Lord of the Ri...4.391892370
9Animal Farm3.729730370
\n", "
" ], "text/plain": [ " title avg_rating review_cnt\n", "0 Twilight (Twilight #1) 3.662500 1120\n", "1 The Hobbit or There and Back Again 4.125000 528\n", "2 The Catcher in the Rye 3.825581 516\n", "3 Harry Potter and the Prisoner of Azkaban (Harr... 4.414634 492\n", "4 Harry Potter and the Chamber of Secrets (Harry... 4.287500 480\n", "5 Angels & Demons (Robert Langdon #1) 3.678571 420\n", "6 Harry Potter and the Order of the Phoenix (Har... 4.186667 375\n", "7 The Lightning Thief (Percy Jackson and the Oly... 4.080645 372\n", "8 The Fellowship of the Ring (The Lord of the Ri... 4.391892 370\n", "9 Animal Farm 3.729730 370" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count number of user reviews & avg book rating for each book\n", "\n", "query_t2 = \"\"\"SELECT\n", " books.title,\n", " subquery.avg_rating,\n", " subquery.review_cnt\n", " FROM\n", " (SELECT\n", " reviews.book_id as book_id,\n", " COUNT (reviews.review_id) AS review_cnt,\n", " AVG (ratings.rating) AS avg_rating\n", " FROM\n", " reviews\n", " INNER JOIN ratings ON ratings.book_id = reviews.book_id\n", " GROUP BY\n", " reviews.book_id\n", " ORDER BY\n", " review_cnt DESC,\n", " avg_rating DESC) AS subquery\n", " INNER JOIN books ON subquery.book_id = books.book_id\n", " ORDER BY\n", " review_cnt DESC,\n", " avg_rating DESC\n", " LIMIT 10\"\"\"\n", "\n", "pd.io.sql.read_sql(query_t2, con = engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twilight received the greatest number of reviews by a long shot. It wasnt as popular as The Hobbit or any of the Harry Potter series, however. These books all received higher ratings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3: Identify the publisher with the greatest number of books with more than 50 pages" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntpublisher
042Penguin Books
131Vintage
225Grand Central Publishing
324Penguin Classics
419Bantam
\n", "
" ], "text/plain": [ " cnt publisher\n", "0 42 Penguin Books\n", "1 31 Vintage\n", "2 25 Grand Central Publishing\n", "3 24 Penguin Classics\n", "4 19 Bantam" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# top publishers with books over 50 pages\n", "\n", "query_t3 = \"\"\"SELECT\n", " COUNT (books.book_id) AS cnt,\n", " publishers.publisher AS publisher\n", " FROM\n", " books\n", " INNER JOIN publishers ON publishers.publisher_id = books.publisher_id\n", " WHERE \n", " books.num_pages > 50\n", " GROUP BY\n", " publishers.publisher\n", " ORDER BY\n", " cnt DESC\n", " LIMIT 5\"\"\"\n", "\n", "pd.io.sql.read_sql(query_t3, con = engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The top publishers are Penguin Books, Vintage, Grand Central Publishing, Pengin Classics, and Ballantine Books. If we group Penguin with Penguin Classics, then they are by far the largest publisher by volume of books." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4: Identify the author with highest average book rating of books with over 50 ratings" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authorfinal_avg
0J.K. Rowling/Mary GrandPré4.283844
1Markus Zusak/Cao Xuân Việt Khương4.264151
2J.R.R. Tolkien4.258446
3Louisa May Alcott4.192308
4Rick Riordan4.080645
\n", "
" ], "text/plain": [ " author final_avg\n", "0 J.K. Rowling/Mary GrandPré 4.283844\n", "1 Markus Zusak/Cao Xuân Việt Khương 4.264151\n", "2 J.R.R. Tolkien 4.258446\n", "3 Louisa May Alcott 4.192308\n", "4 Rick Riordan 4.080645" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# author with highest average rating\n", "\n", "query_t4 = \"\"\"SELECT\n", " authors.author,\n", " AVG (subquery2.avg_rating) as final_avg \n", " FROM\n", " (SELECT\n", " books.title,\n", " books.author_id,\n", " subquery1.avg_rating\n", " FROM\n", " (SELECT\n", " book_id,\n", " COUNT (rating_id) AS rating_cnt,\n", " AVG (rating) AS avg_rating\n", " FROM\n", " ratings\n", " GROUP BY\n", " book_id\n", " HAVING\n", " COUNT (rating_id) > 50) AS subquery1\n", " INNER JOIN books ON books.book_id = subquery1.book_id) AS subquery2\n", " INNER JOIN authors ON authors.author_id = subquery2.author_id\n", " GROUP BY\n", " author\n", " ORDER BY\n", " final_avg DESC\n", " LIMIT 5\"\"\"\n", "\n", "pd.io.sql.read_sql(query_t4, con = engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "J.K. Rowling has the highest average rating out of all books with greater than 50 ratings. Not far behind are Markus Zusak and J.R.R. Tolkien." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5: Average number of text reviews among users who rated more than 50 books" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg_review_cnt
024.333333
\n", "
" ], "text/plain": [ " avg_review_cnt\n", "0 24.333333" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_t5 = \"\"\"SELECT\n", " AVG (subquery2.review_cnt) AS avg_review_cnt\n", " FROM\n", " (SELECT\n", " COUNT (reviews.review_id) as review_cnt,\n", " subquery1.username\n", " FROM\n", " (SELECT\n", " username,\n", " COUNT (rating_id) AS rating_cnt\n", " FROM\n", " ratings\n", " GROUP BY\n", " username\n", " HAVING\n", " COUNT (rating_id) > 50) AS subquery1\n", " INNER JOIN reviews ON reviews.username = subquery1.username\n", " GROUP BY\n", " subquery1.username) AS subquery2\"\"\"\n", "\n", "pd.io.sql.read_sql(query_t5, con = engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Users who rated more than 50 books left on average 24.33 text reviews." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }