{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practicing SQL Queries using sqlite3 in Python\n", "\n", "Recently, I worked on a case study in which I needed to write accurate SQL queries that would answer practical business questions. I wanted to ensure that my work would be correct, but since I'm a student and don't have access to a database, I wrote up this notebook so I could generate data, add the data to a `sqlite3` database, and practice queries on it. While working on this, I found Corey Schafer's [SQLite tutorial video](https://www.youtube.com/watch?v=pd-0G0MigUA&t=1026s) to be extremely useful. \n", "\n", "*Disclaimer*: Please note that this data is all **generated**, and **does not** represent real business data from any firm I have interviewed with.\n", "\n", "Now, on to the fun part!\n", "\n", "Our two questions are as follows:\n", "\n", "1. Given an “Orders” table with the columns provided, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.\n", " \n", "2. Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "import pandas as pd\n", "import numpy.random as r" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Establish a sqlite connection (this creates orders.db if it doesn't already exist)\n", "conn = sqlite3.connect('orders.db')\n", "\n", "# Create a cursor, which we use to execute statements\n", "c = conn.cursor()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Outline our SQL statement that creates the table `orders`\n", "create_statement = \"\"\"CREATE TABLE orders (\n", " order_item_id integer,\n", " order_id integer,\n", " order_date text,\n", " product_id integer,\n", " quantity integer,\n", " unit_price real,\n", " product_group varchar)\"\"\"\n", "\n", "# If the orders table already exists, this try-except statement will drop it and create a new one\n", "try:\n", " c.execute(create_statement)\n", "except sqlite3.OperationalError: \n", " c.execute('DROP TABLE orders')\n", " conn.commit()\n", " c.execute(create_statement)\n", "\n", "conn.commit()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Create a test entry: order_1\n", "order_1 = (881,\n", " 4,\n", " \"2018-10-20\",\n", " 9,\n", " 2,\n", " 4.99,\n", " \"cleanser\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Best practice here is to use ?s when inserting arbitrary values; most other string formatting solutions are vulnerable to SQL injections. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "c.execute('INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)', order_1)\n", "\n", "conn.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab our one entry using a WHERE statement. We'll find it by date:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n" ] } ], "source": [ "sql = \"\"\"SELECT * FROM orders\n", " WHERE order_date LIKE ?\"\"\"\n", "\n", "args = ['2018' + \"%\"] # Read as: \"begins with 2018\"\n", "\n", "c.execute(sql, args)\n", "conn.commit()\n", "\n", "print(c.fetchone())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`numpy`'s `random` module doesn't include a function for generating dates, so I wrote one up quickly. It works by selecting only from days between 1 and 29, to avoid adding in rules for which days can be drawn based on the month. Year was restricted to 2015-2019 to favor the exercises." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2015-8-29', '2018-1-9', '2015-5-19', '2017-7-7', '2016-1-7']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def generate_dates(n):\n", " \"\"\"\n", " Generates n dates. Includes all months, but only includes days from 1-29 \n", " and years from 2016-2019.\n", " \"\"\"\n", " \n", " dates_list = []\n", " \n", " for num in range(n):\n", " dates_list.append('{}-{}-{}'.format(str(r.randint(2015, 2020)), # year\n", " str(r.randint(1, 12)), # month\n", " str(r.randint(1, 30)))) # day\n", " \n", " return dates_list\n", " \n", "# Example dates:\n", "generate_dates(5)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def generate_data(n=10, seed=10):\n", " \"\"\"\n", " Generates a pandas DataFrame with n (default 10) data points for this particular table. \n", " Random number seed is set to 10 by default.\n", " \"\"\"\n", " \n", " r.seed(seed) # set seed; default 10\n", " \n", " order_id = r.randint(1, 100, n) # Order ID\n", " order_item_id = r.randint(100, 1000, n) # Order item ID\n", " order_date = generate_dates(n) # # Order date\n", " product_id = r.randint(1, 51, n) # Product ID\n", " quantity = r.randint(1, 5, n) # Quantity\n", " unit_price = r.choice([4.99, 7.99, 12.99, 19.99, 29.99], n) # Unit price\n", " product_group = r.choice(['cleanser', 'moisturizer', 'toner', 'treatment'], n) # Product group\n", " \n", " # Create dict\n", " d = {'order_id': order_id, \n", " 'order_item_id': order_item_id,\n", " 'order_date': order_date,\n", " 'product_id': product_id,\n", " 'quantity': quantity,\n", " 'unit_price': unit_price,\n", " 'product_group': product_group}\n", " \n", " # Create the DataFrame from the dict\n", " df = pd.DataFrame(data = d)\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idorder_item_idorder_dateproduct_idquantityunit_priceproduct_group
0113982017-4-113344.99cleanser
1743862016-5-224424.99toner
2856562019-1-194617.99treatment
3808602015-1-264037.99treatment
4812612015-3-2141312.99toner
\n", "
" ], "text/plain": [ " order_id order_item_id order_date product_id quantity unit_price \\\n", "0 11 398 2017-4-11 33 4 4.99 \n", "1 74 386 2016-5-22 44 2 4.99 \n", "2 85 656 2019-1-19 46 1 7.99 \n", "3 80 860 2015-1-26 40 3 7.99 \n", "4 81 261 2015-3-21 41 3 12.99 \n", "\n", " product_group \n", "0 cleanser \n", "1 toner \n", "2 treatment \n", "3 treatment \n", "4 toner " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate 500 data entries\n", "fake_data = generate_data(n=500, seed=6)\n", "\n", "fake_data.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Add fake data to orders table\n", "fake_data.to_sql('orders', conn, if_exists='append', index=False)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n", "(398, 11, '2017-4-11', 33, 4, 4.99, 'cleanser')\n", "(386, 74, '2016-5-22', 44, 2, 4.99, 'toner')\n", "(656, 85, '2019-1-19', 46, 1, 7.99, 'treatment')\n", "(860, 80, '2015-1-26', 40, 3, 7.99, 'treatment')\n", "(261, 81, '2015-3-21', 41, 3, 12.99, 'toner')\n", "(937, 63, '2017-3-26', 37, 4, 4.99, 'cleanser')\n", "(960, 26, '2017-7-10', 5, 1, 29.99, 'cleanser')\n", "(977, 2, '2015-4-24', 35, 4, 7.99, 'moisturizer')\n", "(291, 76, '2019-1-17', 23, 3, 12.99, 'moisturizer')\n" ] } ], "source": [ "# First 10 entries of the table\n", "sql = 'SELECT * FROM orders LIMIT 10'\n", "\n", "c.execute(sql)\n", "conn.commit()\n", "\n", "for entry in c.fetchall():\n", " print(entry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Question 1\n", "\n", "Given an “Orders” table with the above columns, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('cleanser', 4, 44.96)\n", "('moisturizer', 9, 193.90999999999997)\n", "('toner', 14, 277.86)\n", "('treatment', 12, 155.88)\n" ] } ], "source": [ "# Query needs to:\n", "# Restrict data to July 2015 orders only\n", "# Group by product group \n", "# Sum the quantity ordered across all orders\n", "# Sum the unit price * quantity across all orders\n", "\n", "sql = \"\"\"SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n", " FROM orders\n", " WHERE order_date LIKE ?\n", " GROUP BY product_group\"\"\"\n", "\n", "args = ['2015-7' + '%']\n", "\n", "c.execute(sql, args)\n", "for entry in c.fetchall():\n", " print(entry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason we need to use question marks is because Python has special instructions for when it sees a `%` symbol (causing a syntax error when passing arguments directly with `sqlite3`), so it's simpler to pass arguments separately rather than embedding them in the query. \n", "\n", "With the date embedded into the query, it'd look like this:\n", "\n", "```SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n", " FROM orders\n", " WHERE order_date LIKE 2015-7%\n", " GROUP BY product_group```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
order_idorder_item_idorder_dateproduct_idquantityunit_priceproduct_group
20326372015-7-532129.99cleanser
72706802015-7-2322119.99moisturizer
149286362015-7-164414.99treatment
184892542015-7-1729219.99toner
\n", "
" ], "text/plain": [ " order_id order_item_id order_date product_id quantity unit_price \\\n", "20 32 637 2015-7-5 32 1 29.99 \n", "72 70 680 2015-7-23 22 1 19.99 \n", "149 28 636 2015-7-16 44 1 4.99 \n", "184 89 254 2015-7-17 29 2 19.99 \n", "\n", " product_group \n", "20 cleanser \n", "72 moisturizer \n", "149 treatment \n", "184 toner " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the pandas DataFrame to verify that the quantities match\n", "july_2015 = fake_data.loc[fake_data['order_date'].str.contains('2015-7')]\n", "july_2015.head(4)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "product_group\n", "cleanser 4\n", "moisturizer 9\n", "toner 14\n", "treatment 12\n", "Name: quantity, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sanity check\n", "july_2015.groupby('product_group')['quantity'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Question 2\n", "\n", "Using the “Orders” table, which products (identified by `product_id`) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1,)\n", "(2,)\n", "(4,)\n", "(5,)\n", "(7,)\n", "(12,)\n", "(15,)\n", "(16,)\n", "(21,)\n", "(22,)\n", "(25,)\n", "(26,)\n", "(27,)\n", "(30,)\n", "(31,)\n", "(32,)\n", "(33,)\n", "(38,)\n", "(41,)\n", "(42,)\n", "(44,)\n", "(45,)\n", "(46,)\n", "(47,)\n", "(48,)\n", "(50,)\n" ] } ], "source": [ "sql = \"\"\"SELECT product_id\n", " FROM (SELECT product_id, COUNT(*) FROM orders\n", " AS sold_2_2017\n", " WHERE order_date LIKE ?\n", " GROUP BY product_id \n", " HAVING COUNT(*) >= 2)\n", " WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE ?)\"\"\"\n", "\n", "args = [('2017' + '%'), ('2018' + '%')]\n", "\n", "c.execute(sql, args)\n", "for entry in c.fetchall():\n", " print(entry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To break down the above query: What we're doing is selecting products that were *also* sold in 2018 from a nested table of products that were sold twice in 2017. \n", "\n", "Given the date structure \"YYYY-MM-DD\", it would look like this as a regular SQL query:\n", "\n", "```SELECT product_id\n", " FROM (SELECT product_id, COUNT(*) FROM orders\n", " AS sold_2_2017\n", " WHERE order_date LIKE 2017%\n", " GROUP BY product_id \n", " HAVING COUNT(*) >= 2)\n", " WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE 2018%)```" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Close up the connection\n", "c.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }