{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Practicing SQL Queries using sqlite3 in Python\n",
"\n",
"Recently, I worked on a case study in which I needed to write accurate SQL queries that would answer practical business questions. I wanted to ensure that my work would be correct, but since I'm a student and don't have access to a database, I wrote up this notebook so I could generate data, add the data to a `sqlite3` database, and practice queries on it. While working on this, I found Corey Schafer's [SQLite tutorial video](https://www.youtube.com/watch?v=pd-0G0MigUA&t=1026s) to be extremely useful. \n",
"\n",
"*Disclaimer*: Please note that this data is all **generated**, and **does not** represent real business data from any firm I have interviewed with.\n",
"\n",
"Now, on to the fun part!\n",
"\n",
"Our two questions are as follows:\n",
"\n",
"1. Given an “Orders” table with the columns provided, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.\n",
" \n",
"2. Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"import pandas as pd\n",
"import numpy.random as r"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Establish a sqlite connection (this creates orders.db if it doesn't already exist)\n",
"conn = sqlite3.connect('orders.db')\n",
"\n",
"# Create a cursor, which we use to execute statements\n",
"c = conn.cursor()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Outline our SQL statement that creates the table `orders`\n",
"create_statement = \"\"\"CREATE TABLE orders (\n",
" order_item_id integer,\n",
" order_id integer,\n",
" order_date text,\n",
" product_id integer,\n",
" quantity integer,\n",
" unit_price real,\n",
" product_group varchar)\"\"\"\n",
"\n",
"# If the orders table already exists, this try-except statement will drop it and create a new one\n",
"try:\n",
" c.execute(create_statement)\n",
"except sqlite3.OperationalError: \n",
" c.execute('DROP TABLE orders')\n",
" conn.commit()\n",
" c.execute(create_statement)\n",
"\n",
"conn.commit()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Create a test entry: order_1\n",
"order_1 = (881,\n",
" 4,\n",
" \"2018-10-20\",\n",
" 9,\n",
" 2,\n",
" 4.99,\n",
" \"cleanser\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Best practice here is to use ?s when inserting arbitrary values; most other string formatting solutions are vulnerable to SQL injections. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"c.execute('INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)', order_1)\n",
"\n",
"conn.commit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's grab our one entry using a WHERE statement. We'll find it by date:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n"
]
}
],
"source": [
"sql = \"\"\"SELECT * FROM orders\n",
" WHERE order_date LIKE ?\"\"\"\n",
"\n",
"args = ['2018' + \"%\"] # Read as: \"begins with 2018\"\n",
"\n",
"c.execute(sql, args)\n",
"conn.commit()\n",
"\n",
"print(c.fetchone())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`numpy`'s `random` module doesn't include a function for generating dates, so I wrote one up quickly. It works by selecting only from days between 1 and 29, to avoid adding in rules for which days can be drawn based on the month. Year was restricted to 2015-2019 to favor the exercises."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['2015-8-29', '2018-1-9', '2015-5-19', '2017-7-7', '2016-1-7']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def generate_dates(n):\n",
" \"\"\"\n",
" Generates n dates. Includes all months, but only includes days from 1-29 \n",
" and years from 2016-2019.\n",
" \"\"\"\n",
" \n",
" dates_list = []\n",
" \n",
" for num in range(n):\n",
" dates_list.append('{}-{}-{}'.format(str(r.randint(2015, 2020)), # year\n",
" str(r.randint(1, 12)), # month\n",
" str(r.randint(1, 30)))) # day\n",
" \n",
" return dates_list\n",
" \n",
"# Example dates:\n",
"generate_dates(5)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def generate_data(n=10, seed=10):\n",
" \"\"\"\n",
" Generates a pandas DataFrame with n (default 10) data points for this particular table. \n",
" Random number seed is set to 10 by default.\n",
" \"\"\"\n",
" \n",
" r.seed(seed) # set seed; default 10\n",
" \n",
" order_id = r.randint(1, 100, n) # Order ID\n",
" order_item_id = r.randint(100, 1000, n) # Order item ID\n",
" order_date = generate_dates(n) # # Order date\n",
" product_id = r.randint(1, 51, n) # Product ID\n",
" quantity = r.randint(1, 5, n) # Quantity\n",
" unit_price = r.choice([4.99, 7.99, 12.99, 19.99, 29.99], n) # Unit price\n",
" product_group = r.choice(['cleanser', 'moisturizer', 'toner', 'treatment'], n) # Product group\n",
" \n",
" # Create dict\n",
" d = {'order_id': order_id, \n",
" 'order_item_id': order_item_id,\n",
" 'order_date': order_date,\n",
" 'product_id': product_id,\n",
" 'quantity': quantity,\n",
" 'unit_price': unit_price,\n",
" 'product_group': product_group}\n",
" \n",
" # Create the DataFrame from the dict\n",
" df = pd.DataFrame(data = d)\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" order_item_id | \n",
" order_date | \n",
" product_id | \n",
" quantity | \n",
" unit_price | \n",
" product_group | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 11 | \n",
" 398 | \n",
" 2017-4-11 | \n",
" 33 | \n",
" 4 | \n",
" 4.99 | \n",
" cleanser | \n",
"
\n",
" \n",
" 1 | \n",
" 74 | \n",
" 386 | \n",
" 2016-5-22 | \n",
" 44 | \n",
" 2 | \n",
" 4.99 | \n",
" toner | \n",
"
\n",
" \n",
" 2 | \n",
" 85 | \n",
" 656 | \n",
" 2019-1-19 | \n",
" 46 | \n",
" 1 | \n",
" 7.99 | \n",
" treatment | \n",
"
\n",
" \n",
" 3 | \n",
" 80 | \n",
" 860 | \n",
" 2015-1-26 | \n",
" 40 | \n",
" 3 | \n",
" 7.99 | \n",
" treatment | \n",
"
\n",
" \n",
" 4 | \n",
" 81 | \n",
" 261 | \n",
" 2015-3-21 | \n",
" 41 | \n",
" 3 | \n",
" 12.99 | \n",
" toner | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id order_item_id order_date product_id quantity unit_price \\\n",
"0 11 398 2017-4-11 33 4 4.99 \n",
"1 74 386 2016-5-22 44 2 4.99 \n",
"2 85 656 2019-1-19 46 1 7.99 \n",
"3 80 860 2015-1-26 40 3 7.99 \n",
"4 81 261 2015-3-21 41 3 12.99 \n",
"\n",
" product_group \n",
"0 cleanser \n",
"1 toner \n",
"2 treatment \n",
"3 treatment \n",
"4 toner "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Generate 500 data entries\n",
"fake_data = generate_data(n=500, seed=6)\n",
"\n",
"fake_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Add fake data to orders table\n",
"fake_data.to_sql('orders', conn, if_exists='append', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n",
"(398, 11, '2017-4-11', 33, 4, 4.99, 'cleanser')\n",
"(386, 74, '2016-5-22', 44, 2, 4.99, 'toner')\n",
"(656, 85, '2019-1-19', 46, 1, 7.99, 'treatment')\n",
"(860, 80, '2015-1-26', 40, 3, 7.99, 'treatment')\n",
"(261, 81, '2015-3-21', 41, 3, 12.99, 'toner')\n",
"(937, 63, '2017-3-26', 37, 4, 4.99, 'cleanser')\n",
"(960, 26, '2017-7-10', 5, 1, 29.99, 'cleanser')\n",
"(977, 2, '2015-4-24', 35, 4, 7.99, 'moisturizer')\n",
"(291, 76, '2019-1-17', 23, 3, 12.99, 'moisturizer')\n"
]
}
],
"source": [
"# First 10 entries of the table\n",
"sql = 'SELECT * FROM orders LIMIT 10'\n",
"\n",
"c.execute(sql)\n",
"conn.commit()\n",
"\n",
"for entry in c.fetchall():\n",
" print(entry)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 1\n",
"\n",
"Given an “Orders” table with the above columns, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('cleanser', 4, 44.96)\n",
"('moisturizer', 9, 193.90999999999997)\n",
"('toner', 14, 277.86)\n",
"('treatment', 12, 155.88)\n"
]
}
],
"source": [
"# Query needs to:\n",
"# Restrict data to July 2015 orders only\n",
"# Group by product group \n",
"# Sum the quantity ordered across all orders\n",
"# Sum the unit price * quantity across all orders\n",
"\n",
"sql = \"\"\"SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n",
" FROM orders\n",
" WHERE order_date LIKE ?\n",
" GROUP BY product_group\"\"\"\n",
"\n",
"args = ['2015-7' + '%']\n",
"\n",
"c.execute(sql, args)\n",
"for entry in c.fetchall():\n",
" print(entry)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reason we need to use question marks is because Python has special instructions for when it sees a `%` symbol (causing a syntax error when passing arguments directly with `sqlite3`), so it's simpler to pass arguments separately rather than embedding them in the query. \n",
"\n",
"With the date embedded into the query, it'd look like this:\n",
"\n",
"```SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n",
" FROM orders\n",
" WHERE order_date LIKE 2015-7%\n",
" GROUP BY product_group```"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" order_item_id | \n",
" order_date | \n",
" product_id | \n",
" quantity | \n",
" unit_price | \n",
" product_group | \n",
"
\n",
" \n",
" \n",
" \n",
" 20 | \n",
" 32 | \n",
" 637 | \n",
" 2015-7-5 | \n",
" 32 | \n",
" 1 | \n",
" 29.99 | \n",
" cleanser | \n",
"
\n",
" \n",
" 72 | \n",
" 70 | \n",
" 680 | \n",
" 2015-7-23 | \n",
" 22 | \n",
" 1 | \n",
" 19.99 | \n",
" moisturizer | \n",
"
\n",
" \n",
" 149 | \n",
" 28 | \n",
" 636 | \n",
" 2015-7-16 | \n",
" 44 | \n",
" 1 | \n",
" 4.99 | \n",
" treatment | \n",
"
\n",
" \n",
" 184 | \n",
" 89 | \n",
" 254 | \n",
" 2015-7-17 | \n",
" 29 | \n",
" 2 | \n",
" 19.99 | \n",
" toner | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id order_item_id order_date product_id quantity unit_price \\\n",
"20 32 637 2015-7-5 32 1 29.99 \n",
"72 70 680 2015-7-23 22 1 19.99 \n",
"149 28 636 2015-7-16 44 1 4.99 \n",
"184 89 254 2015-7-17 29 2 19.99 \n",
"\n",
" product_group \n",
"20 cleanser \n",
"72 moisturizer \n",
"149 treatment \n",
"184 toner "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use the pandas DataFrame to verify that the quantities match\n",
"july_2015 = fake_data.loc[fake_data['order_date'].str.contains('2015-7')]\n",
"july_2015.head(4)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"product_group\n",
"cleanser 4\n",
"moisturizer 9\n",
"toner 14\n",
"treatment 12\n",
"Name: quantity, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Sanity check\n",
"july_2015.groupby('product_group')['quantity'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 2\n",
"\n",
"Using the “Orders” table, which products (identified by `product_id`) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1,)\n",
"(2,)\n",
"(4,)\n",
"(5,)\n",
"(7,)\n",
"(12,)\n",
"(15,)\n",
"(16,)\n",
"(21,)\n",
"(22,)\n",
"(25,)\n",
"(26,)\n",
"(27,)\n",
"(30,)\n",
"(31,)\n",
"(32,)\n",
"(33,)\n",
"(38,)\n",
"(41,)\n",
"(42,)\n",
"(44,)\n",
"(45,)\n",
"(46,)\n",
"(47,)\n",
"(48,)\n",
"(50,)\n"
]
}
],
"source": [
"sql = \"\"\"SELECT product_id\n",
" FROM (SELECT product_id, COUNT(*) FROM orders\n",
" AS sold_2_2017\n",
" WHERE order_date LIKE ?\n",
" GROUP BY product_id \n",
" HAVING COUNT(*) >= 2)\n",
" WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE ?)\"\"\"\n",
"\n",
"args = [('2017' + '%'), ('2018' + '%')]\n",
"\n",
"c.execute(sql, args)\n",
"for entry in c.fetchall():\n",
" print(entry)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To break down the above query: What we're doing is selecting products that were *also* sold in 2018 from a nested table of products that were sold twice in 2017. \n",
"\n",
"Given the date structure \"YYYY-MM-DD\", it would look like this as a regular SQL query:\n",
"\n",
"```SELECT product_id\n",
" FROM (SELECT product_id, COUNT(*) FROM orders\n",
" AS sold_2_2017\n",
" WHERE order_date LIKE 2017%\n",
" GROUP BY product_id \n",
" HAVING COUNT(*) >= 2)\n",
" WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE 2018%)```"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Close up the connection\n",
"c.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}