{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Practicing SQL Queries using sqlite3 in Python\n",
    "\n",
    "Recently, I worked on a case study in which I needed to write accurate SQL queries that would answer practical business questions. I wanted to ensure that my work would be correct, but since I'm a student and don't have access to a database, I wrote up this notebook so I could generate data, add the data to a `sqlite3` database, and practice queries on it. While working on this, I found Corey Schafer's [SQLite tutorial video](https://www.youtube.com/watch?v=pd-0G0MigUA&t=1026s) to be extremely useful. \n",
    "\n",
    "*Disclaimer*: Please note that this data is all **generated**, and **does not** represent real business data from any firm I have interviewed with.\n",
    "\n",
    "Now, on to the fun part!\n",
    "\n",
    "Our two questions are as follows:\n",
    "\n",
    "1. Given an “Orders” table with the columns provided, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.\n",
    " \n",
    "2. Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "import pandas as pd\n",
    "import numpy.random as r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Establish a sqlite connection (this creates orders.db if it doesn't already exist)\n",
    "conn = sqlite3.connect('orders.db')\n",
    "\n",
    "# Create a cursor, which we use to execute statements\n",
    "c = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Outline our SQL statement that creates the table `orders`\n",
    "create_statement = \"\"\"CREATE TABLE orders (\n",
    "                    order_item_id integer,\n",
    "                    order_id integer,\n",
    "                    order_date text,\n",
    "                    product_id integer,\n",
    "                    quantity integer,\n",
    "                    unit_price real,\n",
    "                    product_group varchar)\"\"\"\n",
    "\n",
    "# If the orders table already exists, this try-except statement will drop it and create a new one\n",
    "try:\n",
    "    c.execute(create_statement)\n",
    "except sqlite3.OperationalError: \n",
    "    c.execute('DROP TABLE orders')\n",
    "    conn.commit()\n",
    "    c.execute(create_statement)\n",
    "\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a test entry: order_1\n",
    "order_1 = (881,\n",
    "           4,\n",
    "          \"2018-10-20\",\n",
    "          9,\n",
    "          2,\n",
    "          4.99,\n",
    "          \"cleanser\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Best practice here is to use ?s when inserting arbitrary values; most other string formatting solutions are vulnerable to SQL injections. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.execute('INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)', order_1)\n",
    "\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's grab our one entry using a WHERE statement. We'll find it by date:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n"
     ]
    }
   ],
   "source": [
    "sql = \"\"\"SELECT * FROM orders\n",
    "            WHERE order_date LIKE ?\"\"\"\n",
    "\n",
    "args = ['2018' + \"%\"] # Read as: \"begins with 2018\"\n",
    "\n",
    "c.execute(sql, args)\n",
    "conn.commit()\n",
    "\n",
    "print(c.fetchone())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`numpy`'s `random` module doesn't include a function for generating dates, so I wrote one up quickly. It works by selecting only from days between 1 and 29, to avoid adding in rules for which days can be drawn based on the month. Year was restricted to 2015-2019 to favor the exercises."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2015-8-29', '2018-1-9', '2015-5-19', '2017-7-7', '2016-1-7']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def generate_dates(n):\n",
    "    \"\"\"\n",
    "    Generates n dates. Includes all months, but only includes days from 1-29 \n",
    "    and years from 2016-2019.\n",
    "    \"\"\"\n",
    "    \n",
    "    dates_list = []\n",
    "    \n",
    "    for num in range(n):\n",
    "        dates_list.append('{}-{}-{}'.format(str(r.randint(2015, 2020)), # year\n",
    "                                            str(r.randint(1, 12)), # month\n",
    "                                            str(r.randint(1, 30)))) # day\n",
    "      \n",
    "    return dates_list\n",
    "    \n",
    "# Example dates:\n",
    "generate_dates(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_data(n=10, seed=10):\n",
    "    \"\"\"\n",
    "    Generates a pandas DataFrame with n (default 10) data points for this particular table. \n",
    "    Random number seed is set to 10 by default.\n",
    "    \"\"\"\n",
    "    \n",
    "    r.seed(seed) # set seed; default 10\n",
    "    \n",
    "    order_id = r.randint(1, 100, n) # Order ID\n",
    "    order_item_id = r.randint(100, 1000, n) # Order item ID\n",
    "    order_date = generate_dates(n) # # Order date\n",
    "    product_id = r.randint(1, 51, n) # Product ID\n",
    "    quantity = r.randint(1, 5, n) # Quantity\n",
    "    unit_price = r.choice([4.99, 7.99, 12.99, 19.99, 29.99], n) # Unit price\n",
    "    product_group = r.choice(['cleanser', 'moisturizer', 'toner', 'treatment'], n) # Product group\n",
    "    \n",
    "    # Create dict\n",
    "    d = {'order_id': order_id, \n",
    "         'order_item_id': order_item_id,\n",
    "         'order_date': order_date,\n",
    "         'product_id': product_id,\n",
    "         'quantity': quantity,\n",
    "         'unit_price': unit_price,\n",
    "         'product_group': product_group}\n",
    "    \n",
    "    # Create the DataFrame from the dict\n",
    "    df = pd.DataFrame(data = d)\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order_id</th>\n",
       "      <th>order_item_id</th>\n",
       "      <th>order_date</th>\n",
       "      <th>product_id</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit_price</th>\n",
       "      <th>product_group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11</td>\n",
       "      <td>398</td>\n",
       "      <td>2017-4-11</td>\n",
       "      <td>33</td>\n",
       "      <td>4</td>\n",
       "      <td>4.99</td>\n",
       "      <td>cleanser</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>74</td>\n",
       "      <td>386</td>\n",
       "      <td>2016-5-22</td>\n",
       "      <td>44</td>\n",
       "      <td>2</td>\n",
       "      <td>4.99</td>\n",
       "      <td>toner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>85</td>\n",
       "      <td>656</td>\n",
       "      <td>2019-1-19</td>\n",
       "      <td>46</td>\n",
       "      <td>1</td>\n",
       "      <td>7.99</td>\n",
       "      <td>treatment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>80</td>\n",
       "      <td>860</td>\n",
       "      <td>2015-1-26</td>\n",
       "      <td>40</td>\n",
       "      <td>3</td>\n",
       "      <td>7.99</td>\n",
       "      <td>treatment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>81</td>\n",
       "      <td>261</td>\n",
       "      <td>2015-3-21</td>\n",
       "      <td>41</td>\n",
       "      <td>3</td>\n",
       "      <td>12.99</td>\n",
       "      <td>toner</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   order_id  order_item_id order_date  product_id  quantity  unit_price  \\\n",
       "0        11            398  2017-4-11          33         4        4.99   \n",
       "1        74            386  2016-5-22          44         2        4.99   \n",
       "2        85            656  2019-1-19          46         1        7.99   \n",
       "3        80            860  2015-1-26          40         3        7.99   \n",
       "4        81            261  2015-3-21          41         3       12.99   \n",
       "\n",
       "  product_group  \n",
       "0      cleanser  \n",
       "1         toner  \n",
       "2     treatment  \n",
       "3     treatment  \n",
       "4         toner  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate 500 data entries\n",
    "fake_data = generate_data(n=500, seed=6)\n",
    "\n",
    "fake_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add fake data to orders table\n",
    "fake_data.to_sql('orders', conn, if_exists='append', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')\n",
      "(398, 11, '2017-4-11', 33, 4, 4.99, 'cleanser')\n",
      "(386, 74, '2016-5-22', 44, 2, 4.99, 'toner')\n",
      "(656, 85, '2019-1-19', 46, 1, 7.99, 'treatment')\n",
      "(860, 80, '2015-1-26', 40, 3, 7.99, 'treatment')\n",
      "(261, 81, '2015-3-21', 41, 3, 12.99, 'toner')\n",
      "(937, 63, '2017-3-26', 37, 4, 4.99, 'cleanser')\n",
      "(960, 26, '2017-7-10', 5, 1, 29.99, 'cleanser')\n",
      "(977, 2, '2015-4-24', 35, 4, 7.99, 'moisturizer')\n",
      "(291, 76, '2019-1-17', 23, 3, 12.99, 'moisturizer')\n"
     ]
    }
   ],
   "source": [
    "# First 10 entries of the table\n",
    "sql = 'SELECT * FROM orders LIMIT 10'\n",
    "\n",
    "c.execute(sql)\n",
    "conn.commit()\n",
    "\n",
    "for entry in c.fetchall():\n",
    "    print(entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question 1\n",
    "\n",
    "Given an “Orders” table with the above columns, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('cleanser', 4, 44.96)\n",
      "('moisturizer', 9, 193.90999999999997)\n",
      "('toner', 14, 277.86)\n",
      "('treatment', 12, 155.88)\n"
     ]
    }
   ],
   "source": [
    "# Query needs to:\n",
    "# Restrict data to July 2015 orders only\n",
    "# Group by product group \n",
    "# Sum the quantity ordered across all orders\n",
    "# Sum the unit price * quantity across all orders\n",
    "\n",
    "sql = \"\"\"SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n",
    "            FROM orders\n",
    "            WHERE order_date LIKE ?\n",
    "            GROUP BY product_group\"\"\"\n",
    "\n",
    "args = ['2015-7' + '%']\n",
    "\n",
    "c.execute(sql, args)\n",
    "for entry in c.fetchall():\n",
    "    print(entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reason we need to use question marks is because Python has special instructions for when it sees a `%` symbol (causing a syntax error when passing arguments directly with `sqlite3`), so it's simpler to pass arguments separately rather than embedding them in the query. \n",
    "\n",
    "With the date embedded into the query, it'd look like this:\n",
    "\n",
    "```SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue\n",
    "    FROM orders\n",
    "    WHERE order_date LIKE 2015-7%\n",
    "    GROUP BY product_group```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order_id</th>\n",
       "      <th>order_item_id</th>\n",
       "      <th>order_date</th>\n",
       "      <th>product_id</th>\n",
       "      <th>quantity</th>\n",
       "      <th>unit_price</th>\n",
       "      <th>product_group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>32</td>\n",
       "      <td>637</td>\n",
       "      <td>2015-7-5</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "      <td>29.99</td>\n",
       "      <td>cleanser</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>70</td>\n",
       "      <td>680</td>\n",
       "      <td>2015-7-23</td>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "      <td>19.99</td>\n",
       "      <td>moisturizer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>28</td>\n",
       "      <td>636</td>\n",
       "      <td>2015-7-16</td>\n",
       "      <td>44</td>\n",
       "      <td>1</td>\n",
       "      <td>4.99</td>\n",
       "      <td>treatment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>89</td>\n",
       "      <td>254</td>\n",
       "      <td>2015-7-17</td>\n",
       "      <td>29</td>\n",
       "      <td>2</td>\n",
       "      <td>19.99</td>\n",
       "      <td>toner</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     order_id  order_item_id order_date  product_id  quantity  unit_price  \\\n",
       "20         32            637   2015-7-5          32         1       29.99   \n",
       "72         70            680  2015-7-23          22         1       19.99   \n",
       "149        28            636  2015-7-16          44         1        4.99   \n",
       "184        89            254  2015-7-17          29         2       19.99   \n",
       "\n",
       "    product_group  \n",
       "20       cleanser  \n",
       "72    moisturizer  \n",
       "149     treatment  \n",
       "184         toner  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the pandas DataFrame to verify that the quantities match\n",
    "july_2015 = fake_data.loc[fake_data['order_date'].str.contains('2015-7')]\n",
    "july_2015.head(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "product_group\n",
       "cleanser        4\n",
       "moisturizer     9\n",
       "toner          14\n",
       "treatment      12\n",
       "Name: quantity, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Sanity check\n",
    "july_2015.groupby('product_group')['quantity'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question 2\n",
    "\n",
    "Using the “Orders” table, which products (identified by `product_id`) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1,)\n",
      "(2,)\n",
      "(4,)\n",
      "(5,)\n",
      "(7,)\n",
      "(12,)\n",
      "(15,)\n",
      "(16,)\n",
      "(21,)\n",
      "(22,)\n",
      "(25,)\n",
      "(26,)\n",
      "(27,)\n",
      "(30,)\n",
      "(31,)\n",
      "(32,)\n",
      "(33,)\n",
      "(38,)\n",
      "(41,)\n",
      "(42,)\n",
      "(44,)\n",
      "(45,)\n",
      "(46,)\n",
      "(47,)\n",
      "(48,)\n",
      "(50,)\n"
     ]
    }
   ],
   "source": [
    "sql = \"\"\"SELECT product_id\n",
    "             FROM (SELECT product_id, COUNT(*) FROM orders\n",
    "                AS sold_2_2017\n",
    "                WHERE order_date LIKE ?\n",
    "                GROUP BY product_id \n",
    "                HAVING COUNT(*) >= 2)\n",
    "            WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE ?)\"\"\"\n",
    "\n",
    "args = [('2017' + '%'), ('2018' + '%')]\n",
    "\n",
    "c.execute(sql, args)\n",
    "for entry in c.fetchall():\n",
    "    print(entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To break down the above query: What we're doing is selecting products that were *also* sold in 2018 from a nested table of products that were sold twice in 2017. \n",
    "\n",
    "Given the date structure \"YYYY-MM-DD\", it would look like this as a regular SQL query:\n",
    "\n",
    "```SELECT product_id\n",
    "     FROM (SELECT product_id, COUNT(*) FROM orders\n",
    "        AS sold_2_2017\n",
    "        WHERE order_date LIKE 2017%\n",
    "        GROUP BY product_id \n",
    "        HAVING COUNT(*) >= 2)\n",
    "    WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE 2018%)```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Close up the connection\n",
    "c.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}