{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "DuckDB_Exercise2.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "x4UIW8GUQaqt" }, "source": [ "# **SETUP**" ] }, { "cell_type": "markdown", "metadata": { "id": "XXvRwAVPGtiX" }, "source": [ "First we need to install DuckDB.\n" ] }, { "cell_type": "code", "metadata": { "id": "IFC0Acb9Gu8W" }, "source": [ "!pip install duckdb --pre" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ryMJZSckLmqG" }, "source": [ "# **Loading The Data**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "EJSvVxkOL6af" }, "source": [ "We will work with a generated dataset from the TPC-H benchmark. DuckDB has built-in support for generating the dataset using the `dbgen` procedure.\n", "\n", "We create an in-memory database and generate the data inside DuckDB using the following code snippet.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "nBPmYEMLNQ63" }, "source": [ "import duckdb\n", "con = duckdb.connect(':memory:')\n", "con.execute(\"CALL dbgen(sf=0.1)\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "sZ6-kBpMiJA1" }, "source": [ "# **Inspecting the Dataset**" ] }, { "cell_type": "markdown", "metadata": { "id": "jF6R9TWOfk57" }, "source": [ "The dataset consists of eight tables. We can see which tables are present in the database using the `SHOW TABLES` command.\n", "\n", "Note that we append `.df()` to the query, this fetches the result as a Pandas DataFrame which renders nicely in Colab." ] }, { "cell_type": "code", "metadata": { "id": "0HyO83Gmfnfk" }, "source": [ "con.execute(\"SHOW TABLES\").df()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "p9ZB5Jfkf6wu" }, "source": [ "Using the `DESCRIBE` command, we can inspect the columns that are present in each of the tables. For example, we can inspect the lineitem table as follows:" ] }, { "cell_type": "code", "metadata": { "id": "p4HYVUZEfyuX" }, "source": [ "con.execute(\"DESCRIBE lineitem\").df()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "1EFUAH2siHML" }, "source": [ "We can use the `LIMIT` clause to inspect the first few rows of the lineitem table and display them." ] }, { "cell_type": "code", "metadata": { "id": "XuF7DKkziC2-" }, "source": [ "con.execute(\"SELECT * FROM lineitem LIMIT 10\").df()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "91OEuc7qf6DR" }, "source": [ "To get a better feeling of what a table contains, we can use the `SUMMARIZE` command. This prints out several statistics about each of the columns of the table, such as the min and max value, how many unique values there are, the average value in the column, etc." ] }, { "cell_type": "code", "metadata": { "id": "qlavVmRrf1f0" }, "source": [ "con.execute(\"SUMMARIZE lineitem\").df()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LcZafWWgeNQA" }, "source": [ "# **Testing and Benchmarking**" ] }, { "cell_type": "markdown", "metadata": { "id": "1dROJlGXeVkj" }, "source": [ "Let us start our assignment by running a microbenchmark against the TPC-H dataset.\n", "\n", "In order to make the benchmarking more interesting, let's compare against the SQLite database. This is a typical OLTP (transactional) database that is included along with every Python installation.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vaW5C0HJheWe" }, "source": [ "### **SQLite Setup**" ] }, { "cell_type": "markdown", "metadata": { "id": "clF4mkBohg1j" }, "source": [ "We start out by creating a new in-memory database, just as we did in DuckDB." ] }, { "cell_type": "code", "metadata": { "id": "8Z2gmvPveXnJ" }, "source": [ "import sqlite3\n", "sqlite_con = sqlite3.connect(':memory:', check_same_thread=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tUgwAT89hwpd" }, "source": [ "We can transfer the data from DuckDB to SQLite using a Pandas DataFrame as well. First, we export the data from DuckDB into a Pandas DataFrame using the `.df()` command. Then we use the `to_sql` function to write the data to our SQLite database." ] }, { "cell_type": "code", "metadata": { "id": "fz-420csd1zo" }, "source": [ "table_list = con.execute(\"SHOW TABLES\").fetchall()\n", "\n", "for table in table_list:\n", " tname = table[0]\n", " table_data = con.table(tname).df()\n", " table_data.to_sql(tname, sqlite_con)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "c9yYI7wjihsv" }, "source": [ "# **Running the Benchmark**" ] }, { "cell_type": "markdown", "metadata": { "id": "yW6Y_qpujGcq" }, "source": [ "We have created a query down below which resembles a (simplified) query from the TPC-H benchmark:" ] }, { "cell_type": "code", "metadata": { "id": "EGFeecgris_-" }, "source": [ "query = \"\"\"\n", "SELECT\n", " l_orderkey,\n", " sum(l_extendedprice * (1 - l_discount)) AS revenue,\n", " o_orderdate,\n", " o_shippriority\n", "FROM\n", " customer\n", " JOIN orders ON (c_custkey=o_custkey)\n", " JOIN lineitem ON (l_orderkey=o_orderkey)\n", "WHERE\n", " c_mktsegment = 'BUILDING'\n", " AND o_orderdate < CAST('1995-03-15' AS date)\n", " AND l_shipdate > CAST('1995-03-15' AS date)\n", "GROUP BY\n", " l_orderkey,\n", " o_orderdate,\n", " o_shippriority\n", "ORDER BY\n", " revenue DESC,\n", " o_orderdate\n", "LIMIT 10;\n", "\"\"\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YDM3ENVrj2-t" }, "source": [ "Let's run the query in both SQLite and in DuckDB and measure the execution time." ] }, { "cell_type": "code", "metadata": { "id": "JJjqeZ6vj1DS" }, "source": [ "import time\n", "import pandas as pd\n", "\n", "\n", "def run_query(con_obj, q):\n", " start = time.time()\n", " con_obj.execute(q).fetchall()\n", " end = time.time()\n", " return(str(round(end - start,3)) + 's')\n", "\n", "duckdb_results = [run_query(con, query)]\n", "sqlite_results = [run_query(sqlite_con, query)]\n", "\n", "pd.DataFrame.from_dict({\n", " 'DuckDB': duckdb_results,\n", " 'SQLite': sqlite_results\n", "})" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "U4bIcbh4EARK" }, "source": [ "Using the `PRAGMA disable_optimizer` we can also disable the query optimizer of DuckDB, and re-run the query. In this manner we can see the performance effect that query optimization has on our query." ] }, { "cell_type": "code", "metadata": { "id": "JlAsNIvV-f0x" }, "source": [ "con.execute(\"PRAGMA disable_optimizer\")\n", "duckdb_unoptimized_results = [run_query(con, query)]\n", "con.execute(\"PRAGMA enable_optimizer\")\n", "\n", "pd.DataFrame.from_dict({\n", " 'DuckDB': duckdb_results,\n", " 'DuckDB (Unoptimized)': duckdb_unoptimized_results,\n", " 'SQLite': sqlite_results\n", "})" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OQCvdodnNpKl" }, "source": [ "# **Inspecting the Query Plan**\n", "The query plan of a query can be inspected by prefixing the query with`EXPLAIN`. By default, only the physical query plan is returned. You can use `PRAGMA explain_output='all'` to output the unoptimized logical plan, the optimized logical plan and the physical plan as well." ] }, { "cell_type": "code", "metadata": { "id": "kU6-xxAWNth9" }, "source": [ "def explain_query(query):\n", " print(con.execute(\"EXPLAIN \" + query).fetchall()[0][1])\n", "\n", "explain_query(query)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YKAtnEB2PuCp" }, "source": [ "# **Profiling Queries**\n", "Rather than only viewing the query plan, we can also run the query and look at the profile output. The function `run_and_profile_query` below performs this profiling by enabling the profiling, writing the profiling output to a file, and then printing the contents of that file to the console.\n", "\n", "The profiler output shows extra information for every operator; namely how much time was spent executing that operator, and how many tuples have moved from that operator to the operator above it. \n", "\n", "For a `SEQ_SCAN` (sequential scan), for example, it shows how many tuples have been read from the base table. For a `FILTER`, it shows how many tuples have passed the filter predicate. For a``HASH_GROUP_BY`, it shows how many groups were created and aggregated.\n", "\n", "These intermediate cardinalities are important because they do a good job of explaining why an operator takes a certain amount of time, and in many cases these intermediates can be avoided or drastically reduced by modifying the way in which a query is executed.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "ejriXVhQP1X2" }, "source": [ "def run_and_profile_query(query):\n", " con.execute(\"PRAGMA enable_profiling\")\n", " con.execute(\"PRAGMA profiling_output='out.log'\")\n", " con.execute(query)\n", " con.execute(\"PRAGMA disable_profiling\")\n", " with open('out.log', 'r') as f:\n", " output = f.read()\n", " print(output)\n", " \n", "run_and_profile_query(query)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "VRNkx2dK7_wZ" }, "source": [ "\n", "# **Query Optimizations**\n", "\n", "An important component of a database system is the optimizer. The optimizer changes the query plan so that it is logically equivalent to the original plan, but (hopefully) executes much faster.\n", "\n", "In an ideal world, the optimizer allows the user not to worry about how to formulate a query: the user only needs to describe what result they want to see, and the database figures out the most efficient way of retrieving that result.\n", "\n", "In practice, this is certainly not always true, and in some situations it is necessary to rephrase a query. Nevertheless, optimizers generally do a very good job at optimizing queries, and save users a lot of time in manually reformulating queries.\n", "\n", "Let us run the following query and see how it performs:" ] }, { "cell_type": "code", "metadata": { "id": "R5dTktM58Dje" }, "source": [ "unoptimized_query = \"\"\"\n", "SELECT\n", " l_orderkey,\n", " sum(l_extendedprice * (1 - l_discount)) AS revenue,\n", " o_orderdate,\n", " o_shippriority\n", "FROM\n", " customer,\n", " orders,\n", " lineitem\n", "WHERE\n", " c_mktsegment = 'BUILDING'\n", " AND c_custkey = o_custkey\n", " AND l_orderkey = o_orderkey\n", " AND o_orderdate < CAST('1995-03-15' AS date)\n", " AND l_shipdate > CAST('1995-03-15' AS date)\n", "GROUP BY\n", " l_orderkey,\n", " o_orderdate,\n", " o_shippriority\n", "ORDER BY\n", " revenue DESC,\n", " o_orderdate\n", "LIMIT 10;\n", "\"\"\"\n", "\n", "run_and_profile_query(unoptimized_query)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "vyDSVNG1TmPi" }, "source": [ "\n", "\n", "# **Manual Query Optimizations**\n", "\n", "In order to get a better idea of how query optimizers work, we are going to perform *manual* query optimization. In order to do that, we will disable all query optimizers in DuckDB, which means the query will run *as-is*. We can then change the way the query is physically executed by altering the query. Let's try to disable the optimizer and looking at the query plan:\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "sy7O9XS0T9kJ" }, "source": [ "# con.execute(\"PRAGMA disable_optimizer\")\n", "explain_query(unoptimized_query)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "AnE6A2olT954" }, "source": [ "Looking at the plan you now see that the hash joins that were used before are replaced by cross products followed by a filter. This is what was literally written in the query, however, cross products are extremely expensive! We could run this query, but because of the cross products it will take extremely long. \n", "\n", "Let's rewrite the query to explicitly use joins instead, and then we can actually run it:" ] }, { "cell_type": "code", "metadata": { "id": "oypnNM8qZel2" }, "source": [ "query = \"\"\"\n", "SELECT\n", " l_orderkey,\n", " sum(l_extendedprice * (1 - l_discount)) AS revenue,\n", " o_orderdate,\n", " o_shippriority\n", "FROM\n", " customer\n", " JOIN orders ON (c_custkey=o_custkey)\n", " JOIN lineitem ON (l_orderkey=o_orderkey)\n", "WHERE\n", " c_mktsegment = 'BUILDING'\n", " AND o_orderdate < CAST('1995-03-15' AS date)\n", " AND l_shipdate > CAST('1995-03-15' AS date)\n", "GROUP BY\n", " l_orderkey,\n", " o_orderdate,\n", " o_shippriority\n", "ORDER BY\n", " revenue DESC,\n", " o_orderdate\n", "LIMIT 10;\n", "\"\"\"\n", "\n", "run_and_profile_query(query)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zNEso9LqaXYw" }, "source": [ "# **Assignment**\n", "\n", "Now the query actually finishes; however, it is still much slower than before. There are more changes that can be made to the query to make it run faster. Your assignment (and challenge!) is to adjust the query so that it runs in similar speed to the query with optimizations enabled. You will be the human query optimizer replacing the disabled one.\n", "\n", "Hint:\n", "\n", "1. Join order matters!\n", "2. DuckDB always builds the hash table on the *right side* of a hash join.\n", "3. Filters? Projections?\n", "\n", "Another important consideration is that the query optimization should still output the same query result! An optimizer that changes the query result is incorrect and is considered a bug.\n" ] }, { "cell_type": "code", "metadata": { "id": "0M2rSHRlamG3" }, "source": [ "query = \"\"\"\n", "SELECT\n", " l_orderkey,\n", " sum(l_extendedprice * (1 - l_discount)) AS revenue,\n", " o_orderdate,\n", " o_shippriority\n", "FROM\n", " customer\n", " JOIN orders ON (c_custkey=o_custkey)\n", " JOIN lineitem ON (l_orderkey=o_orderkey)\n", "WHERE\n", " c_mktsegment = 'BUILDING'\n", " AND o_orderdate < CAST('1995-03-15' AS date)\n", " AND l_shipdate > CAST('1995-03-15' AS date)\n", "GROUP BY\n", " l_orderkey,\n", " o_orderdate,\n", " o_shippriority\n", "ORDER BY\n", " revenue DESC,\n", " o_orderdate\n", "LIMIT 10;\n", "\"\"\"\n", "\n", "run_and_profile_query(query)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "p6JUwPDfFJnI" }, "source": [ "# **Bonus Assignment 1**\n", "\n", "The TPC-H queries can be loaded from DuckDB using the query `SELECT * FROM tpch_queries()`. Run all the queries in both DuckDB and SQLite and compare the results.\n", "\n", "Note: Not all queries will work as-is in SQLite, and some might need to be (slightly) rewritten to accomodate SQLite's (more limited) SQL dialect." ] }, { "cell_type": "markdown", "metadata": { "id": "AhBLdZ0XTnG7" }, "source": [ "# **Bonus Assignment 2**\n", "\n", "As a bonus assignment, here is another query that you can optimize." ] }, { "cell_type": "code", "metadata": { "id": "bYS2uTs1BB-q" }, "source": [ "query = \"\"\"\n", "SELECT\n", " nation,\n", " o_year,\n", " sum(amount) AS sum_profit\n", "FROM (\n", " SELECT\n", " n_name AS nation,\n", " extract(year FROM o_orderdate) AS o_year,\n", " l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount\n", " FROM\n", " part,\n", " supplier,\n", " lineitem,\n", " partsupp,\n", " orders,\n", " nation\n", " WHERE\n", " s_suppkey = l_suppkey\n", " AND ps_suppkey = l_suppkey\n", " AND ps_partkey = l_partkey\n", " AND p_partkey = l_partkey\n", " AND o_orderkey = l_orderkey\n", " AND s_nationkey = n_nationkey\n", " AND p_name LIKE '%green%') AS profit\n", "GROUP BY\n", " nation,\n", " o_year\n", "ORDER BY\n", " nation,\n", " o_year DESC;\n", "\"\"\"\n", "\n", "\n", "run_and_profile_query(query)" ], "execution_count": null, "outputs": [] } ] }