{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "x4UIW8GUQaqt" }, "source": [ "# **SETUP**" ] }, { "cell_type": "markdown", "metadata": { "id": "XXvRwAVPGtiX" }, "source": [ "*Install* DuckDB" ] }, { "cell_type": "code", "metadata": { "id": "IFC0Acb9Gu8W" }, "source": [ "!pip install duckdb" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "25WiZPNQhU3f" }, "source": [ "Import libraries" ] }, { "cell_type": "code", "metadata": { "id": "lzZg8HsyknbW" }, "source": [ "import pandas as pd\n", "import duckdb\n", "import time" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XDQF0N7HJIQ_" }, "source": [ "Download:\n", "* [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n", "* Weather Dataset\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "9RXn51F_JMpn" }, "source": [ "!wget \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet\"\n", "!wget \"https://pdet.github.io/assets/data/weather.csv\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ryMJZSckLmqG" }, "source": [ "# **Load the data** " ] }, { "cell_type": "markdown", "metadata": { "id": "u_Fx_Qetmhzm" }, "source": [ "To Load the data in Pandas, we can use the read_csv function." ] }, { "cell_type": "code", "metadata": { "id": "ZAhvgXjQmn2d" }, "source": [ "pandas_taxi = pd.read_parquet('yellow_tripdata_2016-01.parquet')\n", "pandas_weather = pd.read_csv('weather.csv')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "fCJZW9oOmqnm" }, "source": [ "There are two main ways of loading the data in DuckDB\n", "1. Through the [SQL API](https://duckdb.org/docs/sql/statements/create_table)\n" ] }, { "cell_type": "code", "metadata": { "id": "JMalyz3qm-MW" }, "source": [ "#Get a DuckDB Connection, pass it a string to store data on Disk.\n", "duck_conn = duckdb.connect(\"path.db\")\n", "\n", "# Use the create table syntax to read data from a CSV File\n", "load_cab_trip_sql = ''\n", "load_weather_sql = ''\n", "duck_conn.execute(load_cab_trip_sql)\n", "duck_conn.execute(load_weather_sql)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wLTa4KeKngIl" }, "source": [ "2. Through the [Relational API](https://github.com/cwida/duckdb/blob/master/examples/python/duckdb-python.py)\n" ] }, { "cell_type": "code", "metadata": { "id": "Sg86ph-TqLCv" }, "source": [ "# Use the from_csv_auto function to point to the table\n", "duck_rel = ...\n", "print(duck_rel)\n", "\n", "duck_rel = ...\n", "print(duck_rel)\n", "\n", "#Note that by using the Relational API the data is not yet stored in a table, \n", "# so future access to it will read the CSV files again.\n", "\n", "#Use the table function to get a Python Relational object that references a table.\n", "# This will allow us to use the relational API on a table.\n", "duck_taxi = duck_conn.table('cab_trip')\n", "duck_weather = duck_conn.table('weather')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "bBWk1B0gmMMV" }, "source": [ "# **Queries** " ] }, { "cell_type": "markdown", "metadata": { "id": "W4wqtjzlT30T" }, "source": [ "# **Q1 (Simple Aggregation)** \n", "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n" ] }, { "cell_type": "code", "metadata": { "id": "sy7O9XS0T9kJ" }, "source": [ "# Fill SQL with query\n", "sql = \"\"\" \"\"\"\n", "\n", "start_time = time.monotonic()\n", "result = pandas_taxi.agg(\n", " passenger_count=('passenger_count', 'mean'),\n", " trip_distance=('trip_distance', 'mean'),\n", " fare_amount=('fare_amount', 'mean'),\n", " total_amount=('total_amount', 'mean')\n", " )\n", "print (\"Pandas: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=2')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "# TIP: Use aggregate function of duck_taxi\n", "start_time = time.monotonic()\n", "res = ...\n", "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "AnE6A2olT954" }, "source": [ "# **Q2 (Aggregation + Group By)**\n", "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n", "\n", "**Grouped by** passenger_count" ] }, { "cell_type": "code", "metadata": { "id": "oypnNM8qZel2" }, "source": [ "# Fill SQL with query\n", "sql = \"\"\" \"\"\"\n", "\n", "start_time = time.monotonic()\n", "result = pandas_taxi.groupby(['passenger_count']).agg(\n", " trip_distance=('trip_distance', 'mean'),\n", " fare_amount=('fare_amount', 'mean'),\n", " total_amount=('total_amount', 'mean')\n", " )\n", "print (\"Pandas: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=2')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "# TIP: Use aggregate function of duck_taxi\n", "res = ...\n", "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "NmXEJYbqqjzz" }, "source": [ "# **Q3 (Aggregation + Group By + Filter)**\n", "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n", "\n", "**Grouped by** passenger_count\n", "\n", "**Filter** total_amount < 10" ] }, { "cell_type": "code", "metadata": { "id": "z2D-wbPbqkA4" }, "source": [ "# Fill SQL with query\n", "sql = \"\"\" \"\"\"\n", "\n", "start_time = time.monotonic()\n", "filtered_df = pandas_taxi[pandas_taxi['total_amount'] < 10]\n", "result = filtered_df.groupby(['passenger_count']).agg(\n", " trip_distance=('trip_distance', 'mean'),\n", " fare_amount=('fare_amount', 'mean'),\n", " total_amount=('total_amount', 'mean')\n", " )\n", "print (\"Pandas: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=2')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "# TIP: Use filter function of duck_taxi\n", "duck_filter = ...\n", "# TIP: Use aggregate function of duck_filter\n", "res = ...\n", "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qGtq0ThftaS_" }, "source": [ "# **Q4 (Aggregation + Group By + Filter + Join)**\n", "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n", "\n", "**Join** cab_trip.tpep_pickup_datetime::DATE = weather.date\n", "\n", "**Grouped by** passenger_count\n", "\n", "**Filter** total_amount < 10 and maximum_temperature < 50" ] }, { "cell_type": "code", "metadata": { "id": "x67nQKLztbdH" }, "source": [ "# Fill SQL with query (Note the cast in the join key cab_trip.tpep_pickup_datetime::DATE = weather.date )\n", "\n", "sql = \"\"\" \"\"\"\n", "\n", "pandas_taxi.tpep_pickup_datetime = pandas_taxi.tpep_pickup_datetime.astype('datetime64[ns]')\n", "pandas_weather.date = pandas_weather.date.astype('datetime64[ns]')\n", "\n", "start_time = time.monotonic()\n", "\n", "merged = pandas_taxi.merge(\n", " pandas_weather,\n", " left_on='tpep_pickup_datetime',\n", " right_on='date')\n", "\n", "filtered_df = merged[merged['total_amount'] < 10]\n", "\n", "filtered_df = filtered_df[filtered_df['maximum_temperature'] < 50 ]\n", "\n", "result = filtered_df.groupby(['passenger_count']).agg(\n", " trip_distance=('trip_distance', 'mean'),\n", " fare_amount=('fare_amount', 'mean'),\n", " total_amount=('total_amount', 'mean')\n", " )\n", "\n", "print (\"Pandas: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=2')\n", "\n", "start_time = time.monotonic()\n", "res = duck_conn.execute(sql).fetchall()\n", "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n", "\n", "duck_conn.execute('PRAGMA threads=1')\n", "\n", "start_time = time.monotonic()\n", "# TIP: Use filter function of duck_taxi\n", "duck_taxi_filter = ...\n", "# TIP: Use filter function of duck_weather\n", "duck_weather_filter = ...\n", "# TIP: Use join function of duck_taxi_filter with duck_weather_filter\n", "duck_join = ...\n", "# TIP: Use aggregate function\n", "res = ...\n", "\n", "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LZSpVEDu6VnO" }, "source": [ "# **Transactions** " ] }, { "cell_type": "markdown", "metadata": { "id": "HSG2lKb9JQGc" }, "source": [ "Let's say that no car in NYC can fit more than 5 passengers.\n", "Consider all trips with more than 5 passengers as dirty, and remove them.\n" ] }, { "cell_type": "code", "metadata": { "id": "2C6WHjy_vUDE" }, "source": [ "print(duck_conn.execute('select count(*) from cab_trip where passenger_count > 6').fetchall())" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "X20uKwviJPog" }, "source": [ "res = duck_conn.execute('select count(*) from cab_trip where passenger_count > 6').fetchall()\n", "\n", "start_time = time.monotonic()\n", "result = pandas_taxi[pandas_taxi['total_amount'] < 6]\n", "result.to_parquet('result.parquet.tmp', sep=',')\n", "print (\"Pandas: \" + str(time.monotonic() - start_time))\n", "\n", "start_time = time.monotonic()\n", "# TIP: Use delete from SQL\n", "res = ...\n", "print (\"DuckDB: \" + str(time.monotonic() - start_time))" ], "execution_count": null, "outputs": [] } ] }