{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x4UIW8GUQaqt"
      },
      "source": [
        "# **SETUP**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XXvRwAVPGtiX"
      },
      "source": [
        "*Install* DuckDB"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "IFC0Acb9Gu8W"
      },
      "source": [
        "!pip install duckdb"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "25WiZPNQhU3f"
      },
      "source": [
        "Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lzZg8HsyknbW"
      },
      "source": [
        "import pandas as pd\n",
        "import duckdb\n",
        "import time"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XDQF0N7HJIQ_"
      },
      "source": [
        "Download:\n",
        "*   [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n",
        "*   Weather Dataset\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9RXn51F_JMpn"
      },
      "source": [
        "!wget \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet\"\n",
        "!wget \"https://pdet.github.io/assets/data/weather.csv\""
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ryMJZSckLmqG"
      },
      "source": [
        "# **Load the data** "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u_Fx_Qetmhzm"
      },
      "source": [
        "To Load the data in Pandas, we can use the read_csv function."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZAhvgXjQmn2d"
      },
      "source": [
        "pandas_taxi = pd.read_parquet('yellow_tripdata_2016-01.parquet')\n",
        "pandas_weather = pd.read_csv('weather.csv')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fCJZW9oOmqnm"
      },
      "source": [
        "There are two main ways of loading the data in DuckDB\n",
        "1.   Through the [SQL API](https://duckdb.org/docs/sql/statements/create_table)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JMalyz3qm-MW"
      },
      "source": [
        "#Get a DuckDB Connection, pass it a string to store data on Disk.\n",
        "duck_conn = duckdb.connect(\"path.db\")\n",
        "\n",
        "# Use the create table syntax to read data from a CSV File\n",
        "load_cab_trip_sql = ''\n",
        "load_weather_sql = ''\n",
        "duck_conn.execute(load_cab_trip_sql)\n",
        "duck_conn.execute(load_weather_sql)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wLTa4KeKngIl"
      },
      "source": [
        "2.   Through the [Relational API](https://github.com/cwida/duckdb/blob/master/examples/python/duckdb-python.py)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sg86ph-TqLCv"
      },
      "source": [
        "# Use the from_csv_auto function to point to the table\n",
        "duck_rel = ...\n",
        "print(duck_rel)\n",
        "\n",
        "duck_rel = ...\n",
        "print(duck_rel)\n",
        "\n",
        "#Note that by using the Relational API the data is not yet stored in a table, \n",
        "# so future access to it will read the CSV files again.\n",
        "\n",
        "#Use the table function to get a Python Relational object that references a table.\n",
        "# This will allow us to use the relational API on a table.\n",
        "duck_taxi = duck_conn.table('cab_trip')\n",
        "duck_weather = duck_conn.table('weather')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bBWk1B0gmMMV"
      },
      "source": [
        "# **Queries** "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W4wqtjzlT30T"
      },
      "source": [
        "# **Q1 (Simple Aggregation)** \n",
        "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sy7O9XS0T9kJ"
      },
      "source": [
        "# Fill SQL with query\n",
        "sql = \"\"\" \"\"\"\n",
        "\n",
        "start_time = time.monotonic()\n",
        "result = pandas_taxi.agg(\n",
        "    passenger_count=('passenger_count', 'mean'),\n",
        "    trip_distance=('trip_distance', 'mean'),\n",
        "    fare_amount=('fare_amount', 'mean'),\n",
        "    total_amount=('total_amount', 'mean')\n",
        "  )\n",
        "print (\"Pandas: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=2')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "# TIP: Use aggregate function of duck_taxi\n",
        "start_time = time.monotonic()\n",
        "res = ...\n",
        "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AnE6A2olT954"
      },
      "source": [
        "# **Q2 (Aggregation + Group By)**\n",
        "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n",
        "\n",
        "**Grouped by** passenger_count"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oypnNM8qZel2"
      },
      "source": [
        "# Fill SQL with query\n",
        "sql = \"\"\" \"\"\"\n",
        "\n",
        "start_time = time.monotonic()\n",
        "result = pandas_taxi.groupby(['passenger_count']).agg(\n",
        "    trip_distance=('trip_distance', 'mean'),\n",
        "    fare_amount=('fare_amount', 'mean'),\n",
        "    total_amount=('total_amount', 'mean')\n",
        "  )\n",
        "print (\"Pandas: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=2')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "# TIP: Use aggregate function of duck_taxi\n",
        "res = ...\n",
        "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NmXEJYbqqjzz"
      },
      "source": [
        "# **Q3 (Aggregation + Group By + Filter)**\n",
        "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n",
        "\n",
        "**Grouped by** passenger_count\n",
        "\n",
        "**Filter** total_amount < 10"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z2D-wbPbqkA4"
      },
      "source": [
        "# Fill SQL with query\n",
        "sql = \"\"\"  \"\"\"\n",
        "\n",
        "start_time = time.monotonic()\n",
        "filtered_df = pandas_taxi[pandas_taxi['total_amount'] < 10]\n",
        "result = filtered_df.groupby(['passenger_count']).agg(\n",
        "    trip_distance=('trip_distance', 'mean'),\n",
        "    fare_amount=('fare_amount', 'mean'),\n",
        "    total_amount=('total_amount', 'mean')\n",
        "  )\n",
        "print (\"Pandas: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=2')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "# TIP: Use filter function of duck_taxi\n",
        "duck_filter = ...\n",
        "# TIP: Use aggregate function of duck_filter\n",
        "res = ...\n",
        "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qGtq0ThftaS_"
      },
      "source": [
        "# **Q4 (Aggregation + Group By + Filter + Join)**\n",
        "**Average of** (passenger_count, trip_distance, fare_amount e total_amount.)\n",
        "\n",
        "**Join** cab_trip.tpep_pickup_datetime::DATE = weather.date\n",
        "\n",
        "**Grouped by** passenger_count\n",
        "\n",
        "**Filter** total_amount < 10 and maximum_temperature < 50"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x67nQKLztbdH"
      },
      "source": [
        "# Fill SQL with query (Note the cast in the join key cab_trip.tpep_pickup_datetime::DATE = weather.date )\n",
        "\n",
        "sql = \"\"\"  \"\"\"\n",
        "\n",
        "pandas_taxi.tpep_pickup_datetime = pandas_taxi.tpep_pickup_datetime.astype('datetime64[ns]')\n",
        "pandas_weather.date = pandas_weather.date.astype('datetime64[ns]')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "\n",
        "merged = pandas_taxi.merge(\n",
        "  pandas_weather,\n",
        "  left_on='tpep_pickup_datetime',\n",
        "  right_on='date')\n",
        "\n",
        "filtered_df = merged[merged['total_amount'] < 10]\n",
        "\n",
        "filtered_df = filtered_df[filtered_df['maximum_temperature'] < 50 ]\n",
        "\n",
        "result = filtered_df.groupby(['passenger_count']).agg(\n",
        "    trip_distance=('trip_distance', 'mean'),\n",
        "    fare_amount=('fare_amount', 'mean'),\n",
        "    total_amount=('total_amount', 'mean')\n",
        "  )\n",
        "\n",
        "print (\"Pandas: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=2')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "res = duck_conn.execute(sql).fetchall()\n",
        "print (\"DuckDB (2T): \" + str(time.monotonic() - start_time))\n",
        "\n",
        "duck_conn.execute('PRAGMA threads=1')\n",
        "\n",
        "start_time = time.monotonic()\n",
        "# TIP: Use filter function of duck_taxi\n",
        "duck_taxi_filter = ...\n",
        "# TIP: Use filter function of duck_weather\n",
        "duck_weather_filter = ...\n",
        "# TIP: Use join function of duck_taxi_filter with duck_weather_filter\n",
        "duck_join = ...\n",
        "# TIP: Use aggregate function\n",
        "res = ...\n",
        "\n",
        "print (\"DuckDB (RAPI): \" + str(time.monotonic() - start_time))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LZSpVEDu6VnO"
      },
      "source": [
        "# **Transactions** "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HSG2lKb9JQGc"
      },
      "source": [
        "Let's say that no car in NYC can fit more than 5 passengers.\n",
        "Consider all trips with more than 5 passengers as dirty, and remove them.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2C6WHjy_vUDE"
      },
      "source": [
        "print(duck_conn.execute('select count(*) from cab_trip where passenger_count > 6').fetchall())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "X20uKwviJPog"
      },
      "source": [
        "res = duck_conn.execute('select count(*) from cab_trip where passenger_count > 6').fetchall()\n",
        "\n",
        "start_time = time.monotonic()\n",
        "result = pandas_taxi[pandas_taxi['total_amount'] < 6]\n",
        "result.to_parquet('result.parquet.tmp', sep=',')\n",
        "print (\"Pandas: \" + str(time.monotonic() - start_time))\n",
        "\n",
        "start_time = time.monotonic()\n",
        "# TIP: Use delete from SQL\n",
        "res = ...\n",
        "print (\"DuckDB: \" + str(time.monotonic() - start_time))"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}