{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "Ysi7m4vvxpxk" }, "source": [ "In this assignment, you will experiment with the NYC Cab dataset from 2016. This dataset provides information (e.g., pickup/dropoff time, # of passengers, trip distance, fare) about cab trips done in New York City during 2016. You can learn more about the dataset clicking [here!](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n", "\n", "You will load this dataset into pandas, sqlite, and duckdb. You will compare the performance of multiple data-science like queries, including performing a fare estimation (i.e., predicting how much a ride will cost depending on distance) using machine learning.\n", "\n", "In the first section you will implement the loader in duckdb **[5 points].**\n", "\n", "The second section has two data-science like queries, the implementation in pandas is already given, and you should use it as a logical/correctness reference to write the queries for sqlite and duckdb, remember to compare the performance of the three different systems **[25 points]**.\n", "\n", "Finally, in the third section you will implement a simple machine learning algorithm in duckdb to predict fare costs. A full implementation of pandas is given and a partial of sqlite. Again, use them as a logical/correctness reference and compare the performance of the three different systems. **[40 points]**\n", "\n", "Remember to submit your notebook with the answers to all sections as well as a PDF document (max two papes) listing all experienced execution times and reasoning about the performance difference in these systems." ] }, { "cell_type": "markdown", "metadata": { "id": "x4UIW8GUQaqt" }, "source": [ "# **SETUP**" ] }, { "cell_type": "markdown", "metadata": { "id": "XXvRwAVPGtiX" }, "source": [ "First we need to install duckdb." ] }, { "cell_type": "code", "metadata": { "id": "IFC0Acb9Gu8W" }, "source": [ "!pip install duckdb==0.3.1" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XDQF0N7HJIQ_" }, "source": [ "Now we download the NYC Cab dataset and decompress it. \n", "\n", "Since SQLite's python client doesn't provide a built-in copy into function, we also provide a downloadable version of a loaded SQLite instance." ] }, { "cell_type": "code", "metadata": { "id": "9RXn51F_JMpn" }, "source": [ "!wget \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2016-01.parquet\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ryMJZSckLmqG" }, "source": [ "# **Loading The Data** [5pts]" ] }, { "cell_type": "markdown", "metadata": { "id": "EJSvVxkOL6af" }, "source": [ "In this section your goal is to write code to load the data in duckdb." ] }, { "cell_type": "code", "metadata": { "id": "nBPmYEMLNQ63" }, "source": [ "import pandas\n", "import sqlite3\n", "# Creates Pandas Dataframe from Parquet\n", "yellow_tripdata_pandas = pandas.read_parquet('yellow_tripdata_2016-01.parquet')\n", "\n", "# Generates SQLite database\n", "db_sqlite = sqlite3.connect('yellow_tripdate_sqlite.db') \n", "yellow_tripdata_pandas.to_sql('yellow_tripdata_2016_01', db_sqlite, index=False)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YKAtnEB2PuCp" }, "source": [ "How do we load data in DuckDB? https://duckdb.org/docs/data/overview https://duckdb.org/docs/data/parquet\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "ejriXVhQP1X2" }, "source": [ "import duckdb\n", "import time\n", "\n", "db = duckdb.connect('yellow_tripdata_duck.db')\n", "duck_cursor = db.cursor()\n", "\n", "# Fill query here!\n", "sql_load = \"\"\" \"\"\"\n", "\n", "duck_cursor.execute(sql_load)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "vyDSVNG1TmPi" }, "source": [ "# **Queries** [25 pts]" ] }, { "cell_type": "markdown", "metadata": { "id": "W4wqtjzlT30T" }, "source": [ "**Query Q1 [10 pts]**: Count distinct values for each column.\n", "\n", "**TIP:** We provide you with the pandas implementation, be sure to use it to check for correctness!\n", "\n", "In the first query we want to compare the performance of a simple hash-function in our Database Management Systems (DBMSs). To do so, we implement and run a query that counts the distinct values for each column.\n", "In general, most systems use hash-based algorithms to find the distinct values in a large set of values (e.g., a database column).\n", "Obs: Sort-based algorithm are an alternative by usually slower.\n" ] }, { "cell_type": "code", "metadata": { "id": "sy7O9XS0T9kJ" }, "source": [ "import time\n", "\n", "#Fill Query Here (The same syntax works for both SQLite and DuckDB)\n", "sql_01 = \"\"\" \"\"\"\n", "\n", "sqlite_cursor = db_sqlite.cursor() # Creates a Cursor, that allows us to issue Create Table Stataments\n", "\n", "#Implement Pandas Query Here\n", "def query_01_pandas():\n", " print(yellow_tripdata_pandas['VendorID'].nunique(),\n", " yellow_tripdata_pandas['tpep_pickup_datetime'].nunique(),\n", " yellow_tripdata_pandas['tpep_dropoff_datetime'].nunique(),\n", " yellow_tripdata_pandas['passenger_count'].nunique(),\n", " yellow_tripdata_pandas['trip_distance'].nunique(),\n", " yellow_tripdata_pandas['RatecodeID'].nunique(),\n", " yellow_tripdata_pandas['store_and_fwd_flag'].nunique(),\n", " yellow_tripdata_pandas['PULocationID'].nunique(),\n", " yellow_tripdata_pandas['DOLocationID'].nunique(),\n", " yellow_tripdata_pandas['payment_type'].nunique(),\n", " yellow_tripdata_pandas['fare_amount'].nunique(),\n", " yellow_tripdata_pandas['extra'].nunique(),\n", " yellow_tripdata_pandas['mta_tax'].nunique(),\n", " yellow_tripdata_pandas['tip_amount'].nunique(),\n", " yellow_tripdata_pandas['tolls_amount'].nunique(),\n", " yellow_tripdata_pandas['improvement_surcharge'].nunique(),\n", " yellow_tripdata_pandas['total_amount'].nunique(),\n", " yellow_tripdata_pandas['congestion_surcharge'].nunique(),\n", " yellow_tripdata_pandas['airport_fee'].nunique())\n", "\n", "def query_01_sqlite():\n", "\tsqlite_cursor.execute(sql_01)\n", "\tresult = sqlite_cursor.fetchone()\n", "\tprint(result)\n", "\n", "def query_01_duckdb():\n", "\tduck_cursor.execute(sql_01)\n", "\tresult = duck_cursor.fetchall()\n", "\tprint(result)\n", "\n", " \n", "start = time.time()\n", "query_01_pandas()\n", "end = time.time()\n", "print(\"Query in Pandas : \" +str(end - start))\n", "\n", "start = time.time()\n", "query_01_sqlite()\n", "end = time.time()\n", "print(\"Query in SQLite : \" +str(end - start))\n", "\n", "start = time.time()\n", "query_01_duckdb()\n", "end = time.time()\n", "print(\"Query in DuckDB : \" +str(end - start))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "AnE6A2olT954" }, "source": [ "\n", "\n", "In our second query, we want to compare the performance of the aggregation/grouping functions.\n", "**Query Q2 [15 pts]**: What's the avg/max/min frequency of events grouped by day and hour? (Recall that each record in the data set represents one event.)\n", "\n", "**TIP_1** : Start by writing a subquery that groups the data by Day/Hour and counts all events.\n", "\n", "For duckdb:\n", "EXTRACT(DOY FROM tpep_pickup_datetime), EXTRACT(HOUR FROM tpep_pickup_datetime)\n", "\n", "For Sqlite:\n", "strftime('%j', tpep_pickup_datetime), strftime('%H', tpep_pickup_datetime) for\n", "\n", "**TIP_2**: After writing the subquery, get the avg, min and max values of it\n", "\n", "For your convenience, we already provide the Pandas implementation, remember to use it for correctness/logic purposes." ] }, { "cell_type": "code", "metadata": { "id": "oypnNM8qZel2" }, "source": [ "#Implement SQLite query\n", "sql_02_sqlite = \"\"\" \"\"\"\n", "#Implement DuckDB query\n", "sql_02_duckdb = \"\"\" \"\"\"\n", "\n", "\n", "def query_02_pandas():\n", "\tdf = yellow_tripdata_pandas[[\"VendorID\", \"tpep_pickup_datetime\"]]\n", "\tdf['day'] = pandas.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.day\n", "\tdf['hour'] = pandas.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.hour\n", "\tres = df.groupby(['day', 'hour']).agg({'VendorID': 'count'})\n", "\tprint(res[\"VendorID\"].min(),res[\"VendorID\"].mean(),res[\"VendorID\"].max())\n", "\n", "def query_02_sqlite():\n", "\tsqlite_cursor.execute(sql_02_sqlite)\n", "\tresult = sqlite_cursor.fetchall()\n", "\tprint(result)\n", "\n", "def query_02_duckdb():\n", "\tduck_cursor.execute(sql_02_duckdb)\n", "\tresult = duck_cursor.fetchall()\n", "\tprint(result)\n", "\n", " \n", "start = time.time()\n", "query_02_pandas()\n", "end = time.time()\n", "print(\"Query in Pandas : \" +str(end - start))\n", "\n", "start = time.time()\n", "query_02_sqlite()\n", "end = time.time()\n", "print(\"Query in SQLite : \" +str(end - start))\n", "\n", "start = time.time()\n", "query_02_duckdb()\n", "end = time.time()\n", "print(\"Query in DuckDB : \" +str(end - start))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "CqqmSeKvIN2L" }, "source": [ "# **Machine Learning : Fare Estimation** [40 pts]" ] }, { "cell_type": "markdown", "metadata": { "id": "9-AA_rgyIYpI" }, "source": [ "Here you will need to use the data to do the most basic form of machine learning: [linear regression](https://en.wikipedia.org/wiki/Ordinary_least_squares#Simple_linear_regression_model). You can train a simple estimator that takes the trip distance and estimates the price. For this exercise we provide the linear regression using pandas. Your goal is to mimic the pandas computation in both SQLite and DuckDB, and compare the computation time until the calculation of linear regression's parameters alpha and gama **(Which must be the same in all three systems, take this into consideration to check if your answer is correct)**." ] }, { "cell_type": "code", "metadata": { "id": "AR6hF_QgI3Y4" }, "source": [ "cov_xy = (yellow_tripdata_pandas[\"trip_distance\"] * yellow_tripdata_pandas[\"fare_amount\"]).sum() - (\n", " yellow_tripdata_pandas[\"trip_distance\"].sum() * yellow_tripdata_pandas[\"fare_amount\"].sum()\n", ") / len(yellow_tripdata_pandas)\n", "var_xy = (yellow_tripdata_pandas[\"trip_distance\"] ** 2).sum() - yellow_tripdata_pandas[\"trip_distance\"].sum() ** 2 / len(yellow_tripdata_pandas)\n", "beta = cov_xy / var_xy\n", "alpha = yellow_tripdata_pandas[\"fare_amount\"].mean() - beta * yellow_tripdata_pandas[\"trip_distance\"].mean()\n", "\n", "alpha\n", "# 12.486907739140417\n", "beta\n", "# 4.6752084884145456e-06\n", "\n", "# Select some sample data and see how well we can fit the price\n", "sample = yellow_tripdata_pandas.sample(10000)\n", "sample[\"price\"] = alpha + beta * sample[\"trip_distance\"]\n", "ax = sample.plot.scatter(x=\"trip_distance\", y=\"fare_amount\")\n", "sample.plot.line(x=\"trip_distance\", y=\"price\", ax=ax, color=\"red\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "MSNgeIoyJeBN" }, "source": [ "As you can tell from the above image, as with any real life dataset, the New York City trip dataset also contains outliers that we need to clean to get a good regression. Hence we need to get rid of the noisy data that disturbs our basic regression example.\n", "\n", "**Note**: The pandas program above is merely to show the dirty data/outliers, consider the following one for the rest of this section." ] }, { "cell_type": "code", "metadata": { "id": "6SFDISlWJ01R" }, "source": [ "start = time.time()\n", "cap_fare = yellow_tripdata_pandas[\"fare_amount\"].mean() + 3 * yellow_tripdata_pandas[\"fare_amount\"].std()\n", "cap_distance = yellow_tripdata_pandas[\"trip_distance\"].mean() + 3 * yellow_tripdata_pandas[\"trip_distance\"].std()\n", "df_filtered = yellow_tripdata_pandas.query(\n", " f\"trip_distance > 0 and trip_distance < {cap_distance} and fare_amount > 0 and fare_amount < {cap_fare}\"\n", ")\n", "\n", "# Train on the filtered data\n", "cov_xy = (df_filtered[\"trip_distance\"] * df_filtered[\"fare_amount\"]).sum() - (\n", " df_filtered[\"trip_distance\"].sum() * df_filtered[\"fare_amount\"].sum()\n", ") / len(df_filtered)\n", "var_xy = (df_filtered[\"trip_distance\"] ** 2).sum() - df_filtered[\n", " \"trip_distance\"\n", "].sum() ** 2 / len(df_filtered)\n", "beta = cov_xy / var_xy\n", "alpha = df_filtered[\"fare_amount\"].mean() - beta * df_filtered[\"trip_distance\"].mean()\n", "\n", "alpha\n", "# 4.651606864471554\n", "beta\n", "# 2.661444816924383\n", "end = time.time()\n", "print(\"Alpha: \" + str(alpha))\n", "print(\"Beta: \" + str(beta))\n", "print(\"ML Regression in Pandas: \" +str(end - start))\n", "# Plot and check whether it fits better this time\n", "sample[\"price\"] = alpha + beta * sample[\"trip_distance\"]\n", "ax = sample.plot.scatter(x=\"trip_distance\", y=\"fare_amount\")\n", "sample.plot.line(x=\"trip_distance\", y=\"price\", ax=ax, color=\"red\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "g0ZKXzBaKS7v" }, "source": [ "Now, your task is to perform the same linear regression using both duckdb and sqlite." ] }, { "cell_type": "markdown", "metadata": { "id": "IBwc6FGCySin" }, "source": [ "**SQLite**:\n", "Since SQLite doesn't have a function to compute the standard deviation and its a bit more complex than doing it in duckdb. We provide you with a solution that should be taken as inspiration for duckdb.\n", "Here you only have to implement lines 46 and 49 to generate the samples necessary to plot a sample of the data + linear regression." ] }, { "cell_type": "code", "metadata": { "id": "shfAO9MpyYMO" }, "source": [ "import math\n", "start = time.time()\n", "sqlite_cursor.execute(\"\"\"\n", " SELECT \n", " count(*),\n", " sub.avg_fa as avg_fare,\n", " sum((fare_amount - sub.avg_fa) * (fare_amount - sub.avg_fa)) as var_fare,\n", " sub.avg_td as avg_distance,\n", " sum((trip_distance - sub.avg_td) * (trip_distance - sub.avg_td)) as var_distance\n", " FROM \n", " yellow_tripdata_2016_01,\n", " (\n", " SELECT\n", " AVG(fare_amount) as avg_fa,\n", " AVG(trip_distance) as avg_td\n", " FROM yellow_tripdata_2016_01\n", " ) as sub\n", "\"\"\")\n", "n, avg_fare, var_fare, avg_distance, var_distance = sqlite_cursor.fetchone()\n", "max_fare = avg_fare + 3 * math.sqrt(var_fare / (n - 1))\n", "max_distance = avg_distance + 3 * math.sqrt(var_distance / (n - 1))\n", "max_fare, max_distance\n", "\n", "sqlite_cursor.execute(f\"\"\"\n", "SELECT\n", " (SUM(trip_distance * fare_amount) - SUM(trip_distance) * SUM(fare_amount) / COUNT(*)) /\n", " (SUM(trip_distance * trip_distance) - SUM(trip_distance) * SUM(trip_distance) / COUNT(*)) AS beta,\n", " AVG(fare_amount) AS avg_fare_amount,\n", " AVG(trip_distance) AS avg_trip_distance\n", "FROM yellow_tripdata_2016_01\n", "WHERE \n", " fare_amount > 0 AND\n", " fare_amount < {max_fare} AND \n", " trip_distance > 0 AND\n", " trip_distance < {max_distance}\n", "\"\"\")\n", "beta, avg_fare_amount, avg_trip_distance = sqlite_cursor.fetchone()\n", "alpha = avg_fare_amount - beta * avg_trip_distance\n", "end = time.time()\n", "print(\"Alpha: \" + str(alpha))\n", "print(\"Beta: \" + str(beta))\n", "print(\"ML Regression in SQLite: \" +str(end - start))\n", "\n", "\n", "#Get a uniform random sample (size 1000) of trip_distance and fare_amount for plotting \n", "#TIP : Use ORDER BY RANDOM() and Limit\n", "sqlite_sample = \"\"\n", "\n", "#Get a pandas dataframe from sqlite\n", "#TIP : pandas has a function to transform sqlite query results to a dataframe\n", "sqlite_df_sample = ...\n", "\n", "sqlite_df_sample[\"price\"] = alpha + beta * sqlite_df_sample[\"trip_distance\"]\n", "ax = sqlite_df_sample.plot.scatter(x=\"trip_distance\", y=\"fare_amount\")\n", "sqlite_df_sample.plot.line(x=\"trip_distance\", y=\"price\", ax=ax, color=\"red\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "vmYLsxUIyQEN" }, "source": [ "**DUCKDB**\n", "\n", "*TIP: DuckDB has a STDDEV_SAMP function (i.e., function that calculates standard deviation) and thus you can do the whole regression using one query.*\n", "\n", "**Remember alpha and beta MUST have the same values from the pandas solution above**" ] }, { "cell_type": "code", "metadata": { "id": "cSvOOki7Lf9W" }, "source": [ "start = time.time()\n", "#Implement the regression query\n", "duck_regression = \"\"\"\n", "\"\"\"\n", "\n", "\n", "duck_cursor.execute(duck_regression)\n", "#Get the values from beta and alpha from the query (replace ... for code):\n", "beta, avg_fare_amount, avg_trip_distance = ...\n", "alpha = avg_fare_amount - beta * avg_trip_distance\n", "end = time.time()\n", "print(\"Alpha: \" + str(alpha))\n", "print(\"Beta: \" + str(beta))\n", "print(\"ML Regression in DuckDB: \" +str(end - start))\n", "#Get a uniform random sample (size 1000) of trip_distance and fare_amount for plotting \n", "#TIP : Use ORDER BY RANDOM() and Limit\n", "duck_sample = \"\"\n", "\n", "#Get a pandas dataframe from duckdb\n", "#TIP : duckdb python's client has a function to transform query results to a dataframe\n", "duck_df_sample = ...\n", "\n", "duck_df_sample[\"price\"] = alpha + beta * duck_df_sample[\"trip_distance\"]\n", "ax = duck_df_sample.plot.scatter(x=\"trip_distance\", y=\"fare_amount\")\n", "duck_df_sample.plot.line(x=\"trip_distance\", y=\"price\", ax=ax, color=\"red\")" ], "execution_count": null, "outputs": [] } ] }