{ "cells": [ { "cell_type": "code", "execution_count": 196, "id": "d03c0654", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python 3.12.1\n" ] } ], "source": [ "!python -V" ] }, { "cell_type": "markdown", "id": "8c0c3eac", "metadata": {}, "source": [ "## Install packages.\n", "With uv + vscode, there are two options (I went with first).\n", "1. Add them to the project, cli: `uv add pandas` or jupyter: `!uv add pandas`\n", "2. Install them, bypassing `pyproject.toml`. Jupyter: `!uv pip install pandas`\n", "\n", "More in README.MD" ] }, { "cell_type": "code", "execution_count": 197, "id": "350c4ced", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import pickle\n", "from sklearn.feature_extraction import DictVectorizer\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import root_mean_squared_error # , mean_squared_error be" ] }, { "cell_type": "code", "execution_count": null, "id": "63d916e8", "metadata": {}, "outputs": [], "source": [ "df = pd.read_parquet(\"../data/green_tripdata_2021-01.parquet\")" ] }, { "cell_type": "code", "execution_count": 199, "id": "127f1b09", "metadata": {}, "outputs": [], "source": [ "df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)\n", "df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)" ] }, { "cell_type": "code", "execution_count": 200, "id": "73a983eb", "metadata": {}, "outputs": [], "source": [ "# get travel duration time, delta\n", "df[\"duration\"] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime\n", "# for each element in duration (td) apply {math}\n", "df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)" ] }, { "cell_type": "code", "execution_count": 201, "id": "f3c01972", "metadata": {}, "outputs": [], "source": [ "# https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf\n", "# df = df[df.trip_type == 2] # unrequred" ] }, { "cell_type": "code", "execution_count": 202, "id": "2da00479", "metadata": {}, "outputs": [], "source": [ "# Slght deviation from the lecture, we should do filtering before plotting in this case\n", "df = df[((df.duration >= 1) & (df.duration <= 60))]" ] }, { "cell_type": "code", "execution_count": 203, "id": "a613991b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dictplot is deprecated, general alternative is displot.\n", "\n", "# Kernel density estimation (KDE) is enabled (smoothing of the graph)\n", "# Density: how likely values to occur within a certain range (regions of data).\n", "# Density is normalized, so you can compare distribution, unlike frequencies or counts.\n", "\n", "# .set part is unnecessary, sometimes helps with readability\n", "sns.displot(df.duration, kde=True, stat=\"density\") # .set(xlim=(0, 100),ylim=(0, 0.06))" ] }, { "cell_type": "code", "execution_count": 204, "id": "b3f22cbc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 73908.000000\n", "mean 16.852578\n", "std 11.563163\n", "min 1.000000\n", "50% 14.000000\n", "95% 41.000000\n", "98% 48.781000\n", "99% 53.000000\n", "max 60.000000\n", "Name: duration, dtype: float64" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Checking percentile to see below which values most of our rides are\n", "df.duration.describe(percentiles=[0.95, 0.98, 0.99])" ] }, { "cell_type": "code", "execution_count": null, "id": "d6de5cba", "metadata": {}, "outputs": [], "source": [ "categorical = [\"PULocationID\", \"DOLocationID\"]\n", "numerical = [\"trip_distance\"]" ] }, { "cell_type": "code", "execution_count": 206, "id": "6e3bc33e", "metadata": {}, "outputs": [], "source": [ "df[categorical] = df[categorical].astype(str)" ] }, { "cell_type": "code", "execution_count": null, "id": "bb7bed9f", "metadata": {}, "outputs": [], "source": [ "train_dict = df[categorical + numerical].to_dict(orient=\"records\")" ] }, { "cell_type": "markdown", "id": "7fc5c394", "metadata": {}, "source": [ "Notes:\n", "## One-hot encoding\n", "\n", "> Is a technique to convert categorical variables (like strings or IDs) into a format that can be provided to machine learning algorithms.\n", "\n", "### How it works:\n", "- For each unique value in a categorical column, a new column is created.\n", "- In each row, the column corresponding to the value is set to 1, and all others are set to 0.\n", "\n", "#### Example\n", "\n", "Before:\n", "| Color |\n", "|-------|\n", "| Red |\n", "| Blue |\n", "| Green |\n", "\n", "After one-hot:\n", "\n", "| Color=Red | Color=Green | Color=Blue |\n", "|-----------|-------------|------------|\n", "| 1 | 0 | 0 |\n", "| 0 | 0 | 1 |\n", "| 0 | 1 | 0 |\n", "\n", "So, categorical data is represented numerically, ergo usable for most machine learning models.\n", "\n", "## Matrix \n", "In this case each value of DOLocationID, PULocationID becomes a \"column\" which shows 0 or 1, each row is a ride. `trip_distance` remains unchanged,\n", "\n", "After `DictVectorizer`, matrix will look like this:\n", "\n", "| PULocationID=10 | PULocationID=15 | DOLocationID=20 | DOLocationID=30 | trip_distance |\n", "|-----------------|-----------------|-----------------|-----------------|--------------|\n", "| 1 | 0 | 1 | 0 | 2.5 |\n", "| 0 | 1 | 0 | 1 | 1.2 |\n", "| 1 | 0 | 0 | 1 | 3.8 |\n", "\n", "Final matrix has as many columns as there are unique categorical values (from both columns) plus one column for each numerical feature." ] }, { "cell_type": "code", "execution_count": 208, "id": "3739e6e9", "metadata": {}, "outputs": [], "source": [ "dv = DictVectorizer()\n", "X_train = dv.fit_transform(train_dict)" ] }, { "cell_type": "code", "execution_count": 209, "id": "833c52eb", "metadata": {}, "outputs": [], "source": [ "# converting col into numpy array.\n", "# making it a target for learning.\n", "target = \"duration\"\n", "y_train = df[target].values" ] }, { "cell_type": "code", "execution_count": 210, "id": "ea5dd27c", "metadata": {}, "outputs": [], "source": [ "lr = LinearRegression()\n", "lr.fit(X_train, y_train) # makes it learn\n", "y_pred = lr.predict(\n", " X_train\n", ") # we can predict on \"any\" data we give, y is not asked, because it asumes it from .fit" ] }, { "cell_type": "code", "execution_count": 211, "id": "d7427748", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# dictplot is deprecated, for overlapping plots I found there are\n", "# two alternatives: histplot and kdeplot. Kdeplot looks more informative\n", "\n", "sns.histplot(\n", " y_pred, kde=True, stat=\"density\", label=\"prediction\", color=\"C0\", alpha=0.5\n", ")\n", "sns.histplot(y_train, kde=True, stat=\"density\", label=\"actual\", color=\"C1\", alpha=0.5)\n", "plt.legend() # render legend labels\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 212, "id": "fb756bab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9.838799799829626" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# About using root_mean..., it's same as using mean_squared_error(squared=false)\n", "\n", "# calc the wellness of prediction via comparison of train vs pred via formula (y_true - y_pred) ** 2\n", "root_mean_squared_error(y_train, y_pred)\n", "# Thought model is bad, prediction is off by 9 minutes on average" ] }, { "cell_type": "code", "execution_count": 213, "id": "b73bcc72", "metadata": {}, "outputs": [], "source": [ "# refactor for function approach\n", "def read_dataframe(filename):\n", " if filename.endswith(\".csv\"):\n", " df = pd.read_csv(filename)\n", "\n", " df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)\n", " df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)\n", " elif filename.endswith(\".parquet\"):\n", " df = pd.read_parquet(filename)\n", "\n", " df[\"duration\"] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime\n", " df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)\n", "\n", " df = df[(df.duration >= 1) & (df.duration <= 60)]\n", "\n", " categorical = [\"PULocationID\", \"DOLocationID\"]\n", " df[categorical] = df[categorical].astype(str)\n", "\n", " return df" ] }, { "cell_type": "code", "execution_count": null, "id": "a7312bdf", "metadata": {}, "outputs": [], "source": [ "df_train = read_dataframe(\"../data/green_tripdata_2021-01.parquet\")\n", "df_val = read_dataframe(\"../data/green_tripdata_2021-02.parquet\")" ] }, { "cell_type": "code", "execution_count": 215, "id": "9689b41f", "metadata": {}, "outputs": [], "source": [ "# This way model treat combined PU/DO as unique identifier\n", "# I guess it helps the model to learn on specific patterns of PU/DO combination\n", "df_train[\"PU_DO\"] = df_train[\"PULocationID\"] + \"_\" + df_train[\"DOLocationID\"]\n", "df_val[\"PU_DO\"] = df_val[\"PULocationID\"] + \"_\" + df_val[\"DOLocationID\"]" ] }, { "cell_type": "markdown", "id": "a401cf22", "metadata": {}, "source": [ "training - january\n", "validation - february" ] }, { "cell_type": "code", "execution_count": 216, "id": "eb71cd89", "metadata": {}, "outputs": [], "source": [ "categorical = [\"PU_DO\"] # combined 'PULocationID', 'DOLocationID'\n", "numerical = [\"trip_distance\"]\n", "\n", "dv = DictVectorizer()\n", "\n", "train_dicts = df_train[categorical + numerical].to_dict(orient=\"records\")\n", "X_train = dv.fit_transform(train_dicts)\n", "\n", "val_dicts = df_val[categorical + numerical].to_dict(orient=\"records\")\n", "X_val = dv.transform(val_dicts)" ] }, { "cell_type": "code", "execution_count": 217, "id": "7c3359b2", "metadata": {}, "outputs": [], "source": [ "target = \"duration\"\n", "y_train = df_train[target].values\n", "y_val = df_val[target].values" ] }, { "cell_type": "code", "execution_count": 218, "id": "2ed32658", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.758715209092169" ] }, "execution_count": 218, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr = LinearRegression()\n", "lr.fit(X_train, y_train)\n", "\n", "y_pred = lr.predict(X_val)\n", "\n", "root_mean_squared_error(y_val, y_pred)" ] }, { "cell_type": "code", "execution_count": 219, "id": "107f8ea3", "metadata": {}, "outputs": [], "source": [ "# exporting the model\n", "with open(\"../models/lin_reg.bin\", \"wb\") as f_out:\n", " pickle.dump((dv, lr), f_out)" ] }, { "cell_type": "code", "execution_count": 220, "id": "7d39c7a2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.616617770546549" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr = Lasso(alpha=0.0001) # Alpha uses the math concept of regularization\n", "lr.fit(X_train, y_train)\n", "\n", "y_pred = lr.predict(X_val)\n", "\n", "root_mean_squared_error(y_val, y_pred)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 5 }