{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![LOGO](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/img/MODIN_ver2_hrz.png?raw=True)\n",
    "\n",
    "<center><h2>Scale your pandas workflows by changing one line of code</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To install the most recent stable release for Modin run the following code on your command line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install modin[all] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For further instructions on how to install Modin with conda or for specific platforms or engines, see our detailed [installation guide](https://modin.readthedocs.io/en/latest/getting_started/installation.html).\n",
    "\n",
    "Modin acts as a drop-in replacement for pandas so you can simply change a single line of import to speed up your pandas workflows. To use Modin, you simply have to replace the import of pandas with the import of Modin, as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import modin.pandas as pd\n",
    "import pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-01-07 07:29:30,173\tINFO services.py:1250 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n"
     ]
    }
   ],
   "source": [
    "#############################################\n",
    "### For the purpose of timing comparisons ###\n",
    "#############################################\n",
    "import time\n",
    "import ray\n",
    "ray.init()\n",
    "from IPython.display import Markdown, display\n",
    "def printmd(string):\n",
    "    display(Markdown(string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset: NYC taxi trip data\n",
    "\n",
    "Link to raw dataset: https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv (**Size: ~200MB**)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('taxi.csv', <http.client.HTTPMessage at 0x1307faf70>)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This may take a few minutes to download\n",
    "import urllib.request\n",
    "s3_path = \"https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv\"\n",
    "urllib.request.urlretrieve(s3_path, \"taxi.csv\")  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Faster Data Loading with Modin's ``read_csv``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to read with pandas: 2.744 seconds\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "pandas_df = pandas.read_csv(\"taxi.csv\", parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to read with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to read with Modin: 1.35 seconds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "UserWarning: `read_*` implementation has mismatches with pandas:\n",
      "Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue.\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "## Modin is 2.03x faster than pandas at `read_csv`!"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "modin_df = pd.read_csv(\"taxi.csv\", parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to read with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"## Modin is {}x faster than pandas at `read_csv`!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can quickly check that the result from pandas and Modin is exactly the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>tpep_pickup_datetime</th>\n",
       "      <th>tpep_dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>congestion_surcharge</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:30:10</td>\n",
       "      <td>2021-01-01 00:36:12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.10</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>142</td>\n",
       "      <td>43</td>\n",
       "      <td>2.0</td>\n",
       "      <td>8.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>11.80</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:51:20</td>\n",
       "      <td>2021-01-01 00:52:19</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.20</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>238</td>\n",
       "      <td>151</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>4.30</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:43:30</td>\n",
       "      <td>2021-01-01 01:11:06</td>\n",
       "      <td>1.0</td>\n",
       "      <td>14.70</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>132</td>\n",
       "      <td>165</td>\n",
       "      <td>1.0</td>\n",
       "      <td>42.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>8.65</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>51.95</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:15:48</td>\n",
       "      <td>2021-01-01 00:31:01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.60</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>138</td>\n",
       "      <td>132</td>\n",
       "      <td>1.0</td>\n",
       "      <td>29.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>6.05</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>36.35</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2021-01-01 00:31:49</td>\n",
       "      <td>2021-01-01 00:48:21</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.94</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>68</td>\n",
       "      <td>33</td>\n",
       "      <td>1.0</td>\n",
       "      <td>16.50</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>4.06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>24.36</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369760</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:32:04</td>\n",
       "      <td>2021-01-25 08:49:32</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.80</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>135</td>\n",
       "      <td>82</td>\n",
       "      <td>NaN</td>\n",
       "      <td>21.84</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>25.39</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369761</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:34:00</td>\n",
       "      <td>2021-01-25 09:04:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.86</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>42</td>\n",
       "      <td>161</td>\n",
       "      <td>NaN</td>\n",
       "      <td>26.67</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>30.22</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369762</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:37:00</td>\n",
       "      <td>2021-01-25 08:53:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.45</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14</td>\n",
       "      <td>106</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.29</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>28.84</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369763</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:28:00</td>\n",
       "      <td>2021-01-25 08:50:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.04</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>175</td>\n",
       "      <td>216</td>\n",
       "      <td>NaN</td>\n",
       "      <td>28.24</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>31.79</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369764</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:38:00</td>\n",
       "      <td>2021-01-25 08:50:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.93</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>248</td>\n",
       "      <td>168</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.76</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>24.31</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1369765 rows × 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \\\n",
       "0             1.0  2021-01-01 00:30:10   2021-01-01 00:36:12              1.0   \n",
       "1             1.0  2021-01-01 00:51:20   2021-01-01 00:52:19              1.0   \n",
       "2             1.0  2021-01-01 00:43:30   2021-01-01 01:11:06              1.0   \n",
       "3             1.0  2021-01-01 00:15:48   2021-01-01 00:31:01              0.0   \n",
       "4             2.0  2021-01-01 00:31:49   2021-01-01 00:48:21              1.0   \n",
       "...           ...                  ...                   ...              ...   \n",
       "1369760       NaN  2021-01-25 08:32:04   2021-01-25 08:49:32              NaN   \n",
       "1369761       NaN  2021-01-25 08:34:00   2021-01-25 09:04:00              NaN   \n",
       "1369762       NaN  2021-01-25 08:37:00   2021-01-25 08:53:00              NaN   \n",
       "1369763       NaN  2021-01-25 08:28:00   2021-01-25 08:50:00              NaN   \n",
       "1369764       NaN  2021-01-25 08:38:00   2021-01-25 08:50:00              NaN   \n",
       "\n",
       "         trip_distance  RatecodeID store_and_fwd_flag  PULocationID  \\\n",
       "0                 2.10         1.0                  N           142   \n",
       "1                 0.20         1.0                  N           238   \n",
       "2                14.70         1.0                  N           132   \n",
       "3                10.60         1.0                  N           138   \n",
       "4                 4.94         1.0                  N            68   \n",
       "...                ...         ...                ...           ...   \n",
       "1369760           8.80         NaN                NaN           135   \n",
       "1369761           5.86         NaN                NaN            42   \n",
       "1369762           4.45         NaN                NaN            14   \n",
       "1369763          10.04         NaN                NaN           175   \n",
       "1369764           4.93         NaN                NaN           248   \n",
       "\n",
       "         DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  \\\n",
       "0                  43           2.0         8.00   3.00      0.5        0.00   \n",
       "1                 151           2.0         3.00   0.50      0.5        0.00   \n",
       "2                 165           1.0        42.00   0.50      0.5        8.65   \n",
       "3                 132           1.0        29.00   0.50      0.5        6.05   \n",
       "4                  33           1.0        16.50   0.50      0.5        4.06   \n",
       "...               ...           ...          ...    ...      ...         ...   \n",
       "1369760            82           NaN        21.84   2.75      0.5        0.00   \n",
       "1369761           161           NaN        26.67   2.75      0.5        0.00   \n",
       "1369762           106           NaN        25.29   2.75      0.5        0.00   \n",
       "1369763           216           NaN        28.24   2.75      0.5        0.00   \n",
       "1369764           168           NaN        20.76   2.75      0.5        0.00   \n",
       "\n",
       "         tolls_amount  improvement_surcharge  total_amount  \\\n",
       "0                 0.0                    0.3         11.80   \n",
       "1                 0.0                    0.3          4.30   \n",
       "2                 0.0                    0.3         51.95   \n",
       "3                 0.0                    0.3         36.35   \n",
       "4                 0.0                    0.3         24.36   \n",
       "...               ...                    ...           ...   \n",
       "1369760           0.0                    0.3         25.39   \n",
       "1369761           0.0                    0.3         30.22   \n",
       "1369762           0.0                    0.3         28.84   \n",
       "1369763           0.0                    0.3         31.79   \n",
       "1369764           0.0                    0.3         24.31   \n",
       "\n",
       "         congestion_surcharge  \n",
       "0                         2.5  \n",
       "1                         0.0  \n",
       "2                         0.0  \n",
       "3                         0.0  \n",
       "4                         2.5  \n",
       "...                       ...  \n",
       "1369760                   0.0  \n",
       "1369761                   0.0  \n",
       "1369762                   0.0  \n",
       "1369763                   0.0  \n",
       "1369764                   0.0  \n",
       "\n",
       "[1369765 rows x 18 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pandas_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>tpep_pickup_datetime</th>\n",
       "      <th>tpep_dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>congestion_surcharge</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:30:10</td>\n",
       "      <td>2021-01-01 00:36:12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.10</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>142</td>\n",
       "      <td>43</td>\n",
       "      <td>2.0</td>\n",
       "      <td>8.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>11.80</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:51:20</td>\n",
       "      <td>2021-01-01 00:52:19</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.20</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>238</td>\n",
       "      <td>151</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>4.30</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:43:30</td>\n",
       "      <td>2021-01-01 01:11:06</td>\n",
       "      <td>1.0</td>\n",
       "      <td>14.70</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>132</td>\n",
       "      <td>165</td>\n",
       "      <td>1.0</td>\n",
       "      <td>42.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>8.65</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>51.95</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2021-01-01 00:15:48</td>\n",
       "      <td>2021-01-01 00:31:01</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.60</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>138</td>\n",
       "      <td>132</td>\n",
       "      <td>1.0</td>\n",
       "      <td>29.00</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>6.05</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>36.35</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2021-01-01 00:31:49</td>\n",
       "      <td>2021-01-01 00:48:21</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.94</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>68</td>\n",
       "      <td>33</td>\n",
       "      <td>1.0</td>\n",
       "      <td>16.50</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>4.06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>24.36</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369760</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:32:04</td>\n",
       "      <td>2021-01-25 08:49:32</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.80</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>135</td>\n",
       "      <td>82</td>\n",
       "      <td>NaN</td>\n",
       "      <td>21.84</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>25.39</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369761</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:34:00</td>\n",
       "      <td>2021-01-25 09:04:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.86</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>42</td>\n",
       "      <td>161</td>\n",
       "      <td>NaN</td>\n",
       "      <td>26.67</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>30.22</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369762</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:37:00</td>\n",
       "      <td>2021-01-25 08:53:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.45</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14</td>\n",
       "      <td>106</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.29</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>28.84</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369763</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:28:00</td>\n",
       "      <td>2021-01-25 08:50:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.04</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>175</td>\n",
       "      <td>216</td>\n",
       "      <td>NaN</td>\n",
       "      <td>28.24</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>31.79</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369764</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-01-25 08:38:00</td>\n",
       "      <td>2021-01-25 08:50:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.93</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>248</td>\n",
       "      <td>168</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.76</td>\n",
       "      <td>2.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>24.31</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1369765 rows x 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \\\n",
       "0             1.0  2021-01-01 00:30:10   2021-01-01 00:36:12              1.0   \n",
       "1             1.0  2021-01-01 00:51:20   2021-01-01 00:52:19              1.0   \n",
       "2             1.0  2021-01-01 00:43:30   2021-01-01 01:11:06              1.0   \n",
       "3             1.0  2021-01-01 00:15:48   2021-01-01 00:31:01              0.0   \n",
       "4             2.0  2021-01-01 00:31:49   2021-01-01 00:48:21              1.0   \n",
       "...           ...                  ...                   ...              ...   \n",
       "1369760       NaN  2021-01-25 08:32:04   2021-01-25 08:49:32              NaN   \n",
       "1369761       NaN  2021-01-25 08:34:00   2021-01-25 09:04:00              NaN   \n",
       "1369762       NaN  2021-01-25 08:37:00   2021-01-25 08:53:00              NaN   \n",
       "1369763       NaN  2021-01-25 08:28:00   2021-01-25 08:50:00              NaN   \n",
       "1369764       NaN  2021-01-25 08:38:00   2021-01-25 08:50:00              NaN   \n",
       "\n",
       "         trip_distance  RatecodeID store_and_fwd_flag  PULocationID  \\\n",
       "0                 2.10         1.0                  N           142   \n",
       "1                 0.20         1.0                  N           238   \n",
       "2                14.70         1.0                  N           132   \n",
       "3                10.60         1.0                  N           138   \n",
       "4                 4.94         1.0                  N            68   \n",
       "...                ...         ...                ...           ...   \n",
       "1369760           8.80         NaN                NaN           135   \n",
       "1369761           5.86         NaN                NaN            42   \n",
       "1369762           4.45         NaN                NaN            14   \n",
       "1369763          10.04         NaN                NaN           175   \n",
       "1369764           4.93         NaN                NaN           248   \n",
       "\n",
       "         DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  \\\n",
       "0                  43           2.0         8.00   3.00      0.5        0.00   \n",
       "1                 151           2.0         3.00   0.50      0.5        0.00   \n",
       "2                 165           1.0        42.00   0.50      0.5        8.65   \n",
       "3                 132           1.0        29.00   0.50      0.5        6.05   \n",
       "4                  33           1.0        16.50   0.50      0.5        4.06   \n",
       "...               ...           ...          ...    ...      ...         ...   \n",
       "1369760            82           NaN        21.84   2.75      0.5        0.00   \n",
       "1369761           161           NaN        26.67   2.75      0.5        0.00   \n",
       "1369762           106           NaN        25.29   2.75      0.5        0.00   \n",
       "1369763           216           NaN        28.24   2.75      0.5        0.00   \n",
       "1369764           168           NaN        20.76   2.75      0.5        0.00   \n",
       "\n",
       "         tolls_amount  improvement_surcharge  total_amount  \\\n",
       "0                 0.0                    0.3         11.80   \n",
       "1                 0.0                    0.3          4.30   \n",
       "2                 0.0                    0.3         51.95   \n",
       "3                 0.0                    0.3         36.35   \n",
       "4                 0.0                    0.3         24.36   \n",
       "...               ...                    ...           ...   \n",
       "1369760           0.0                    0.3         25.39   \n",
       "1369761           0.0                    0.3         30.22   \n",
       "1369762           0.0                    0.3         28.84   \n",
       "1369763           0.0                    0.3         31.79   \n",
       "1369764           0.0                    0.3         24.31   \n",
       "\n",
       "         congestion_surcharge  \n",
       "0                         2.5  \n",
       "1                         0.0  \n",
       "2                         0.0  \n",
       "3                         0.0  \n",
       "4                         2.5  \n",
       "...                       ...  \n",
       "1369760                   0.0  \n",
       "1369761                   0.0  \n",
       "1369762                   0.0  \n",
       "1369763                   0.0  \n",
       "1369764                   0.0  \n",
       "\n",
       "[1369765 rows x 18 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "modin_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Faster Append with Modin's ``concat``"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our previous ``read_csv`` example operated on a relatively small dataframe. In the following example, we duplicate the same taxi dataset 100 times and then concatenate them together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to concat with pandas: 34.144 seconds\n"
     ]
    }
   ],
   "source": [
    "N_copies= 100\n",
    "start = time.time()\n",
    "\n",
    "big_pandas_df = pandas.concat([pandas_df for _ in range(N_copies)])\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to concat with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to concat with Modin: 0.564 seconds\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "### Modin is 60.57x faster than pandas at `concat`!"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "big_modin_df = pd.concat([modin_df for _ in range(N_copies)])\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to concat with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `concat`!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result dataset is around 19GB in size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2m\u001b[36m(apply_list_of_funcs pid=73415)\u001b[0m \n",
      "\u001b[2m\u001b[36m(apply_list_of_funcs pid=73416)\u001b[0m \n",
      "<class 'modin.pandas.dataframe.DataFrame'>\n",
      "Int64Index: 136976500 entries, 0 to 1369764\n",
      "Data columns (total 18 columns):\n",
      " #   Column                 Non-Null Count      Dtype         \n",
      "---  ---------------------  ------------------  -----         \n",
      " 0   VendorID               127141300 non-null  float64\n",
      " 1   tpep_pickup_datetime   136976500 non-null  datetime64[ns]\n",
      " 2   tpep_dropoff_datetime  136976500 non-null  datetime64[ns]\n",
      " 3   passenger_count        127141300 non-null  float64\n",
      " 4   trip_distance          136976500 non-null  float64\n",
      " 5   RatecodeID             127141300 non-null  float64\n",
      " 6   store_and_fwd_flag     127141300 non-null  object\n",
      " 7   PULocationID           136976500 non-null  int64\n",
      " 8   DOLocationID           136976500 non-null  int64\n",
      " 9   payment_type           127141300 non-null  float64\n",
      " 10  fare_amount            136976500 non-null  float64\n",
      " 11  extra                  136976500 non-null  float64\n",
      " 12  mta_tax                136976500 non-null  float64\n",
      " 13  tip_amount             136976500 non-null  float64\n",
      " 14  tolls_amount           136976500 non-null  float64\n",
      " 15  improvement_surcharge  136976500 non-null  float64\n",
      " 16  total_amount           136976500 non-null  float64\n",
      " 17  congestion_surcharge   136976500 non-null  float64\n",
      "dtypes: float64(13), datetime64[ns](2), int64(2), object(1)\n",
      "memory usage: 19.4 GB\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "UserWarning: Distributing <class 'int'> object. This may take some time.\n"
     ]
    }
   ],
   "source": [
    "big_modin_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Faster ``apply`` over a single column\n",
    "\n",
    "The performance benefits of Modin becomes aparent when we operate on large gigabyte-scale datasets. For example, let's say that we want to round up the number across a single column via the ``apply`` operation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to apply with pandas: 43.969 seconds\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "rounded_trip_distance_pandas = big_pandas_df[\"trip_distance\"].apply(round)\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to apply with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to apply with Modin: 1.225 seconds\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "### Modin is 35.88x faster than pandas at `apply` on one column!"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "start = time.time()\n",
    "\n",
    "rounded_trip_distance_modin = big_modin_df[\"trip_distance\"].apply(round)\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to apply with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `apply` on one column!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary\n",
    "\n",
    "Hopefully, this tutorial demonstrated how Modin delivers significant speedup on pandas operations without the need for any extra effort. Throughout example, we moved from working with 100MBs of data to 20GBs of data all without having to change anything or manually optimize our code to achieve the level of scalable performance that Modin provides.\n",
    "\n",
    "Note that in this quickstart example, we've only shown ``read_csv``, ``concat``, ``apply``, but these are not the only pandas operations that Modin optimizes for. In fact, Modin covers [more than 90% of the pandas API](https://github.com/modin-project/modin/blob/master/README.md#pandas-api-coverage), yielding considerable speedups for many common operations."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}