{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![LOGO](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/img/MODIN_ver2_hrz.png?raw=True)\n", "\n", "

Scale your pandas workflows by changing one line of code

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install the most recent stable release for Modin run the following code on your command line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install modin[all] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For further instructions on how to install Modin with conda or for specific platforms or engines, see our detailed [installation guide](https://modin.readthedocs.io/en/latest/getting_started/installation.html).\n", "\n", "Modin acts as a drop-in replacement for pandas so you can simply change a single line of import to speed up your pandas workflows. To use Modin, you simply have to replace the import of pandas with the import of Modin, as follows." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import modin.pandas as pd\n", "import pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-01-07 07:29:30,173\tINFO services.py:1250 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] } ], "source": [ "#############################################\n", "### For the purpose of timing comparisons ###\n", "#############################################\n", "import time\n", "import ray\n", "ray.init()\n", "from IPython.display import Markdown, display\n", "def printmd(string):\n", " display(Markdown(string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset: NYC taxi trip data\n", "\n", "Link to raw dataset: https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv (**Size: ~200MB**)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('taxi.csv', )" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This may take a few minutes to download\n", "import urllib.request\n", "s3_path = \"https://modin-test.s3.us-west-1.amazonaws.com/yellow_tripdata_2015-01.csv\"\n", "urllib.request.urlretrieve(s3_path, \"taxi.csv\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Faster Data Loading with Modin's ``read_csv``" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time to read with pandas: 2.744 seconds\n" ] } ], "source": [ "start = time.time()\n", "\n", "pandas_df = pandas.read_csv(\"taxi.csv\", parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n", "\n", "end = time.time()\n", "pandas_duration = end - start\n", "print(\"Time to read with pandas: {} seconds\".format(round(pandas_duration, 3)))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to read with Modin: 1.35 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "UserWarning: `read_*` implementation has mismatches with pandas:\n", "Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue.\n" ] }, { "data": { "text/markdown": [ "## Modin is 2.03x faster than pandas at `read_csv`!" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "start = time.time()\n", "\n", "modin_df = pd.read_csv(\"taxi.csv\", parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n", "\n", "end = time.time()\n", "modin_duration = end - start\n", "print(\"Time to read with Modin: {} seconds\".format(round(modin_duration, 3)))\n", "\n", "printmd(\"## Modin is {}x faster than pandas at `read_csv`!\".format(round(pandas_duration / modin_duration, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can quickly check that the result from pandas and Modin is exactly the same." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
01.02021-01-01 00:30:102021-01-01 00:36:121.02.101.0N142432.08.003.000.50.000.00.311.802.5
11.02021-01-01 00:51:202021-01-01 00:52:191.00.201.0N2381512.03.000.500.50.000.00.34.300.0
21.02021-01-01 00:43:302021-01-01 01:11:061.014.701.0N1321651.042.000.500.58.650.00.351.950.0
31.02021-01-01 00:15:482021-01-01 00:31:010.010.601.0N1381321.029.000.500.56.050.00.336.350.0
42.02021-01-01 00:31:492021-01-01 00:48:211.04.941.0N68331.016.500.500.54.060.00.324.362.5
.........................................................
1369760NaN2021-01-25 08:32:042021-01-25 08:49:32NaN8.80NaNNaN13582NaN21.842.750.50.000.00.325.390.0
1369761NaN2021-01-25 08:34:002021-01-25 09:04:00NaN5.86NaNNaN42161NaN26.672.750.50.000.00.330.220.0
1369762NaN2021-01-25 08:37:002021-01-25 08:53:00NaN4.45NaNNaN14106NaN25.292.750.50.000.00.328.840.0
1369763NaN2021-01-25 08:28:002021-01-25 08:50:00NaN10.04NaNNaN175216NaN28.242.750.50.000.00.331.790.0
1369764NaN2021-01-25 08:38:002021-01-25 08:50:00NaN4.93NaNNaN248168NaN20.762.750.50.000.00.324.310.0
\n", "

1369765 rows × 18 columns

\n", "
" ], "text/plain": [ " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", "0 1.0 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 \n", "1 1.0 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 \n", "2 1.0 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 \n", "3 1.0 2021-01-01 00:15:48 2021-01-01 00:31:01 0.0 \n", "4 2.0 2021-01-01 00:31:49 2021-01-01 00:48:21 1.0 \n", "... ... ... ... ... \n", "1369760 NaN 2021-01-25 08:32:04 2021-01-25 08:49:32 NaN \n", "1369761 NaN 2021-01-25 08:34:00 2021-01-25 09:04:00 NaN \n", "1369762 NaN 2021-01-25 08:37:00 2021-01-25 08:53:00 NaN \n", "1369763 NaN 2021-01-25 08:28:00 2021-01-25 08:50:00 NaN \n", "1369764 NaN 2021-01-25 08:38:00 2021-01-25 08:50:00 NaN \n", "\n", " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", "0 2.10 1.0 N 142 \n", "1 0.20 1.0 N 238 \n", "2 14.70 1.0 N 132 \n", "3 10.60 1.0 N 138 \n", "4 4.94 1.0 N 68 \n", "... ... ... ... ... \n", "1369760 8.80 NaN NaN 135 \n", "1369761 5.86 NaN NaN 42 \n", "1369762 4.45 NaN NaN 14 \n", "1369763 10.04 NaN NaN 175 \n", "1369764 4.93 NaN NaN 248 \n", "\n", " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", "0 43 2.0 8.00 3.00 0.5 0.00 \n", "1 151 2.0 3.00 0.50 0.5 0.00 \n", "2 165 1.0 42.00 0.50 0.5 8.65 \n", "3 132 1.0 29.00 0.50 0.5 6.05 \n", "4 33 1.0 16.50 0.50 0.5 4.06 \n", "... ... ... ... ... ... ... \n", "1369760 82 NaN 21.84 2.75 0.5 0.00 \n", "1369761 161 NaN 26.67 2.75 0.5 0.00 \n", "1369762 106 NaN 25.29 2.75 0.5 0.00 \n", "1369763 216 NaN 28.24 2.75 0.5 0.00 \n", "1369764 168 NaN 20.76 2.75 0.5 0.00 \n", "\n", " tolls_amount improvement_surcharge total_amount \\\n", "0 0.0 0.3 11.80 \n", "1 0.0 0.3 4.30 \n", "2 0.0 0.3 51.95 \n", "3 0.0 0.3 36.35 \n", "4 0.0 0.3 24.36 \n", "... ... ... ... \n", "1369760 0.0 0.3 25.39 \n", "1369761 0.0 0.3 30.22 \n", "1369762 0.0 0.3 28.84 \n", "1369763 0.0 0.3 31.79 \n", "1369764 0.0 0.3 24.31 \n", "\n", " congestion_surcharge \n", "0 2.5 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 2.5 \n", "... ... \n", "1369760 0.0 \n", "1369761 0.0 \n", "1369762 0.0 \n", "1369763 0.0 \n", "1369764 0.0 \n", "\n", "[1369765 rows x 18 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pandas_df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
01.02021-01-01 00:30:102021-01-01 00:36:121.02.101.0N142432.08.003.000.50.000.00.311.802.5
11.02021-01-01 00:51:202021-01-01 00:52:191.00.201.0N2381512.03.000.500.50.000.00.34.300.0
21.02021-01-01 00:43:302021-01-01 01:11:061.014.701.0N1321651.042.000.500.58.650.00.351.950.0
31.02021-01-01 00:15:482021-01-01 00:31:010.010.601.0N1381321.029.000.500.56.050.00.336.350.0
42.02021-01-01 00:31:492021-01-01 00:48:211.04.941.0N68331.016.500.500.54.060.00.324.362.5
.........................................................
1369760NaN2021-01-25 08:32:042021-01-25 08:49:32NaN8.80NaNNaN13582NaN21.842.750.50.000.00.325.390.0
1369761NaN2021-01-25 08:34:002021-01-25 09:04:00NaN5.86NaNNaN42161NaN26.672.750.50.000.00.330.220.0
1369762NaN2021-01-25 08:37:002021-01-25 08:53:00NaN4.45NaNNaN14106NaN25.292.750.50.000.00.328.840.0
1369763NaN2021-01-25 08:28:002021-01-25 08:50:00NaN10.04NaNNaN175216NaN28.242.750.50.000.00.331.790.0
1369764NaN2021-01-25 08:38:002021-01-25 08:50:00NaN4.93NaNNaN248168NaN20.762.750.50.000.00.324.310.0
\n", "

1369765 rows x 18 columns

\n", "
" ], "text/plain": [ " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", "0 1.0 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 \n", "1 1.0 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 \n", "2 1.0 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 \n", "3 1.0 2021-01-01 00:15:48 2021-01-01 00:31:01 0.0 \n", "4 2.0 2021-01-01 00:31:49 2021-01-01 00:48:21 1.0 \n", "... ... ... ... ... \n", "1369760 NaN 2021-01-25 08:32:04 2021-01-25 08:49:32 NaN \n", "1369761 NaN 2021-01-25 08:34:00 2021-01-25 09:04:00 NaN \n", "1369762 NaN 2021-01-25 08:37:00 2021-01-25 08:53:00 NaN \n", "1369763 NaN 2021-01-25 08:28:00 2021-01-25 08:50:00 NaN \n", "1369764 NaN 2021-01-25 08:38:00 2021-01-25 08:50:00 NaN \n", "\n", " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", "0 2.10 1.0 N 142 \n", "1 0.20 1.0 N 238 \n", "2 14.70 1.0 N 132 \n", "3 10.60 1.0 N 138 \n", "4 4.94 1.0 N 68 \n", "... ... ... ... ... \n", "1369760 8.80 NaN NaN 135 \n", "1369761 5.86 NaN NaN 42 \n", "1369762 4.45 NaN NaN 14 \n", "1369763 10.04 NaN NaN 175 \n", "1369764 4.93 NaN NaN 248 \n", "\n", " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", "0 43 2.0 8.00 3.00 0.5 0.00 \n", "1 151 2.0 3.00 0.50 0.5 0.00 \n", "2 165 1.0 42.00 0.50 0.5 8.65 \n", "3 132 1.0 29.00 0.50 0.5 6.05 \n", "4 33 1.0 16.50 0.50 0.5 4.06 \n", "... ... ... ... ... ... ... \n", "1369760 82 NaN 21.84 2.75 0.5 0.00 \n", "1369761 161 NaN 26.67 2.75 0.5 0.00 \n", "1369762 106 NaN 25.29 2.75 0.5 0.00 \n", "1369763 216 NaN 28.24 2.75 0.5 0.00 \n", "1369764 168 NaN 20.76 2.75 0.5 0.00 \n", "\n", " tolls_amount improvement_surcharge total_amount \\\n", "0 0.0 0.3 11.80 \n", "1 0.0 0.3 4.30 \n", "2 0.0 0.3 51.95 \n", "3 0.0 0.3 36.35 \n", "4 0.0 0.3 24.36 \n", "... ... ... ... \n", "1369760 0.0 0.3 25.39 \n", "1369761 0.0 0.3 30.22 \n", "1369762 0.0 0.3 28.84 \n", "1369763 0.0 0.3 31.79 \n", "1369764 0.0 0.3 24.31 \n", "\n", " congestion_surcharge \n", "0 2.5 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 2.5 \n", "... ... \n", "1369760 0.0 \n", "1369761 0.0 \n", "1369762 0.0 \n", "1369763 0.0 \n", "1369764 0.0 \n", "\n", "[1369765 rows x 18 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "modin_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Faster Append with Modin's ``concat``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our previous ``read_csv`` example operated on a relatively small dataframe. In the following example, we duplicate the same taxi dataset 100 times and then concatenate them together." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to concat with pandas: 34.144 seconds\n" ] } ], "source": [ "N_copies= 100\n", "start = time.time()\n", "\n", "big_pandas_df = pandas.concat([pandas_df for _ in range(N_copies)])\n", "\n", "end = time.time()\n", "pandas_duration = end - start\n", "print(\"Time to concat with pandas: {} seconds\".format(round(pandas_duration, 3)))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to concat with Modin: 0.564 seconds\n" ] }, { "data": { "text/markdown": [ "### Modin is 60.57x faster than pandas at `concat`!" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "start = time.time()\n", "\n", "big_modin_df = pd.concat([modin_df for _ in range(N_copies)])\n", "\n", "end = time.time()\n", "modin_duration = end - start\n", "print(\"Time to concat with Modin: {} seconds\".format(round(modin_duration, 3)))\n", "\n", "printmd(\"### Modin is {}x faster than pandas at `concat`!\".format(round(pandas_duration / modin_duration, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result dataset is around 19GB in size." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(apply_list_of_funcs pid=73415)\u001b[0m \n", "\u001b[2m\u001b[36m(apply_list_of_funcs pid=73416)\u001b[0m \n", "\n", "Int64Index: 136976500 entries, 0 to 1369764\n", "Data columns (total 18 columns):\n", " # Column Non-Null Count Dtype \n", "--- --------------------- ------------------ ----- \n", " 0 VendorID 127141300 non-null float64\n", " 1 tpep_pickup_datetime 136976500 non-null datetime64[ns]\n", " 2 tpep_dropoff_datetime 136976500 non-null datetime64[ns]\n", " 3 passenger_count 127141300 non-null float64\n", " 4 trip_distance 136976500 non-null float64\n", " 5 RatecodeID 127141300 non-null float64\n", " 6 store_and_fwd_flag 127141300 non-null object\n", " 7 PULocationID 136976500 non-null int64\n", " 8 DOLocationID 136976500 non-null int64\n", " 9 payment_type 127141300 non-null float64\n", " 10 fare_amount 136976500 non-null float64\n", " 11 extra 136976500 non-null float64\n", " 12 mta_tax 136976500 non-null float64\n", " 13 tip_amount 136976500 non-null float64\n", " 14 tolls_amount 136976500 non-null float64\n", " 15 improvement_surcharge 136976500 non-null float64\n", " 16 total_amount 136976500 non-null float64\n", " 17 congestion_surcharge 136976500 non-null float64\n", "dtypes: float64(13), datetime64[ns](2), int64(2), object(1)\n", "memory usage: 19.4 GB\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "UserWarning: Distributing object. This may take some time.\n" ] } ], "source": [ "big_modin_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Faster ``apply`` over a single column\n", "\n", "The performance benefits of Modin becomes aparent when we operate on large gigabyte-scale datasets. For example, let's say that we want to round up the number across a single column via the ``apply`` operation. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to apply with pandas: 43.969 seconds\n" ] } ], "source": [ "start = time.time()\n", "rounded_trip_distance_pandas = big_pandas_df[\"trip_distance\"].apply(round)\n", "\n", "end = time.time()\n", "pandas_duration = end - start\n", "print(\"Time to apply with pandas: {} seconds\".format(round(pandas_duration, 3)))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time to apply with Modin: 1.225 seconds\n" ] }, { "data": { "text/markdown": [ "### Modin is 35.88x faster than pandas at `apply` on one column!" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "start = time.time()\n", "\n", "rounded_trip_distance_modin = big_modin_df[\"trip_distance\"].apply(round)\n", "\n", "end = time.time()\n", "modin_duration = end - start\n", "print(\"Time to apply with Modin: {} seconds\".format(round(modin_duration, 3)))\n", "\n", "printmd(\"### Modin is {}x faster than pandas at `apply` on one column!\".format(round(pandas_duration / modin_duration, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "\n", "Hopefully, this tutorial demonstrated how Modin delivers significant speedup on pandas operations without the need for any extra effort. Throughout example, we moved from working with 100MBs of data to 20GBs of data all without having to change anything or manually optimize our code to achieve the level of scalable performance that Modin provides.\n", "\n", "Note that in this quickstart example, we've only shown ``read_csv``, ``concat``, ``apply``, but these are not the only pandas operations that Modin optimizes for. In fact, Modin covers [more than 90% of the pandas API](https://github.com/modin-project/modin/blob/master/README.md#pandas-api-coverage), yielding considerable speedups for many common operations." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 2 }